Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UNIFORM] Remove the setSchema patch #3460

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Aug 1, 2024

This patch aims to align the field-ID assignment between uniform and Iceberg. This way we can use the public Iceberg APIs to evolve the schema, instead of enforcing the schema through a private API.

The difference is that Iceberg does not have true pre-order traversal, but when it encounters a struct, it reserves an ID for each field in the struct before continuing the traversal.

Iceberg:

1: struct
  2: struct
    4: long
    5: timestamp
  3: int

Delta:

1: struct
  2: struct
    3: long
    4: timestamp
  5: int

This PR changes Delta's assignment behavior, so we don't need the patch anymore.

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

@Fokko Fokko changed the title Remove the set patch Remove the setSchema patch Aug 1, 2024
@Fokko Fokko force-pushed the fd-try-removing-patch branch 3 times, most recently from 7e8e259 to 5b74b24 Compare August 5, 2024 07:53
@Fokko Fokko marked this pull request as ready for review August 5, 2024 07:55
@Fokko
Copy link
Contributor Author

Fokko commented Aug 5, 2024

cc @amogh-jahagirdar

@Fokko Fokko force-pushed the fd-try-removing-patch branch 2 times, most recently from 26e2c3b to 18f5b4c Compare August 6, 2024 20:38
@Fokko Fokko changed the title Remove the setSchema patch [UNIFORM] Remove the setSchema patch Aug 6, 2024
Copy link
Collaborator

@tomvanbussel tomvanbussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments on the code.

In general I am a bit worried about the brittleness of this solution. It relies on every code path assigning field ids in the exact same way as Iceberg. It's highly likely that someone adding a new code path in the future might not be aware of this requirement. At the very least we should add some verification logic to the UniForm exporter to confirm that the field ids are as expected. Would it instead be possible to add some interfaces to Iceberg that can help us with evolving the schema?

One open question for me is how this will work with existing tables that have already assigned field ids in an incompatible way.

Comment on lines +299 to +300
def transformColumns(schema: StructType,
traverseStructsAtOnce: Boolean = false)(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style Nit:

Suggested change
def transformColumns(schema: StructType,
traverseStructsAtOnce: Boolean = false)(
def transformColumns(
schema: StructType,
traverseStructsAtOnce: Boolean = false)(

def transformColumns(
schema: StructType)(
def transformColumns(schema: StructType,
traverseStructsAtOnce: Boolean = false)(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"At once" is a little bit ambiguous. Could we call this something like traverseAllStructFieldsBeforeRecursing?

StructType(traversedFields.map { traversedField =>
val newField = if (!traverseStructsAtOnce) {
tf(path, traversedField, DELTA_COL_RESOLVER)
} else traversedField
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style Nit: Either add both branches inline, or create a block for both branches

Suggested change
} else traversedField
} else {
traversedField
}

val newSchema =
SchemaMergingUtils.transformColumns(rawSchema)((path, field, _) => {
SchemaMergingUtils.transformColumns(rawSchema, traverseStructsAtOnce = true)(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change could benefit from a few tests in DeltaColumnMappingSuite that create a table and add some columns and then check the assigned field ids in the log.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants