-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[UNIFORM] Remove the setSchema
patch
#3460
base: master
Are you sure you want to change the base?
Conversation
7e8e259
to
5b74b24
Compare
5b74b24
to
b6889ad
Compare
26e2c3b
to
18f5b4c
Compare
18f5b4c
to
4fb469a
Compare
4fb469a
to
c581018
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments on the code.
In general I am a bit worried about the brittleness of this solution. It relies on every code path assigning field ids in the exact same way as Iceberg. It's highly likely that someone adding a new code path in the future might not be aware of this requirement. At the very least we should add some verification logic to the UniForm exporter to confirm that the field ids are as expected. Would it instead be possible to add some interfaces to Iceberg that can help us with evolving the schema?
One open question for me is how this will work with existing tables that have already assigned field ids in an incompatible way.
def transformColumns(schema: StructType, | ||
traverseStructsAtOnce: Boolean = false)( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style Nit:
def transformColumns(schema: StructType, | |
traverseStructsAtOnce: Boolean = false)( | |
def transformColumns( | |
schema: StructType, | |
traverseStructsAtOnce: Boolean = false)( |
def transformColumns( | ||
schema: StructType)( | ||
def transformColumns(schema: StructType, | ||
traverseStructsAtOnce: Boolean = false)( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"At once" is a little bit ambiguous. Could we call this something like traverseAllStructFieldsBeforeRecursing
?
StructType(traversedFields.map { traversedField => | ||
val newField = if (!traverseStructsAtOnce) { | ||
tf(path, traversedField, DELTA_COL_RESOLVER) | ||
} else traversedField |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style Nit: Either add both branches inline, or create a block for both branches
} else traversedField | |
} else { | |
traversedField | |
} |
val newSchema = | ||
SchemaMergingUtils.transformColumns(rawSchema)((path, field, _) => { | ||
SchemaMergingUtils.transformColumns(rawSchema, traverseStructsAtOnce = true)( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change could benefit from a few tests in DeltaColumnMappingSuite
that create a table and add some columns and then check the assigned field ids in the log.
This patch aims to align the field-ID assignment between uniform and Iceberg. This way we can use the public Iceberg APIs to evolve the schema, instead of enforcing the schema through a private API.
The difference is that Iceberg does not have true pre-order traversal, but when it encounters a struct, it reserves an ID for each field in the struct before continuing the traversal.
Iceberg:
Delta:
This PR changes Delta's assignment behavior, so we don't need the patch anymore.
Which Delta project/connector is this regarding?
Description
How was this patch tested?
Does this PR introduce any user-facing changes?