Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support converting Iceberg for CONVERT TO DELTA command in Delta Lake #1463

Closed

Conversation

jackierwzhang
Copy link
Contributor

@jackierwzhang jackierwzhang commented Oct 27, 2022

Description

Adding support for in-place converting Iceberg to Delta using the CONVERT TO DELTA command in Apache Spark. Specifically, this PR supports converting a Parquet-based Iceberg table inside a path/directory to Delta Lake format.

Here's an example flow:

  1. Given a Spark environment
  2. Follow the Iceberg setup here. Please use the hadoop directory based catalog so we could find Iceberg in a path.
  3. Suppose now we have an iceberg table sitting inside s3://bucket/catalog/db/table
  4. Run the following command
CONVERT TO DELTA iceberg.`s3://bucket/catalog/db/table`
  1. Now you have a Delta table at the same location!
  2. To bring this Delta table into any Spark catalog, simply run CREATE TABLE delta_table USING delta LOCATION 's3://bucket/catalog/db/table'

See more detail in this ticket: #1462.

How was this patch tested?

New unit tests.

We have tested Iceberg version from 0.13.1 to 1.0.0.

Does this PR introduce any user-facing changes?

It introduces a iceberg-delta-compat module that contains all the Iceberg + Spark dependencies, please include this module during Spark startup so CONVERT TO DELTA command could work.

@scottsand-db
Copy link
Collaborator

scottsand-db commented Oct 27, 2022

@jackierwzhang

Does this PR introduce any user-facing changes?

This section should also discuss the new maven artifact your PR is proposing, and how to include it in a spark session along with delta lake.

Also - have you tested this on a real cluster (e.g. EMR)? Your build.sbt project definition is a little unlike the others. e.g. yours includes .cross(CrossVersion.binary) yet commonSettings already includes crossScalaVersions := all_scala_versions,, so I'd want to see this tested.

Also - we should add an integration test for this. This PR is already large enough so I think we can add this in a future PR.

@jackierwzhang
Copy link
Contributor Author

jackierwzhang commented Oct 27, 2022

Also - have you tested this on a real cluster (e.g. EMR)?

Yes, I have tested this on a real cluster.

Your build.sbt project definition is a little unlike the others. e.g. yours includes .cross(CrossVersion.binary) yet commonSettings already includes crossScalaVersions := all_scala_versions,, so I'd want to see this tested.

Yes, so basically this will allow the scala build to cross compile different iceberg versions against all_scala_versions as we defined. If I don't do this, the 2.13 compiler would fail (which served as a inverse-test I suppose).

we should add an integration test for this. This PR is already large enough so I think we can add this in a future PR.

Sure but imo the unit test is already a complete Spark environment with a standard Iceberg setup.

build.sbt Outdated
lazy val deltaIcebergCompat = (project in file("delta-iceberg-compat"))
.dependsOn(core % "compile->compile;test->test;provided->provided")
.settings (
name := "delta-iceberg-compat",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zsxwing do you think this is a good name?
compat or compatibility?

and should we have "spark" in the name?

@scottsand-db
Copy link
Collaborator

we should add an integration test for this. This PR is already large enough so I think we can add this in a future PR.

Sure but imo the unit test is already a complete Spark environment with a standard Iceberg setup.

Yes but this doesn't test the packaged JAR. We've caught many bugs (e.g. shading, dependencies, scala version issues, etc.) before with our integration tests.

Copy link
Collaborator

@vkorukanti vkorukanti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on adding an integration test.

Also docs on what jars needed to be copied (end-2-end steps)

Otherwise LGTM.

@vkorukanti
Copy link
Collaborator

Are there any limitations on what types/features in Iceberg supported and not supported?

@jackierwzhang
Copy link
Contributor Author

jackierwzhang commented Oct 31, 2022

Are there any limitations on what types/features in Iceberg supported and not supported?

There are a few remaining things we don't yet support:

  1. Converting hive based iceberg table
  2. Converting non-parquet iceberg table
  3. Converting special partition transformation rules (ref)
  4. Converting iceberg with custom name mapping (ref)
  5. Converting iceberg table with case-sensitive column names (ref)

Copy link
Collaborator

@vkorukanti vkorukanti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants