Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extension to read and ingest Delta Lake tables #15755

Merged
merged 72 commits into from
Jan 31, 2024

Conversation

abhishekrb19
Copy link
Contributor

@abhishekrb19 abhishekrb19 commented Jan 24, 2024

Delta Lake is an open source storage layer that brings reliability to data lakes. Users can ingest data stored in a Delta Lake table into Apache Druid via a new input source delta. This is available via the Delta Lake extension, add the druid-deltalake-extensions to the list of loaded extensions. The Delta input source reads the configured Delta Lake table and extracts all the underlying delta files in the table's latest snapshot. Delta Lake files are versioned Parquet format.

This is joint work with @abhishekagarwal87, @LakshSingla and @AmatyaAvadhanula.

Example of a DML query using MSQ:

REPLACE INTO "delta-employee-datasource" OVERWRITE ALL
SELECT * FROM TABLE(
  EXTERN(
    '{"type":"delta","tablePath":"s3a://your-bucket/employee-delta-table"}',
    '{"type":"json"}'
  )
) EXTEND ("timestamp" VARCHAR, "name" VARCHAR, "age" BIGINT, "salary" DOUBLE, "bonus" FLOAT)
PARTITIONED BY ALL

Delta ioConfig in native batch spec :

    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "delta",
        "tablePath": "s3a://your-bucket/employee-delta-table"
      },
    }

Web-console screenshots:

CleanShot 2024-01-30 at 02 45 51@2x

Sample and ingestion of a mock dataset from the cloud:
CleanShot 2024-01-30 at 02 47 49@2x

Delta 3.1.0 just released. Can look into upgrading the delta dependency as a follow-up.

Release note

Enhanced ingestion capabilities to support ingestion of Delta Lake tables.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

abhishekagarwal87 and others added 30 commits December 19, 2023 22:28
… transitive deps from delta-lake

Will need to sort out the dependencies later.
Copy link
Contributor

@vogievetsky vogievetsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Web console changes look good.

Copy link
Member

@vtlim vtlim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits but docs otherwise look good!

docs/development/extensions-contrib/delta-lake.md Outdated Show resolved Hide resolved
docs/development/extensions-contrib/delta-lake.md Outdated Show resolved Hide resolved
docs/development/extensions-contrib/delta-lake.md Outdated Show resolved Hide resolved
docs/development/extensions-contrib/delta-lake.md Outdated Show resolved Hide resolved
docs/ingestion/input-sources.md Outdated Show resolved Hide resolved
docs/ingestion/input-sources.md Outdated Show resolved Hide resolved
abhishekrb19 and others added 2 commits January 30, 2024 17:14
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
@abhishekrb19 abhishekrb19 merged commit 9f95a69 into apache:master Jan 31, 2024
83 checks passed
@abhishekrb19 abhishekrb19 added this to the 29.0.0 milestone Jan 31, 2024
abhishekrb19 added a commit to abhishekrb19/incubator-druid that referenced this pull request Jan 31, 2024
* something

* test commit

* compilation fix

* more compilation fixes (fixme placeholders)

* Comment out druid-kereberos build since it conflicts with newly added transitive deps from delta-lake

Will need to sort out the dependencies later.

* checkpoint

* remove snapshot schema since we can get schema from the row

* iterator bug fix

* json json json

* sampler flow

* empty impls for read(InputStats) and sample()

* conversion?

* conversion, without timestamp

* Web console changes to show Delta Lake

* Asset bug fix and tile load

* Add missing pieces to input source info, etc.

* fix stuff

* Use a different delta lake asset

* Delta lake extension dependencies

* Cleanup

* Add InputSource, module init and helper code to process delta files.

* Test init

* Checkpoint changes

* Test resources and updates

* some fixes

* move to the correct package

* More tests

* Test cleanup

* TODOs

* Test updates

* requirements and javadocs

* Adjust dependencies

* Update readme

* Bump up version

* fixup typo in deps

* forbidden api and checkstyle checks

* Trim down dependencies

* new lines

* Fixup Intellij inspections.

* Add equals() and hashCode()

* chain splits, intellij inspections

* review comments and todo placeholder

* fix up some docs

* null table path and test dependencies. Fixup broken link.

* run prettify

* Different test; fixes

* Upgrade pyspark and delta-spark to latest (3.5.0 and 3.0.0) and regenerate tests

* yank the old test resource.

* add a couple of sad path tests

* Updates to readme based on latest.

* Version support

* Extract Delta DateTime converstions to DeltaTimeUtils class and add test

* More comprehensive split tests.

* Some test renames.

* Cleanup and update instructions.

* add pruneSchema() optimization for table scans.

* Oops, missed the parquet files.

* Update default table and rename schema constants.

* Test setup and misc changes.

* Add class loader logic as the context class loader is unaware about extension classes

* change some table client creation logic.

* Add hadoop-aws, hadoop-common and related exclusions.

* Remove org.apache.hadoop:hadoop-common

* Apply suggestions from code review

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Add entry to .spelling to fix docs static check

---------

Co-authored-by: abhishekagarwal87 <1477457+abhishekagarwal87@users.noreply.github.com>
Co-authored-by: Laksh Singla <lakshsingla@gmail.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
@pjain1
Copy link
Member

pjain1 commented Jan 31, 2024

@abhishekrb19 thanks for this. I was just going through this and wondering if delta lake kernel api supports any partitioning by time that we can leverage during indexing. Ingestion spec generally have an interval so can that be used to prune the amount of files on delta lake that is read during indexing, with my limited understanding of delta lake kernel api it supports filtering on columns while building scan.

I have not seen many delta lake data lakes but one I seen had a base directory which hosted the metadata in _delta_log folder, data was partitioned by day so had subdirectories for each day of data and those directories has hourly subdirectories. So if someone needs to run hourly indexing job will the extension end up reading all data again and again ?

@abhishekrb19
Copy link
Contributor Author

Hi @pjain1, yes, Delta tables can be partitioned by date. Coincidentally, Delta 3.1.0 released today. In 3.1.0, the kernel supports data skipping in addition to partition pruning using filter predicates, whenever applicable. I will try to find time and add this functionality soon. In its current form, the connector will do full table scans, so I think upgrading to 3.1.0 and adding support for filters will be a good next step.

@abhishekrb19 abhishekrb19 deleted the delta_lake_connector_ext branch January 31, 2024 07:15
@pjain1
Copy link
Member

pjain1 commented Jan 31, 2024

@abhishekrb19 thanks

abhishekagarwal87 pushed a commit that referenced this pull request Jan 31, 2024
…15812)

First commit: clean backport of Extension to read and ingest Delta Lake tables #15755
Second commit: change version from 30.0.0 to 29.0.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants