Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Support cache delta lake delta log metadata #49069

Merged
merged 3 commits into from
Aug 1, 2024

Conversation

Youngwb
Copy link
Contributor

@Youngwb Youngwb commented Jul 29, 2024

Why I'm doing:

Query delta lake table will need to access delta log, it would take long time to read a lot of json/parquet files, thees metadata files are not going to change, so they are suitable for caching to avoid repeated reading.

What I'm doing:

  1. add jsonCache for json files in delta log
  2. add checkpointCache for parquet files in delta log
  3. add catalog properties for delta log cache
Property                                                   default value
enable_deltalake_table_cache                                  true
enable_deltalake_json_meta_cache                              true
deltalake_json_meta_cache_ttl_sec                             48 * 60 * 60
deltalake_json_meta_cache_max_num                             1000
enable_deltalake_checkpoint_meta_cache                        false
deltalake_checkpoint_meta_cache_ttl_sec                       48 * 60 * 60
deltalake_checkpoint_meta_cache_max_num                       100
  1. add DeltaLakeEngine to override DefaultEngine, this allows us implement a customized reading process for Parquet and JSON.

Fixes #issue

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.3
    • 3.2
    • 3.1
    • 3.0
    • 2.5

@Youngwb Youngwb requested a review from a team as a code owner July 29, 2024 09:14
trueeyu
trueeyu previously approved these changes Jul 30, 2024
public CloseableIterator<ColumnarBatch> readParquetFiles(
CloseableIterator<FileStatus> fileIter,
StructType physicalSchema,
Optional<Predicate> predicate) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we don't use predicate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we cannot cache the entire parquet data when use predicate

Copy link
Contributor

@stephen-shelby stephen-shelby Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it the same as iceberg that caching all manifest file content?

Copy link

sonarcloud bot commented Jul 30, 2024

Quality Gate Failed Quality Gate failed

Failed conditions
E Reliability Rating on New Code (required ≥ A)

See analysis details on SonarCloud

Catch issues before they fail your Quality Gate with our IDE extension SonarLint

Copy link

[FE Incremental Coverage Report]

pass : 161 / 173 (93.06%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 com/starrocks/connector/delta/DeltaLakeMetadataFactory.java 3 4 75.00% [79]
🔵 com/starrocks/connector/delta/DeltaLakeParquetHandler.java 36 42 85.71% [81, 90, 91, 92, 99, 122]
🔵 com/starrocks/connector/delta/DeltaLakeEngine.java 10 11 90.91% [52]
🔵 com/starrocks/connector/delta/DeltaLakeMetastore.java 16 17 94.12% [74]
🔵 com/starrocks/connector/delta/DeltaLakeJsonHandler.java 57 60 95.00% [136, 174, 175]
🔵 com/starrocks/connector/delta/DeltaLakeInternalMgr.java 6 6 100.00% []
🔵 com/starrocks/connector/delta/CachingDeltaLakeMetastore.java 4 4 100.00% []
🔵 com/starrocks/connector/delta/HMSBackedDeltaMetastore.java 1 1 100.00% []
🔵 com/starrocks/connector/delta/DeltaLakeConnector.java 1 1 100.00% []
🔵 com/starrocks/connector/delta/DeltaLakeCatalogProperties.java 27 27 100.00% []

Copy link

[BE Incremental Coverage Report]

pass : 0 / 0 (0%)

@Youngwb Youngwb merged commit edbab22 into StarRocks:main Aug 1, 2024
51 of 53 checks passed
@Youngwb Youngwb deleted the delta_engine branch August 1, 2024 06:03
@Youngwb
Copy link
Contributor Author

Youngwb commented Aug 7, 2024

@mergify backport branch-3.3

Copy link
Contributor

mergify bot commented Aug 7, 2024

backport branch-3.3

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request Aug 7, 2024
wanpengfei-git pushed a commit that referenced this pull request Aug 8, 2024
…49069) (#49479)

Co-authored-by: Youngwb <yangwenbo_mailbox@163.com>
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants