Skip to content

Delta Lake 3.0.0

Compare
Choose a tag to compare
@scottsand-db scottsand-db released this 17 Oct 23:04
· 6 commits to branch-3.0 since this release

We are excited to announce the final release of Delta Lake 3.0.0. This release includes several exciting new features and artifacts.

Highlights

Here are the most important aspects of 3.0.0:

Spark 3.5 Support

Unlike the initial preview release, Delta Spark is now built on top of Apache Spark™ 3.5. See the Delta Spark section below for more details.

Delta Universal Format (UniForm)

Delta Universal Format (UniForm) will allow you to read Delta tables with Hudi and Iceberg clients. Iceberg support is available with this release. UniForm takes advantage of the fact that all table storage formats, such as Delta, Iceberg, and Hudi, actually consist of Parquet data files and a metadata layer. In this release, UniForm automatically generates Iceberg metadata and commits to Hive metastore, allowing Iceberg clients to read Delta tables as if they were Iceberg tables. Create a UniForm-enabled table using the following command:

CREATE TABLE T (c1 INT) USING DELTA TBLPROPERTIES (
  'delta.universalFormat.enabledFormats' = 'iceberg');

Every write to this table will automatically keep Iceberg metadata updated. See the documentation here for more details, and the key implementations here and here.

Delta Kernel

The Delta Kernel project is a set of Java libraries (Rust will be coming soon!) for building Delta connectors that can read (and, soon, write to) Delta tables without the need to understand the Delta protocol details).

You can use this library to do the following:

  • Read data from Delta tables in a single thread in a single process.
  • Read data from Delta tables using multiple threads in a single process.
  • Build a complex connector for a distributed processing engine and read very large Delta tables.
  • [soon!] Write to Delta tables from multiple threads / processes / distributed engines.

Reading a Delta table with Kernel APIs is as follows.

TableClient myTableClient = DefaultTableClient.create() ;          // define a client
Table myTable = Table.forPath(myTableClient, "/delta/table/path"); // define what table to scan
Snapshot mySnapshot = myTable.getLatestSnapshot(myTableClient);    // define which version of table to scan
Predicate scanFilter = ...                                         // define the predicate
Scan myScan = mySnapshot.getScanBuilder(myTableClient)             // specify the scan details
        .withFilters(scanFilter)
        .build();
Scan.readData(...)                                                 // returns the table data 

Full example code can be found here.

For more information, refer to:

  • User guide on step by step process of using Kernel in a standalone Java program or in a distributed processing connector.
  • Slides explaining the rationale behind Kernel and the API design.
  • Example Java programs that illustrate how to read Delta tables using the Kernel APIs.
  • Table and default TableClient API Java documentation

This release of Delta contains the Kernel Table API and default TableClient API definitions and implementation which allow:

  • Reading Delta tables with optional Deletion Vectors enabled or column mapping (name mode only) enabled.
  • Partition pruning optimization to reduce the number of data files to read.

Welcome Delta Connectors to the Delta repository!

All previous connectors from https://github.com/delta-io/connectors have been moved to this repository (https://github.com/delta-io/delta) as we aim to unify our Delta connector ecosystem structure. This includes Delta-Standalone, Delta-Flink, Delta-Hive, PowerBI, and SQL-Delta-Import. The repository https://github.com/delta-io/connectors is now deprecated.

Delta Spark

Delta Spark 3.0.0 is built on top of Apache Spark™ 3.5. Similar to Apache Spark, we have released Maven artifacts for both Scala 2.12 and Scala 2.13. Note that the Delta Spark maven artifact has been renamed from delta-core to delta-spark.

The key features of this release are:

Other notable changes include

  • Fix for a bug in MERGE statements that contain a scalar subquery with non-deterministic results. Such a subquery can return different results during source materialization, while finding matches, and while writing modified rows. This can cause rows to be either dropped or duplicated.
  • Fix for potential resource leak when DV file not found during parquet read
  • Support protocol version downgrade
  • Fix to initial preview release to support converting null partition values in UniForm
  • Fix to WRITE command to not commit empty transactions, just like what DELETE, UPDATE, and MERGE commands do already
  • Support 3-part table name identifier. Now, commands like OPTIMIZE <catalog>.<db>.<tbl> will work.
  • Performance improvement to CDF read queries scanning in batch to reduce the number of cloud requests and to reduce Spark scheduler pressure
  • Fix for edge case in CDF read query optimization due to incorrect statistic value
  • Fix for edge case in streaming reads where having the same file with different DVs in the same batch would yield incorrect results as the wrong file and DV pair would be read
  • Prevent table corruption by disallowing overwriteSchema when partitionOverwriteMode is set to dynamic
  • Fix a bug where DELETE with DVs would not work on Column Mapping-enabled tables
  • Support automatic schema evolution in structs that are inside maps
  • Minor fix to Delta table path URI concatenation
  • Support writing parquet data files to the data subdirectory via the SQL configuration spark.databricks.delta.write.dataFilesToSubdir. This is used to add UniForm support on BigQuery.

Delta Flink

Delta-Flink 3.0.0 is built on top of Apache Flink™ 1.16.1.

The key features of this release are

  • Support for Flink SQL and Catalog. You can now use the Flink/Delta connector for Flink SQL jobs. You can CREATE Delta tables, SELECT data from them (uses the Delta Source), and INSERT new data into them (uses the Delta Sink). Note: for correct operations on the Delta tables, you must first configure the Delta Catalog using CREATE CATALOG before running a SQL command on Delta tables. For more information, please see the documentation here.
  • Significant performance improvement to Global Committer initialization - The last-successfully-committed delta version by a given Flink application is now loaded lazily significantly reducing the CPU utilization in the most common scenarios.

Other notable changes include

  • Fix a bug where Flink STRING types were incorrectly truncated to type VARCHAR with length 1

Delta Standalone

The key features in this release are:

  • Support for disabling Delta checkpointing during commits - For very large tables with millions of files, performing Delta checkpoints can become an expensive overhead during writes. Users can now disable this checkpointing by setting the hadoop configuration property io.delta.standalone.checkpointing.enabled to false. This is only safe and suggested to do if another job will periodically perform the checkpointing.
  • Performance improvement to snapshot initialization - When a delta table is loaded at a particular version, the snapshot must contain, at a minimum, the latest protocol and metadata. This PR improves the snapshot load performance for repeated table changes.
  • Support adding absolute paths to the Delta log - This now enables users to manually perform SHALLOW CLONEs and create Delta tables with external files.
  • Fix in schema evolution to prevent adding non-nullable columns to existing Delta tables

Credits

Adam Binford, Ahir Reddy, Ala Luszczak, Alex, Allen Reese, Allison Portis, Ami Oka, Andreas Chatzistergiou, Animesh Kashyap, Anonymous, Antoine Amend, Bart Samwel, Bo Gao, Boyang Jerry Peng, Burak Yavuz, CabbageCollector, Carmen Kwan, ChengJi-db, Christopher Watford, Christos Stavrakakis, Costas Zarifis, Denny Lee, Desmond Cheong, Dhruv Arya, Eric Maynard, Eric Ogren, Felipe Pessoto, Feng Zhu, Fredrik Klauss, Gengliang Wang, Gerhard Brueckl, Gopi Krishna Madabhushi, Grzegorz Kołakowski, Hang Jia, Hao Jiang, Herivelton Andreassa, Herman van Hovell, Jacek Laskowski, Jackie Zhang, Jiaan Geng, Jiaheng Tang, Jiawei Bao, Jing Wang, Johan Lasperas, Jonas Irgens Kylling, Jungtaek Lim, Junyong Lee, K.I. (Dennis) Jung, Kam Cheung Ting, Krzysztof Chmielewski, Lars Kroll, Lin Ma, Lin Zhou, Luca Menichetti, Lukas Rupprecht, Martin Grund, Min Yang, Ming DAI, Mohamed Zait, Neil Ramaswamy, Ole Sasse, Olivier NOUGUIER, Pablo Flores, Paddy Xu, Patrick Pichler, Paweł Kubit, Prakhar Jain, Pulkit Singhal, RunyaoChen, Ryan Johnson, Sabir Akhadov, Satya Valluri, Scott Sandre, Shixiong Zhu, Siying Dong, Son, Tathagata Das, Terry Kim, Tom van Bussel, Venki Korukanti, Wenchen Fan, Xinyi, Yann Byron, Yaohua Zhao, Yijia Cui, Yuhong Chen, Yuming Wang, Yuya Ebihara, Zhen Li, aokolnychyi, gurunath, jintao shen, maryannxue, noelo, panbingkun, windpiger, wwang-talend, sherlockbeard