Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Support reading CDFs from tables with Deletion vectors #1701

Closed
1 of 3 tasks
xupefei opened this issue Apr 19, 2023 · 1 comment
Closed
1 of 3 tasks
Labels
enhancement New feature or request
Milestone

Comments

@xupefei
Copy link
Contributor

xupefei commented Apr 19, 2023

Feature request

Overview & motivation

#1591 brought deletion vectors to Delta Lake, and changed the way DELETE works from "removing an old file & add a new file" to "removing a file and adding it back with a DV attached". This change breaks the assumption of CDF generation, which assumes all rows in the removed file are delete and all rows in the added file are insert. We must make the CDC reader handle DVs.

High-level implementation details

This FR proposes to make the CDC reader look at DVs in FileAction and compute a new, in-memory DV to mark deleted rows. Assuming we have two DVs, then there can be four cases:

  1. Remove without DV, add without DV: not possible. The protocol does not allow this.
  2. Remove without DV, add with DV1: rows masked by DV1 are deleted.
  3. Remove with DV1, add without DV: rows masked by DV1 are added. This may happen when restoring a table.
  4. Remove with DV1, add with DV2:
    1. Rows masked by DV2 but not DV1 are deleted.
    2. Rows masked by DV1 but not DV2 are re-added. This may happen when restoring a table.

Looking at the above cases, we could do a diff on DVs and attach the result to a file scan, to obtain desired rows. For cases 3 and 4.2, we must invert the DV so it keeps marked rows rather than removes them.

The implementation will be in two phases. The first one will do some preparations and the second one will change the CDC reader.

First phase: #1680
Second phase: TBD.

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

  • Yes. I can contribute this feature independently.
  • Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
  • No. I cannot contribute this feature at this time.
@xupefei xupefei added the enhancement New feature or request label Apr 19, 2023
vkorukanti pushed a commit that referenced this issue Apr 20, 2023
…- Part 1/2

This PR is part of #1701. A detailed overview of changes is described at #1701.

This is the first PR to add support to allow reading CDC from files that have DV associated. In this PR we do some preparation work to allow fine control of how to handle masked rows: keep or drop. Later these two types will be used by CDCReader to pull masked rows out from files.

Closes #1680

GitOrigin-RevId: d0f49ee0a11e604f089d45df1611272a81d47813
scottsand-db pushed a commit that referenced this issue May 1, 2023
This PR is part of #1701.

This is a follow-up of #1680 to add support to allow reading CDC from files that have DV associated. In this PR we modify the CDC reader to construct in-line DVs diff'ed from two existing DVs, and modify the corresponding FileIndex to use the in-line DV.

Closes #1704

GitOrigin-RevId: 9e3589eb576a773b9f05777521b01485ebeaf33e
@allisonport-db allisonport-db added this to the 2.4.0 milestone May 24, 2023
@allisonport-db
Copy link
Collaborator

Closed by #1704 and #1680

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants