Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compacting produces smaller row groups than expected #2386

Closed
PeterKeDer opened this issue Apr 4, 2024 · 1 comment · Fixed by #2387
Closed

Compacting produces smaller row groups than expected #2386

PeterKeDer opened this issue Apr 4, 2024 · 1 comment · Fixed by #2387
Labels
bug Something isn't working

Comments

@PeterKeDer
Copy link
Contributor

PeterKeDer commented Apr 4, 2024

Environment

Delta-rs version: 0.16.4

Binding: python

Environment:

  • Cloud provider: AWS
  • OS: macOS
  • Other:

Bug

What happened:

Compact produces parquet files that are larger than expected:

dt = DeltaTable("...")
dt.optimize.compact(
    writer_properties=WriterProperties(
        max_row_group_size=8192,
        write_batch_size=8192,
    )
)

The resulting parquet files have row groups with 1024 rows instead of 8192.

What you expected to happen:

Most row groups in the compacted parquet should have size 8192.

How to reproduce it:

Call dt.optimize.compact() with max_row_group_size greater than 1024.

More details:

This is caused by calling self.arrow_writer.flush() at the end of each batch in core/src/operations/writer.rs, introduced recently in #2318. This creates a new row group for each batch, even when there are less than rows than max_row_group_size. Since we read batches using ParquetRecordBatchStreamBuilder with default config (i.e. batch size 1024), we end up with only row groups up to 1024 rows, even if we set max_row_group_size to a larger value.

I don't think calling flush is necessary since ArrowWriter does that automatically when we reach max_row_group_size rows.

This negatively impacts our use cases by inflating our parquet sizes, sometimes by up to 4x (40 MB to 160 MB).

@PeterKeDer PeterKeDer added the bug Something isn't working label Apr 4, 2024
@ion-elgreco
Copy link
Collaborator

@PeterKeDer feel free to make a PR to revert the change! I was testing some things there

ion-elgreco pushed a commit that referenced this issue Apr 5, 2024
# Description

Reverts #2318 by removing
`flush` after writing each batch since it was causing smaller than
expected row groups to be written during compaction.

# Related Issue(s)
- closes #2386
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants