Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to Parquet writer to skip compressing individual columns #15411

Merged
merged 17 commits into from
Apr 18, 2024

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Mar 28, 2024

Description

#15081 added the ability to select per-column encodings in the Parquet writer. Some Parquet encodings (e.g DELTA_BINARY_PACKED) do not mix well with compression (see PARQUET-2414 for example). This PR adds the ability to turn off compression for select columns. This uses the same mechanism as encoding selection, so an example use would be:

  cudf::io::table_input_metadata table_metadata(table);
  table_metadata.column_metadata[0]
    .set_name("int_delta_binary")
    .set_encoding(cudf::io::column_encoding::DELTA_BINARY_PACKED)
    .set_skip_compression(true);

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@etseidl etseidl requested a review from a team as a code owner March 28, 2024 16:46
Copy link

copy-pr-bot bot commented Mar 28, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 28, 2024
@etseidl
Copy link
Contributor Author

etseidl commented Mar 28, 2024

cc @GregoryKimball

Copy link
Member

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thank you for the effort!

@mhaseeb123
Copy link
Member

/ok to test

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me. @vuule Do you want to take a look before merging? If not, feel free to merge as-is.

@bdice bdice added feature request New feature or request non-breaking Non-breaking change labels Apr 18, 2024
@bdice
Copy link
Contributor

bdice commented Apr 18, 2024

/ok to test

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this, a stealth feature? Not even a CC!? :(

Cool stuff, looks good, just have a few questions.

cpp/tests/io/parquet_writer_test.cpp Show resolved Hide resolved
cpp/tests/io/parquet_writer_test.cpp Show resolved Hide resolved
cudf::io::parquet::detail::FileMetaData fmd;
read_footer(source, &fmd);

EXPECT_EQ(fmd.row_groups[0].columns[0].meta_data.codec, cudf::io::parquet::detail::UNCOMPRESSED);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can fail if compressed size is, by chance, larger than uncompressed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm...not that line, but the line below could 😟

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the other one.
Still, it shouldn't change randomly. The way the values are encoded and the way the compression works should be very stable.

cpp/src/io/parquet/page_enc.cu Show resolved Hide resolved
@vuule vuule added the cuIO cuIO issue label Apr 18, 2024
@vuule
Copy link
Contributor

vuule commented Apr 18, 2024

/merge

@rapids-bot rapids-bot bot merged commit e0c4280 into rapidsai:branch-24.06 Apr 18, 2024
75 checks passed
@etseidl etseidl deleted the select_comp branch April 19, 2024 00:19
rapids-bot bot pushed a commit that referenced this pull request May 22, 2024
…PI (#15613)

Several recent PRs (#15081, #15411, #15600) added the ability to control some aspects of Parquet file writing on a per-column basis. During discussion of #15081 it was [suggested](#15081 (comment)) that these options be exposed by cuDF-python in a manner similar to pyarrow. This PR adds the ability to control per-column encoding, compression, binary output, and fixed-length data width, using fully qualified Parquet column names. For example, given a cuDF table with an integer column 'a', and a `list<int32>` column 'b', the fully qualified column names would be 'a' and 'b.list.element'.

Addresses "Add cuDF-python API support for specifying encodings" task in #13501.

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Vukasin Milovanovic (https://github.com/vuule)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #15613
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants