Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support arrow:schema in Parquet writer to faithfully roundtrip duration types with Arrow #15875

Merged

Conversation

mhaseeb123
Copy link
Member

@mhaseeb123 mhaseeb123 commented May 29, 2024

Description

Closes #15847

This PR adds the support to construct and write base64-encoded serialized arrow:schema-type IPC message to parquet file footer to allow faithfully roundtrip with Arrow via Parquet for duration type.

Answered

  • Only construct and write arrow:schema if asked by the user via store_schema argument (cudf) or write_arrow_schema (libcudf). i.e. Default these variables to false otherwise.
  • The internal/libcudf variable name for store_schema can stay write_arrow_schema and it should be fine. This has been done to disambiguate which schema (arrow or parquet) we are talking about.
  • Separate PR: int96_timestamps cannot be deprecated/removed in cuDF as Spark is actively using it. Remove INT96 timestamps in cuDF Parquet writer #15901
  • cuDF Parquet writer supports decimal32 and decimal64 fixed types. These are not directly supported by Arrow so we will convert decimal32/decimal64 columns to decimal128.
  • is_col_nullable() function moved to writer_impl_helpers.cpp along with some other helper functions.
  • A common convert_data_to_decimal128 can be separated out and used in writer_impl.cu and to_arrow.cu. Tracking in a separate issue. [FEA] Deduplicate convert_data_to_decimal128() function #16194

CC @vuule @etseidl @nvdbaranec @GregoryKimball @galipremsagar for vis.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mhaseeb123 mhaseeb123 self-assigned this May 29, 2024
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. CMake CMake build issue labels May 29, 2024
@mhaseeb123 mhaseeb123 added 2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function breaking Breaking change Reliability labels May 29, 2024
Copy link
Contributor

@nvdbaranec nvdbaranec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks great. Just a couple small things so far.

cpp/src/io/parquet/arrow_schema_writer.cpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Show resolved Hide resolved
cpp/src/io/parquet/writer_impl_helpers.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@lithomas1 lithomas1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python changes look good to me (just 1 comment).

Might want to have @vyasr or @galipremsagar take another look though.

python/cudf/cudf/tests/test_parquet.py Outdated Show resolved Hide resolved
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great stuff!
just a few comments

cpp/src/io/parquet/writer_impl_helpers.hpp Outdated Show resolved Hide resolved
switch (column.type().id()) {
case type_id::DECIMAL32:
// Convert data to decimal128 type
d128_vectors.emplace_back(convert_data_to_decimal128<int32_t>(column, stream));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging these into a single kernel (per decimal type) would be great for performance with many decimal columns, but looks like it would not fit well into the recursive implementation.
How about a stream pool? 4-8 streams that we use in round robin order might help when we have to convert may decimal columns.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracking this in a separate issue #16194.

@mhaseeb123 mhaseeb123 requested a review from vuule July 9, 2024 02:22
cpp/src/io/parquet/arrow_schema_writer.hpp Outdated Show resolved Hide resolved
@@ -322,6 +322,9 @@
output_as_binary : set, optional, default None
If a column name is present in the set, that column will be output as
unannotated binary, rather than the default 'UTF-8'.
store_schema : bool, default False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we think this should be False by default? It seems like faithful roundtrips with Arrow would be a benefit by default. However, it seems like enabling this feature will cast / convert some data types (e.g. "days" aren't supported, and only decimal128 is supported -- if I read the rest of this PR correctly). Are those conversions potentially lossy / do they change metadata? If so, are those conversions worth documenting?

Copy link
Member Author

@mhaseeb123 mhaseeb123 Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of reasons why we chose to set it False by default. Actually it's great that you asked this so this will be documented here.

One arrow:schema should only be used when we want to round-trip certain col types (primarily durations for now) with arrow only. Otherwise, cudf roundtrips with itself and arrow perfectly fine.

Second, cudf still supports int96 timestamps as Spark has actively been using it. Enabling store_schema breaks Sparks existing and future workflows requiring them to set this to False whenever using int96 timestamps.

Third, like you mentioned, we can't roundtrip decimal32 and decimal64 with cuDF itself with arrow::schema by default without (losslessly) converting them to decimal128.

To summarize, things work perfectly fine without arrow::schema for the most part until we need to faithfully round-trip duration with arrow only in which case it is enabled.

@galipremsagar please feel free to add any reasons that we discussed during the cuIO standup meeting a couple weeks ago.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This kind of information would be good to have in the docstrings!

@mhaseeb123 mhaseeb123 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs Review Waiting for reviewer to review or respond labels Jul 9, 2024
@mhaseeb123
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 67bd366 into rapidsai:branch-24.08 Jul 9, 2024
80 checks passed
@mhaseeb123 mhaseeb123 deleted the arrow-schema-support-pq-writer branch July 9, 2024 23:29
galipremsagar added a commit to galipremsagar/cudf that referenced this pull request Jul 31, 2024
…rip `duration` types with Arrow (rapidsai#15875)"

This reverts commit 67bd366.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge breaking Breaking change CMake CMake build issue cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[FEA] Support arrow:Schema in Parquet writer for faithful roundtrip with Arrow via Parquet
7 participants