Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test of interoperability of cuDF and arrow BYTE_STREAM_SPLIT encoders #15832

Merged
merged 10 commits into from
Jun 24, 2024

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented May 22, 2024

Description

BYTE_STREAM_SPLIT encoding was recently added to cuDF (#15311). The Parquet specification was recently changed (apache/parquet-format#229) to extend the datatypes that can be encoded as BYTE_STREAM_SPLIT, and this was only recently implemented in arrow (apache/arrow#40094). This PR adds a check that cuDF and arrow can produce compatible files using BYTE_STREAM_SPLIT encoding.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@etseidl etseidl requested a review from a team as a code owner May 22, 2024 23:37
@etseidl etseidl requested review from wence- and isVoid May 22, 2024 23:37
Copy link

copy-pr-bot bot commented May 22, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the Python Affects Python cuDF API. label May 22, 2024
@vyasr
Copy link
Contributor

vyasr commented May 23, 2024

/ok to test

@wence-
Copy link
Contributor

wence- commented Jun 13, 2024

/ok to test

@wence- wence- added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jun 13, 2024
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, one non-blocking suggestion to extend the test to try with more than one row group as well.

python/cudf/cudf/tests/test_parquet.py Show resolved Hide resolved
@wence-
Copy link
Contributor

wence- commented Jun 24, 2024

/ok to test

@wence-
Copy link
Contributor

wence- commented Jun 24, 2024

/merge

@wence-
Copy link
Contributor

wence- commented Jun 24, 2024

Thanks @etseidl

@rapids-bot rapids-bot bot merged commit ed41668 into rapidsai:branch-24.08 Jun 24, 2024
73 checks passed
@etseidl etseidl deleted the arrow_cudf_byte_stream_split branch June 24, 2024 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants