Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding parquet transcoding example #15420

Merged
merged 22 commits into from
May 13, 2024

Conversation

mhaseeb123
Copy link
Member

@mhaseeb123 mhaseeb123 commented Apr 1, 2024

Description

This PR adds a new example parquet_io to libcudf/cpp/examples instrumenting reading and writing parquet files with different column encodings (same for all columns for now) and compressions to close #15344. The example maybe elaborated and/or evolved as needed. #15348 should be merged before this PR to get all CMake updates needed to successfully build and run this example.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mhaseeb123 mhaseeb123 added feature request New feature or request 2 - In Progress Currently a work in progress non-breaking Non-breaking change labels Apr 1, 2024
@mhaseeb123 mhaseeb123 requested a review from a team as a code owner April 1, 2024 19:54
@mhaseeb123 mhaseeb123 requested review from vyasr and vuule April 1, 2024 19:54
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Apr 1, 2024
@github-actions github-actions bot added the CMake CMake build issue label Apr 1, 2024
@GregoryKimball
Copy link
Contributor

GregoryKimball commented Apr 16, 2024

Thank you @mhaseeb123 for writing this example! It has made it easy for me to do a ton of testing with Kaggle data 🤩

There are a few benchmarking items that we might want to address in the example, maybe with a comment, maybe with code change.. up to you.

  1. startup time on first read. I believe we do nvcomp and cufile loading on the first read, so it takes more like 3s instead of ~30-300 ms. We might want to pre-load the extra libraries or else leave the first read untimed.
  2. RMM pool growth. I believe the example uses a pool and then this grows during the first read, adding to runtime. We might choose to set a default, accept a CLI parameter, or leave the first read untimed.
  3. OS cache clearing. Currently the example will time the second read optimistically, at least if the file is in the OS cache. I'm not sure if written files are always in the OS cache or if they have to be read once first. We might want to investigate to see if cache clearing before the second read has any impact on read time.
  4. Allowing optional size metadata to be written. We added page size metadata to the writer PARQUET-2261 Size Statistics #14000 and reader Use page statistics in Parquet reader #14973. Perhaps we could enable this option via a CLI parameter.

@mhaseeb123
Copy link
Member Author

Thank you @GregoryKimball for the feedback. I am thinking of not timing the first read to avoid both RMM pool growth, and nvcomp and cufile pitfalls. I will add a CLI option to clear OS cache before second and subsequent reads plus another CLI option for optional size metadata soon.

@mhaseeb123 mhaseeb123 marked this pull request as draft April 19, 2024 19:03
@mhaseeb123 mhaseeb123 changed the title Adding artifacts for the parquet transcoding example Adding parquet transcoding example May 7, 2024
@GregoryKimball
Copy link
Contributor

Thank you @mhaseeb123 for discussing the example. I think we will address 1,2,3 above by untiming the first read, and then resolve 4 by adding a writer option setting the sizes metadata to OFF

Copy link

copy-pr-bot bot commented May 8, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the ci label May 8, 2024
@mhaseeb123
Copy link
Member Author

/ok to test

@mhaseeb123 mhaseeb123 added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels May 8, 2024
@mhaseeb123 mhaseeb123 marked this pull request as ready for review May 8, 2024 17:47
@mhaseeb123 mhaseeb123 requested a review from a team as a code owner May 8, 2024 17:47
@mhaseeb123
Copy link
Member Author

Hi reviewers @vuule @vyasr. Just pinging you as the PR is now ready to be reviewed

@mhaseeb123
Copy link
Member Author

/ok to test

@mhaseeb123
Copy link
Member Author

/ok to test

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few more suggestions, looks good otherwise 👍

cpp/examples/parquet_io/parquet_io.cpp Outdated Show resolved Hide resolved
cpp/examples/parquet_io/parquet_io.cpp Outdated Show resolved Hide resolved
mhaseeb123 and others added 3 commits May 9, 2024 12:10
Co-authored-by: Vukasin Milovanovic <vmilovanovic@nvidia.com>
Co-authored-by: Vukasin Milovanovic <vmilovanovic@nvidia.com>
@mhaseeb123
Copy link
Member Author

/ok to test

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved; just one optional suggestion

cpp/examples/parquet_io/parquet_io.hpp Outdated Show resolved Hide resolved
mhaseeb123 and others added 2 commits May 9, 2024 12:54
Co-authored-by: Vukasin Milovanovic <vmilovanovic@nvidia.com>
@mhaseeb123 mhaseeb123 requested a review from a team as a code owner May 9, 2024 19:57
@mhaseeb123
Copy link
Member Author

/ok to test

@mhaseeb123
Copy link
Member Author

/ok to test

@mhaseeb123
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit bff3015 into rapidsai:branch-24.06 May 13, 2024
70 checks passed
@mhaseeb123 mhaseeb123 deleted the cuio-parquet-examples branch May 13, 2024 08:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[FEA] Add Parquet transcoding to cpp/examples
4 participants