Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add libcudf example with large strings #15983

Merged
merged 110 commits into from
Sep 5, 2024

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Jun 11, 2024

Description

Creating an example that shows reading large strings columns. This uses the 1 billion row challenge input data and provides three examples of loading this data:

  • brc uses the CSV reader to load the input file in one call and aggregates the results using groupby
  • brc_chunks uses the CSV reader to load the input file in chunks, aggregates each chunk, and computes the results
  • brc_pipeline same as brc_chunks but input chunks are processed in separate threads/streams.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jun 11, 2024
@davidwendt davidwendt self-assigned this Jun 11, 2024
@github-actions github-actions bot added the CMake CMake build issue label Jun 11, 2024
@davidwendt davidwendt requested a review from bdice August 29, 2024 20:38
@GregoryKimball GregoryKimball requested review from vuule and removed request for srinivasyadav18 August 30, 2024 15:54
@GregoryKimball
Copy link
Contributor

On GH200, the brc_pipeline example seems to be calling cudaHostRegister on the entire file for each chunk.
image
And it ends up much slower. I'm looking to see if anything in the data source handling could be changed.
image

@vuule
Copy link
Contributor

vuule commented Aug 30, 2024

On GH200, the brc_pipeline example seems to be calling cudaHostRegister on the entire file for each chunk.

That unexpected. I'll check the code and update here.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving -- only a few small suggestions.

rapids_cuda_set_architectures(RAPIDS)

project(
billion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this to brc? Or does the project need a unique name?

It's a little confusing to have three unique names for this example:

  1. The directory is named 1billion
  2. The project is named billion
  3. The executable is named brc (and variations thereof)

Let's consolidate these.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with renaming the directory and project the same. I want to use something easily readable/discoverable from the perspective of someone looking at the examples folder. My suggestion is to use billion_rows for the directory and project name.
I would like to keep the executable names shorter since they are built in context of the parent directory and would look less cumbersome in my opinion. I'd like to keep the brc variations also because the blog uses those names in charts that would need to be regenerated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s fine! Let’s do that.

cpp/examples/1billion/README.md Outdated Show resolved Hide resolved
cpp/examples/1billion/README.md Outdated Show resolved Hide resolved
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a few non-blocking suggestions

cpp/examples/1billion/brc.cpp Outdated Show resolved Hide resolved
cpp/examples/1billion/brc_chunks.cpp Outdated Show resolved Hide resolved
cpp/examples/1billion/brc_pipeline.cpp Outdated Show resolved Hide resolved
@ttnghia
Copy link
Contributor

ttnghia commented Sep 5, 2024

Nit:

  • IMO the folder name 1billion is vague/not clean. Typically I would avoid naming anything starting with a number.
  • I'm kind of OCD with formatting. For printing information, I would prefer to see the printed sentences as "First letter of each sentence is capitalized" instead of "everything is in lower-case".

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lovely examples. Expected more complex code, especially for the pipeline.

Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great

Comment on lines +46 to +47
auto const mr_name = std::string("pool");
auto resource = create_memory_resource(mr_name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: to keep it simple,

Suggested change
auto const mr_name = std::string("pool");
auto resource = create_memory_resource(mr_name);
auto resource = create_memory_resource("pool");

@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 715677e into rapidsai:branch-24.10 Sep 5, 2024
86 checks passed
@davidwendt davidwendt deleted the example-1billion branch September 5, 2024 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Status: Done
Status: Landed
Development

Successfully merging this pull request may close these issues.

6 participants