Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor joins for conditional semis and antis #14646

Merged

Conversation

DanialJavady96
Copy link
Contributor

@DanialJavady96 DanialJavady96 commented Dec 18, 2023

Contributes to #10039

Currently conditional_joins for both semi and anti joins rely on an implementation that was designed for taking in results from both tables involved in the join. This leads to wasteful allocation that can be optimized for these two cases.

Description

Add a new kernel to be used for both semi and anti joins.
Add some new device functions for adding only one array of shared_memory for caching.

Tests pass on my 3080.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

…cate a second output when it is not needed in that context
Copy link

copy-pr-bot bot commented Dec 18, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Dec 18, 2023
@DanialJavady96
Copy link
Contributor Author

DanialJavady96 commented Dec 18, 2023

CC @bdice @vyasr please let me know if changes needed to be made or if i misunderstood anything. I imagine in desire of keeping PRs smaller that this shouldn't touch the size APIs

@DanialJavady96 DanialJavady96 marked this pull request as draft December 18, 2023 17:31
@PointKernel PointKernel added 3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change and removed 3 - Ready for Review Ready for review by team labels Dec 18, 2023
@DanialJavady96
Copy link
Contributor Author

Did some benchmarking with branch-24.02 and this branch, performance gains were negligible/statistically insignificant(1-3% gains). However, I made some changes by removing the compute_size kernels, and used a pessimistic assumption that the size would always be the left table size N(compromise memory for runtime speed up), and gains were significant

My specs are as follows
image

CPU: 12th Gen Intel(R) Core(TM) i9-12900K, 3200 Mhz, 16 Core(s), 24 Logical Processor(s)
GPU: RTX 3080.
RAM: 64gb ddr5
OS: WSL2 Win 11 host os

image

@vuule
Copy link
Contributor

vuule commented Dec 20, 2023

/ok to test

@GregoryKimball
Copy link
Contributor

@vyasr would you please take a look when you get back?

@GregoryKimball
Copy link
Contributor

Please note that this PR addresses part of #10039

@PointKernel PointKernel added Performance Performance related issue 3 - Ready for Review Ready for review by team labels Jan 3, 2024
@PointKernel PointKernel changed the title [DRAFT] refactor joins for conditional semis and antis Refactor joins for conditional semis and antis Jan 3, 2024
@PointKernel PointKernel marked this pull request as ready for review January 3, 2024 20:22
@PointKernel
Copy link
Member

/ok to test

@PointKernel
Copy link
Member

@DanialJavady96 Making this ready for review to draw proper attention from reviewers

@bdice
Copy link
Contributor

bdice commented Apr 17, 2024

Do we need any expanded tests? I'll try to look into that.

Responding to myself -- I think our testing looks okay for now. I don't know of anything that would need to be changed. https://github.com/rapidsai/cudf/blob/branch-24.06/cpp/tests/join/conditional_join_tests.cu

DanialJavady96 and others added 3 commits April 17, 2024 09:30
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Co-authored-by: Yunsong Wang <yunsongw@nvidia.com>
Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to use device_async_resource_ref for MR parameters. Otherwise just a couple of nits.

cpp/src/join/conditional_join.cu Outdated Show resolved Hide resolved
cpp/src/join/conditional_join.cu Outdated Show resolved Hide resolved
cpp/src/join/conditional_join.cu Outdated Show resolved Hide resolved
@@ -348,14 +443,13 @@ std::unique_ptr<rmm::device_uvector<size_type>> conditional_left_semi_join(
rmm::mr::device_memory_resource* mr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

☑️ todo: ‏ Please use rmm::device_async_resource_ref (not pointer)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #15498 for details / examples. It should be a straightforward replacement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DanialJavady96 I can launch another round of CI tests once this and the below (line 366) mr is migrated to the new async resource ref.

@bdice
Copy link
Contributor

bdice commented Apr 18, 2024

/ok to test

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, and one question about exiting early / work reduction.

Comment on lines 148 to 149
auto const right_num_rows{right.num_rows()};
auto const left_num_rows{left.num_rows()};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's choose a consistent pattern between this and the changes below. Either use auto const variables or always use .num_rows(). It looks like the below was refactored to use the method, but this is still using variables.

cpp/src/join/conditional_join_kernels.cuh Outdated Show resolved Hide resolved
cpp/src/join/conditional_join_kernels.cuh Outdated Show resolved Hide resolved
if (join_type == join_kind::LEFT_SEMI_JOIN && !found_match) {
add_left_to_cache(outer_row_index, current_idx_shared, warp_id, join_shared_l[warp_id]);
}
found_match = true;
Copy link
Contributor

@bdice bdice Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once found_match is true for an outer_row_index, are we allowed to quit evaluating more inner rows for matches? It seems like we should be able to trigger the flush code and then go to the next outer_row_index. Both SEMI and ANTI joins check that found_match is false before adding an outer_row_index, but it seems like they would continue evaluating other inner rows anyway (but they shouldn't have to do so). Does that sound right?

Copy link
Contributor

@ZelboK ZelboK Apr 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good catch! Yeah there is no need at all for it to continue searching that space after it finds a match. I am really curious about the speed improvements this change will make. Let me benchmark before i push up.

ZelboK and others added 2 commits April 19, 2024 19:31
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
@ZelboK
Copy link
Contributor

ZelboK commented Apr 22, 2024

@bdice

Benchmark                                                                                                    Time             CPU   Iterations
----------------------------------------------------------------------------------------------------------------------------------------------
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit/100000/100000/manual_time               311 ms          312 ms            2
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit/100000/400000/manual_time              1126 ms         1126 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit/100000/1000000/manual_time             2748 ms         2748 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit/100000/100000/manual_time               318 ms          318 ms            2
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit/100000/400000/manual_time              1147 ms         1147 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit/100000/1000000/manual_time             2796 ms         2796 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit_nulls/100000/100000/manual_time         415 ms          415 ms            2
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit_nulls/100000/400000/manual_time        1485 ms         1485 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit_nulls/100000/1000000/manual_time       3605 ms         3605 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit_nulls/100000/100000/manual_time         417 ms          417 ms            2
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit_nulls/100000/400000/manual_time        1497 ms         1497 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit_nulls/100000/1000000/manual_time       3651 ms         3651 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit/100000/100000/manual_time               310 ms          310 ms            2
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit/100000/400000/manual_time              1117 ms         1117 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit/100000/1000000/manual_time             2725 ms         2725 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit/100000/100000/manual_time               316 ms          316 ms            2
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit/100000/400000/manual_time              1142 ms         1142 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit/100000/1000000/manual_time             2782 ms         2782 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit_nulls/100000/100000/manual_time         412 ms          412 ms            2
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit_nulls/100000/400000/manual_time        1482 ms         1482 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit_nulls/100000/1000000/manual_time       3615 ms         3615 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit_nulls/100000/100000/manual_time         418 ms          418 ms            2
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit_nulls/100000/400000/manual_time        1501 ms         1501 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit_nulls/100000/1000000/manual_time       3658 ms         3658 ms            1
(pyt_dev) ksm@Kashimo:~/cudf/cpp/build/benchmarks$ 

Compared to the benchmarks here,

#14646 (comment)

Looks pretty good! Some of the gains are quite significant.

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last styling issue. Otherwise LGTM

auto left_num_rows{left.num_rows()};
if (right_num_rows == 0) {

if (right.num_rows() == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the place that @bdice was referring to https://github.com/rapidsai/cudf/pull/14646/files#r1573011742.

Depending on how you see it, we should consistently use either auto const ... or .num_rows() in both functions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change the entire file to be consistent

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah... Okay. It turns out i was pushing to the wrong fork. That's why there were was some inconsistencies. My bad.

@harrism harrism dismissed their stale review April 24, 2024 00:15

So I won't block merging.

@PointKernel
Copy link
Member

/ok to test

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. Great work!

@bdice
Copy link
Contributor

bdice commented Apr 30, 2024

/merge

@rapids-bot rapids-bot bot merged commit 5287580 into rapidsai:branch-24.06 Apr 30, 2024
70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

7 participants