Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add distinct-key joins to libcudf #14948

Closed
4 of 6 tasks
GregoryKimball opened this issue Feb 1, 2024 · 2 comments
Closed
4 of 6 tasks

[FEA] Add distinct-key joins to libcudf #14948

GregoryKimball opened this issue Feb 1, 2024 · 2 comments
Labels
2 - In Progress Currently a work in progress feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue Spark Functionality that helps Spark RAPIDS

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Feb 1, 2024

Is your feature request related to a problem? Please describe.
For equality joins in which the keys of one of the tables do not contain any duplicates, then we can provide a more efficient implementation based on cuco::static_set. Distinct-key joins also have more predictable output sizes and most join types can be implemented with single-pass kernels. The join APIs currently in libcudf's hash_join class use the cuco::static_multimap data structure to support duplicates.

Describe the solution you'd like
We should provide a new distinct_hash_join class that uses the cuco::static_set data structure and does not support duplicate keys in the build table. This class would have member functions for inner_join and left_join join types.

Staging the work

Additional context
See also #12261, which includes refactoring hash_join from using cuco::static_multimap to cuco::static_multiset. If we add the simpler and more efficient distinct-key joins, it will make it easier to experiment with join implementations using set-like data structures.

Distinct-key joins are common in "primary key / foreign key" joins because the primary key in a table is required to never have duplicates.

@GregoryKimball GregoryKimball added feature request New feature or request 0 - Backlog In queue waiting for assignment Performance Performance related issue Spark Functionality that helps Spark RAPIDS labels Feb 1, 2024
@GregoryKimball GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Feb 1, 2024
@PointKernel PointKernel changed the title [FEA] Add unique-key joins to libcudf [FEA] Add distinct-key joins to libcudf Feb 13, 2024
@GregoryKimball GregoryKimball added 2 - In Progress Currently a work in progress and removed 0 - Backlog In queue waiting for assignment labels Feb 21, 2024
rapids-bot bot pushed a commit that referenced this issue Feb 23, 2024
Contributes to #14948

This PR adds a public `cudf::distinct_hash_join` class that provides a fast code path for joins with distinct keys.

Only distinct inner join is tackled in the current PR.

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Jason Lowe (https://github.com/jlowe)
  - Bradley Dice (https://github.com/bdice)
  - Lawrence Mitchell (https://github.com/wence-)
  - David Wendt (https://github.com/davidwendt)
  - Nghia Truong (https://github.com/ttnghia)

URL: #14990
rapids-bot bot pushed a commit that referenced this issue Mar 6, 2024
Contributes to #14948

This PR adds distinct left join. It also cleans up the distinct inner code to use the terms "build" and "probe" consistently instead of "left" and "right".

Authors:
  - Yunsong Wang (https://github.com/PointKernel)
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Jason Lowe (https://github.com/jlowe)
  - Nghia Truong (https://github.com/ttnghia)

URL: #15149
@PointKernel
Copy link
Member

Based on tests, explicit shared memory hash tables for joins don't help improve performance:

shared_memory:

| Key | Distribution | NumOutputs | Samples |  CPU Time  | Noise  |  GPU Time  | Noise  |  Elem/s  |
|-----|--------------|------------|---------|------------|--------|------------|--------|----------|
| I32 |       UNIQUE |        800 |  36864x |  16.658 us | 26.38% |  13.567 us | 12.03% |  58.967M |
| I32 |       UNIQUE |       2000 |  35200x |  17.238 us | 23.20% |  14.207 us |  8.56% | 140.774M |
| I32 |       UNIQUE |       8000 |  23424x |  24.394 us | 16.17% |  21.355 us |  7.31% | 374.615M |
| I32 |       UNIQUE |      80000 |   5088x | 101.504 us |  3.65% |  98.452 us |  1.74% | 812.575M |
| I32 |       UNIQUE |     800000 |   1042x | 890.035 us |  0.61% | 887.040 us |  0.50% | 901.876M |
| I32 |       UNIQUE |    8000000 |     58x |   8.730 ms |  0.17% |   8.727 ms |  0.17% | 916.703M |
| I32 |       UNIQUE |   80000000 |     11x |  87.252 ms |  0.01% |  87.250 ms |  0.01% | 916.902M |
| I64 |       UNIQUE |        800 |  30928x |  19.206 us | 20.21% |  16.174 us |  7.08% |  49.463M |
| I64 |       UNIQUE |       2000 |  34576x |  17.513 us | 22.77% |  14.463 us |  8.11% | 138.283M |
| I64 |       UNIQUE |       8000 |  16272x |  33.790 us | 10.68% |  30.754 us |  3.78% | 260.125M |
| I64 |       UNIQUE |      80000 |   2640x | 192.500 us |  1.73% | 189.497 us |  0.67% | 422.169M |
| I64 |       UNIQUE |     800000 |    283x |   1.774 ms |  0.51% |   1.771 ms |  0.47% | 451.705M |
| I64 |       UNIQUE |    8000000 |     29x |  17.573 ms |  0.07% |  17.570 ms |  0.07% | 455.313M |
| I64 |       UNIQUE |   80000000 |     11x | 175.586 ms |  0.02% | 175.585 ms |  0.02% | 455.619M |

global memory:

| Key | Distribution | NumOutputs | Samples |  CPU Time  | Noise  |  GPU Time  | Noise  |  Elem/s  |
|-----|--------------|------------|---------|------------|--------|------------|--------|----------|
| I32 |       UNIQUE |        800 |  66784x |  10.678 us | 47.51% |   7.487 us | 18.63% | 106.854M |
| I32 |       UNIQUE |       2000 |  64320x |  10.932 us | 44.70% |   7.774 us | 16.48% | 257.269M |
| I32 |       UNIQUE |       8000 |  62640x |  11.161 us | 44.79% |   7.982 us | 18.55% |   1.002G |
| I32 |       UNIQUE |      80000 |  35520x |  17.181 us | 26.87% |  14.080 us | 14.23% |   5.682G |
| I32 |       UNIQUE |     800000 |   7552x |  69.285 us |  5.07% |  66.239 us |  2.00% |  12.077G |
| I32 |       UNIQUE |    8000000 |   2016x | 622.167 us |  1.21% | 619.066 us |  1.09% |  12.923G |
| I32 |       UNIQUE |   80000000 |    848x |   6.216 ms |  0.82% |   6.213 ms |  0.82% |  12.876G |
| I64 |       UNIQUE |        800 |  64640x |  10.884 us | 46.03% |   7.736 us | 19.71% | 103.415M |
| I64 |       UNIQUE |       2000 |  63936x |  11.054 us | 45.48% |   7.822 us | 16.70% | 255.690M |
| I64 |       UNIQUE |       8000 |  59792x |  11.565 us | 42.66% |   8.363 us | 16.62% | 956.566M |
| I64 |       UNIQUE |      80000 |  33040x |  18.327 us | 26.25% |  15.140 us | 14.08% |   5.284G |
| I64 |       UNIQUE |     800000 |   7152x |  74.976 us |  5.02% |  71.897 us |  2.42% |  11.127G |
| I64 |       UNIQUE |    8000000 |   2016x | 671.884 us |  1.34% | 668.698 us |  1.24% |  11.964G |
| I64 |       UNIQUE |   80000000 |   1104x |   6.662 ms |  0.86% |   6.659 ms |  0.86% |  12.014G |

This issue can be closed with confirmation from the spark side and the follow up work will be tracked via #15502

@PointKernel
Copy link
Member

Closing as completed.

NVIDIA/spark-rapids#7529 is more related to high-multiplicity tuning and shared memory hash table won't help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue Spark Functionality that helps Spark RAPIDS
Projects
Archived in project
Development

No branches or pull requests

2 participants