Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip decode steps in Parquet reader when nullable columns have no nulls #15332

Merged
merged 35 commits into from
May 8, 2024

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Mar 18, 2024

Description

Closes #15266.

Some block and warp operations and some atomics can be avoided during Parquet page decoding when nullable columns are known to not contain any null values. One issue, however, is that since the columns are nullable, the null mask has to be set. My approach in this PR was to initialize nullable column buffers with an ALL_VALID mask rather than ALL_NULL. This will work when nulls are present because store_validity() sets the bitmask to all zeros before ORing in the passed valid_mask.

Benchmarks modified to not emit nulls showed a good improvement in decoding time for fixed-width data types.

## [0] NVIDIA RTX A6000

|  data_type  |    io_type    |  cardinality  |  run_length  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|-------------|---------------|---------------|--------------|------------|-------------|------------|-------------|--------------|---------|----------|
|  INTEGRAL   | DEVICE_BUFFER |       0       |      1       |   9.955 ms |       2.31% |   9.361 ms |       3.55% |  -594.395 us |  -5.97% |   FAIL   |
|  INTEGRAL   | DEVICE_BUFFER |     1000      |      1       |   9.964 ms |       3.01% |   8.981 ms |       3.98% |  -982.965 us |  -9.87% |   FAIL   |
|  INTEGRAL   | DEVICE_BUFFER |       0       |      32      |  10.222 ms |       3.32% |   9.207 ms |       5.47% | -1014.597 us |  -9.93% |   FAIL   |
|  INTEGRAL   | DEVICE_BUFFER |     1000      |      32      |   9.930 ms |       3.83% |   8.607 ms |       3.37% | -1323.326 us | -13.33% |   FAIL   |
|    FLOAT    | DEVICE_BUFFER |       0       |      1       |   5.999 ms |       3.59% |   5.752 ms |       3.87% |  -246.635 us |  -4.11% |   FAIL   |
|    FLOAT    | DEVICE_BUFFER |     1000      |      1       |   6.576 ms |       4.40% |   5.839 ms |       4.43% |  -737.338 us | -11.21% |   FAIL   |
|    FLOAT    | DEVICE_BUFFER |       0       |      32      |   5.828 ms |       4.59% |   4.940 ms |       4.41% |  -887.375 us | -15.23% |   FAIL   |
|    FLOAT    | DEVICE_BUFFER |     1000      |      32      |   6.198 ms |       3.91% |   5.271 ms |       3.54% |  -927.602 us | -14.97% |   FAIL   |
|   DECIMAL   | DEVICE_BUFFER |       0       |      1       |  20.199 ms |       1.66% |  20.014 ms |       2.23% |  -184.710 us |  -0.91% |   PASS   |
|   DECIMAL   | DEVICE_BUFFER |     1000      |      1       |   7.068 ms |       3.99% |   6.479 ms |       4.08% |  -588.856 us |  -8.33% |   FAIL   |
|   DECIMAL   | DEVICE_BUFFER |       0       |      32      |   9.287 ms |       3.45% |   8.656 ms |       2.94% |  -631.348 us |  -6.80% |   FAIL   |
|   DECIMAL   | DEVICE_BUFFER |     1000      |      32      |   5.641 ms |       4.39% |   5.021 ms |       3.31% |  -620.122 us | -10.99% |   FAIL   |
|  TIMESTAMP  | DEVICE_BUFFER |       0       |      1       |  27.488 ms |       1.57% |  27.235 ms |       1.74% |  -253.277 us |  -0.92% |   PASS   |
|  TIMESTAMP  | DEVICE_BUFFER |     1000      |      1       |   6.656 ms |       4.73% |   6.049 ms |       4.61% |  -607.760 us |  -9.13% |   FAIL   |
|  TIMESTAMP  | DEVICE_BUFFER |       0       |      32      |   9.974 ms |       3.22% |   9.204 ms |       2.84% |  -770.247 us |  -7.72% |   FAIL   |
|  TIMESTAMP  | DEVICE_BUFFER |     1000      |      32      |   5.998 ms |       5.17% |   5.203 ms |       3.06% |  -794.943 us | -13.25% |   FAIL   |
|  DURATION   | DEVICE_BUFFER |       0       |      1       |   8.816 ms |       3.61% |   8.538 ms |       3.26% |  -278.877 us |  -3.16% |   PASS   |
|  DURATION   | DEVICE_BUFFER |     1000      |      1       |   5.989 ms |       4.76% |   5.446 ms |       4.57% |  -542.636 us |  -9.06% |   FAIL   |
|  DURATION   | DEVICE_BUFFER |       0       |      32      |   6.822 ms |       4.96% |   6.042 ms |       3.74% |  -779.786 us | -11.43% |   FAIL   |
|  DURATION   | DEVICE_BUFFER |     1000      |      32      |   5.706 ms |       5.40% |   4.930 ms |       3.39% |  -775.607 us | -13.59% |   FAIL   |
|   STRING    | DEVICE_BUFFER |       0       |      1       |  36.616 ms |       1.74% |  36.483 ms |       1.31% |  -132.191 us |  -0.36% |   PASS   |
|   STRING    | DEVICE_BUFFER |     1000      |      1       |  12.006 ms |       4.15% |  11.989 ms |       3.53% |   -16.278 us |  -0.14% |   PASS   |
|   STRING    | DEVICE_BUFFER |       0       |      32      |  36.587 ms |       1.99% |  36.514 ms |       1.38% |   -73.737 us |  -0.20% |   PASS   |
|   STRING    | DEVICE_BUFFER |     1000      |      32      |  11.235 ms |       4.25% |  11.228 ms |       3.62% |    -7.041 us |  -0.06% |   PASS   |
|    LIST     | DEVICE_BUFFER |       0       |      1       |  36.929 ms |       1.88% |  36.988 ms |       1.42% |    59.350 us |   0.16% |   PASS   |
|    LIST     | DEVICE_BUFFER |     1000      |      1       |  36.510 ms |       1.91% |  36.558 ms |       1.66% |    48.536 us |   0.13% |   PASS   |
|    LIST     | DEVICE_BUFFER |       0       |      32      |  35.513 ms |       2.00% |  35.490 ms |       1.77% |   -23.411 us |  -0.07% |   PASS   |
|    LIST     | DEVICE_BUFFER |     1000      |      32      |  35.755 ms |       1.99% |  35.728 ms |       1.64% |   -27.564 us |  -0.08% |   PASS   |
|   STRUCT    | DEVICE_BUFFER |       0       |      1       |  43.456 ms |       1.35% |  43.537 ms |       1.16% |    81.405 us |   0.19% |   PASS   |
|   STRUCT    | DEVICE_BUFFER |     1000      |      1       |  25.549 ms |       2.54% |  25.698 ms |       1.90% |   149.295 us |   0.58% |   PASS   |
|   STRUCT    | DEVICE_BUFFER |       0       |      32      |  43.103 ms |       1.87% |  43.019 ms |       1.59% |   -84.825 us |  -0.20% |   PASS   |
|   STRUCT    | DEVICE_BUFFER |     1000      |      32      |  23.462 ms |       2.81% |  23.432 ms |       1.88% |   -30.434 us |  -0.13% |   PASS   |

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Mar 18, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 18, 2024
@etseidl
Copy link
Contributor Author

etseidl commented Mar 18, 2024

@abellina when you have a chance I'd be curious to see how this works with your benchmarks.

cc @nvdbaranec

@github-actions github-actions bot added Python Affects Python cuDF API. CMake CMake build issue labels Mar 19, 2024
@github-actions github-actions bot added Python Affects Python cuDF API. and removed Python Affects Python cuDF API. CMake CMake build issue labels Mar 19, 2024
@github-actions github-actions bot removed the Python Affects Python cuDF API. label Mar 20, 2024
@etseidl etseidl requested a review from vuule May 3, 2024 15:32
@vuule vuule requested a review from mhaseeb123 May 3, 2024 17:14
@abellina
Copy link
Contributor

abellina commented May 3, 2024

@etseidl I'll take a look at this and we will try to run with our benchmarks, but I think it might be early next week. I am sorry about the delay in responding, I have to fix some communication settings on my end.

I would not block this PR as we will detect regressions quickly in our benchmark runs.

@mhaseeb123
Copy link
Member

/ok to test

@mhaseeb123
Copy link
Member

The idea and changes look good to me. I would like to see improvement results from benchmarks as well. As always, thanks Ed for a great effort! 🙂

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few comments got left behind, otherwise 💯

cpp/src/io/parquet/decode_fixed.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/decode_fixed.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/decode_fixed.cu Outdated Show resolved Hide resolved
Co-authored-by: Vukasin Milovanovic <vmilovanovic@nvidia.com>
@mhaseeb123
Copy link
Member

/ok to test

@vuule
Copy link
Contributor

vuule commented May 3, 2024

Apologies for leaving a trailing whitespace in one of the comments. As they say, "this little maneuver will cost us 51 years"

@vuule
Copy link
Contributor

vuule commented May 3, 2024

/ok to test

@etseidl
Copy link
Contributor Author

etseidl commented May 3, 2024

we will detect regressions quickly in our benchmark runs

😬

I'm fine with waiting to see how this works in the spark environment. It's easier to not commit than to have to revert 😄

@vuule vuule added the 5 - DO NOT MERGE Hold off on merging; see PR for details label May 3, 2024
@abellina
Copy link
Contributor

abellina commented May 8, 2024

I should have numbers for this today @etseidl sorry for the delay.

@etseidl
Copy link
Contributor Author

etseidl commented May 8, 2024

Thanks for testing this @abellina 🙏

@abellina
Copy link
Contributor

abellina commented May 8, 2024

@etseidl I am not seeing a difference in our tests. I am having trouble getting baselines to be as fast as normal in this environment, so take this with a grain of salt, but if I compare last night's nightly against this branch it's just noise, some are higher some are lower, and the queries I expect to see impact from scan (query9, query88) don't really see a change.

query1: Previous (2176.6666666666665 ms) vs Current (2005.3333333333333 ms) Diff 171 E2E 1.09x
query2: Previous (2722.3333333333335 ms) vs Current (2636.3333333333335 ms) Diff 86 E2E 1.03x
query3: Previous (707.6666666666666 ms) vs Current (624.0 ms) Diff 83 E2E 1.13x
query4: Previous (13285.0 ms) vs Current (13070.666666666666 ms) Diff 214 E2E 1.02x
query5: Previous (2572.0 ms) vs Current (2638.0 ms) Diff -66 E2E 0.97x
query6: Previous (1279.3333333333333 ms) vs Current (1225.6666666666667 ms) Diff 53 E2E 1.04x
query7: Previous (1466.0 ms) vs Current (1474.0 ms) Diff -8 E2E 0.99x
query8: Previous (1426.3333333333333 ms) vs Current (1437.3333333333333 ms) Diff -11 E2E 0.99x
query9: Previous (6973.0 ms) vs Current (7032.666666666667 ms) Diff -59 E2E 0.99x
query10: Previous (1989.0 ms) vs Current (1944.3333333333333 ms) Diff 44 E2E 1.02x
query11: Previous (7948.666666666667 ms) vs Current (7326.0 ms) Diff 622 E2E 1.08x
query12: Previous (875.0 ms) vs Current (796.6666666666666 ms) Diff 78 E2E 1.10x
query13: Previous (2205.6666666666665 ms) vs Current (2520.0 ms) Diff -314 E2E 0.88x
query14_part1: Previous (8609.0 ms) vs Current (8694.0 ms) Diff -85 E2E 0.99x
query14_part2: Previous (6744.666666666667 ms) vs Current (6877.333333333333 ms) Diff -132 E2E 0.98x
query15: Previous (1987.0 ms) vs Current (1739.3333333333333 ms) Diff 247 E2E 1.14x
query16: Previous (8691.333333333334 ms) vs Current (8774.0 ms) Diff -82 E2E 0.99x
query17: Previous (2610.0 ms) vs Current (2823.6666666666665 ms) Diff -213 E2E 0.92x
query18: Previous (5004.666666666667 ms) vs Current (4977.666666666667 ms) Diff 27 E2E 1.01x
query19: Previous (3083.6666666666665 ms) vs Current (2287.0 ms) Diff 796 E2E 1.35x
query20: Previous (921.0 ms) vs Current (877.6666666666666 ms) Diff 43 E2E 1.05x
query21: Previous (696.6666666666666 ms) vs Current (657.0 ms) Diff 39 E2E 1.06x
query22: Previous (1357.3333333333333 ms) vs Current (1401.0 ms) Diff -43 E2E 0.97x
query23_part1: Previous (14399.666666666666 ms) vs Current (14628.666666666666 ms) Diff -229 E2E 0.98x
query23_part2: Previous (19844.333333333332 ms) vs Current (21094.666666666668 ms) Diff -1250 E2E 0.94x
query24_part1: Previous (8953.333333333334 ms) vs Current (8582.333333333334 ms) Diff 371 E2E 1.04x
query24_part2: Previous (8605.333333333334 ms) vs Current (8619.0 ms) Diff -13 E2E 1.00x
query25: Previous (2104.0 ms) vs Current (2054.0 ms) Diff 50 E2E 1.02x
query26: Previous (1308.6666666666667 ms) vs Current (1368.3333333333333 ms) Diff -59 E2E 0.96x
query27: Previous (1700.0 ms) vs Current (1564.0 ms) Diff 136 E2E 1.09x
query28: Previous (7561.666666666667 ms) vs Current (7674.333333333333 ms) Diff -112 E2E 0.99x
query29: Previous (3166.6666666666665 ms) vs Current (3588.0 ms) Diff -421 E2E 0.88x
query30: Previous (3949.3333333333335 ms) vs Current (3904.0 ms) Diff 45 E2E 1.01x
query31: Previous (3821.3333333333335 ms) vs Current (3881.6666666666665 ms) Diff -60 E2E 0.98x
query32: Previous (1924.3333333333333 ms) vs Current (1938.6666666666667 ms) Diff -14 E2E 0.99x
query33: Previous (1751.0 ms) vs Current (2398.3333333333335 ms) Diff -647 E2E 0.73x
query34: Previous (2795.3333333333335 ms) vs Current (2733.3333333333335 ms) Diff 62 E2E 1.02x
query35: Previous (2844.6666666666665 ms) vs Current (2730.6666666666665 ms) Diff 114 E2E 1.04x
query36: Previous (1724.6666666666667 ms) vs Current (1674.6666666666667 ms) Diff 50 E2E 1.03x
query37: Previous (1497.3333333333333 ms) vs Current (1746.6666666666667 ms) Diff -249 E2E 0.86x
query38: Previous (4900.333333333333 ms) vs Current (4734.666666666667 ms) Diff 165 E2E 1.03x
query39_part1: Previous (2502.6666666666665 ms) vs Current (2202.3333333333335 ms) Diff 300 E2E 1.14x
query39_part2: Previous (1533.6666666666667 ms) vs Current (1446.3333333333333 ms) Diff 87 E2E 1.06x
query40: Previous (1693.3333333333333 ms) vs Current (1732.3333333333333 ms) Diff -39 E2E 0.98x
query41: Previous (439.6666666666667 ms) vs Current (438.3333333333333 ms) Diff 1 E2E 1.00x
query42: Previous (418.0 ms) vs Current (366.3333333333333 ms) Diff 51 E2E 1.14x
query43: Previous (897.0 ms) vs Current (867.0 ms) Diff 30 E2E 1.03x
query44: Previous (649.0 ms) vs Current (655.3333333333334 ms) Diff -6 E2E 0.99x
query45: Previous (1324.3333333333333 ms) vs Current (1342.3333333333333 ms) Diff -18 E2E 0.99x
query46: Previous (2373.3333333333335 ms) vs Current (2242.0 ms) Diff 131 E2E 1.06x
query47: Previous (2232.0 ms) vs Current (2442.3333333333335 ms) Diff -210 E2E 0.91x
query48: Previous (1145.3333333333333 ms) vs Current (1141.3333333333333 ms) Diff 4 E2E 1.00x
query49: Previous (2930.6666666666665 ms) vs Current (3163.6666666666665 ms) Diff -233 E2E 0.93x
query50: Previous (8678.666666666666 ms) vs Current (9419.333333333334 ms) Diff -740 E2E 0.92x
query51: Previous (3151.6666666666665 ms) vs Current (3247.6666666666665 ms) Diff -96 E2E 0.97x
query52: Previous (476.3333333333333 ms) vs Current (530.6666666666666 ms) Diff -54 E2E 0.90x
query53: Previous (928.6666666666666 ms) vs Current (895.3333333333334 ms) Diff 33 E2E 1.04x
query54: Previous (1908.0 ms) vs Current (1903.0 ms) Diff 5 E2E 1.00x
query55: Previous (503.6666666666667 ms) vs Current (516.3333333333334 ms) Diff -12 E2E 0.98x
query56: Previous (1215.6666666666667 ms) vs Current (1126.0 ms) Diff 89 E2E 1.08x
query57: Previous (2100.3333333333335 ms) vs Current (2084.0 ms) Diff 16 E2E 1.01x
query58: Previous (1526.3333333333333 ms) vs Current (1477.3333333333333 ms) Diff 49 E2E 1.03x
query59: Previous (2023.0 ms) vs Current (2086.3333333333335 ms) Diff -63 E2E 0.97x
query60: Previous (2261.3333333333335 ms) vs Current (2260.3333333333335 ms) Diff 1 E2E 1.00x
query61: Previous (1288.0 ms) vs Current (1299.3333333333333 ms) Diff -11 E2E 0.99x
query62: Previous (1367.3333333333333 ms) vs Current (1425.3333333333333 ms) Diff -58 E2E 0.96x
query63: Previous (1263.6666666666667 ms) vs Current (1502.6666666666667 ms) Diff -239 E2E 0.84x
query64: Previous (17341.333333333332 ms) vs Current (17432.0 ms) Diff -90 E2E 0.99x
query65: Previous (3955.6666666666665 ms) vs Current (4175.333333333333 ms) Diff -219 E2E 0.95x
query66: Previous (5169.0 ms) vs Current (4960.0 ms) Diff 209 E2E 1.04x
query67: Previous (29768.333333333332 ms) vs Current (30835.0 ms) Diff -1066 E2E 0.97x
query68: Previous (1882.6666666666667 ms) vs Current (1898.0 ms) Diff -15 E2E 0.99x
query69: Previous (1604.3333333333333 ms) vs Current (1847.0 ms) Diff -242 E2E 0.87x
query70: Previous (2043.3333333333333 ms) vs Current (2306.6666666666665 ms) Diff -263 E2E 0.89x
query71: Previous (3429.6666666666665 ms) vs Current (3922.0 ms) Diff -492 E2E 0.87x
query72: Previous (3785.0 ms) vs Current (3582.3333333333335 ms) Diff 202 E2E 1.06x
query73: Previous (1237.3333333333333 ms) vs Current (1234.0 ms) Diff 3 E2E 1.00x
query74: Previous (5593.0 ms) vs Current (5690.0 ms) Diff -97 E2E 0.98x
query75: Previous (7276.333333333333 ms) vs Current (7282.666666666667 ms) Diff -6 E2E 1.00x
query76: Previous (3253.6666666666665 ms) vs Current (2433.3333333333335 ms) Diff 820 E2E 1.34x
query77: Previous (1832.0 ms) vs Current (1706.3333333333333 ms) Diff 125 E2E 1.07x
query78: Previous (9297.666666666666 ms) vs Current (9277.666666666666 ms) Diff 20 E2E 1.00x
query79: Previous (1654.3333333333333 ms) vs Current (2191.3333333333335 ms) Diff -537 E2E 0.75x
query80: Previous (4295.666666666667 ms) vs Current (4578.0 ms) Diff -282 E2E 0.94x
query81: Previous (3208.0 ms) vs Current (3237.0 ms) Diff -29 E2E 0.99x
query82: Previous (2592.0 ms) vs Current (2693.3333333333335 ms) Diff -101 E2E 0.96x
query83: Previous (7616.0 ms) vs Current (7254.333333333333 ms) Diff 361 E2E 1.05x
query84: Previous (1964.0 ms) vs Current (2037.0 ms) Diff -73 E2E 0.96x
query85: Previous (2161.6666666666665 ms) vs Current (2078.3333333333335 ms) Diff 83 E2E 1.04x
query86: Previous (1161.3333333333333 ms) vs Current (959.0 ms) Diff 202 E2E 1.21x
query87: Previous (5044.666666666667 ms) vs Current (5856.333333333333 ms) Diff -811 E2E 0.86x
query88: Previous (5407.0 ms) vs Current (5484.333333333333 ms) Diff -77 E2E 0.99x
query89: Previous (1029.6666666666667 ms) vs Current (1088.0 ms) Diff -58 E2E 0.95x
query90: Previous (1492.0 ms) vs Current (1427.0 ms) Diff 65 E2E 1.05x
query91: Previous (1029.3333333333333 ms) vs Current (980.3333333333334 ms) Diff 48 E2E 1.05x
query92: Previous (1071.6666666666667 ms) vs Current (1274.3333333333333 ms) Diff -202 E2E 0.84x
query93: Previous (11156.333333333334 ms) vs Current (10653.0 ms) Diff 503 E2E 1.05x
query94: Previous (4380.333333333333 ms) vs Current (4398.0 ms) Diff -17 E2E 1.00x
query95: Previous (6585.0 ms) vs Current (6690.0 ms) Diff -105 E2E 0.98x
query96: Previous (1088.3333333333333 ms) vs Current (1263.6666666666667 ms) Diff -175 E2E 0.86x
query97: Previous (2195.0 ms) vs Current (2136.6666666666665 ms) Diff 58 E2E 1.03x
query98: Previous (2777.6666666666665 ms) vs Current (2703.3333333333335 ms) Diff 74 E2E 1.03x
query99: Previous (2136.0 ms) vs Current (2083.0 ms) Diff 53 E2E 1.03x
benchmark: Previous (393666.6666666667 ms) vs Current (397000.0 ms) Diff -3333 E2E 0.99x

@vuule vuule added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 5 - DO NOT MERGE Hold off on merging; see PR for details labels May 8, 2024
@vuule
Copy link
Contributor

vuule commented May 8, 2024

Thanks @abellina. Looks like there are no regressions, this is good to merge.

@vuule
Copy link
Contributor

vuule commented May 8, 2024

/merge

@rapids-bot rapids-bot bot merged commit ffbdd24 into rapidsai:branch-24.06 May 8, 2024
70 checks passed
@etseidl etseidl deleted the no_nulls branch May 8, 2024 20:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[FEA] skip parquet decode steps for nullable all-valid pages
4 participants