Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance for cudf::strings::count_re #15578

Merged
merged 3 commits into from
Apr 25, 2024

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Apr 22, 2024

Description

Improves performance of cudf::strings::count_re when pattern starts with a literal character.
Although this is a specific use case, the regex code has special logic to help speed up the search in this case.

Since the pattern indicates the target must contain this character as the start of the matching sequence, it first does a normal find for the character before continuing matching the remaining pattern. The find() function can be inefficient for long strings since it is character based and must resolve the character's byte-position by counting from the beginning of the string. For a function like count_re() all occurrences are matched within a target meaning longer target strings can incur expensive counting.

The solution included here is to introduce a more efficient find_char() utility that accepts a string_view::const_iterator() which automatically keeps track of its byte and character positions. This helps minimize byte/character counting in between calls from count_re() and other similar functions that make repeated calls for all matches (e.g. replace_re() and split_re()).

Close #15567

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 22, 2024
@davidwendt davidwendt self-assigned this Apr 22, 2024
@davidwendt
Copy link
Contributor Author

Some performance numbers for using count_re('a')

## [0] Quadro GV100

| row_width | num_rows |   Ref Time |   Cmp Time |          Diff |   %Diff |
|-----------|----------|------------|------------|---------------|---------|
|    32     |   4096   |  44.871 us |  38.855 us |     -6.016 us | -13.41% |
|    64     |   4096   |  58.717 us |  46.681 us |    -12.035 us | -20.50% |
|    128    |   4096   | 115.386 us |  77.133 us |    -38.253 us | -33.15% |
|    256    |   4096   | 264.371 us | 166.308 us |    -98.063 us | -37.09% |
|    512    |   4096   | 538.269 us | 297.873 us |   -240.396 us | -44.66% |
|   1024    |   4096   |   1.278 ms | 615.012 us |   -662.502 us | -51.86% |
|   2048    |   4096   |   3.964 ms |   1.381 ms |  -2582.890 us | -65.17% |
|    32     |  32768   |  51.643 us |  43.267 us |     -8.377 us | -16.22% |
|    64     |  32768   |  65.373 us |  53.333 us |    -12.041 us | -18.42% |
|    128    |  32768   | 139.921 us |  93.561 us |    -46.360 us | -33.13% |
|    256    |  32768   | 307.079 us | 194.002 us |   -113.076 us | -36.82% |
|    512    |  32768   | 714.200 us | 431.820 us |   -282.380 us | -39.54% |
|   1024    |  32768   |   2.037 ms | 949.529 us |  -1087.021 us | -53.38% |
|   2048    |  32768   |   6.012 ms |   2.019 ms |  -3992.349 us | -66.41% |
|    32     |  262144  | 148.891 us | 124.328 us |    -24.563 us | -16.50% |
|    64     |  262144  | 244.408 us | 185.365 us |    -59.043 us | -24.16% |
|    128    |  262144  | 503.423 us | 346.905 us |   -156.518 us | -31.09% |
|    256    |  262144  |   1.224 ms | 821.019 us |   -403.339 us | -32.94% |
|    512    |  262144  |   3.486 ms |   2.131 ms |  -1354.260 us | -38.85% |
|   1024    |  262144  |  10.196 ms |   4.771 ms |  -5424.827 us | -53.20% |
|   2048    |  262144  |  29.228 ms |   9.893 ms | -19334.925 us | -66.15% |
|    32     | 2097152  | 940.278 us | 760.596 us |   -179.682 us | -19.11% |
|    64     | 2097152  |   1.624 ms |   1.204 ms |   -420.264 us | -25.88% |
|    128    | 2097152  |   3.402 ms |   2.333 ms |  -1068.791 us | -31.42% |
|    256    | 2097152  |   8.322 ms |   5.559 ms |  -2763.156 us | -33.20% |
|    512    | 2097152  |  24.399 ms |  15.039 ms |  -9360.116 us | -38.36% |
|    32     | 16777216 |   7.274 ms |   5.862 ms |  -1411.535 us | -19.41% |
|    64     | 16777216 |  12.675 ms |   9.352 ms |  -3323.108 us | -26.22% |

A general improvement overall but 2-3x improvement for long strings.

@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Apr 22, 2024
@davidwendt davidwendt marked this pull request as ready for review April 22, 2024 17:22
@davidwendt davidwendt requested a review from a team as a code owner April 22, 2024 17:22
@davidwendt
Copy link
Contributor Author

Also should update here the results from the code snippet in the original issue.
Before

count on string len 30, took 0.00039577484130859375 seconds
count on string len 300, took 0.0024445056915283203 seconds
count on string len 3000, took 0.1573934555053711 seconds
count on string len 30000, took 11.848689079284668 seconds

After

count on string len 30, took 0.0003459453582763672 seconds
count on string len 300, took 0.0008559226989746094 seconds
count on string len 3000, took 0.006361961364746094 seconds
count on string len 30000, took 0.061533212661743164 seconds

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good with some non-blocking nits.

cpp/benchmarks/string/count.cpp Outdated Show resolved Hide resolved
cpp/src/strings/regex/regex.inl Outdated Show resolved Hide resolved
@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 4dc9ebb into rapidsai:branch-24.06 Apr 25, 2024
70 checks passed
@davidwendt davidwendt deleted the perf-count-re branch April 25, 2024 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change strings strings issues (C++ and Python)
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[BUG] Poor performance scaling of str.count with long strings
3 participants