Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-39565: [C++] Do not concatenate chunked values of fixed-width types to run "array_take" #41700

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

felipecrv
Copy link
Contributor

@felipecrv felipecrv commented May 17, 2024

Rationale for this change

Concatenating a chunked array into a single array before running the array_take kernels is very inefficient and can lead to out-of-memory crashes. See also #25822.

What changes are included in this PR?

  • Implementation of kernels for "array_take" that can receive a ChunkedArray as values and produce an output without concatenating these chunks
  • Improvements in the dispatching logic of TakeMetaFunction("take") to make "array_take" able to have a chunked_exec kernel for all types (some specialized and some based on concatenation)

Are these changes tested?

By existing tests. Some tests were added in previous PRs that introduced some of the infrastructure to support this.

@felipecrv felipecrv changed the title GH-39565: [C++] GH-39565: [C++] Do not concatenate ChunkedArray values to run "array_take" May 17, 2024
@felipecrv felipecrv changed the title GH-39565: [C++] Do not concatenate ChunkedArray values to run "array_take" GH-39565: [C++] Do not concatenate chunked values of fixed-width types to run "array_take" May 17, 2024
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels May 17, 2024
@mapleFU
Copy link
Member

mapleFU commented May 17, 2024

May I ask a unrelated question, when would we call assert and when call DCHECK, since I think they would likely to be same?

@felipecrv
Copy link
Contributor Author

May I ask a unrelated question, when would we call assert and when call DCHECK, since I think they would likely to be same?

We call assert in headers because we don't want to pay the cost of including logging.h everywhere. Think of assert as lighter-weight debug checks. But if you see an assert in a .cc file tell me to change it to DCHECK*.

@felipecrv felipecrv force-pushed the take_chunked_fixed branch 2 times, most recently from fbd97a3 to f4b4e12 Compare June 10, 2024 15:35
felipecrv added a commit that referenced this pull request Jun 13, 2024
… make them private (#42127)

### Rationale for this change

Move TakeXXX free functions into `TakeMetaFunction` and make them private

### What changes are included in this PR?

Code move and some small refactorings in preparation for #41700.

### Are these changes tested?

By existing tests.
* GitHub Issue: #42126

Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
@felipecrv felipecrv marked this pull request as ready for review June 15, 2024 15:12
@@ -60,6 +60,7 @@ void RegisterSelectionFunction(const std::string& name, FunctionDoc doc,
{std::move(kernel_data.value_type), std::move(kernel_data.selection_type)},
OutputType(FirstType));
base_kernel.exec = kernel_data.exec;
base_kernel.exec_chunked = kernel_data.chunked_exec;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The member variable is called exec_chunked but the type is called ChunkedExec (so confusing). In this PR I ended up sticking to chunked_exec. Once everything is reviewed and merged I could try to unify things to the direction people prefer.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 15, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 16, 2024
@felipecrv felipecrv force-pushed the take_chunked_fixed branch 4 times, most recently from 28da5e6 to 2ff6789 Compare June 20, 2024 13:54
@pitrou
Copy link
Member

pitrou commented Aug 21, 2024

I ran the new benchmarks (those with a small selection factor) and the results are more varied, see below:

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Non-regressions: (13)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                                    benchmark           baseline          contender  change %                                                                                                                                                                                                                                                  counters
             TakeChunkedFlatInt64FewMonotonicIndices/524288/0   3.487G items/sec   6.131G items/sec    75.818             {'family_index': 5, 'per_family_instance_index': 4, 'run_name': 'TakeChunkedFlatInt64FewMonotonicIndices/524288/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4676, 'null_percent': 0.0, 'selection_factor': 0.05}
          TakeChunkedChunkedInt64FewMonotonicIndices/524288/0   3.397G items/sec   5.278G items/sec    55.353          {'family_index': 1, 'per_family_instance_index': 4, 'run_name': 'TakeChunkedChunkedInt64FewMonotonicIndices/524288/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4603, 'null_percent': 0.0, 'selection_factor': 0.05}
             TakeChunkedFlatInt64FewMonotonicIndices/524288/1   3.019G items/sec   4.508G items/sec    49.291           {'family_index': 5, 'per_family_instance_index': 3, 'run_name': 'TakeChunkedFlatInt64FewMonotonicIndices/524288/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4000, 'null_percent': 100.0, 'selection_factor': 0.05}
          TakeChunkedChunkedInt64FewMonotonicIndices/524288/1   2.802G items/sec   4.135G items/sec    47.589        {'family_index': 1, 'per_family_instance_index': 3, 'run_name': 'TakeChunkedChunkedInt64FewMonotonicIndices/524288/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3740, 'null_percent': 100.0, 'selection_factor': 0.05}
          TakeChunkedFlatInt64FewMonotonicIndices/524288/1000   2.316G items/sec   2.910G items/sec    25.628          {'family_index': 5, 'per_family_instance_index': 0, 'run_name': 'TakeChunkedFlatInt64FewMonotonicIndices/524288/1000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3104, 'null_percent': 0.1, 'selection_factor': 0.05}
       TakeChunkedChunkedInt64FewMonotonicIndices/524288/1000   2.254G items/sec   2.684G items/sec    19.075       {'family_index': 1, 'per_family_instance_index': 0, 'run_name': 'TakeChunkedChunkedInt64FewMonotonicIndices/524288/1000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3082, 'null_percent': 0.1, 'selection_factor': 0.05}
            TakeChunkedFlatInt64FewMonotonicIndices/524288/10   2.257G items/sec   2.635G items/sec    16.733           {'family_index': 5, 'per_family_instance_index': 1, 'run_name': 'TakeChunkedFlatInt64FewMonotonicIndices/524288/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3052, 'null_percent': 10.0, 'selection_factor': 0.05}
         TakeChunkedChunkedInt64FewMonotonicIndices/524288/10   2.179G items/sec   2.466G items/sec    13.175        {'family_index': 1, 'per_family_instance_index': 1, 'run_name': 'TakeChunkedChunkedInt64FewMonotonicIndices/524288/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2878, 'null_percent': 10.0, 'selection_factor': 0.05}
   TakeChunkedChunkedStringFewRandomIndicesWithNulls/524288/1   5.541G items/sec   5.568G items/sec     0.490 {'family_index': 2, 'per_family_instance_index': 3, 'run_name': 'TakeChunkedChunkedStringFewRandomIndicesWithNulls/524288/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 7344, 'null_percent': 100.0, 'selection_factor': 0.05}
         TakeChunkedChunkedStringFewMonotonicIndices/524288/1   2.753G items/sec   2.736G items/sec    -0.625       {'family_index': 3, 'per_family_instance_index': 3, 'run_name': 'TakeChunkedChunkedStringFewMonotonicIndices/524288/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3674, 'null_percent': 100.0, 'selection_factor': 0.05}
             TakeChunkedFlatInt64FewMonotonicIndices/524288/2   1.880G items/sec   1.852G items/sec    -1.442            {'family_index': 5, 'per_family_instance_index': 2, 'run_name': 'TakeChunkedFlatInt64FewMonotonicIndices/524288/2', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2520, 'null_percent': 50.0, 'selection_factor': 0.05}
          TakeChunkedChunkedInt64FewMonotonicIndices/524288/2   1.818G items/sec   1.766G items/sec    -2.872         {'family_index': 1, 'per_family_instance_index': 2, 'run_name': 'TakeChunkedChunkedInt64FewMonotonicIndices/524288/2', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2403, 'null_percent': 50.0, 'selection_factor': 0.05}
TakeChunkedChunkedStringFewRandomIndicesWithNulls/524288/1000 269.164M items/sec 256.848M items/sec    -4.576 {'family_index': 2, 'per_family_instance_index': 0, 'run_name': 'TakeChunkedChunkedStringFewRandomIndicesWithNulls/524288/1000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 358, 'null_percent': 0.1, 'selection_factor': 0.05}

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Regressions: (17)
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                                   benchmark           baseline          contender  change %                                                                                                                                                                                                                                                  counters
        TakeChunkedChunkedStringFewMonotonicIndices/524288/2 854.187M items/sec 810.956M items/sec    -5.061        {'family_index': 3, 'per_family_instance_index': 2, 'run_name': 'TakeChunkedChunkedStringFewMonotonicIndices/524288/2', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1124, 'null_percent': 50.0, 'selection_factor': 0.05}
        TakeChunkedChunkedStringFewMonotonicIndices/524288/0 420.243M items/sec 393.347M items/sec    -6.400          {'family_index': 3, 'per_family_instance_index': 4, 'run_name': 'TakeChunkedChunkedStringFewMonotonicIndices/524288/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 561, 'null_percent': 0.0, 'selection_factor': 0.05}
  TakeChunkedChunkedStringFewRandomIndicesWithNulls/524288/2 751.346M items/sec 700.885M items/sec    -6.716   {'family_index': 2, 'per_family_instance_index': 2, 'run_name': 'TakeChunkedChunkedStringFewRandomIndicesWithNulls/524288/2', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 982, 'null_percent': 50.0, 'selection_factor': 0.05}
     TakeChunkedChunkedStringFewMonotonicIndices/524288/1000 383.643M items/sec 353.206M items/sec    -7.934       {'family_index': 3, 'per_family_instance_index': 0, 'run_name': 'TakeChunkedChunkedStringFewMonotonicIndices/524288/1000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 511, 'null_percent': 0.1, 'selection_factor': 0.05}
  TakeChunkedChunkedStringFewRandomIndicesWithNulls/524288/0 308.601M items/sec 283.361M items/sec    -8.179    {'family_index': 2, 'per_family_instance_index': 4, 'run_name': 'TakeChunkedChunkedStringFewRandomIndicesWithNulls/524288/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 411, 'null_percent': 0.0, 'selection_factor': 0.05}
 TakeChunkedChunkedStringFewRandomIndicesWithNulls/524288/10 327.708M items/sec 300.772M items/sec    -8.220  {'family_index': 2, 'per_family_instance_index': 1, 'run_name': 'TakeChunkedChunkedStringFewRandomIndicesWithNulls/524288/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 440, 'null_percent': 10.0, 'selection_factor': 0.05}
       TakeChunkedChunkedStringFewMonotonicIndices/524288/10 452.054M items/sec 393.971M items/sec   -12.849        {'family_index': 3, 'per_family_instance_index': 1, 'run_name': 'TakeChunkedChunkedStringFewMonotonicIndices/524288/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 602, 'null_percent': 10.0, 'selection_factor': 0.05}
   TakeChunkedChunkedInt64FewRandomIndicesWithNulls/524288/2   1.276G items/sec 509.634M items/sec   -60.060   {'family_index': 0, 'per_family_instance_index': 2, 'run_name': 'TakeChunkedChunkedInt64FewRandomIndicesWithNulls/524288/2', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1691, 'null_percent': 50.0, 'selection_factor': 0.05}
      TakeChunkedFlatInt64FewRandomIndicesWithNulls/524288/2   1.323G items/sec 517.354M items/sec   -60.894      {'family_index': 4, 'per_family_instance_index': 2, 'run_name': 'TakeChunkedFlatInt64FewRandomIndicesWithNulls/524288/2', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1750, 'null_percent': 50.0, 'selection_factor': 0.05}
  TakeChunkedChunkedInt64FewRandomIndicesWithNulls/524288/10   1.615G items/sec 549.705M items/sec   -65.960  {'family_index': 0, 'per_family_instance_index': 1, 'run_name': 'TakeChunkedChunkedInt64FewRandomIndicesWithNulls/524288/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2171, 'null_percent': 10.0, 'selection_factor': 0.05}
     TakeChunkedFlatInt64FewRandomIndicesWithNulls/524288/10   1.709G items/sec 553.717M items/sec   -67.595     {'family_index': 4, 'per_family_instance_index': 1, 'run_name': 'TakeChunkedFlatInt64FewRandomIndicesWithNulls/524288/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2274, 'null_percent': 10.0, 'selection_factor': 0.05}
TakeChunkedChunkedInt64FewRandomIndicesWithNulls/524288/1000   2.026G items/sec 587.265M items/sec   -71.021 {'family_index': 0, 'per_family_instance_index': 0, 'run_name': 'TakeChunkedChunkedInt64FewRandomIndicesWithNulls/524288/1000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2670, 'null_percent': 0.1, 'selection_factor': 0.05}
   TakeChunkedFlatInt64FewRandomIndicesWithNulls/524288/1000   2.078G items/sec 588.468M items/sec   -71.680    {'family_index': 4, 'per_family_instance_index': 0, 'run_name': 'TakeChunkedFlatInt64FewRandomIndicesWithNulls/524288/1000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2771, 'null_percent': 0.1, 'selection_factor': 0.05}
   TakeChunkedChunkedInt64FewRandomIndicesWithNulls/524288/0   3.429G items/sec 717.438M items/sec   -79.076    {'family_index': 0, 'per_family_instance_index': 4, 'run_name': 'TakeChunkedChunkedInt64FewRandomIndicesWithNulls/524288/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4639, 'null_percent': 0.0, 'selection_factor': 0.05}
      TakeChunkedFlatInt64FewRandomIndicesWithNulls/524288/0   3.582G items/sec 721.296M items/sec   -79.861       {'family_index': 4, 'per_family_instance_index': 4, 'run_name': 'TakeChunkedFlatInt64FewRandomIndicesWithNulls/524288/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4799, 'null_percent': 0.0, 'selection_factor': 0.05}
   TakeChunkedChunkedInt64FewRandomIndicesWithNulls/524288/1   3.873G items/sec 759.915M items/sec   -80.381  {'family_index': 0, 'per_family_instance_index': 3, 'run_name': 'TakeChunkedChunkedInt64FewRandomIndicesWithNulls/524288/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 5316, 'null_percent': 100.0, 'selection_factor': 0.05}
      TakeChunkedFlatInt64FewRandomIndicesWithNulls/524288/1   4.173G items/sec 761.551M items/sec   -81.751     {'family_index': 4, 'per_family_instance_index': 3, 'run_name': 'TakeChunkedFlatInt64FewRandomIndicesWithNulls/524288/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 5385, 'null_percent': 100.0, 'selection_factor': 0.05}

Putting aside the String perf changes which are minor and probably irrelevant, we can see that on monotonic indices, this PR actually increases performance, while still hurting it on random indices.

@felipecrv
Copy link
Contributor Author

...while still hurting it on random indices.

The binary searches for chunk index are a costly part of the process that is avoided when value chunks are concatenated.

I have ideas to make the search interleaved so that multiple binary searches can happen in parallel (i.e. likely sharing the same cache line accesses to make progress).

@felipecrv
Copy link
Contributor Author

felipecrv commented Oct 7, 2024

@pitrou wouldn't it make sense to keep the responsibility for concatenation to a layer above the kernels? Like a query optimizer? They are in a better position to make memory/time trade-offs than the context-less kernel.

The worst regression (-81%) has the kernel still at 4G items/sec.

TakeChunkedFlatInt64FewRandomIndicesWithNulls
4.173G items/sec 761.551M items/sec   -81.751

I find it very inelegant to put these heuristics at the compute kernel level.

Imagine a pipeline trying to save on memory allocations by keeping the array chunked as much as possible and then a simple filter operation requires allocating enough memory to keep it all in memory.

Another case would be a pipeline where the caller is consolidating a big contiguous array for more operations than just array_take. They should be the ones concatenating.

@pitrou
Copy link
Member

pitrou commented Oct 7, 2024

@pitrou wouldn't it make sense to keep the responsibility for concatenation to a layer above the kernels? Like a query optimizer? They are in a better position to make memory/time trade-offs than the context-less kernel.

Ideally perhaps. In practice this assumes that 1) there is a query optimizer 2) it has enough information about implementation details to make an informed decision.

In practice, take is probably often called directly in the context of PyArrow's sort methods. So this

arrow/python/pyarrow/table.pxi

Lines 1139 to 1143 in e62fbaa

indices = _pc().sort_indices(
self,
options=_pc().SortOptions(sort_keys=[("", order)], **kwargs)
)
return self.take(indices)

The worst regression (-81%) has the kernel still at 4G items/sec.

I might be misreading, but this is the worst regression on the new benchmarks, right (those with a small selection factor)? On the benchmarks with a 1.0 selection factor (such as when sorting), the worst absolute results are around 25 Mitems/sec AFAICT. Or are those results obsolete?

Imagine a pipeline trying to save on memory allocations by keeping the array chunked as much as possible and then a simple filter operation requires allocating enough memory to keep it all in memory.

Well, currently "take" would always concatenate array chunks, so at least there is no regression in that regard.

Still, I understand the concern. We might want to expose an additional option to entirely disable concatenation when possible. But that might be overkill as well.

We will ensure "array_take" returns a ChunkedArray if at least one input
is chunked, just like "take" does. Even when the output fits in a single
chunk.
…::exec_chunked

Before this commit, only the "take" meta function could handle CA
parameters.
This is not a time-saver yet because in TakeCC kernels, every call to
TakeCA will create a new ValuesSpan instance, but this will change in
the next commits.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants