-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arrow chunk_size as keyword argument #3084
Conversation
@mewim what do you think about moving the |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #3084 +/- ##
=======================================
Coverage 92.71% 92.71%
=======================================
Files 1162 1162
Lines 43140 43140
=======================================
+ Hits 39997 39999 +2
+ Misses 3143 3141 -2 ☔ View full report in Codecov by Sentry. |
I just made those changes and built & tested locally, works well 👌🏽. |
FYI: I believe they were specified this way in order to validate result acquisition across chunk boundaries. Also the meaning of As per #2998 (comment), you could instead allow opt-in to the adaptive behaviour by supporting a This would allow for adaptive behaviour without changing the meaning of integer |
Hi @alexander-beedie, great points.
In any case, I think we're converging towards an agreeable and better solution, so thanks again :) |
I would probably keep the chunk size as referring to number of rows and use adaptive chunk size only if chunk size is set to None / 0 / -1. |
@mewim I reworked it according to your latest comment. What do you think? |
I'd reserve
Just that the way the PR was initially written changed the meaning of |
@alexander-beedie how does -1 produce an "unchunked" result? To my understanding arrow will always return record batches, i.e., chunks of records? The cases where it's Could you maybe post a snippet of how you'd use it here? |
It doesn't/can't at the moment as the current low-level code will always produce chunked results (the adaptive mode improves the situation, but isn't a guarantee). The idea would be to always produce a single chunk that represents the complete result set. We'd use such a mode in Polars, as otherwise we typically rechunk Arrow data that arrives with n_chunks > 1 (it's more optimal for us to operate on contiguous data). |
Hmm, so it's a polars-specific setting where we'd document that setting |
I think c26fc74 addresses your comments @alexander-beedie. We can use the This will have to be documented carefully, though! |
It's a no-op unless n_chunks > 1, so no need to set explicitly; can leave as-is. |
@prrao87 Note that when merging to the master, the commits needs to be squashed into one. |
Shall I go ahead and squash-merge? |
Yeah I think it is ready to merge |
@prrao87: Somehow I completely missed this method; great solution ✌️😎 |
Closes #2998.
As per @ray6080's comment, setting the arrow record batch size (which we call
chunk_size
) to 1M is a reasonable default because DuckDB does the same. In the majority of cases, a larger record batch size is favourable, and the user can always bring down thechunk_size
if necessary.I also fixed the pyarrow tests to not specify the
chunk_size
argument as low integer values.