Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas Pyarrow Backend Bugfix and Tests #3152

Merged
merged 5 commits into from
Mar 27, 2024
Merged

Conversation

mxwli
Copy link
Collaborator

@mxwli mxwli commented Mar 26, 2024

Changes:

  • We no longer own the dataframe in pyarrow scan so that we don't sometimes segfault on exit
  • Null mask logic for unions & lists has been corrected
  • Add tests for lists

MAP scanning will be added in the next PR.

Copy link

codecov bot commented Mar 26, 2024

Codecov Report

Attention: Patch coverage is 66.66667% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 92.08%. Comparing base (cb4d757) to head (fc0f7a8).
Report is 26 commits behind head on master.

Files Patch % Lines
src/common/arrow/arrow_null_mask_tree.cpp 28.57% 5 Missing ⚠️
src/common/arrow/arrow_array_scan.cpp 80.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3152      +/-   ##
==========================================
+ Coverage   91.91%   92.08%   +0.16%     
==========================================
  Files        1169     1168       -1     
  Lines       43736    44065     +329     
==========================================
+ Hits        40202    40576     +374     
+ Misses       3534     3489      -45     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mxwli mxwli requested a review from acquamarin March 27, 2024 16:07
@@ -52,7 +52,7 @@ void ListAuxiliaryBuffer::resizeDataVector(ValueVector* dataVector) {
auto buffer = std::make_unique<uint8_t[]>(capacity * dataVector->getNumBytesPerValue());
memcpy(buffer.get(), dataVector->valueBuffer.get(), size * dataVector->getNumBytesPerValue());
dataVector->valueBuffer = std::move(buffer);
dataVector->nullMask->resize(capacity);
dataVector->nullMask->resize(capacity); // note: allocating 64 times what is needed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am a little bit confused about the comment? What do you mean by 64times?

Copy link
Collaborator Author

@mxwli mxwli Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capacity refers to the number of values inside our vector. However, to nullMask, capacity refers to the number of uint64_ts it should allocate to the buffer. NullMask is a bitmap, so directly resizing to capacity will allocate 64 times the number of bits necessary.

Update: I applied the change locally and ran tests. They passed.

@mxwli mxwli merged commit 73ed1ea into master Mar 27, 2024
16 of 17 checks passed
@mxwli mxwli deleted the pandas-pyarrow-backend branch March 27, 2024 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants