Pandas Pyarrow Backend Bugfix and Tests #3152

mxwli · 2024-03-26T22:00:15Z

Changes:

We no longer own the dataframe in pyarrow scan so that we don't sometimes segfault on exit
Null mask logic for unions & lists has been corrected
Add tests for lists

MAP scanning will be added in the next PR.

codecov · 2024-03-26T22:15:19Z

Codecov Report

Attention: Patch coverage is 66.66667% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 92.08%. Comparing base (cb4d757) to head (fc0f7a8).
Report is 26 commits behind head on master.

Files	Patch %	Lines
src/common/arrow/arrow_null_mask_tree.cpp	28.57%	5 Missing ⚠️
src/common/arrow/arrow_array_scan.cpp	80.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3152      +/-   ##
==========================================
+ Coverage   91.91%   92.08%   +0.16%     
==========================================
  Files        1169     1168       -1     
  Lines       43736    44065     +329     
==========================================
+ Hits        40202    40576     +374     
+ Misses       3534     3489      -45

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

acquamarin · 2024-03-27T16:16:19Z

src/common/vector/auxiliary_buffer.cpp

@@ -52,7 +52,7 @@ void ListAuxiliaryBuffer::resizeDataVector(ValueVector* dataVector) {
    auto buffer = std::make_unique<uint8_t[]>(capacity * dataVector->getNumBytesPerValue());
    memcpy(buffer.get(), dataVector->valueBuffer.get(), size * dataVector->getNumBytesPerValue());
    dataVector->valueBuffer = std::move(buffer);
-    dataVector->nullMask->resize(capacity);
+    dataVector->nullMask->resize(capacity); // note: allocating 64 times what is needed


i am a little bit confused about the comment? What do you mean by 64times?

capacity refers to the number of values inside our vector. However, to nullMask, capacity refers to the number of uint64_ts it should allocate to the buffer. NullMask is a bitmap, so directly resizing to capacity will allocate 64 times the number of bits necessary.

Update: I applied the change locally and ran tests. They passed.

mxwli added 3 commits March 26, 2024 17:22

add more test coverage & fixes to pyarrow

c18bff6

formatting fixes

2f05fd5

clang-tidy

cdf6113

clang fix

fba6bce

mxwli requested a review from acquamarin March 27, 2024 16:07

add missing GIL acquire

fc0f7a8

acquamarin approved these changes Mar 27, 2024

View reviewed changes

mxwli merged commit 73ed1ea into master Mar 27, 2024
16 of 17 checks passed

mxwli deleted the pandas-pyarrow-backend branch March 27, 2024 17:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas Pyarrow Backend Bugfix and Tests #3152

Pandas Pyarrow Backend Bugfix and Tests #3152

mxwli commented Mar 26, 2024

codecov bot commented Mar 26, 2024 •

edited

Loading

acquamarin Mar 27, 2024

mxwli Mar 27, 2024 •

edited

Loading

Pandas Pyarrow Backend Bugfix and Tests #3152

Pandas Pyarrow Backend Bugfix and Tests #3152

Conversation

mxwli commented Mar 26, 2024

codecov bot commented Mar 26, 2024 • edited Loading

Codecov Report

acquamarin Mar 27, 2024

Choose a reason for hiding this comment

mxwli Mar 27, 2024 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Mar 26, 2024 •

edited

Loading

mxwli Mar 27, 2024 •

edited

Loading