Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas pyarrow backend #3058

Merged
merged 7 commits into from
Mar 22, 2024
Merged

Pandas pyarrow backend #3058

merged 7 commits into from
Mar 22, 2024

Conversation

mxwli
Copy link
Contributor

@mxwli mxwli commented Mar 14, 2024

Progress:

Scanning Implementation Testing Notes
Primitives (eg int) Partial Partial Everything done besides custom bitwidth decimals
Temporals (eg timestamp) Partial Partial Everything is done besides timezone processing
BLOB Done Done View testing could use improvement
String Done Done View testing could use improvement
List Done Done View testing could use improvement
Structs Done WIP Null is handled poorly by pandas. Testing will require more effort as a result
Union Done WIP
Maps Not Done Not Done Storage Layout Specifications Not Found
Dictionary Encoding Done Done
Run-End Encoding Done WIP complex types are not supported by run-end encoding

Areas for General Improvement

  • all null/nonnull flag for null mask trees
  • zero copying validity buffer for null mask trees
  • inefficient scans (one by one scanning occurs in run-end scanning, dictionary scanning, and union scanning)
  • Since the dictionary array can be shared across multiple chunks, we can get away with scanning it once for all chunks. This is currently not done. The implementation here can be tricky.
  • overflow checking for temporal type scanning (eg. when converting from seconds to microseconds, our values can overflow)
  • validity checking can have some edge causes (eg. when a struct is null, its actual contents in storage can be arbritrary. Therefore if we attempt to treat that address as semantically valid, we can run into errors.) This is untested because it involves the direct manipulation of physical buffers

Extra changes:

  • ValueVector::getValue previously returned non const reference to Value, despite being a const method. It now returns a const reference if the class is const, and a non const reference if the class is non const
  • ValueVector::setNullFromBits now has invert parameter that allows for copying of inverted bitmaps.

Edit: I forgot to mention that most of the tests are randomly generated. However, the random number generator is seeded, so the data generation is deterministic.

Edit 2: More tests will be added in future PRs.

Copy link
Collaborator

@acquamarin acquamarin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed, we should resolve the type of the dataframe during compile time(in replacement function) instead of run time.

@mxwli mxwli force-pushed the pandas-pyarrow-backend branch 3 times, most recently from add4965 to e281d5c Compare March 14, 2024 21:04
Copy link

codecov bot commented Mar 14, 2024

Codecov Report

Attention: Patch coverage is 49.50071% with 354 lines in your changes are missing coverage. Please review.

Project coverage is 91.91%. Comparing base (6d39076) to head (cb4d757).

Files Patch % Lines
src/common/arrow/arrow_array_scan.cpp 35.52% 216 Missing ⚠️
src/common/arrow/arrow_null_mask_tree.cpp 28.15% 74 Missing ⚠️
src/common/arrow/arrow_type.cpp 45.16% 51 Missing ⚠️
tools/python_api/src_cpp/pyarrow/pyarrow_scan.cpp 86.66% 6 Missing ⚠️
src/binder/bind/bind_reading_clause.cpp 92.50% 3 Missing ⚠️
src/include/common/arrow/arrow_nullmask_tree.h 33.33% 2 Missing ⚠️
src/include/function/table_functions.h 0.00% 1 Missing ⚠️
tools/python_api/src_cpp/pandas/pandas_scan.cpp 96.42% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3058      +/-   ##
==========================================
- Coverage   92.61%   91.91%   -0.70%     
==========================================
  Files        1161     1169       +8     
  Lines       43107    43736     +629     
==========================================
+ Hits        39923    40202     +279     
- Misses       3184     3534     +350     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@ray6080 ray6080 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see quite a few types missing test coverage, can you see if you can add more tests?
@acquamarin should take a look at changes on function side.

tools/python_api/src_cpp/py_database.cpp Outdated Show resolved Hide resolved
src/include/common/vector/value_vector.h Show resolved Hide resolved
src/common/arrow/arrow_type.cpp Outdated Show resolved Hide resolved
src/common/arrow/arrow_type.cpp Outdated Show resolved Hide resolved
src/common/arrow/arrow_type.cpp Outdated Show resolved Hide resolved
src/common/arrow/arrow_array_scan.cpp Outdated Show resolved Hide resolved
src/common/arrow/arrow_array_scan.cpp Outdated Show resolved Hide resolved
src/include/common/arrow/arrow_scan.h Outdated Show resolved Hide resolved
tools/python_api/test/test_df_pyarrow.py Outdated Show resolved Hide resolved
src/common/arrow/arrow_array_scan.cpp Outdated Show resolved Hide resolved
@mxwli
Copy link
Contributor Author

mxwli commented Mar 18, 2024

I still see quite a few types missing test coverage, can you see if you can add more tests? @acquamarin should take a look at changes on function side.

Sure. I'm going to try making every relevant change in this PR, though some changes may have to be implemented in a future PR.

@mxwli mxwli force-pushed the pandas-pyarrow-backend branch 3 times, most recently from 2e49f63 to ab267a1 Compare March 21, 2024 18:31
Makefile Outdated Show resolved Hide resolved
src/common/arrow/arrow_array_scan.cpp Outdated Show resolved Hide resolved
src/common/arrow/arrow_array_scan.cpp Outdated Show resolved Hide resolved
src/common/arrow/arrow_array_scan.cpp Outdated Show resolved Hide resolved
src/common/arrow/arrow_array_scan.cpp Outdated Show resolved Hide resolved
tools/python_api/src_cpp/include/pyarrow/pyarrow_scan.h Outdated Show resolved Hide resolved
tools/python_api/src_cpp/include/pyarrow/pyarrow_scan.h Outdated Show resolved Hide resolved
tools/python_api/src_cpp/pandas/pandas_scan.cpp Outdated Show resolved Hide resolved
tools/python_api/src_cpp/pyarrow/pyarrow_scan.cpp Outdated Show resolved Hide resolved
andyfengHKU and others added 7 commits March 22, 2024 11:38
formatting checks

remove disabled tests

more clang format fixes...

py lint check

clang tidy

more clang tidy and py lint checks

more and more clang tidy

explicit pyarrow scan ctor

possibly fixed tests not running?

CI fixes

fix pytest

non portable type resolution solution

apple clang test fix?

add some requested changes

apply backend switching

remove fixed list

update httpfs version & remove python-debug

clang tidy

fix array arithmetic on void

revert extension version change

apply clang-tidy

fix compiler errors

... more clang-tidy fixes

added requested changes

clang-tidy
... thanks clang
@mxwli mxwli merged commit d4b261b into master Mar 22, 2024
24 of 25 checks passed
@mxwli mxwli deleted the pandas-pyarrow-backend branch March 22, 2024 16:35
@andyfengHKU andyfengHKU mentioned this pull request Apr 15, 2024
32 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants