Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework npy copy to integrate with query processor pipeline #1734

Merged
merged 1 commit into from
Jul 8, 2023
Merged

Conversation

aziz-mu
Copy link
Contributor

@aziz-mu aziz-mu commented Jun 28, 2023

This PR implements #1670 , by removing classes specific to NPY reading that aren't needed anymore, implementing a read_npy operator to match the read_csv and read_parquet operators, and making changes to CopyNode so that copying can still be done column-by-column

src/include/processor/operator/physical_operator.h Outdated Show resolved Hide resolved
src/processor/mapper/map_ddl.cpp Outdated Show resolved Hide resolved
src/processor/operator/copy/read_npy.cpp Outdated Show resolved Hide resolved
src/processor/operator/copy/copy_node.cpp Outdated Show resolved Hide resolved
src/processor/operator/copy/copy_node.cpp Outdated Show resolved Hide resolved
src/processor/operator/copy/copy_node.cpp Outdated Show resolved Hide resolved
@aziz-mu
Copy link
Contributor Author

aziz-mu commented Jul 6, 2023

Note that there's still a bug in the PR - reading large (>2048 rows), multidimensional (e.g. column w/ type INT32[10]) .npy files causes an error. I'm currently working on fixing this, but if it's urgent to get this PR merged to integrate with storage changes for the next release, I propose removing the failing test (which I've done already), and creating an issue for it to be fixed soon. Happy to discuss this further

@aziz-mu aziz-mu marked this pull request as ready for review July 6, 2023 20:26
@mewim
Copy link
Collaborator

mewim commented Jul 6, 2023

@aziz-mu Let's wait until the bug is fixed. The main use case of NPY copy is to handle large, high-dimensional data files for PyG workload currently. If there is a bug reading large multidimensional files, this feature will not be very useful.

Copy link
Contributor

@ray6080 ray6080 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should fix failed tests.

src/processor/operator/copy/read_npy.cpp Outdated Show resolved Hide resolved
test/test_files/tinysnb/explain/explain.test Outdated Show resolved Hide resolved
test/test_files/copy/copy_npy_large.test Outdated Show resolved Hide resolved
src/storage/in_mem_storage_structure/in_mem_column.cpp Outdated Show resolved Hide resolved
src/processor/operator/copy/copy_node.cpp Outdated Show resolved Hide resolved
src/storage/copier/npy_reader.cpp Show resolved Hide resolved
src/include/processor/operator/copy/read_npy.h Outdated Show resolved Hide resolved
src/processor/mapper/map_ddl.cpp Outdated Show resolved Hide resolved
src/storage/copier/npy_reader.cpp Show resolved Hide resolved
src/processor/operator/copy/read_npy.cpp Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Jul 7, 2023

Codecov Report

Patch coverage: 97.78% and project coverage change: +0.13 🎉

Comparison is base (97e3b8e) 90.92% compared to head (c01d26f) 91.05%.

❗ Current head c01d26f differs from pull request most recent head ee9a98c. Consider uploading reports for the commit ee9a98c to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1734      +/-   ##
==========================================
+ Coverage   90.92%   91.05%   +0.13%     
==========================================
  Files         774      773       -1     
  Lines       28371    28311      -60     
==========================================
- Hits        25795    25779      -16     
+ Misses       2576     2532      -44     
Impacted Files Coverage Δ
src/common/vector/value_vector.cpp 100.00% <ø> (ø)
src/include/main/connection.h 100.00% <ø> (ø)
...gical_plan/logical_operator/logical_create_table.h 100.00% <ø> (ø)
src/include/processor/operator/copy/read_file.h 100.00% <ø> (+8.33%) ⬆️
src/include/processor/operator/physical_operator.h 100.00% <ø> (ø)
src/include/processor/physical_plan.h 100.00% <ø> (ø)
src/processor/mapper/map_ddl.cpp 100.00% <ø> (+2.22%) ⬆️
src/storage/copier/npy_reader.cpp 90.67% <92.30%> (-0.07%) ⬇️
src/processor/mapper/map_copy.cpp 95.71% <95.71%> (ø)
src/processor/operator/copy/copy_node.cpp 97.02% <96.96%> (+0.09%) ⬆️
... and 10 more

... and 5 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@ray6080 ray6080 force-pushed the npy-copy branch 3 times, most recently from 5c6599c to 13a4f59 Compare July 8, 2023 11:18
@ray6080 ray6080 changed the title Npy copy Rework npy copy to integrate query processor pipeline Jul 8, 2023
@ray6080 ray6080 changed the title Rework npy copy to integrate query processor pipeline Rework npy copy to integrate with query processor pipeline Jul 8, 2023
@ray6080 ray6080 merged commit bfb4fc6 into master Jul 8, 2023
7 checks passed
@ray6080 ray6080 deleted the npy-copy branch July 8, 2023 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants