Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Parquet filetype on COPY TO #1893

Merged
merged 7 commits into from
Sep 18, 2023
Merged

Conversation

rfdavid
Copy link
Collaborator

@rfdavid rfdavid commented Aug 5, 2023

Implementation

This PR is an extension of #1716 to support copying query results to .parquet files. Using the arrow library, there are mainly three ways to write to parquet files (in the decreasing order of level of abstraction): using StreamWriter, WriteTable and WriteBatch (reference here). StreamWriter doesn't support nested table, and therefore doesn't suit our needs. WriteTable provides builders to write logical datatypes (eg, int64 builder), whereas in the low-level API we make use of WriteBatch to write the batch of values. The repetition and definition levels [1] [2] must be calculated in the code. This PR is concentrated on the low-level API, which gives us more flexibility in the implementation, and it's one step before we create our own parquet writer.

Support COPY (...) TO 'out.parquet'

To be addressed:

  • Implement interval, union , internal_id ,fixed_list and map data types
  • Only int64 is working on var_list and struct, the others must be implemented
  • Fix a bug inside List>Struct
  • Perform tests on large datasets and check how the flush is actually working
  • Schema can be improved by making use of more required fields, hence saving more bits
  • Implement BufferedRowGroup
  • Handle null values inside a nested data type

@rfdavid rfdavid changed the title Support Parquet files on COPY TO Support Parquet filetype on COPY TO Aug 5, 2023
@codecov
Copy link

codecov bot commented Aug 11, 2023

Codecov Report

Patch coverage: 84.01% and project coverage change: +0.04% 🎉

Comparison is base (314934a) 90.20% compared to head (ea41a5f) 90.24%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1893      +/-   ##
==========================================
+ Coverage   90.20%   90.24%   +0.04%     
==========================================
  Files         945      950       +5     
  Lines       33561    33811     +250     
==========================================
+ Hits        30274    30514     +240     
- Misses       3287     3297      +10     
Files Changed Coverage Δ
src/include/common/types/types.h 100.00% <ø> (ø)
...ssor/operator/persistent/parquet_column_writer.cpp 77.77% <77.77%> (ø)
.../processor/operator/persistent/csv_file_writer.cpp 85.07% <85.18%> (+1.20%) ⬆️
...cessor/operator/persistent/parquet_file_writer.cpp 85.36% <85.36%> (ø)
...rc/include/processor/operator/persistent/copy_to.h 93.33% <92.30%> (-6.67%) ⬇️
src/binder/bind/bind_copy.cpp 91.11% <100.00%> (+0.13%) ⬆️
src/common/types/types.cpp 92.40% <100.00%> (+0.07%) ⬆️
src/include/common/copier_config/copier_config.h 100.00% <100.00%> (ø)
src/include/common/exception/message.h 60.00% <100.00%> (+10.00%) ⬆️
...de/processor/operator/persistent/csv_file_writer.h 100.00% <100.00%> (ø)
... and 5 more

... and 7 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rfdavid rfdavid force-pushed the copy_to_parquet branch 2 times, most recently from 8ff867d to eae6431 Compare August 31, 2023 14:24
@rfdavid rfdavid requested a review from ray6080 August 31, 2023 16:46
@rfdavid rfdavid force-pushed the copy_to_parquet branch 2 times, most recently from 031529e to 552c3ba Compare September 5, 2023 22:26
@rfdavid rfdavid marked this pull request as ready for review September 6, 2023 02:01
@rfdavid rfdavid force-pushed the copy_to_parquet branch 3 times, most recently from 888fdb6 to e5bd0fd Compare September 8, 2023 18:34
src/include/common/vector/value_vector.h Outdated Show resolved Hide resolved
src/include/processor/operator/persistent/copy_to.h Outdated Show resolved Hide resolved
src/include/processor/operator/persistent/copy_to.h Outdated Show resolved Hide resolved
src/processor/processor.cpp Outdated Show resolved Hide resolved
@rfdavid rfdavid force-pushed the copy_to_parquet branch 3 times, most recently from a30c208 to bd668b2 Compare September 17, 2023 20:06
@rfdavid rfdavid merged commit 9ff7a2a into kuzudb:master Sep 18, 2023
11 checks passed
@rfdavid rfdavid deleted the copy_to_parquet branch September 18, 2023 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants