Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test Framework: Support CSV to Parquet conversion #1611

Merged
merged 7 commits into from
Jun 7, 2023

Conversation

rfdavid
Copy link
Collaborator

@rfdavid rfdavid commented Jun 2, 2023

Introduction

This commit adds the capability of converting CSV datasets to parquet datasets on the fly for the tests by simply using the following command in the test header:

-DATASET PARQUET CSV_TO_PARQUET(dataset)

To load dataset without any conversion:

-DATASET CSV tinysnb
-DATASET NPY npy-1d
-DATASET PARQUET demo-db/parquet

How it works

  1. Create a directory dataset/parquet_temp/tinysnb
  2. Copy schema.cypher to the created directory
  3. Read and parse copy.cypher, extract the csv file names, header/no header information, csv delimiter
  4. Create a new copy.cypher with the new COPY commands and paths
  5. Convert .csv files to .parquet files to the parquet temp directory
  6. Set dataset path to parquet temp directory
  7. Remove parquet temp directory after all tests run.

Related to #1521

@rfdavid rfdavid force-pushed the csv_to_parquet_on_tests branch 2 times, most recently from dd0889c to 509ed75 Compare June 5, 2023 14:59
@rfdavid rfdavid changed the title WIP: Test Framework: Support CSV to Parquet conversion Test Framework: Support CSV to Parquet conversion Jun 5, 2023
@rfdavid rfdavid requested a review from ray6080 June 5, 2023 18:54
@rfdavid rfdavid marked this pull request as ready for review June 5, 2023 18:54
dataset/tinysnb/copy.cypher Outdated Show resolved Hide resolved
test/include/test_runner/test_group.h Outdated Show resolved Hide resolved
test/runner/e2e_test.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this script still useful? Should we consider removing it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful if you want to generate a dataset to store in our codebase. Other than that, I can't think of anything. We can also remove demo-db/parquet too.

src/include/common/file_utils.h Outdated Show resolved Hide resolved
test/include/test_runner/csv_to_parquet_converter.h Outdated Show resolved Hide resolved
test/test_runner/csv_to_parquet_converter.cpp Outdated Show resolved Hide resolved
test/test_runner/csv_to_parquet_converter.cpp Outdated Show resolved Hide resolved
src/common/string_utils.cpp Outdated Show resolved Hide resolved
std::shared_ptr<arrow::Table> csvTable;
ARROW_ASSIGN_OR_RAISE(infile, arrow::io::ReadableFile::Open(inputFile));
auto readOptions = arrow::csv::ReadOptions::Defaults();
auto parseOptions = arrow::csv::ParseOptions::Defaults();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that you are relying on arrow's auto csv reader to figure out data types for each csv column. I'm not sure if it's always as expected, especially when it comes to dates/timestamps/nested data types. I believe arrow converts some nested data types into strings. Can you double check the metadata of generated parquet files?
Ideally, we need to figure out data types for each column, and let the arrow reader be aware of that.
I'm fine with this auto reader for now if it works mostly as expected.

Actually I think the best way is we should add native support of exporting tables to parquet files in Cypher.
Then the conversion would be much easier.

Besides this, I'm curious what's the default row group size when we dump data into parquet?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, I'm relying on arrow's auto csv reader and it is not much assertive. Dates and timestamps are being converted into strings.

This commit adds the capability of converting CSV
datasets to parquet datasets on the fly for the tests,
by simply using the following command in the test header:
-DATASET CSV CSV_TO_PARQUET(dataset)
@codecov
Copy link

codecov bot commented Jun 6, 2023

Codecov Report

Patch coverage: 90.90% and project coverage change: +0.01 🎉

Comparison is base (dbb2552) 91.49% compared to head (19e8077) 91.50%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1611      +/-   ##
==========================================
+ Coverage   91.49%   91.50%   +0.01%     
==========================================
  Files         725      725              
  Lines       26326    26334       +8     
==========================================
+ Hits        24086    24098      +12     
+ Misses       2240     2236       -4     
Impacted Files Coverage Δ
src/include/common/file_utils.h 100.00% <ø> (ø)
src/common/file_utils.cpp 73.03% <90.90%> (+1.42%) ⬆️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@rfdavid rfdavid merged commit 20d696a into kuzudb:master Jun 7, 2023
8 checks passed
@rfdavid rfdavid deleted the csv_to_parquet_on_tests branch June 7, 2023 14:50
yuchenZhangTG pushed a commit to yuchenZhangTG/kuzu that referenced this pull request Jun 8, 2023
Test Framework: Support CSV to Parquet conversion (kuzudb#1611)

Convert CSV dataset to PARQUET dataset inside .test files by using:
-DATASET PARQUET CSV_TO_PARQUET(dataset)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants