-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test Framework: Support CSV to Parquet conversion #1611
Conversation
dd0889c
to
509ed75
Compare
0c748f5
to
1286c7b
Compare
scripts/parquet/csv_to_parquet.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this script still useful? Should we consider removing it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be useful if you want to generate a dataset to store in our codebase. Other than that, I can't think of anything. We can also remove demo-db/parquet
too.
std::shared_ptr<arrow::Table> csvTable; | ||
ARROW_ASSIGN_OR_RAISE(infile, arrow::io::ReadableFile::Open(inputFile)); | ||
auto readOptions = arrow::csv::ReadOptions::Defaults(); | ||
auto parseOptions = arrow::csv::ParseOptions::Defaults(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that you are relying on arrow's auto csv reader to figure out data types for each csv column. I'm not sure if it's always as expected, especially when it comes to dates/timestamps/nested data types. I believe arrow converts some nested data types into strings. Can you double check the metadata of generated parquet files?
Ideally, we need to figure out data types for each column, and let the arrow reader be aware of that.
I'm fine with this auto reader for now if it works mostly as expected.
Actually I think the best way is we should add native support of exporting tables to parquet files in Cypher.
Then the conversion would be much easier.
Besides this, I'm curious what's the default row group size when we dump data into parquet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, I'm relying on arrow's auto csv reader and it is not much assertive. Dates and timestamps are being converted into strings.
This commit adds the capability of converting CSV datasets to parquet datasets on the fly for the tests, by simply using the following command in the test header: -DATASET CSV CSV_TO_PARQUET(dataset)
1286c7b
to
8bfaf67
Compare
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #1611 +/- ##
==========================================
+ Coverage 91.49% 91.50% +0.01%
==========================================
Files 725 725
Lines 26326 26334 +8
==========================================
+ Hits 24086 24098 +12
+ Misses 2240 2236 -4
☔ View full report in Codecov by Sentry. |
Test Framework: Support CSV to Parquet conversion (kuzudb#1611) Convert CSV dataset to PARQUET dataset inside .test files by using: -DATASET PARQUET CSV_TO_PARQUET(dataset)
Introduction
This commit adds the capability of converting CSV datasets to parquet datasets on the fly for the tests by simply using the following command in the test header:
To load dataset without any conversion:
How it works
dataset/parquet_temp/tinysnb
schema.cypher
to the created directorycopy.cypher
, extract the csv file names, header/no header information, csv delimiterRelated to #1521