Test Framework: Support CSV to Parquet conversion #1611

rfdavid · 2023-06-02T19:20:48Z

Introduction

This commit adds the capability of converting CSV datasets to parquet datasets on the fly for the tests by simply using the following command in the test header:

-DATASET PARQUET CSV_TO_PARQUET(dataset)

To load dataset without any conversion:

-DATASET CSV tinysnb
-DATASET NPY npy-1d
-DATASET PARQUET demo-db/parquet

How it works

Create a directory dataset/parquet_temp/tinysnb
Copy schema.cypher to the created directory
Read and parse copy.cypher, extract the csv file names, header/no header information, csv delimiter
Create a new copy.cypher with the new COPY commands and paths
Convert .csv files to .parquet files to the parquet temp directory
Set dataset path to parquet temp directory
Remove parquet temp directory after all tests run.

Related to #1521

dataset/tinysnb/copy.cypher

test/include/test_runner/test_group.h

test/runner/e2e_test.cpp

ray6080 · 2023-06-06T03:30:08Z

scripts/parquet/csv_to_parquet.py

Is this script still useful? Should we consider removing it?

It might be useful if you want to generate a dataset to store in our codebase. Other than that, I can't think of anything. We can also remove demo-db/parquet too.

src/include/common/file_utils.h

test/include/test_runner/csv_to_parquet_converter.h

test/test_runner/csv_to_parquet_converter.cpp

src/common/string_utils.cpp

ray6080 · 2023-06-06T07:57:08Z

test/test_runner/csv_to_parquet_converter.cpp

+    std::shared_ptr<arrow::Table> csvTable;
+    ARROW_ASSIGN_OR_RAISE(infile, arrow::io::ReadableFile::Open(inputFile));
+    auto readOptions = arrow::csv::ReadOptions::Defaults();
+    auto parseOptions = arrow::csv::ParseOptions::Defaults();


My understanding is that you are relying on arrow's auto csv reader to figure out data types for each csv column. I'm not sure if it's always as expected, especially when it comes to dates/timestamps/nested data types. I believe arrow converts some nested data types into strings. Can you double check the metadata of generated parquet files?
Ideally, we need to figure out data types for each column, and let the arrow reader be aware of that.
I'm fine with this auto reader for now if it works mostly as expected.

Actually I think the best way is we should add native support of exporting tables to parquet files in Cypher.
Then the conversion would be much easier.

Besides this, I'm curious what's the default row group size when we dump data into parquet?

That's right, I'm relying on arrow's auto csv reader and it is not much assertive. Dates and timestamps are being converted into strings.

This commit adds the capability of converting CSV datasets to parquet datasets on the fly for the tests, by simply using the following command in the test header: -DATASET CSV CSV_TO_PARQUET(dataset)

codecov · 2023-06-06T19:07:19Z

Codecov Report

Patch coverage: 90.90% and project coverage change: +0.01 🎉

Comparison is base (dbb2552) 91.49% compared to head (19e8077) 91.50%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1611      +/-   ##
==========================================
+ Coverage   91.49%   91.50%   +0.01%     
==========================================
  Files         725      725              
  Lines       26326    26334       +8     
==========================================
+ Hits        24086    24098      +12     
+ Misses       2240     2236       -4

Impacted Files	Coverage Δ
src/include/common/file_utils.h	`100.00% <ø> (ø)`
src/common/file_utils.cpp	`73.03% <90.90%> (+1.42%)`	⬆️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Test Framework: Support CSV to Parquet conversion (kuzudb#1611) Convert CSV dataset to PARQUET dataset inside .test files by using: -DATASET PARQUET CSV_TO_PARQUET(dataset)

rfdavid force-pushed the csv_to_parquet_on_tests branch 2 times, most recently from dd0889c to 509ed75 Compare June 5, 2023 14:59

rfdavid changed the title ~~WIP: Test Framework: Support CSV to Parquet conversion~~ Test Framework: Support CSV to Parquet conversion Jun 5, 2023

rfdavid force-pushed the csv_to_parquet_on_tests branch from 0c748f5 to 1286c7b Compare June 5, 2023 15:40

rfdavid requested a review from ray6080 June 5, 2023 18:54

rfdavid marked this pull request as ready for review June 5, 2023 18:54

ray6080 approved these changes Jun 6, 2023

View reviewed changes

rfdavid added 5 commits June 6, 2023 13:48

Test Framework: Support CSV to Parquet conversion

ed1886d

This commit adds the capability of converting CSV datasets to parquet datasets on the fly for the tests, by simply using the following command in the test header: -DATASET CSV CSV_TO_PARQUET(dataset)

Added parquet tinysnb test and parser adjustments

9978544

Added CSV param to the newly created tests

c50eca8

Fixes from code review kuzudb#1611

81f03c1

Rebase

8bfaf67

rfdavid force-pushed the csv_to_parquet_on_tests branch from 1286c7b to 8bfaf67 Compare June 6, 2023 18:05

Change to FileUtils::joinPath

b8efcb5

Remove joinPath from createCopyCommandInfo

19e8077

rfdavid mentioned this pull request Jun 7, 2023

Remove csv_to_parquet.py from the codebase #1646

Closed

rfdavid merged commit 20d696a into kuzudb:master Jun 7, 2023
8 checks passed

rfdavid deleted the csv_to_parquet_on_tests branch June 7, 2023 14:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test Framework: Support CSV to Parquet conversion #1611

Test Framework: Support CSV to Parquet conversion #1611

rfdavid commented Jun 2, 2023 •

edited

Loading

ray6080 Jun 6, 2023

rfdavid Jun 6, 2023

ray6080 Jun 6, 2023

rfdavid Jun 6, 2023

codecov bot commented Jun 6, 2023 •

edited

Loading

Test Framework: Support CSV to Parquet conversion #1611

Test Framework: Support CSV to Parquet conversion #1611

Conversation

rfdavid commented Jun 2, 2023 • edited Loading

Introduction

How it works

ray6080 Jun 6, 2023

Choose a reason for hiding this comment

rfdavid Jun 6, 2023

Choose a reason for hiding this comment

ray6080 Jun 6, 2023

Choose a reason for hiding this comment

rfdavid Jun 6, 2023

Choose a reason for hiding this comment

codecov bot commented Jun 6, 2023 • edited Loading

Codecov Report

rfdavid commented Jun 2, 2023 •

edited

Loading

codecov bot commented Jun 6, 2023 •

edited

Loading