-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reader: implement parallel CSV reading #2070
Conversation
In testing, LDBC 100 loads in 3 seconds on ac4 (with 128 cores). However, this does not account for the overhead of counting the number of rows. If we are not going to make hash indexes resizable soon, we should add a dedicated CSV row counting function. Furthermore, the 3 seconds does not account for the overhead of writing to disk, since I ran:
On LDBC-10, which is much nicer to benchmark because the serial CSV reader loads it quickly enough, I got these numbers: Serial: ~12s. Again, in practice, we see only a 2x speedup since we pay the price of counting the rows. |
00500dc
to
1a48115
Compare
@@ -19,18 +19,21 @@ struct CSVReaderConfig { | |||
char listBeginChar; | |||
char listEndChar; | |||
bool hasHeader; | |||
bool parallel; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really related to this PR. I wonder if we should merge CSVReaderConfig with ReaderConfig at some point. Each reader can access a subset of fields from ReaderConfig class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some attributes can be shared. Parallel
definitely should be. Others though... don't make sense, right?
0846d9a
to
18c7e9f
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #2070 +/- ##
==========================================
+ Coverage 89.55% 89.60% +0.04%
==========================================
Files 981 985 +4
Lines 35901 35745 -156
==========================================
- Hits 32151 32028 -123
+ Misses 3750 3717 -33
☔ View full report in Codecov by Sentry. |
18c7e9f
to
e68d69e
Compare
695fa48
to
df729ba
Compare
src/include/processor/operator/persistent/reader/csv/base_csv_reader.h
Outdated
Show resolved
Hide resolved
src/include/processor/operator/persistent/reader/csv/parallel_csv_reader.h
Show resolved
Hide resolved
src/include/processor/operator/persistent/reader/csv/serial_csv_reader.h
Outdated
Show resolved
Hide resolved
src/include/processor/operator/persistent/reader/csv/parallel_csv_reader.h
Outdated
Show resolved
Hide resolved
src/processor/operator/persistent/reader/csv/parallel_csv_reader.cpp
Outdated
Show resolved
Hide resolved
1cdb9b5
to
33b8ff4
Compare
This also refactors the CSVReader class to enable this change.
33b8ff4
to
495e283
Compare
@@ -179,9 +179,10 @@ void CopyNode::checkNonNullConstraint(NullColumnChunk* nullChunk, offset_t numNo | |||
} | |||
|
|||
void CopyNode::finalize(ExecutionContext* context) { | |||
auto numNodes = StorageUtils::getStartOffsetOfNodeGroup(sharedState->getCurNodeGroupIdx()) + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we make this change? The change looks like a bug to me. This could just lead to 0 numNodes in statistics.
This also refactors the CSVReader class to enable this change.