Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reader/csv: avoid lseek for getting offset #2569

Merged
merged 1 commit into from
Dec 11, 2023
Merged

Conversation

Riolku
Copy link
Contributor

@Riolku Riolku commented Dec 11, 2023

Currently we use lseek() after every row in the parallel CSV reader to get the file offset. I thought this would be very cheap, and figured it was easier than tracking it ourselves. However, it is not cheap at all, as demonstrated by a flamegraph, and it is not hard to track ourselves either.

Currently we use lseek() after every row in the parallel CSV reader to
get the file offset. I thought this would be very cheap, and figured it
was easier than tracking it ourselves. However, it is not cheap at all,
as demonstrated by a flamegraph, and it is not hard to track ourselves
either.
Copy link

codecov bot commented Dec 11, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (15f79e6) 92.90% compared to head (5d0ca82) 92.90%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #2569   +/-   ##
=======================================
  Coverage   92.90%   92.90%           
=======================================
  Files        1026     1026           
  Lines       38591    38588    -3     
=======================================
- Hits        35853    35851    -2     
+ Misses       2738     2737    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Riolku
Copy link
Contributor Author

Riolku commented Dec 11, 2023

Benchmark:

Query: LOAD FROM "100mil_pk.csv" RETURN COUNT(*);

The file contains 100 million integers, and nothing else.

master: 9.284s.
this branch: 2.427s.

@Riolku
Copy link
Contributor Author

Riolku commented Dec 11, 2023

LDBC100 Comment:

master: 80.67s.
this branch: 57.85s.

@Riolku Riolku merged commit 572a69f into master Dec 11, 2023
14 checks passed
@Riolku Riolku deleted the csv-reader-no-lseek branch December 11, 2023 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants