Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Out of core CSV support using Apache Arrow CSV reader (fast 🔥!) #1028

Merged
merged 2 commits into from
Sep 23, 2022

Conversation

maartenbreddels
Copy link
Member

@maartenbreddels maartenbreddels commented Oct 28, 2020

@JovanVeljanoski we need to discuss this, how we expose this.
Questions

  • Do we always want to have lazy csv reading? Or if below say 20% of available RAM, load into memory directly? Or special methods? vaex.io.open_csv_lazy vaex.io.open_csv_memory (better be explicit?).
  • How to we expose the pandas route?

I want to move some input/output function from __init__.py into io.py, let me know if you like it.

Stats on a 70GB CSV file (on nyx, 64 cores AMD ryzen):

  • Openining: 4-6 second (fast row count estimate) $ time py.test tests/csv_test.py -v -k test_large_csv_count_array_lengths
  • Reading a single column 9-10 seconds $ time py.test tests/csv_test.py -v -k test_large_csv_count_array_lengths

TODO

@yohplala
Copy link
Contributor

Hi @maartenbreddels
Sorry, I am seeing this PR.
A question out of curiosity: does the Apache Arrow Out-of core CSV reader is able to work with zipped csv?
Having the csv files zipped is something common (at least, pandas read them, transparently unzipping them I guess).
Does Apache Arrow do the same?
Clearly not understanding in depth memory mapping, I could hint that this zipping makes things more complex, does it?

@JovanVeljanoski
Copy link
Member

JovanVeljanoski commented Sep 20, 2022

In this PR we will also try to support reading of gziped CSV. Here are some relevant threads or comments:

@maartenbreddels
Copy link
Member Author

  • Python package / main (macOS-latest, 3.6) (pull_request)

This one hangs quite regularly

@maartenbreddels maartenbreddels merged commit 5403c03 into master Sep 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants