-
-
Notifications
You must be signed in to change notification settings - Fork 366
-
-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recommended file format for large files #129
Comments
Why not the npz format for a numpy matrix? |
Our input data isn't usually a numpy array, and can't easily be converted to one, because in most cases the different columns have different data types. Of course, we can store that as a set of different numpy arrays, but that's a bit clunky, so npz is just one of several possible alternatives with different tradeoffs. |
Is bcolz an option? It seems fast, compact and somewhat pydata-friendly (numpy + import/export to pandas,HDF5), plus it intends to support GPU. It is based on Blosc like Castra. |
Sure, bcolz is a good option worth considering. We'll need to try out the various options and see if we can find a good balance of speed, file size, and cross-platform support. Feather is another possibility. |
For now, we are going to continue to use the castra py2.7 file in the census notebook, but fallback to the Meanwhile, you can install feather using: |
A set of utilities for measuring read and write performance for various columnar files have been added to the examples directory. To use them, follow these steps:
The output from step 3 should be something like:
Here the output from the first command is the writing times for each format, in seconds, which are very short for this small file. For reading, files can be read either into a Pandas or a dask dataframe, which perform differently, so both versions are tested where supported. For each read, various measurements are shown:
The results from tinycensus are important mainly to verify that everything is working; presumably file reading speed is only actually crucial for larger files. The results for the full 300-million-row census dataset are currently:
or from a different run with a few different software library versions:
These numbers are real, in the sense that they were the actual timings on this Macbook Pro laptop, but they should all be considered preliminary. For one thing, castra is currently very slow compared to the other formats, yet in previous manual tests it was much faster than HDF5, and so there may be some configuration settings that need to be applied for good performance in this case. Updated results will be posted here as these settings are worked out. |
Update: The above results were for Dask with no cache enabled. Enabling it (as in the current master of filetimes.sh) greatly improves performance with dask on Aggregate2, as would be expected:
bcolz+dask is still by far the fastest on every measure. Other tests are underway changing the chunking sizes used by dask, which affect how well the operations parallelize, and those results will be posted here when available. |
Updated times, reading only, no compression, for fastparquet on my system: |
Because the full set of files take hours to generate, the previous full-census results were from a saved set of files with by now unknown parameters. Here's the result from an overnight run of everything on my system:
Both of the HDF5 times have really increased; not sure why. The rest seem comparable, so presumably I changed some setting for how I generated the HDF5 file. Note that the "feather" results are still invalid -- the output PNG is blank (which is detectable in the "Out" column). @martindurant, those times sound great -- much better than what I'm currently seeing. How do I need to change my script, and which versions of dask and fastparquet do I need, to be able to reproduce those? |
The above results are all for files generated using pandas originally. The filetimes.py script now supports generating files in some formats using dask instead, which does affect the time (presumably because of different chunking options):
For all the filetypes currently supported by filetimes.py (which doesn't yet include everything dask supports), the reading time is lower when created using these options for dask, though not typically dramatically. |
The best parameters I could get to work
On load, include the keyword |
Thanks! Using latest fastparquet master? |
Yes, which is the same as 0.0.3 in conda-forge (I think OSX is still waiting to build) |
@martindurant, what is the effect of doing |
This makes the data be byte-strings rather than assumed utf8 strings, so decoding is not necessary on read. Decoding time is insignificant for the case of categoricals, but if you use the fixed-string encoding, it is significant, so may as well avoid it. That means that your labels after aggregation will be b'w', b'l' etc. |
You might want to repeat your benchmarks on a system that has plenty of memory for all data and intermediates, so that you are not affected by swapping, but also measure the peak memory usage of loading the various memory formats. The caching story is separate... Note that if you use the distributed scheduler (even on a single machine), you can explicitly persist the data, which since it is explicit, gives you better control over what gets stored. |
Good idea. In principle, I would think there is plenty of memory available, since the .csv file takes 9GB on disk, and I think any reasonable in-memory format for 600 million floats and 300 million one-character strings should be less than that. So if there are swapping issues, they would surely indicate that there are intermediates being stored unnecessarily or that the in-memory format is very inefficient. Explicitly persisting the data would help sort out some of these issues, though it needs to be done per column, because we want to reveal the advantages of supporting per-column reading for formats that support this (in other tests not shown here for data with many unused columns). |
Thanks. I made the changes you suggested above for parquet and pushed them to master, so hopefully I got them correct. The results now are indeed better for Parquet:
Still, bcolz+dask remains by far the fastest, even after the initial loading. I'll try to figure out how to make the dataframe persist in a clean way, which should make all the different dask filetypes have similar timings for Aggregate2 (like they do for the different pandas filetypes). (These results are for files created using dask where supported, and if not then created using pandas, and skipping .gz.parq because it was taking too long to write that file.) |
Work on fastparquet continues apace, and re-running the benchmark with the Dec 22 version of fastparquet (now specified in the filetimes.sh instructions) gives greatly improved speeds for the Parquet format:
Here the display format has changed slightly to replace the old "Total" column with "Aggregate1" and showing the loading and aggregating portions in parentheses, to emphasize that the only meaningful number across dataframe types is the combined time. The final column (confusingly now named "Total") has also now been computed from within python, and thus now ignores the startup time for Python itself, but this change was necessary due to problems with how /usr/bin/time was handling return values. Here's an easier to compare version listing these times against the ones from the previous comment:
All of the Parquet times are now reduced, some greatly so, apart from the Aggregate2 with pandas times (which should always be independent of file format in any case). There is still a surprisingly large range in Aggregate2 times across file formats with dask, indicating that the dask cache is not being very effective in this case. Oddly, the bcolz, castra, and h5 performance has dropped significantly. The underlying code for all of those formats should not have changed, but changes in dask could have affected the speed. More likely, the difference is probably due to changes in filetimes.py itself, perhaps due to choosing different options for encoding the categorical column, which should be investigated. In any case, the Parquet results are now approaching (for Aggregate1) or exceeding (for Aggregate2) the best ever seen from bcolz and castra, and since Parquet is a well established format, it seems like a good idea to switch to it. |
I'm looking into the performance mystery over the next week or so; I'll likely be posing questions and/or observations to this issue as I make progress. |
Interesting results with PR #309 . These seem to suggest that the
As a baseline, here are the results with Pandas:
@jbednar - we can achieve the ~2-second Agg2 time when using the |
Updated results (on PR #309), after reordering the With dask-generated files:
With pandas-generated files:
|
Updated results from PR #344 , after several rounds of performance optimizations, fixing issues with the various file formats in filetimes.py, passing in cached range calculations for the second aggregation, and using threaded scheduler + dask's
Just noticed that the feather format did not work properly. Please disregard those rows. |
Those times we can live with, right? |
Yes! Those are excellent times. :-) Thanks for all your help, both of you! |
Closing this, with remaining work to make sure we follow it listed in #325. |
For reference, here's how we are creating the new
And then it gets read in using:
The |
This is based on @gbrener's writeup on holoviz#129 and holoviz#313. Related to holoviz#325
@jbednar any chance you can rerun the test with ORC+snappy? Our use case favors a format that supports ACID queries and I'm curious if ORC outperforms parquet. |
@benepo , I don't know of any way to read ORC files in python. python-orc actually reads in java and passes records over a socket, no way you'll get performance out of that. |
@martindurant - good point. I hadn't realized ORC support was limited to Java. On a separate note, I wonder how the Dask read performance shown above compares to Apache Arrow: https://arrow.apache.org/docs/python/parquet.html |
dask is able to reas parquet via either fastparquet or arrow's parquet-cpp. Arrow has shown superior performance in some tests (read only), but the difference is not great, and fastparquet has a wider set of options and capabilities. |
That's good to know - thanks. I was having some difficulty finding performance benchmarks between them. ref: https://github.com/elastacloud/parquet-dotnet/issues/171 |
@jbednar Could you point me to the hdf5 example? googling for it took me to this page |
Datashader is agnostic about file formats, working with anything that can be loaded into a dataframe-like object (currently supporting Pandas and Dask dataframes). But because datashader focuses on having good performance for large datasets, the performance of the file format is a major factor in the usability of the library. Thus we should use examples that serve to guide users towards good solutions for their own problems, recommending and demonstrating approaches that we find to work well.
Right now, our examples use CSV and castra or HDF5 formats. It is of course important to show a CSV example, since nearly every dataset can be obtained in CSV for import into the library. However, CSV is highly inefficient in both file size and reading speed, and it also truncates floating-point precision in ways that are problematic when zooming in closely to a dataset.
Castra is a relatively high-performance binary format that works well for the large datasets in the examples, but it is not yet a mature project, and is not available on the main conda channel. Should we invest in making castra be more fully supported? If not, I think we should choose another binary format (HDF5?) to use for our examples.
The text was updated successfully, but these errors were encountered: