Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommended file format for large files #129

Closed
jbednar opened this issue Mar 29, 2016 · 40 comments
Closed

Recommended file format for large files #129

jbednar opened this issue Mar 29, 2016 · 40 comments
Milestone

Comments

@jbednar
Copy link
Member

jbednar commented Mar 29, 2016

Datashader is agnostic about file formats, working with anything that can be loaded into a dataframe-like object (currently supporting Pandas and Dask dataframes). But because datashader focuses on having good performance for large datasets, the performance of the file format is a major factor in the usability of the library. Thus we should use examples that serve to guide users towards good solutions for their own problems, recommending and demonstrating approaches that we find to work well.

Right now, our examples use CSV and castra or HDF5 formats. It is of course important to show a CSV example, since nearly every dataset can be obtained in CSV for import into the library. However, CSV is highly inefficient in both file size and reading speed, and it also truncates floating-point precision in ways that are problematic when zooming in closely to a dataset.

Castra is a relatively high-performance binary format that works well for the large datasets in the examples, but it is not yet a mature project, and is not available on the main conda channel. Should we invest in making castra be more fully supported? If not, I think we should choose another binary format (HDF5?) to use for our examples.

@apaytuvi
Copy link

apaytuvi commented May 3, 2016

Why not the npz format for a numpy matrix?

@jbednar
Copy link
Member Author

jbednar commented May 4, 2016

Our input data isn't usually a numpy array, and can't easily be converted to one, because in most cases the different columns have different data types. Of course, we can store that as a set of different numpy arrays, but that's a bit clunky, so npz is just one of several possible alternatives with different tradeoffs.

@miktoki
Copy link

miktoki commented May 5, 2016

Is bcolz an option? It seems fast, compact and somewhat pydata-friendly (numpy + import/export to pandas,HDF5), plus it intends to support GPU. It is based on Blosc like Castra.

@jbednar
Copy link
Member Author

jbednar commented May 5, 2016

Sure, bcolz is a good option worth considering. We'll need to try out the various options and see if we can find a good balance of speed, file size, and cross-platform support. Feather is another possibility.

@brendancol
Copy link
Collaborator

brendancol commented Jun 22, 2016

For now, we are going to continue to use the castra py2.7 file in the census notebook, but fallback to the census.h5 version which has better cross-platform support. Moving forward, it looks like feather-format would be a great solution to performance and interoperability issues.

Meanwhile, you can install feather using: conda install feather-format -c conda-forge and modify the examples appropriately.

@jbednar
Copy link
Member Author

jbednar commented Nov 30, 2016

A set of utilities for measuring read and write performance for various columnar files have been added to the examples directory. To use them, follow these steps:

  1. Pull the latest datashader github master and cd to the examples/ directory
  2. Grab a copy of tinycensus.zip and unpack it into examples/data/tinycensus.zip
  3. Follow the instructions at the top of examples/filetimes.sh to:
    a. set up an environment with all the right packages and versions needed
    b. create versions of tinycensus.csv in various formats
    c. load and time each version of the file using a datashader aggregation
  4. Once that works, you can replace "tinycensus.csv" in the instructions with "census.h5" and run through it again, which will take a lot longer (at least half an hour, all told)

The output from step 3 should be something like:

bash-3.2$ source activate filetimes
(filetimes) bash-3.2$ rm -rf times/tinycen* ; python -c "import filetimes ; filetimes.base='census' ; filetimes.categories=['race']; filetimes.timed_write('data/tinycensus.csv')"
times/tinycensus.csv         00.13
times/tinycensus.h5          00.04
times/tinycensus.castra      00.02
times/tinycensus.bcolz       00.01
times/tinycensus.feather     00.00
times/tinycensus.parq        00.39
times/tinycensus.snappy.parq 00.02
times/tinycensus.gz.parq     00.77
(filetimes) bash-3.2$ ./filetimes.sh times/tinycensus
times/tinycensus.csv         dask    Total:000.25  Load:000.01  Aggregate1:000.24  Aggregate2:000.06  In:00001014035  Out:00000033122         2.10
times/tinycensus.h5          dask    Total:000.28  Load:000.01  Aggregate1:000.27  Aggregate2:000.09  In:00000831916  Out:00000033122         2.12
times/tinycensus.castra      dask    Total:000.20  Load:000.00  Aggregate1:000.20  Aggregate2:000.02  In:00000588729  Out:00000033122         1.87
times/tinycensus.bcolz       dask    Total:000.20  Load:000.01  Aggregate1:000.19  Aggregate2:000.02  In:00000247776  Out:00000033122         1.93
times/tinycensus.feather     dask    Not supported         1.52
times/tinycensus.parq        dask    Total:001.33  Load:000.00  Aggregate1:001.32  Aggregate2:000.02  In:00000749022  Out:00000033122         3.05
times/tinycensus.gz.parq     dask    Total:001.36  Load:000.00  Aggregate1:001.35  Aggregate2:000.04  In:00000290573  Out:00000033122         3.15
times/tinycensus.snappy.parq dask    Total:001.32  Load:000.00  Aggregate1:001.32  Aggregate2:000.03  In:00000491241  Out:00000033122         3.05
times/tinycensus.csv         pandas  Total:000.20  Load:000.02  Aggregate1:000.19  Aggregate2:000.00  In:00001014035  Out:00000033122         1.85
times/tinycensus.h5          pandas  Total:000.21  Load:000.03  Aggregate1:000.18  Aggregate2:000.00  In:00000831916  Out:00000033122         1.87
times/tinycensus.castra      pandas  Not supported         1.55
times/tinycensus.bcolz       pandas  Not supported         1.56
times/tinycensus.feather     pandas  Total:000.18  Load:000.00  Aggregate1:000.18  Aggregate2:000.00  In:00000644376  Out:00000033122         1.82
times/tinycensus.parq        pandas  Total:001.15  Load:000.97  Aggregate1:000.18  Aggregate2:000.00  In:00000749022  Out:00000033122         2.84
times/tinycensus.gz.parq     pandas  Total:001.21  Load:001.03  Aggregate1:000.18  Aggregate2:000.00  In:00000290573  Out:00000033122         2.90
times/tinycensus.snappy.parq pandas  Total:001.12  Load:000.95  Aggregate1:000.17  Aggregate2:000.00  In:00000491241  Out:00000033122         2.73
(filetimes) bash-3.2$ 

Here the output from the first command is the writing times for each format, in seconds, which are very short for this small file. For reading, files can be read either into a Pandas or a dask dataframe, which perform differently, so both versions are tested where supported. For each read, various measurements are shown:

  • Load: The time taken by the load step.
  • Aggregate1: The time taken by an aggregated plot of the entire dataset, into a small aggregate array.
  • Total: Load+Aggregate1, i.e., the time to get the first result from a fresh start. This is usually the number to look at rather than either of the above numbers, because dask's lazy loading means that loading time can appear under either of the above values.
  • Aggregate2: The time taken by an aggregated plot of the entire dataset into a somewhat larger array, after the dataset has already passed through memory in the previous step
  • In: size of the input file (which varies a lot for the different formats)
  • Out: size of the final .png output. It is crucial to check that this value is the same for all formats, and to visually compare the PNGs in times/censuspng, to ensure that the results are actually correct. Problems with certain formats will often result in blank PNGs, which will have a different (and very small) size.
  • (last column): real time for the entire filetimes.py script to complete. For large files should be approximately equal to Total+Aggregate2, indicating that other processing time is minimal.

The results from tinycensus are important mainly to verify that everything is working; presumably file reading speed is only actually crucial for larger files. The results for the full 300-million-row census dataset are currently:

times/census.csv             dask    Total:129.54  Load:000.14  Aggregate1:129.40  Aggregate2:133.04  In:10139934963  Out:00000625044       267.59
times/census.h5              dask    Total:054.71  Load:000.05  Aggregate1:054.66  Aggregate2:053.04  In:07760840029  Out:00000625044       113.71
times/census.castra          dask    Total:118.69  Load:000.00  Aggregate1:118.69  Aggregate2:135.53  In:05250720658  Out:00000625044       272.19
times/census.bcolz           dask    Total:024.13  Load:005.94  Aggregate1:018.19  Aggregate2:015.72  In:01795702236  Out:00000625044        44.79
times/census.feather         dask    Not supported         1.66
times/census.parq            dask    Total:076.72  Load:000.01  Aggregate1:076.71  Aggregate2:085.92  In:07475213089  Out:00000625044       169.12
times/census.snappy.parq     dask    Total:155.97  Load:000.01  Aggregate1:155.96  Aggregate2:165.00  In:03582480323  Out:00000625044       327.42
times/census.csv             pandas  Total:213.91  Load:191.70  Aggregate1:022.21  Aggregate2:014.13  In:10139934963  Out:00000625044       236.20
times/census.h5              pandas  Total:088.64  Load:058.08  Aggregate1:030.56  Aggregate2:013.45  In:07760840029  Out:00000625044       109.82
times/census.castra          pandas  Not supported         3.02
times/census.bcolz           pandas  Not supported         1.71
times/census.feather         pandas  Total:040.69  Load:022.69  Aggregate1:018.00  Aggregate2:015.33  In:01533375368  Out:00000002056        81.73
times/census.parq            pandas  Total:068.71  Load:049.84  Aggregate1:018.87  Aggregate2:013.26  In:07475213089  Out:00000625044        88.63
times/census.snappy.parq     pandas  Total:098.23  Load:079.18  Aggregate1:019.04  Aggregate2:013.06  In:03582480323  Out:00000625044       116.67

or from a different run with a few different software library versions:

times/census.csv             dask    Total:116.69  Load:000.15  Aggregate1:116.54  Aggregate2:115.79  In:10139934963  Out:00000625044       236.19
times/census.h5              dask    Total:052.65  Load:000.05  Aggregate1:052.60  Aggregate2:042.76  In:07760840029  Out:00000625044       100.40
times/census.castra          dask    Total:131.04  Load:000.00  Aggregate1:131.04  Aggregate2:141.17  In:05250720658  Out:00000625044       312.30
times/census.bcolz           dask    Total:025.33  Load:005.95  Aggregate1:019.38  Aggregate2:016.08  In:01795702236  Out:00000625044        46.46
times/census.feather         dask    Not supported         1.57
times/census.parq            dask    Total:091.25  Load:000.01  Aggregate1:091.24  Aggregate2:088.31  In:07475213089  Out:00000625044       186.99
times/census.snappy.parq     dask    Total:145.52  Load:000.01  Aggregate1:145.51  Aggregate2:163.57  In:03582480323  Out:00000625044       314.87
times/census.csv             pandas  Total:210.19  Load:183.96  Aggregate1:026.24  Aggregate2:013.53  In:10139934963  Out:00000625044       230.89
times/census.h5              pandas  Total:102.16  Load:063.98  Aggregate1:038.18  Aggregate2:013.74  In:07760840029  Out:00000625044       124.10
times/census.castra          pandas  Not supported         2.68
times/census.bcolz           pandas  Not supported         1.61
times/census.feather         pandas  Total:043.65  Load:023.43  Aggregate1:020.22  Aggregate2:015.92  In:01533375368  Out:00000002056        85.99
times/census.parq            pandas  Total:073.02  Load:053.20  Aggregate1:019.82  Aggregate2:013.70  In:07475213089  Out:00000625044        93.16
times/census.snappy.parq     pandas  Total:098.05  Load:077.45  Aggregate1:020.60  Aggregate2:013.55  In:03582480323  Out:00000625044       117.38

These numbers are real, in the sense that they were the actual timings on this Macbook Pro laptop, but they should all be considered preliminary. For one thing, castra is currently very slow compared to the other formats, yet in previous manual tests it was much faster than HDF5, and so there may be some configuration settings that need to be applied for good performance in this case. Updated results will be posted here as these settings are worked out.

@jbednar
Copy link
Member Author

jbednar commented Nov 30, 2016

Update: The above results were for Dask with no cache enabled. Enabling it (as in the current master of filetimes.sh) greatly improves performance with dask on Aggregate2, as would be expected:

(filetimes) bash-3.2$ /usr/bin/time ./filetimes.sh times/census
times/census.csv             dask    Total:124.91  Load:000.14  Aggregate1:124.77  Aggregate2:054.95  In:10139934963  Out:00000625044       185.00
times/census.h5              dask    Total:056.64  Load:000.05  Aggregate1:056.59  Aggregate2:015.70  In:07760840029  Out:00000625044        81.85
times/census.castra          dask    Total:118.02  Load:000.00  Aggregate1:118.01  Aggregate2:003.97  In:05250720658  Out:00000625044       137.81
times/census.bcolz           dask    Total:027.87  Load:006.20  Aggregate1:021.67  Aggregate2:001.74  In:01795702236  Out:00000625044        36.23
times/census.feather         dask    Not supported         2.41
times/census.parq            dask    Total:105.49  Load:000.01  Aggregate1:105.49  Aggregate2:053.31  In:07475213089  Out:00000625044       167.00
times/census.snappy.parq     dask    Total:168.46  Load:000.01  Aggregate1:168.45  Aggregate2:078.17  In:03582480323  Out:00000625044       255.90
times/census.csv             pandas  Total:215.78  Load:191.66  Aggregate1:024.11  Aggregate2:013.50  In:10139934963  Out:00000625044       236.78
times/census.h5              pandas  Total:101.69  Load:066.39  Aggregate1:035.30  Aggregate2:014.50  In:07760840029  Out:00000625044       123.70
times/census.castra          pandas  Not supported         2.72
times/census.bcolz           pandas  Not supported         1.62
times/census.feather         pandas  Total:045.90  Load:025.97  Aggregate1:019.93  Aggregate2:015.59  In:01533375368  Out:00000002056        89.52
times/census.parq            pandas  Total:072.06  Load:052.15  Aggregate1:019.92  Aggregate2:013.50  In:07475213089  Out:00000625044        91.91
times/census.snappy.parq     pandas  Total:103.29  Load:081.55  Aggregate1:021.73  Aggregate2:013.24  In:03582480323  Out:00000625044       122.05
     1540.17 real      1689.98 user      1000.99 sys

bcolz+dask is still by far the fastest on every measure. Other tests are underway changing the chunking sizes used by dask, which affect how well the operations parallelize, and those results will be posted here when available.

@martindurant
Copy link

Updated times, reading only, no compression, for fastparquet on my system:
Original sequential read with fastparquet: 24.5s
Improved sequential read as of 0.0.3: 22.2s
Using fixed-length dtype for 'race' column, no index and no nulls: 9.7s
With dask-distributed and three processes: 5.3s

@jbednar
Copy link
Member Author

jbednar commented Dec 2, 2016

Because the full set of files take hours to generate, the previous full-census results were from a saved set of files with by now unknown parameters. Here's the result from an overnight run of everything on my system:

$ rm -rf times/cen* ; python -c "import filetimes ; filetimes.base='census' ; filetimes.categories=['race']; filetimes.timed_write('data/census.h5',dftype='pandas')"
times/census.feather         pandas  35.89
times/census.h5              pandas  394.01
times/census.bcolz           pandas  49.64
times/census.parq            pandas  134.26
times/census.csv             pandas  1329.53
times/census.castra          pandas  317.59
times/census.snappy.parq     pandas  210.03
times/census.gz.parq         pandas  11804.33
    14520.56 real     13644.26 user       565.59 sys
$ /usr/bin/time ./filetimes.sh times/census
times/census.csv             dask    Total:118.72  Load:000.14  Aggregate1:118.58  Aggregate2:051.88  In:10139934963  Out:00000625044  177.85
times/census.h5              dask    Total:541.59  Load:000.02  Aggregate1:541.58  Aggregate2:255.61  In:07705345127  Out:00000625044  832.28
times/census.castra          dask    Total:112.48  Load:000.01  Aggregate1:112.47  Aggregate2:003.93  In:05250720658  Out:00000625044  134.10
times/census.bcolz           dask    Total:028.38  Load:005.87  Aggregate1:022.51  Aggregate2:001.66  In:01795702236  Out:00000625044   36.52
times/census.feather         dask    Not supported         2.47
times/census.parq            dask    Total:087.59  Load:000.01  Aggregate1:087.58  Aggregate2:046.06  In:07475213089  Out:00000625044  142.89
times/census.gz.parq         dask    Total:158.87  Load:000.01  Aggregate1:158.86  Aggregate2:072.41  In:01860120218  Out:00000625044  240.10
times/census.snappy.parq     dask    Total:146.52  Load:000.01  Aggregate1:146.51  Aggregate2:073.10  In:03585367907  Out:00000625044  230.40
times/census.csv             pandas  Total:204.22  Load:178.74  Aggregate1:025.47  Aggregate2:013.31  In:10139934963  Out:00000625044  224.74
times/census.h5              pandas  Total:313.12  Load:264.30  Aggregate1:048.82  Aggregate2:013.34  In:07705345127  Out:00000625044  336.00
times/census.castra          pandas  Not supported         2.77
times/census.bcolz           pandas  Not supported         1.53
times/census.feather         pandas  Total:043.99  Load:023.70  Aggregate1:020.29  Aggregate2:014.98  In:01533375368  Out:00000002056   84.77
times/census.parq            pandas  Total:063.88  Load:044.71  Aggregate1:019.17  Aggregate2:012.91  In:07475213089  Out:00000625044   83.22
times/census.gz.parq         pandas  Total:127.13  Load:105.74  Aggregate1:021.39  Aggregate2:012.98  In:01860120218  Out:00000625044  146.85
times/census.snappy.parq     pandas  Total:088.93  Load:069.00  Aggregate1:019.93  Aggregate2:012.98  In:03585367907  Out:00000625044  108.57
     2785.36 real      2932.73 user      1302.05 sys
$ /usr/bin/time ./filetimes.sh times/census
times/census.csv             dask    Total:116.77  Load:000.14  Aggregate1:116.63  Aggregate2:052.46  In:10139934963  Out:00000625044  174.64
times/census.h5              dask    Total:482.55  Load:000.02  Aggregate1:482.54  Aggregate2:241.44  In:07705345127  Out:00000625044  746.60
times/census.castra          dask    Total:110.94  Load:000.01  Aggregate1:110.93  Aggregate2:003.94  In:05250720658  Out:00000625044  130.96
times/census.bcolz           dask    Total:026.75  Load:005.93  Aggregate1:020.82  Aggregate2:001.64  In:01795702236  Out:00000625044   34.95
times/census.feather         dask    Not supported         2.47
times/census.parq            dask    Total:082.89  Load:000.01  Aggregate1:082.88  Aggregate2:059.14  In:07475213089  Out:00000625044  150.18
times/census.gz.parq         dask    Total:156.57  Load:000.01  Aggregate1:156.56  Aggregate2:084.09  In:01860120218  Out:00000625044  249.50
times/census.snappy.parq     dask    Total:147.44  Load:000.01  Aggregate1:147.43  Aggregate2:082.33  In:03585367907  Out:00000625044  240.28
times/census.csv             pandas  Total:207.12  Load:180.03  Aggregate1:027.09  Aggregate2:013.08  In:10139934963  Out:00000625044  227.72
times/census.h5              pandas  Total:301.65  Load:261.25  Aggregate1:040.40  Aggregate2:013.39  In:07705345127  Out:00000625044  326.59
times/census.castra          pandas  Not supported         2.90
times/census.bcolz           pandas  Not supported         1.50
times/census.feather         pandas  Total:045.00  Load:024.45  Aggregate1:020.56  Aggregate2:014.93  In:01533375368  Out:00000002056   85.78
times/census.parq            pandas  Total:065.19  Load:046.27  Aggregate1:018.92  Aggregate2:012.92  In:07475213089  Out:00000625044   84.61
times/census.gz.parq         pandas  Total:127.05  Load:106.20  Aggregate1:020.85  Aggregate2:012.90  In:01860120218  Out:00000625044  146.62
times/census.snappy.parq     pandas  Total:088.29  Load:068.94  Aggregate1:019.36  Aggregate2:013.08  In:03585367907  Out:00000625044  108.02
     2713.56 real      2838.16 user      1330.10 sys

Both of the HDF5 times have really increased; not sure why. The rest seem comparable, so presumably I changed some setting for how I generated the HDF5 file. Note that the "feather" results are still invalid -- the output PNG is blank (which is detectable in the "Out" column).

@martindurant, those times sound great -- much better than what I'm currently seeing. How do I need to change my script, and which versions of dask and fastparquet do I need, to be able to reproduce those?

@jbednar
Copy link
Member Author

jbednar commented Dec 2, 2016

The above results are all for files generated using pandas originally. The filetimes.py script now supports generating files in some formats using dask instead, which does affect the time (presumably because of different chunking options):

$ rm -rf times/cen* ; python -c "import filetimes ; filetimes.base='census' ; filetimes.categories=['race']; filetimes.timed_write('data/census.h5',dftype='dask')"
times/census.feather         dask    Not supported
times/census.parq            dask    129.65
times/census.h5              dask    553.42
times/census.bcolz           dask    Not supported
times/census.castra          dask    Not supported
times/census.csv             dask    Not supported
times/census.snappy.parq     dask    478.76
times/census.gz.parq         dask    2220.69
     3360.93 real      8145.34 user      1120.01 sys
$ /usr/bin/time ./filetimes.sh times/census
times/census.csv             dask    Not supported                                                                                           
times/census.h5              dask    Total:497.82  Load:000.03  Aggregate1:497.79  Aggregate2:232.90  In:07722898138  Out:00000625044  756.61 
times/census.castra          dask    Not supported                                                                                           
times/census.bcolz           dask    Not supported                                                                                           
times/census.feather         dask    Not supported         1.63                                                                              
times/census.parq            dask    Total:074.52  Load:000.00  Aggregate1:074.52  Aggregate2:045.22  In:05021807724  Out:00000625044  130.96
times/census.gz.parq         dask    Total:117.52  Load:000.00  Aggregate1:117.52  Aggregate2:063.45  In:01397611604  Out:00000625044  191.23
times/census.snappy.parq     dask    Total:111.86  Load:000.00  Aggregate1:111.86  Aggregate2:059.17  In:02357899187  Out:00000625044  184.54
times/census.csv             pandas  Not supported                                                                                           
times/census.h5              pandas  Total:312.11  Load:263.17  Aggregate1:048.94  Aggregate2:013.37  In:07722898138  Out:00000625044  335.17
times/census.castra          pandas  Not supported         2.90                                                                              
times/census.bcolz           pandas  Not supported         1.58                                                                              
times/census.feather         pandas  Not supported                                                                                           
times/census.parq            pandas  Total:057.55  Load:035.45  Aggregate1:022.10  Aggregate2:013.07  In:05021807724  Out:00000625044   75.74
times/census.gz.parq         pandas  Total:101.43  Load:081.17  Aggregate1:020.25  Aggregate2:013.39  In:01397611604  Out:00000625044  121.05
times/census.snappy.parq     pandas  Total:075.29  Load:053.47  Aggregate1:021.83  Aggregate2:012.90  In:02357899187  Out:00000625044   94.48
     1908.12 real      1528.57 user       702.84 sys
$ /usr/bin/time ./filetimes.sh times/census
times/census.csv             dask    Not supported                                                                                           
times/census.h5              dask    Total:485.79  Load:000.02  Aggregate1:485.76  Aggregate2:227.66  In:07722898138  Out:00000625044  741.37
times/census.castra          dask    Not supported                                                                                           
times/census.bcolz           dask    Not supported                                                                                           
times/census.feather         dask    Not supported         1.55                                                                              
times/census.parq            dask    Total:070.43  Load:000.00  Aggregate1:070.43  Aggregate2:047.29  In:05021807724  Out:00000625044  128.19
times/census.gz.parq         dask    Total:118.02  Load:000.01  Aggregate1:118.02  Aggregate2:062.95  In:01397611604  Out:00000625044  190.70
times/census.snappy.parq     dask    Total:110.53  Load:000.00  Aggregate1:110.53  Aggregate2:062.98  In:02357899187  Out:00000625044  184.11
times/census.csv             pandas  Not supported                                                                                           
times/census.h5              pandas  Total:319.84  Load:271.32  Aggregate1:048.52  Aggregate2:013.36  In:07722898138  Out:00000625044  343.52
times/census.castra          pandas  Not supported         2.71                                                                              
times/census.bcolz           pandas  Not supported         1.47                                                                              
times/census.feather         pandas  Not supported                                                                                           
times/census.parq            pandas  Total:056.44  Load:035.18  Aggregate1:021.26  Aggregate2:013.38  In:05021807724  Out:00000625044   74.96
times/census.gz.parq         pandas  Total:101.46  Load:080.98  Aggregate1:020.48  Aggregate2:013.03  In:01397611604  Out:00000625044  120.90
times/census.snappy.parq     pandas  Total:074.62  Load:053.64  Aggregate1:020.98  Aggregate2:013.08  In:02357899187  Out:00000625044   93.93
     1895.15 real      1531.25 user       683.85 sys

For all the filetypes currently supported by filetimes.py (which doesn't yet include everything dask supports), the reading time is lower when created using these options for dask, though not typically dramatically.

@martindurant
Copy link

The best parameters I could get to work

write["parq"]["pandas"] = lambda df,filepath: fp.write(filepath, df, file_scheme='hive')
=>
write["parq"]["pandas"] = lambda df,filepath: fp.write(filepath, df, file_scheme='hive', fixed_text={'race': 1}, has_nulls=0, write_index=False)
Where the dtype of race object/bytes df['race'] = df['race'].str.encode('utf8')

On load, include the keyword index=False.

@jbednar
Copy link
Member Author

jbednar commented Dec 2, 2016

Thanks! Using latest fastparquet master?

@martindurant
Copy link

Yes, which is the same as 0.0.3 in conda-forge (I think OSX is still waiting to build)

@jbednar
Copy link
Member Author

jbednar commented Dec 2, 2016

@martindurant, what is the effect of doing .str.encode(`utf8`) instead of .astype(str)?

@martindurant
Copy link

This makes the data be byte-strings rather than assumed utf8 strings, so decoding is not necessary on read. Decoding time is insignificant for the case of categoricals, but if you use the fixed-string encoding, it is significant, so may as well avoid it. That means that your labels after aggregation will be b'w', b'l' etc.
(I realize that the labels are identical whether encoded or not, but there is no way to simply "cast" between bytes-string and string in python - perhaps that is something numba or cython can do if we assume the encoding is trivial)

@martindurant
Copy link

You might want to repeat your benchmarks on a system that has plenty of memory for all data and intermediates, so that you are not affected by swapping, but also measure the peak memory usage of loading the various memory formats. The caching story is separate... Note that if you use the distributed scheduler (even on a single machine), you can explicitly persist the data, which since it is explicit, gives you better control over what gets stored.

@jbednar
Copy link
Member Author

jbednar commented Dec 2, 2016

Good idea. In principle, I would think there is plenty of memory available, since the .csv file takes 9GB on disk, and I think any reasonable in-memory format for 600 million floats and 300 million one-character strings should be less than that. So if there are swapping issues, they would surely indicate that there are intermediates being stored unnecessarily or that the in-memory format is very inefficient. Explicitly persisting the data would help sort out some of these issues, though it needs to be done per column, because we want to reveal the advantages of supporting per-column reading for formats that support this (in other tests not shown here for data with many unused columns).

@jbednar
Copy link
Member Author

jbednar commented Dec 2, 2016

Thanks. I made the changes you suggested above for parquet and pushed them to master, so hopefully I got them correct. The results now are indeed better for Parquet:

$ /usr/bin/time ./filetimes.sh times/census
times/census.csv             dask    Total:136.28  Load:000.14  Aggregate1:136.14  Aggregate2:060.08  In:10139934963  Out:00000625044       201.18
times/census.h5              dask    Total:067.60  Load:000.05  Aggregate1:067.55  Aggregate2:027.37  In:07760840029  Out:00000625044       104.74
times/census.castra          dask    Total:122.03  Load:000.01  Aggregate1:122.02  Aggregate2:004.04  In:05250720658  Out:00000625044       140.84
times/census.bcolz           dask    Total:028.19  Load:006.23  Aggregate1:021.96  Aggregate2:001.72  In:01795702236  Out:00000625044        37.07
times/census.feather         dask    Not supported         2.49
times/census.parq            dask    Total:067.30  Load:000.01  Aggregate1:067.29  Aggregate2:027.79  In:05021810665  Out:00000625044       105.03
times/census.snappy.parq     dask    Total:085.43  Load:000.01  Aggregate1:085.43  Aggregate2:037.79  In:02355045362  Out:00000625044       132.63
times/census.csv             pandas  Total:230.92  Load:204.59  Aggregate1:026.33  Aggregate2:013.39  In:10139934963  Out:00000625044       251.66
times/census.h5              pandas  Total:095.84  Load:060.94  Aggregate1:034.91  Aggregate2:013.78  In:07760840029  Out:00000625044       116.71
times/census.castra          pandas  Not supported         2.87
times/census.bcolz           pandas  Not supported         1.69
times/census.feather         pandas  Total:043.86  Load:023.53  Aggregate1:020.33  Aggregate2:016.51  In:01533375368  Out:00000002056        90.23
times/census.parq            pandas  Total:054.44  Load:034.10  Aggregate1:020.33  Aggregate2:014.03  In:05021810665  Out:00000625044        74.72
times/census.gz.parq         pandas  Not supported         2.46
times/census.snappy.parq     pandas  Total:073.86  Load:052.23  Aggregate1:021.64  Aggregate2:013.87  In:02355045362  Out:00000625044        93.24
     1318.26 real      1651.54 user       681.86 sys

Still, bcolz+dask remains by far the fastest, even after the initial loading. I'll try to figure out how to make the dataframe persist in a clean way, which should make all the different dask filetypes have similar timings for Aggregate2 (like they do for the different pandas filetypes).

(These results are for files created using dask where supported, and if not then created using pandas, and skipping .gz.parq because it was taking too long to write that file.)

@jbednar
Copy link
Member Author

jbednar commented Dec 22, 2016

Work on fastparquet continues apace, and re-running the benchmark with the Dec 22 version of fastparquet (now specified in the filetimes.sh instructions) gives greatly improved speeds for the Parquet format:

times/census.parq            dask    Aggregate1:037.16 (000.01+037.15)  Aggregate2:001.18  In:05213482659  Out:00000625044  Total:042.54
times/census.snappy.parq     dask    Aggregate1:070.36 (000.01+070.35)  Aggregate2:001.05  In:02376471320  Out:00000625044  Total:077.14
times/census.parq            pandas  Aggregate1:042.39 (023.11+019.28)  Aggregate2:013.35  In:05213482659  Out:00000625044  Total:060.83
times/census.snappy.parq     pandas  Aggregate1:059.63 (040.14+019.49)  Aggregate2:013.03  In:02376471320  Out:00000625044  Total:077.39
times/census.castra          dask    Aggregate1:130.50 (000.01+130.50)  Aggregate2:004.03  In:05250720658  Out:00000625044  Total:139.68
times/census.bcolz           dask    Aggregate1:048.71 (007.05+041.65)  Aggregate2:008.84  In:01795702236  Out:00000625044  Total:062.66
times/census.h5              dask    Aggregate1:450.60 (000.02+450.58)  Aggregate2:142.60  In:07705345127  Out:00000625044  Total:599.94
times/census.h5              pandas  Aggregate1:308.01 (265.23+042.77)  Aggregate2:013.24  In:07705345127  Out:00000625044  Total:326.66
times/census.csv             dask    Aggregate1:119.69 (000.14+119.55)  Aggregate2:015.55  In:10139934963  Out:00000625044  Total:140.86
times/census.csv             pandas  Aggregate1:211.87 (191.30+020.57)  Aggregate2:013.58  In:10139934963  Out:00000625044  Total:230.43

Here the display format has changed slightly to replace the old "Total" column with "Aggregate1" and showing the loading and aggregating portions in parentheses, to emphasize that the only meaningful number across dataframe types is the combined time. The final column (confusingly now named "Total") has also now been computed from within python, and thus now ignores the startup time for Python itself, but this change was necessary due to problems with how /usr/bin/time was handling return values.

Here's an easier to compare version listing these times against the ones from the previous comment:

times/census.parq            dask    Aggregate1:037.16 (was 067.30)  Aggregate2:001.18  (was 027.79)
times/census.snappy.parq     dask    Aggregate1:070.36 (was 085.43)  Aggregate2:001.05  (was 037.79)
times/census.parq            pandas  Aggregate1:042.39 (was 054.44)  Aggregate2:013.35  (was 014.03)
times/census.snappy.parq     pandas  Aggregate1:059.63 (was 073.86)  Aggregate2:013.03  (was 013.87)
times/census.castra          dask    Aggregate1:130.50 (was 122.03)  Aggregate2:004.03  (was 004.04)
times/census.bcolz           dask    Aggregate1:048.71 (was 028.19)  Aggregate2:008.84  (was 001.72)
times/census.h5              dask    Aggregate1:450.60 (was 067.60)  Aggregate2:142.60  (was 027.37)
times/census.h5              pandas  Aggregate1:308.01 (was 095.84)  Aggregate2:013.24  (was 013.78)
times/census.csv             dask    Aggregate1:119.69 (was 136.28)  Aggregate2:015.55  (was 060.08)
times/census.csv             pandas  Aggregate1:211.87 (was 230.92)  Aggregate2:013.58  (was 013.39)

All of the Parquet times are now reduced, some greatly so, apart from the Aggregate2 with pandas times (which should always be independent of file format in any case). There is still a surprisingly large range in Aggregate2 times across file formats with dask, indicating that the dask cache is not being very effective in this case.

Oddly, the bcolz, castra, and h5 performance has dropped significantly. The underlying code for all of those formats should not have changed, but changes in dask could have affected the speed. More likely, the difference is probably due to changes in filetimes.py itself, perhaps due to choosing different options for encoding the categorical column, which should be investigated. In any case, the Parquet results are now approaching (for Aggregate1) or exceeding (for Aggregate2) the best ever seen from bcolz and castra, and since Parquet is a well established format, it seems like a good idea to switch to it.

@gbrener
Copy link
Contributor

gbrener commented Apr 7, 2017

I'm looking into the performance mystery over the next week or so; I'll likely be posing questions and/or observations to this issue as I make progress.

@gbrener
Copy link
Contributor

gbrener commented Apr 13, 2017

Interesting results with PR #309 . These seem to suggest that the df.persist() method in Dask is preloading the data but (still) loading from disk during Agg2, whereas the dask.cache.Cache(cachesize).register() function is (more properly) caching the data:

$ python filetimes.py times/census.parq dask census meterswest metersnorth race --debug
DEBUG: Cache disabled
DEBUG: Memory usage (before read):      131.2768 MB
DEBUG: read_parquet(times/census.parq, columns=['meterswest', 'metersnorth', 'race'], index=False)
DEBUG: Memory usage (after read):       131.694592 MB
DEBUG: DataFrame size:                  5213.477938 MB
DEBUG: Memory usage (after agg1):       7849.742336 MB
DEBUG: Memory usage (after agg2):       7849.742336 MB
times/census.parq            dask    Aggregate1:015.22 (000.01+015.22)  Aggregate2:016.32  In:05213483268  Out:00000625044  Total:043.40
$ python filetimes.py times/census.parq dask census meterswest metersnorth race --debug --cache persist
DEBUG: Cache "persist" mode enabled
DEBUG: Memory usage (before read):      131.649536 MB
DEBUG: read_parquet(times/census.parq, columns=['meterswest', 'metersnorth', 'race'], index=False)
DEBUG: Force-loading Dask dataframe
DEBUG: Memory usage (after read):       7142.629376 MB
DEBUG: DataFrame size:                  5213.477938 MB
DEBUG: Memory usage (after agg1):       8020.066304 MB
DEBUG: Memory usage (after agg2):       8020.066304 MB
times/census.parq            dask    Aggregate1:024.66 (006.16+018.50)  Aggregate2:013.83  In:05213483268  Out:00000625044  Total:049.71
$ python filetimes.py times/census.parq dask census meterswest metersnorth race --debug --cache cachey
DEBUG: Cache "cachey" mode enabled
DEBUG: Memory usage (before read):      132.272128 MB
DEBUG: read_parquet(times/census.parq, index=False, columns=['meterswest', 'metersnorth', 'race'])
DEBUG: Memory usage (after read):       132.763648 MB
DEBUG: DataFrame size:                  5213.477938 MB
DEBUG: Memory usage (after agg1):       8800.44032 MB
DEBUG: Memory usage (after agg2):       8800.44032 MB
times/census.parq            dask    Aggregate1:026.67 (000.01+026.67)  Aggregate2:001.67  In:05213483268  Out:00000625044  Total:035.23

As a baseline, here are the results with Pandas:

python filetimes.py times/census.parq pandas census meterswest metersnorth race --debug
DEBUG: Cache disabled
DEBUG: Memory usage (before read):      132.66944 MB
DEBUG: read_parq_pandas(times/census.parq)
DEBUG: Memory usage (after read):       5700.501504 MB
DEBUG: DataFrame size:                  5213.475478 MB
DEBUG: Memory usage (after agg1):       8110.46912 MB
DEBUG: Memory usage (after agg2):       8114.21696 MB
times/census.parq            pandas  Aggregate1:023.63 (006.52+017.11)  Aggregate2:014.41  In:05213483268  Out:00000625044  Total:043.40

@jbednar - we can achieve the ~2-second Agg2 time when using the dask.cache.Cache(cachesize).register() method.

@gbrener
Copy link
Contributor

gbrener commented Apr 17, 2017

Updated results (on PR #309), after reordering the _optimize and update calls in the dask_pipeline function. There's an issue with CSV file size detection, but besides that the times look promising:

With dask-generated files:

./filetimes.sh times/census persist
times/census.parq            dask    Aggregate1:011.63 (005.98+005.65)  Aggregate2:004.24  In:05213483268  Out:00000625044  Total:020.41
times/census.snappy.parq     dask    Aggregate1:013.97 (009.45+004.51)  Aggregate2:004.65  In:02377124733  Out:00000625044  Total:023.84
times/census.parq            pandas  Aggregate1:021.58 (006.54+015.04)  Aggregate2:014.32  In:05213483268  Out:00000625044  Total:040.94
times/census.snappy.parq     pandas  Aggregate1:033.89 (018.85+015.04)  Aggregate2:014.41  In:02377124733  Out:00000625044  Total:053.53
times/census.castra          dask    Operation not supported
times/census.castra          pandas  Operation not supported
times/census.bcolz           dask    File does not exist
times/census.h5              dask    Aggregate1:233.56 (224.97+008.60)  Aggregate2:005.55  In:07723157059  Out:00000625044  Total:243.79
times/census.h5              pandas  Aggregate1:277.92 (246.40+031.52)  Aggregate2:014.34  In:07723157059  Out:00000625044  Total:298.16
times/census*.csv            dask    Aggregate1:059.85 (055.10+004.75)  Aggregate2:004.38  In:00000000000  Out:00000625044  Total:070.48
times/census0.csv            pandas  Aggregate1:019.57 (017.49+002.08)  Aggregate2:001.93  In:01051457188  Out:00000480166  Total:025.06
times/census.feather         pandas  File does not exist

With pandas-generated files:

$ ./filetimes.sh times/census persist
times/census.parq            dask    Aggregate1:014.07 (008.15+005.92)  Aggregate2:005.06  In:05213483213  Out:00000625044  Total:024.79
times/census.snappy.parq     dask    Aggregate1:014.04 (009.27+004.77)  Aggregate2:004.65  In:02377123600  Out:00000625044  Total:024.02
times/census.parq            pandas  Aggregate1:023.36 (006.58+016.78)  Aggregate2:013.86  In:05213483213  Out:00000625044  Total:042.32
times/census.snappy.parq     pandas  Aggregate1:031.99 (017.84+014.15)  Aggregate2:013.80  In:02377123600  Out:00000625044  Total:050.92
times/census.castra          dask    Operation not supported
times/census.castra          pandas  Operation not supported
times/census.bcolz           dask    Aggregate1:031.03 (027.31+003.73)  Aggregate2:004.09  In:01660936168  Out:00000625044  Total:039.61
times/census.h5              dask    Aggregate1:223.18 (214.88+008.29)  Aggregate2:004.59  In:07705345127  Out:00000625044  Total:232.28
times/census.h5              pandas  Aggregate1:280.76 (250.59+030.17)  Aggregate2:014.64  In:07705345127  Out:00000625044  Total:301.09
times/census*.csv            dask    Aggregate1:060.40 (055.90+004.50)  Aggregate2:004.88  In:00000000000  Out:00000625044  Total:072.09
times/census.csv             pandas  Aggregate1:172.49 (150.78+021.71)  Aggregate2:014.94  In:07184296032  Out:00000625044  Total:192.69
times/census.feather         pandas  Aggregate1:053.86 (034.67+019.19)  Aggregate2:014.22  In:01533375368  Out:00000002056  Total:093.13

@jbednar jbednar added this to the 0.5.0 milestone Apr 26, 2017
@jbednar jbednar assigned gbrener and unassigned jbednar May 2, 2017
@gbrener
Copy link
Contributor

gbrener commented May 6, 2017

Updated results from PR #344 , after several rounds of performance optimizations, fixing issues with the various file formats in filetimes.py, passing in cached range calculations for the second aggregation, and using threaded scheduler + dask's df.persist():

$ ./filetimes.sh times/census persist
times/census.parq            dask    Aggregate1:021.94 (010.49+011.45)  Aggregate2:000.28  In:05213483268  Out:00000626699  Total:027.03
times/census.snappy.parq     dask    Aggregate1:018.01 (009.77+008.24)  Aggregate2:000.29  In:02377124733  Out:00000626699  Total:022.88
times/census.gz.parq         dask    Aggregate1:041.77 (033.62+008.15)  Aggregate2:000.29  In:01394613858  Out:00000626699  Total:046.61
times/census.bcolz           dask    Aggregate1:036.29 (027.95+008.34)  Aggregate2:000.91  In:01660936168  Out:00000626699  Total:041.70
times/census.h5              dask    Aggregate1:244.40 (224.95+019.45)  Aggregate2:000.49  In:07723157059  Out:00000626699  Total:249.41
times/census.csv             dask    Aggregate1:061.99 (053.82+008.17)  Aggregate2:000.53  In:07184296200  Out:00000626699  Total:067.74
times/census.feather         dask    Aggregate1:082.08 (062.49+019.60)  Aggregate2:000.30  In:01533375368  Out:00000002047  Total:105.06
times/census.parq            pandas  Aggregate1:008.17 (006.20+001.96)  Aggregate2:001.71  In:05213483268  Out:00000626699  Total:014.26
times/census.snappy.parq     pandas  Aggregate1:020.72 (018.73+001.99)  Aggregate2:001.68  In:02377124733  Out:00000626699  Total:026.75
times/census.gz.parq         pandas  Aggregate1:055.14 (053.11+002.03)  Aggregate2:001.73  In:01394613858  Out:00000626699  Total:061.21
times/census.bcolz           pandas  Aggregate1:044.52 (036.31+008.21)  Aggregate2:001.70  In:01660936168  Out:00000626699  Total:051.21
times/census.h5              pandas  Aggregate1:263.38 (246.69+016.69)  Aggregate2:001.78  In:07723157059  Out:00000626699  Total:269.70
times/census.csv             pandas  Aggregate1:159.89 (151.30+008.60)  Aggregate2:001.70  In:07184296200  Out:00000626699  Total:166.32
times/census.feather         pandas  Aggregate1:042.70 (034.43+008.26)  Aggregate2:001.84  In:01533375368  Out:00000002047  Total:068.14

Just noticed that the feather format did not work properly. Please disregard those rows.

@martindurant
Copy link

Those times we can live with, right?

@jbednar
Copy link
Member Author

jbednar commented May 7, 2017

Yes! Those are excellent times. :-) Thanks for all your help, both of you!

@jbednar jbednar mentioned this issue May 8, 2017
14 tasks
@jbednar
Copy link
Member Author

jbednar commented May 8, 2017

Closing this, with remaining work to make sure we follow it listed in #325.

@jbednar jbednar closed this as completed May 8, 2017
@jbednar jbednar removed the ready label May 8, 2017
@jbednar
Copy link
Member Author

jbednar commented May 10, 2017

For reference, here's how we are creating the new census.snappy.parq file that we will distribute, based on the findings above:

import dask.dataframe as dd
import numpy as np

df = dd.read_hdf("census.h5", key="census", columns=['meterswest', 'metersnorth', 'race'], chunksize=76668751)
df = df.rename(columns={'metersnorth': 'northing', 'meterswest': 'easting'})
df['northing'] = df['northing'].astype(np.float32)
df['easting'] = df['easting'].astype(np.float32)
df['race'] = df['race'].astype('category')

print(df.tail())
print(df.dtypes)

print("write parquet")
dd.to_parquet("census.snappy.parq", df, compression="SNAPPY")

And then it gets read in using:

import dask.dataframe as dd

df = dd.io.parquet.read_parquet('data/census.snappy.parq')
df = df.persist()

The df.persist() call ensures that data is kept in memory, which is the right thing to do if it will be used repeatedly and if it fits, but it can be omitted if there isn't enough memory or if you're only doing one aggregation anyway.

jbcrail added a commit to jbcrail/datashader that referenced this issue May 11, 2017
This is based on @gbrener's writeup on holoviz#129 and holoviz#313.

Related to holoviz#325
@benepo
Copy link

benepo commented Sep 22, 2017

@jbednar any chance you can rerun the test with ORC+snappy? Our use case favors a format that supports ACID queries and I'm curious if ORC outperforms parquet.

@martindurant
Copy link

@benepo , I don't know of any way to read ORC files in python. python-orc actually reads in java and passes records over a socket, no way you'll get performance out of that.

@benepo
Copy link

benepo commented Sep 23, 2017

@martindurant - good point. I hadn't realized ORC support was limited to Java.

On a separate note, I wonder how the Dask read performance shown above compares to Apache Arrow: https://arrow.apache.org/docs/python/parquet.html

@martindurant
Copy link

dask is able to reas parquet via either fastparquet or arrow's parquet-cpp. Arrow has shown superior performance in some tests (read only), but the difference is not great, and fastparquet has a wider set of options and capabilities.

@benepo
Copy link

benepo commented Sep 23, 2017

That's good to know - thanks. I was having some difficulty finding performance benchmarks between them.

ref: https://github.com/elastacloud/parquet-dotnet/issues/171

@holoviz holoviz deleted a comment from Pittconnect Sep 25, 2017
@ahundt
Copy link

ahundt commented Jun 9, 2018

Right now, our examples use CSV and castra or HDF5 formats.

@jbednar Could you point me to the hdf5 example? googling for it took me to this page

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants