Modin doesn't partition DataFrames read from a parquet file if the file itself isn't partitioned #5296
Labels
P1
Important tasks that we should complete soon
pandas.io
Performance 🚀
Performance related issues and pull requests.
The current implementation of
.read_parquet()
completely relies on the partitioning provided by a parquet file scheme (row_groups, column chunking), this partitioning may not always be good though.The recommended Apache's configuration for a row_group size considers an optimal row chunk size to be 1GB [1] which acts terribly for Modin because of the risk to put the whole dataframe into only 1 or 2 partitions, this way of partitioning causes good-parallelized implementations to perform very poorly in comparison with what they could achieve with proper partitioning.
Consider this example where I read a single-row-group parquet file with only 100_000 rows and then apply a simple function to the dataframe. It appears that the dataframe read with the parquet partitioning scheme performs 4x slower than the properly partitioned one. The difference in performance will grow with increasing the dataset.
Reproducer
I wonder if we could to not fully rely on the partitioning provided by a parquet file by using our own partitioning scheme if the provided one is not good enough.
The text was updated successfully, but these errors were encountered: