Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Document tuning Spark for writing large raster geometries to Parquet #833

Merged
merged 2 commits into from
May 17, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/tutorial/raster.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,7 @@ Starting from `v1.1.0`, Sedona SQL supports raster data sources and raster opera
## Tutorials

[Python Jupyter Notebook](https://github.com/apache/sedona/blob/master/binder/ApacheSedonaRaster.ipynb)

## Performance

[Storing large raster geometries in Parquet files](../storing-blobs-in-parquet)
51 changes: 51 additions & 0 deletions docs/tutorial/storing-blobs-in-parquet.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Storing large raster geometries in Parquet files

!!!warning
Always convert the raster geometries to a well known format with the RS_AsXXX functions before saving them.
It is possible to save the raw bytes of the raster geometries, but they will be stored in an internal Sedona format that is not guaranteed to be stable across versions.

The default settings in Spark are not well suited for storing large binaries like raster geometries.
It is very much worth the time to tune and benchmark your settings.
Writing large binaries with the default settings will result in poorly structured Parquet files that are very expensive to read.
Some basic tuning can increase the read performance by several magnitudes.

## Background

Parquet files are divided into one or several row groups.
Each column in a row group is stored in a column chunk.
Each column chunk is further divided into pages.
A page is conceptually an indivisible unit in terms of compression and encoding.
The default size for a page is 1 MB.
Data is buffered until the page is full and then written to disk.
The frequency of checks of the page size limit will be between `parquet.page.size.row.check.min` and `parquet.page.size.row.check.max` (default between 100 and 10000 rows).

If you write 5 MB image files to Parquet with the default setting the first page size check will happen after 100 rows.
You will end up with pages of 500 MB instead of 1 MB.
Reading such a file will require a lot of memory and will be slow.

## Reading poorly structured Parquet files

Especially snappy compressed files are sensitive to oversized pages.
More performant options are no compression or zstd compression.
You can set `spark.buffer.size` to a value larger than the default of 64k to improve read performance.
Increasing `spark.buffer.size` might add an io penalty for other columns in the Parquet file.

## Writing better structured Parquet files for blobs

Ideally you want to write Parquet files with a sane page size to get better and more consistent read performance across different clients.
Since version 1.12.0 of parquet-hadoop, bundled with Spark 3.2, you can add Hadoop properties for controlling page size checks.
Better values for writing blobs are:

```
spark.sql.parquet.compression.codec=zstd
spark.hadoop.parquet.page.size.row.check.min=2
spark.hadoop.parquet.page.size.row.check.max=10
```

Zstd performs better than snappy in general.
Even more so for large pages.
The first page size check will happen after 2 rows.
If the page is not full after 2 rows the next check will happen after another 2-10 rows, depending on the size of the two rows already written.

Spark will set Hadoop properties from Spark properties prefixed with "spark.hadoop.".
For a full list of Parquet Hadoop properties see: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ nav:
- Performance tuning:
- Benchmark: tutorial/benchmark.md
- Tune RDD application: tutorial/Advanced-Tutorial-Tune-your-Application.md
- Storing large raster geometries in Parquet files: tutorial/storing-blobs-in-parquet.md
- Sedona with Apache Flink:
- Spatial SQL app: tutorial/flink/sql.md
- Examples:
Expand Down