apache · jiayuasu · May 17, 2023 · May 17, 2023 · May 17, 2023
@@ -17,3 +17,7 @@ Starting from `v1.1.0`, Sedona SQL supports raster data sources and raster opera
 ## Tutorials
 
 [Python Jupyter Notebook](https://github.com/apache/sedona/blob/master/binder/ApacheSedonaRaster.ipynb)
+
+## Performance
+
+[Storing large raster geometries in Parquet files](../storing-blobs-in-parquet)
@@ -0,0 +1,51 @@
+# Storing large raster geometries in Parquet files
+
+!!!warning
+    Always convert the raster geometries to a well known format with the RS_AsXXX functions before saving them.
+    It is possible to save the raw bytes of the raster geometries, but they will be stored in an internal Sedona format that is not guaranteed to be stable across versions.
+
+The default settings in Spark are not well suited for storing large binaries like raster geometries.
+It is very much worth the time to tune and benchmark your settings.
+Writing large binaries with the default settings will result in poorly structured Parquet files that are very expensive to read.
+Some basic tuning can increase the read performance by several magnitudes.
+
+## Background
+
+Parquet files are divided into one or several row groups.
+Each column in a row group is stored in a column chunk.
+Each column chunk is further divided into pages.
+A page is conceptually an indivisible unit in terms of compression and encoding.
+The default size for a page is 1 MB.
+Data is buffered until the page is full and then written to disk.
+The frequency of checks of the page size limit will be between `parquet.page.size.row.check.min` and `parquet.page.size.row.check.max` (default between 100 and 10000 rows).
+
+If you write 5 MB image files to Parquet with the default setting the first page size check will happen after 100 rows.
+You will end up with pages of 500 MB instead of 1 MB.
+Reading such a file will require a lot of memory and will be slow.
+
+## Reading poorly structured Parquet files
+
+Especially snappy compressed files are sensitive to oversized pages.
+More performant options are no compression or zstd compression.
+You can set `spark.buffer.size` to a value larger than the default of 64k to improve read performance.
+Increasing `spark.buffer.size` might add an io penalty for other columns in the Parquet file.
+
+## Writing better structured Parquet files for blobs
+
+Ideally you want to write Parquet files with a sane page size to get better and more consistent read performance across different clients.
+Since version 1.12.0 of parquet-hadoop, bundled with Spark 3.2, you can add Hadoop properties for controlling page size checks.
+Better values for writing blobs are:
+
+```
+spark.sql.parquet.compression.codec=zstd
+spark.hadoop.parquet.page.size.row.check.min=2
+spark.hadoop.parquet.page.size.row.check.max=10
+```
+
+Zstd performs better than snappy in general.
+Even more so for large pages.
+The first page size check will happen after 2 rows.
+If the page is not full after 2 rows the next check will happen after another 2-10 rows, depending on the size of the two rows already written.
+
+Spark will set Hadoop properties from Spark properties prefixed with "spark.hadoop.".
+For a full list of Parquet Hadoop properties see: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
@@ -40,6 +40,7 @@ nav:
         - Performance tuning:
           - Benchmark: tutorial/benchmark.md
           - Tune RDD application: tutorial/Advanced-Tutorial-Tune-your-Application.md
+          - Storing large raster geometries in Parquet files: tutorial/storing-blobs-in-parquet.md
       - Sedona with Apache Flink:
         - Spatial SQL app: tutorial/flink/sql.md
       - Examples: