Writing to multiple GeoParquet files will not output _metadata #1296

jwass · 2024-03-29T14:42:14Z

Expected behavior

When writing out a GeoParquet dataframe that results in multiple files, the _metadata summary file will not be created when configured to do so.

import sedona
from sedona.spark import *
sedona = SedonaContext.create(spark)

print("spark version: {}".format(spark.version))
print("sedona version: {}".format(sedona.version))
spark.conf.set("parquet.summary.metadata.level", "ALL")

def write_geoparquet(df, path):
    df.write.format("geoparquet") \
        .option("geoparquet.version", "1.0.0") \
        .option("geoparquet.crs", "") \
        .option("compression", "zstd") \
        .option("parquet.block.size", 16 * 1024 * 1024) \
        .option("maxRecordsPerFile", 10000000) \
        .mode("overwrite").save(path)

df = sedona.read.format("geoparquet").option("mergeSchema", "true").load(input_path)
write_geoparquet(df, output_path)

If the number of records exceeds maxRecordsPerFile so that more than one file is written, the _metadata and _common_metadata files will not be written. When there are fewer records that only one file is written, then _metadata and _common_metadata will be created.

However if I change the above to write parquet instead of geoparquet:

def write_parquet(df, path):
    df.write.format("parquet") \
        .option("compression", "zstd") \
        .option("parquet.block.size", 16 * 1024 * 1024) \
        .option("maxRecordsPerFile", 10000000) \
        .mode("overwrite").save(path)

write_parquet(df, output_path)

Then _metadata and _common_metadata will be written even with multiple files. Is there a setting or other way to enable writing the common metadata files?

I'd like to write these files as reading in full datasets from pyarrow or others will not need to fully scan all files which can be time-consuming for large datasets.

Settings

Sedona version = 3.4.1
Apache Spark version = 3.4.1

Environment = Databricks

The text was updated successfully, but these errors were encountered:

Kontinuation · 2024-04-01T14:53:22Z

The geo metadata in the parquet footers may not be the same for all written geoparquet files, especially the bbox field, this makes the default parquet footer metadata merging process fail with the following exception:

java.lang.RuntimeException: could not merge metadata: key geo has conflicting values: [{"version":"1.0.0","primary_column":"geom","columns":{"geom":{"encoding":"WKB","geometry_types":["Polygon"],"bbox":[1.0,1.0,9998.0,9998.0],"crs":null}}}, {"version":"1.0.0","primary_column":"geom","columns":{"geom":{"encoding":"WKB","geometry_types":["Polygon"],"bbox":[0.0,0.0,10000.0,10000.0],"crs":null}}}]
	at org.apache.parquet.hadoop.metadata.StrictKeyValueMetadataMergeStrategy.merge(StrictKeyValueMetadataMergeStrategy.java:36)
	at org.apache.parquet.hadoop.metadata.GlobalMetaData.merge(GlobalMetaData.java:106)
	at org.apache.parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:1451)
	at org.apache.parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:1422)
	at org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:1383)
	at org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:84)
	at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:50)
	at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:192)

We have to implement an output committer for GeoParquet to merge geo metadata properly. If your usecase do not need to read the geo metadata from _common_metadata or _metadata file, we can simply ignore geo metadata when generating such files.

jwass · 2024-04-03T20:15:35Z

@Kontinuation Thanks. I think it would be totally fine to leave off the geo metadata in the combined _metadata and/or _common_metadata files - as long as it is still present in the individual geoparquet files.

Since GeoParquet doesn't define these single _metadata summary files, I don't think it would be any issue at all - of course in the future it may standardize on a definition but I think for now it'll only be used for row group filtering and the geo metadata is not needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writing to multiple GeoParquet files will not output _metadata #1296

Writing to multiple GeoParquet files will not output _metadata #1296

jwass commented Mar 29, 2024 •

edited

Loading

Kontinuation commented Apr 1, 2024

jwass commented Apr 3, 2024 •

edited

Loading

Writing to multiple GeoParquet files will not output _metadata #1296

Writing to multiple GeoParquet files will not output _metadata #1296

Comments

jwass commented Mar 29, 2024 • edited Loading

Expected behavior

Settings

Kontinuation commented Apr 1, 2024

jwass commented Apr 3, 2024 • edited Loading

jwass commented Mar 29, 2024 •

edited

Loading

jwass commented Apr 3, 2024 •

edited

Loading