[SEDONA-429][SEDONA-430] Support specifying GeoParquet spec version number and CRS #1162

Kontinuation · 2023-12-27T04:08:19Z

Note: This PR depends on #1161

Did you read the Contributor Guide?

Yes, I have read Contributor Rules and Contributor Development Guide

Is this PR related to a JIRA ticket?

Yes, the URL of the associated JIRA ticket is https://issues.apache.org/jira/browse/SEDONA-429 and https://issues.apache.org/jira/browse/SEDONA-430. The PR name follows the format [SEDONA-XXX] my subject.

What changes were proposed in this PR?

Bumped the default GeoParquet version number from 1.0.0-beta.1 to 1.0.0
Allow specifying GeoParquet version number using geoparquet.version option
Allow specifying CRS metadata for geometry columns using geoparquet.crs option

How was this patch tested?

Add new tests for GeoParquet metadata.

Did this PR include necessary documentation updates?

Yes, I have updated the documentation update.

jiayuasu · 2023-12-27T06:51:36Z

...n/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/GeoParquetMetaData.scala

@@ -14,12 +14,7 @@
 package org.apache.spark.sql.execution.datasources.parquet



Can we add a new data source to only read the metadata of a parquet file? This is crucial for entry-level users to explore an unknown parquet file including geoparquet. In our geoparquet case, this will help user know the projjson value since we are not able to properly parse it to a known epsg code.

I understand that a Spark DataFrame only allows the schema to be the metadata which cannot be used to hold such information.

So I suggest that we add a new data source namely geoparquet.metadata, which loads these metadata using ParquetFileReader. One good example is from DuckDB: https://duckdb.org/docs/data/parquet/metadata.html

This can be addressed in a separate PR.

Created a JIRA ticket for this: https://issues.apache.org/jira/browse/SEDONA-455

Let's address this in a separate PR.

docs/tutorial/sql.md

...c/main/scala/org/apache/spark/sql/execution/datasources/parquet/GeoParquetWriteSupport.scala

… 3.0~3.3

… versions of Hadoop

docs/tutorial/sql.md

Kontinuation changed the title ~~[SEDONA-429][SEDONA-430] Support specifying GeoParquet spec version number and CRS~~ [SEDONA-429,SEDONA-430] Support specifying GeoParquet spec version number and CRS Dec 27, 2023

Kontinuation changed the title ~~[SEDONA-429,SEDONA-430] Support specifying GeoParquet spec version number and CRS~~ [SEDONA-429][SEDONA-430] Support specifying GeoParquet spec version number and CRS Dec 27, 2023

Kontinuation force-pushed the geoparquet-fixes branch from 1717a76 to 4f856f5 Compare December 27, 2023 04:40

Kontinuation marked this pull request as ready for review December 27, 2023 05:16

jiayuasu requested changes Dec 27, 2023

View reviewed changes

jiayuasu added attention needed affect public APIs labels Dec 27, 2023

jiayuasu added this to the sedona-1.5.1 milestone Dec 27, 2023

jiayuasu added improvement sedona-spark labels Dec 27, 2023

jiayuasu requested changes Dec 27, 2023

View reviewed changes

...c/main/scala/org/apache/spark/sql/execution/datasources/parquet/GeoParquetWriteSupport.scala Outdated Show resolved Hide resolved

Kontinuation marked this pull request as draft December 29, 2023 01:03

Kontinuation added 7 commits December 29, 2023 17:29

Support geoparquet.version and geoparquet.crs option for Spark 3.0~3.3

866cde0

Add tests for geoparquet.version and geoparquet.crs options for Spark…

431ff4c

… 3.0~3.3

Add documentation for geoparquet.version and geoparquet.crs options

62e29d9

Apply this patch on Spark 3.4 and Spark 3.5

4a9c101

Remove Configuration.getPropsWithPrefix to be compatible with lower…

fc28df8

… versions of Hadoop

Add notes about crs metadata in GeoParquet files

cff270d

Allow omitting CRS by setting geoparquet.crs to "" (empty string)

921d79b

Kontinuation force-pushed the geoparquet-fixes branch from 8d108a1 to 921d79b Compare December 29, 2023 09:42

Kontinuation added 2 commits December 29, 2023 19:30

Set default crs metadata to null

083b088

Apply to Spark 3.4 and Spark 3.5

ea54f36

Kontinuation marked this pull request as ready for review December 29, 2023 14:03

jiayuasu requested changes Dec 29, 2023

View reviewed changes

docs/tutorial/sql.md Show resolved Hide resolved

Explain the behavior of geoparquet.crs option

9bf5522

jiayuasu approved these changes Jan 2, 2024

View reviewed changes

jiayuasu merged commit 4140594 into apache:master Jan 2, 2024
47 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SEDONA-429][SEDONA-430] Support specifying GeoParquet spec version number and CRS #1162

[SEDONA-429][SEDONA-430] Support specifying GeoParquet spec version number and CRS #1162

Kontinuation commented Dec 27, 2023 •

edited

Loading

jiayuasu Dec 27, 2023 •

edited

Loading

jiayuasu Dec 27, 2023 •

edited

Loading

		@@ -14,12 +14,7 @@
		package org.apache.spark.sql.execution.datasources.parquet

[SEDONA-429][SEDONA-430] Support specifying GeoParquet spec version number and CRS #1162

[SEDONA-429][SEDONA-430] Support specifying GeoParquet spec version number and CRS #1162

Conversation

Kontinuation commented Dec 27, 2023 • edited Loading

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

jiayuasu Dec 27, 2023 • edited Loading

Choose a reason for hiding this comment

jiayuasu Dec 27, 2023 • edited Loading

Choose a reason for hiding this comment

Kontinuation commented Dec 27, 2023 •

edited

Loading

jiayuasu Dec 27, 2023 •

edited

Loading

jiayuasu Dec 27, 2023 •

edited

Loading