[SEDONA-156] Support spatial filter push-down for GeoParquetFileFormat #744

Kontinuation · 2023-01-05T03:12:06Z

Did you read the Contributor Guide?

Yes, I have read Contributor Rules and Contributor Development Guide

Is this PR related to a JIRA ticket?

Yes, the URL of the associated JIRA ticket is https://issues.apache.org/jira/browse/SEDONA-156. The PR name follows the format [SEDONA-XXX] my subject.

What changes were proposed in this PR?

Add spatial filter push-down to GeoParquetFileFormat. Spatial filters running on GeoParquet datasets will skip unrelated files. The file pruning process uses the per-column bbox property in GeoParquet metadata to determine if the file should be skipped or not.

How was this patch tested?

Added new unit tests for this patch.

Did this PR include necessary documentation updates?

No, this PR does not affect any public API so no need to change the docs. We can mention that we support filter push-down when processing GeoParquet datasets though it works transparently.

bbox in column metadata to perform file level prunning.

jiayuasu · 2023-01-05T07:36:34Z

@Kontinuation This looks good to me! Thank you! Can you please add some documentation?

Update GeoParquet read / write in [1] and [2] since the users no longer need to specify geometry field thanks to your recent few PRs.
Add some explanation to this page and explain how the filter pushdown works in GeoParquet.

You could do the doc update in a separate PR if needed.

[1] https://sedona.apache.org/1.3.1-incubating/tutorial/sql/#load-geoparquet
[2] https://sedona.apache.org/1.3.1-incubating/tutorial/sql/#save-geoparquet
[3] https://sedona.apache.org/1.3.1-incubating/api/sql/Optimizer/#sedonasql-query-optimizer

Kontinuation · 2023-01-05T07:57:59Z

[1], [2] was already handled in PR 740.

I'll add a section to describe predicate push-down for GeoParquet to [3] in this PR.

neontty · 2023-01-05T21:38:35Z

@Kontinuation your code is so clean! Thank you for putting in this work.

jiayuasu · 2023-01-09T17:29:24Z

@Kontinuation Quick follow-up question: since Sedona already offers ST_GeoHash (https://sedona.apache.org/1.3.1-incubating/api/flink/Function/?h=geohash#st_geohash), I believe one generate a GeoHash for each geometry, sort all geoms by GeoHash, then save the DataFrame in a GeoParquet. This way, the pruning power of GeoParquet will be much stronger given smaller bboxs in each Parquet partition?

Kontinuation · 2023-01-10T05:43:30Z

@Kontinuation Quick follow-up question: since Sedona already offers ST_GeoHash (https://sedona.apache.org/1.3.1-incubating/api/flink/Function/?h=geohash#st_geohash), I believe one generate a GeoHash for each geometry, sort all geoms by GeoHash, then save the DataFrame in a GeoParquet. This way, the pruning power of GeoParquet will be much stronger given smaller bboxs in each Parquet partition?

Yes. The data has to be clustered by spatial proximity to maximize the efficiency of spatial predicate push-down. Sorting the data using GeoHash or other space-filling curves is a good approach to reorganizing the dataset for query efficiency.

The following figures are results of sorting geoms by high precision GeoHash values, writing data as GeoParquet files, and finally plotting the bboxes of the GeoParquet files. Though the partitionings are not perfect, they still provide an opportunity for pruning some files depending on the query window.

AREALM US (50 files)	AREALM US (200 files)

MS Buildings NYC (50 files)	MS Buildings NYC (200 files)

The Spatial Parquet paper has also experimented with the effect of sorting the data using Z curve and Hilbert curve. Their approach works for file-level data pruning as well, though the paper primarily deals with row group/page-level data skipping enabled by Spatial Parquet.

Another approach is to partition the dataset by low-precision GeoHash values. It produces non-overlapping bounding boxes for point datasets. I've taken this approach when creating examples in the documentation. Carto's analytics toolbox for Databricks supports partitioning the dataset by Z2 index of tile coordinates, which is a similar approach.

Kontinuation added 2 commits January 4, 2023 16:32

Implemented spatial filter push down for GeoParquetFileFormat. We use

54401e8

bbox in column metadata to perform file level prunning.

Fixed compatibility issue with Spark 3.0.x

157a302

Kontinuation marked this pull request as ready for review January 5, 2023 03:46

jiayuasu added improvement sedona-sql labels Jan 5, 2023

jiayuasu added this to the sedona-1.4.0 milestone Jan 5, 2023

Added documentation for GeoParquet spatial predicate push-down support

0b533e6

jiayuasu approved these changes Jan 7, 2023

View reviewed changes

jiayuasu merged commit 112596a into apache:master Jan 7, 2023

Kontinuation deleted the geoparquet-predicate-pushdown branch August 23, 2023 15:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SEDONA-156] Support spatial filter push-down for GeoParquetFileFormat #744

[SEDONA-156] Support spatial filter push-down for GeoParquetFileFormat #744

Kontinuation commented Jan 5, 2023

jiayuasu commented Jan 5, 2023

Kontinuation commented Jan 5, 2023

neontty commented Jan 5, 2023

jiayuasu commented Jan 9, 2023

Kontinuation commented Jan 10, 2023 •

edited

Loading

[SEDONA-156] Support spatial filter push-down for GeoParquetFileFormat #744

[SEDONA-156] Support spatial filter push-down for GeoParquetFileFormat #744

Conversation

Kontinuation commented Jan 5, 2023

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

jiayuasu commented Jan 5, 2023

Kontinuation commented Jan 5, 2023

neontty commented Jan 5, 2023

jiayuasu commented Jan 9, 2023

Kontinuation commented Jan 10, 2023 • edited Loading

Kontinuation commented Jan 10, 2023 •

edited

Loading