Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SEDONA-156] Support spatial filter push-down for GeoParquetFileFormat #744

Merged
merged 3 commits into from
Jan 7, 2023

Conversation

Kontinuation
Copy link
Member

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

Add spatial filter push-down to GeoParquetFileFormat. Spatial filters running on GeoParquet datasets will skip unrelated files. The file pruning process uses the per-column bbox property in GeoParquet metadata to determine if the file should be skipped or not.

How was this patch tested?

Added new unit tests for this patch.

Did this PR include necessary documentation updates?

  • No, this PR does not affect any public API so no need to change the docs. We can mention that we support filter push-down when processing GeoParquet datasets though it works transparently.

@Kontinuation Kontinuation marked this pull request as ready for review January 5, 2023 03:46
@jiayuasu
Copy link
Member

jiayuasu commented Jan 5, 2023

@Kontinuation This looks good to me! Thank you! Can you please add some documentation?

  1. Update GeoParquet read / write in [1] and [2] since the users no longer need to specify geometry field thanks to your recent few PRs.
  2. Add some explanation to this page and explain how the filter pushdown works in GeoParquet.

You could do the doc update in a separate PR if needed.

[1] https://sedona.apache.org/1.3.1-incubating/tutorial/sql/#load-geoparquet
[2] https://sedona.apache.org/1.3.1-incubating/tutorial/sql/#save-geoparquet
[3] https://sedona.apache.org/1.3.1-incubating/api/sql/Optimizer/#sedonasql-query-optimizer

@Kontinuation
Copy link
Member Author

[1], [2] was already handled in PR 740.

I'll add a section to describe predicate push-down for GeoParquet to [3] in this PR.

@neontty
Copy link

neontty commented Jan 5, 2023

@Kontinuation your code is so clean! Thank you for putting in this work.

@jiayuasu jiayuasu merged commit 112596a into apache:master Jan 7, 2023
@jiayuasu
Copy link
Member

jiayuasu commented Jan 9, 2023

@Kontinuation Quick follow-up question: since Sedona already offers ST_GeoHash (https://sedona.apache.org/1.3.1-incubating/api/flink/Function/?h=geohash#st_geohash), I believe one generate a GeoHash for each geometry, sort all geoms by GeoHash, then save the DataFrame in a GeoParquet. This way, the pruning power of GeoParquet will be much stronger given smaller bboxs in each Parquet partition?

@Kontinuation
Copy link
Member Author

Kontinuation commented Jan 10, 2023

@Kontinuation Quick follow-up question: since Sedona already offers ST_GeoHash (https://sedona.apache.org/1.3.1-incubating/api/flink/Function/?h=geohash#st_geohash), I believe one generate a GeoHash for each geometry, sort all geoms by GeoHash, then save the DataFrame in a GeoParquet. This way, the pruning power of GeoParquet will be much stronger given smaller bboxs in each Parquet partition?

Yes. The data has to be clustered by spatial proximity to maximize the efficiency of spatial predicate push-down. Sorting the data using GeoHash or other space-filling curves is a good approach to reorganizing the dataset for query efficiency.

The following figures are results of sorting geoms by high precision GeoHash values, writing data as GeoParquet files, and finally plotting the bboxes of the GeoParquet files. Though the partitionings are not perfect, they still provide an opportunity for pruning some files depending on the query window.

AREALM US (50 files) AREALM US (200 files)
arealm_50_files arealm_200_files
MS Buildings NYC (50 files) MS Buildings NYC (200 files)
ms_buildings_nyc ms_buildings_nyc_200

The Spatial Parquet paper has also experimented with the effect of sorting the data using Z curve and Hilbert curve. Their approach works for file-level data pruning as well, though the paper primarily deals with row group/page-level data skipping enabled by Spatial Parquet.

Another approach is to partition the dataset by low-precision GeoHash values. It produces non-overlapping bounding boxes for point datasets. I've taken this approach when creating examples in the documentation. Carto's analytics toolbox for Databricks supports partitioning the dataset by Z2 index of tile coordinates, which is a similar approach.

@Kontinuation Kontinuation deleted the geoparquet-predicate-pushdown branch August 23, 2023 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants