Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SEDONA-207] Implemented a new geometry serde for supporting Z/M dimensions and better performance #739

Merged
merged 6 commits into from
Dec 29, 2022

Conversation

Kontinuation
Copy link
Member

@Kontinuation Kontinuation commented Dec 28, 2022

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

New Geometry Serde

The new geometry serde was implemented in common/src/main/java/org/apache/sedona/common/geometrySerde/. The ShapeSerde used by Kryo serializer and the WKB based serde used by GeometryUDT were replaced by this new serde. Please refer to SEDONA-207 for a detailed explanation of this new geometry serde.

GeoParquet

GeoParquet stores geometry objects as WKB binary values, which happens to be the old serialization format of GeometryUDT thus no special treatment was needed. This PR changed the serialization format of GeometryUDT, so geometry values in GeoParquet files need to be explicitly parsed and serialized.

GeometryUDT in Python

We've implemented the new serialization format in pure python, it is 2~3x slower than shapely.wkb.loads/dumps, which would impact the performance of collect, toPandas and Python UDFs in pyspark. We'll explore ways to implement it as a CPython extension to achieve good performance.

Python version of the geometry serde does not support M dimension and SRID due to the restrictions of shapely. We'll support M dimension once shapely/shapely#1648 was resolved.

How was this patch tested?

Unit tests were added to test this patch. This patch was also manually tested on a Spark standalone cluster.

The geometry serde code for Python was manually tested with shapely 2.0. We need to update the python unit tests to be compatible with shapely 2.0 in the future.

Did this PR include necessary documentation updates?

  • No, this PR does not affect any public API so no need to change the docs.

@Kontinuation
Copy link
Member Author

Example project build failed since the snapshot artifact for Sedona 1.3.1-incubating was gone. Do we need to switch the dependency to a released version?

@jiayuasu
Copy link
Member

@Kontinuation I just switched all versions to 1.3.1-incubating. If you pull the latest change, you should be fine.

A quick question: does this solve SEDONA-222?

@Kontinuation
Copy link
Member Author

Kontinuation commented Dec 28, 2022

@Kontinuation I just switched all versions to 1.3.1-incubating. If you pull the latest change, you should be fine.

A quick question: does this solve SEDONA-222?

It does not resolve SEDONA-222. I'm working on another branch to resolve issues related to GeoParquet, and this issue will be resolved in an upcoming PR.


package org.apache.sedona.common.geometrySerde;

import org.apache.spark.unsafe.Platform;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed my build started failing for #735 because I guess it uses a merged codebase to run off of. The common module is specifically not supposed to have any reference to Spark because:

  • It's not versioned by the Scala release
  • It's used by Flink as well so this won't even be available

This will need to be removed, either by porting any of the necessary code here directly, or moving this to the core module and making at least the UnsafeGeometryBuffer (or just the whole serializer) a Spark only thing for the time being

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#741 was submitted to resolve this issue, where spark.unsafe.Platform was replaced by direct references to sun.misc.Unsafe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants