-
Notifications
You must be signed in to change notification settings - Fork 654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SEDONA-207] Implemented a new geometry serde for supporting Z/M dimensions and better performance #739
Conversation
Example project build failed since the snapshot artifact for Sedona 1.3.1-incubating was gone. Do we need to switch the dependency to a released version? |
@Kontinuation I just switched all versions to 1.3.1-incubating. If you pull the latest change, you should be fine. A quick question: does this solve SEDONA-222? |
…ored unsafe geometry buffer to use spark.unsafe.Platform
e4d9945
to
c284c5c
Compare
It does not resolve SEDONA-222. I'm working on another branch to resolve issues related to GeoParquet, and this issue will be resolved in an upcoming PR. |
|
||
package org.apache.sedona.common.geometrySerde; | ||
|
||
import org.apache.spark.unsafe.Platform; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticed my build started failing for #735 because I guess it uses a merged codebase to run off of. The common module is specifically not supposed to have any reference to Spark because:
- It's not versioned by the Scala release
- It's used by Flink as well so this won't even be available
This will need to be removed, either by porting any of the necessary code here directly, or moving this to the core module and making at least the UnsafeGeometryBuffer (or just the whole serializer) a Spark only thing for the time being
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#741 was submitted to resolve this issue, where spark.unsafe.Platform
was replaced by direct references to sun.misc.Unsafe
.
Did you read the Contributor Guide?
Is this PR related to a JIRA ticket?
[SEDONA-XXX] my subject
.What changes were proposed in this PR?
New Geometry Serde
The new geometry serde was implemented in
common/src/main/java/org/apache/sedona/common/geometrySerde/
. The ShapeSerde used by Kryo serializer and the WKB based serde used byGeometryUDT
were replaced by this new serde. Please refer to SEDONA-207 for a detailed explanation of this new geometry serde.GeoParquet
GeoParquet stores geometry objects as WKB binary values, which happens to be the old serialization format of
GeometryUDT
thus no special treatment was needed. This PR changed the serialization format ofGeometryUDT
, so geometry values in GeoParquet files need to be explicitly parsed and serialized.GeometryUDT in Python
We've implemented the new serialization format in pure python, it is 2~3x slower than
shapely.wkb.loads/dumps
, which would impact the performance ofcollect
,toPandas
and Python UDFs in pyspark. We'll explore ways to implement it as a CPython extension to achieve good performance.Python version of the geometry serde does not support M dimension and SRID due to the restrictions of shapely. We'll support M dimension once shapely/shapely#1648 was resolved.
How was this patch tested?
Unit tests were added to test this patch. This patch was also manually tested on a Spark standalone cluster.
The geometry serde code for Python was manually tested with shapely 2.0. We need to update the python unit tests to be compatible with shapely 2.0 in the future.
Did this PR include necessary documentation updates?