Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save / Load indexed spatial & partitioned Rdd #1213

Open
vbmacher opened this issue Jan 22, 2024 · 2 comments
Open

Save / Load indexed spatial & partitioned Rdd #1213

vbmacher opened this issue Jan 22, 2024 · 2 comments

Comments

@vbmacher
Copy link

vbmacher commented Jan 22, 2024

Expected behavior

Maybe this is possible somehow, but I haven't find this anywhere. I'm relatively new to Sedona and Geo-processing.
I'd like to see a possibility to save and then load a spatial RDD which is already analyzed, partitioned and possibly with the index. We have a use case we use such dataset in many jobs (which use the same spatial data) and it's time-consuming to create the partitioning & build index every time.
Not sure if it's possible though.

For example:

// save once:
val spatialRdd = Adapter.toSpatialRdd(df, ...)
spatialRdd.analyze()
spatialRdd.spatialPartitioning(GridType.KDBTREE, math.min(Integer.MAX_VALUE, df.count() / 2).toInt) // IllegalArgumentException: [Sedona] Number of partitions cannot be larger than half of total records num 
spatialRdd.buildIndex(IndexType.RTREE, true)
SomeSedonaUtility.saveSpatialRdd(spatialRdd, path) // <-- save with index and partitioned

// load & use multiple times:
val rdd = SomeSedonaUtility.loadSpatialRdd(path)

// and usage:
val otherRdd = Adapter.toSpatialRdd(otherDs, ...)
otherRdd.spatialPartitioning(rdd.getPartitioner)

val useIndex = true
val considerBoundaryIntersection = SpatialPredicate.COVERS
val params = new JoinQuery.JoinParams(useIndex, considerBoundaryIntersection, IndexType.RTREE, JoinBuildSide.LEFT)

val joined = JoinQuery.spatialJoin(rdd, otherRdd, params)

Actual behavior

Index & partitioning must be set at runtime (to my knowledge).

Steps to reproduce the problem

The feature is missing, so it's not possible to reproduce it.

Settings

Sedona version = 1.5.1

Apache Spark version = 3.5

API type = Scala

Scala version = 2.12

JRE version = 1.8

Environment = EMR

@jiayuasu
Copy link
Member

@vbmacher Unfortunately, a spatial partitioned RDD cannot be saved and loaded back because it will lead to wrong results. See the explanation here: https://sedona.apache.org/1.5.1/tutorial/rdd/#save-an-spatialrdd-spatialpartitioned-wo-indexed

@vbmacher
Copy link
Author

Thanks @jiayuasu, so I read there also it is possible to save indexed rdd (https://sedona.apache.org/1.5.1/tutorial/rdd/#save-an-spatialrdd-indexed), but to my knowledge building an index requires spatial partitioning. So when I save the indexed RDD and then reload it back, there won't be partitioning set up but index will work ?

Also I'd like to know more details on this one, if possible:

We are working on some solutions. Stay tuned!

Is it something which we can expect maybe next release? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants