Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SEDONA-453] Fix Quadtree index efficiency degrade when indexing points #1158

Closed

Conversation

Kontinuation
Copy link
Member

@Kontinuation Kontinuation commented Dec 21, 2023

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

We found that the index efficiency of Quadtree drastically degrades when indexing datasets made up of points. The index returns way more candidates than expected when querying the Quadtree using envelopes. The reason is that JTS Quadtree automatically expands indexed envelopes by 0.5 if the envelope has zero width and height (see Quadtree.java#L61-L96), this makes the indexed envelopes of points are way larger than necessary, especially when indexed points are WGS84 coordinates.

Suppose that we are indexing the following dataset using Quadtree:

Screenshot 2023-12-21 at 9 10 58 PM

The envelopes indexed by Quadtree happens to be something like this:

Screenshot 2023-12-21 at 9 12 16 PM

This PR workarounds this problem by manually extendinging envelopes with 0 width or height by 1e-3. This will prevent JTS Quadtree from extending the envelopes by 0.5, and 1e-3 is small enough to cope with the most common use cases. However, this will significantly increase the size of Quadtree since the tree will be deeper.

How was this patch tested?

Add test to verify Quad tree index efficiency for PointRDD.

Did this PR include necessary documentation updates?

  • No, this PR does not affect any public API so no need to change the docs.

@Kontinuation
Copy link
Member Author

Manually reducing the size of the envelopes for points may significantly increase the size of the Quadtree. This may cause executors to run out of memory or exceed the kryo buffer limit when serializing the spatial index. STRtree is a better option for spatial indexing. We've already made STRtree as the default configuration in #1159, so I'd like to keep the Quadtree indexing unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant