Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support st_contains using H3 index #8498

Merged
merged 15 commits into from
Apr 30, 2022

Conversation

WangCHX
Copy link
Contributor

@WangCHX WangCHX commented Apr 9, 2022

This is a follow up PR for : #7252 since that one seems doesn't get update for long time.

The idea is to convert input geometry into a list of h3 cells by using polyfill. But h3 polyfill only fills with the hexagons whose centers are contained by the geometry.
so creating a method coverGeometryInH3 to return the set of H3 cells at the specified resolution which completely cover the input shape.
Instructions:

  1. The PR has to be tagged with at least one of the following labels (*):
    1. feature

(**) Use release-notes label for scenarios like:

  • New configuration options
  • Deprecation of configurations
  • Signature changes to public methods/interfaces
  • New plugins added or old plugins removed

@codecov-commenter
Copy link

codecov-commenter commented Apr 9, 2022

Codecov Report

Merging #8498 (094d2f1) into master (fc12dce) will increase coverage by 5.18%.
The diff coverage is 63.77%.

@@             Coverage Diff              @@
##             master    #8498      +/-   ##
============================================
+ Coverage     63.60%   68.79%   +5.18%     
- Complexity     4322     4326       +4     
============================================
  Files          1648     1694      +46     
  Lines         86917    88960    +2043     
  Branches      13265    13497     +232     
============================================
+ Hits          55281    61197    +5916     
+ Misses        27595    23431    -4164     
- Partials       4041     4332     +291     
Flag Coverage Δ
integration2 25.61% <3.93%> (?)
unittests1 66.51% <63.77%> (+<0.01%) ⬆️
unittests2 14.15% <0.00%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
.../org/apache/pinot/segment/local/utils/H3Utils.java 40.38% <38.29%> (-19.62%) ⬇️
...ava/org/apache/pinot/core/plan/FilterPlanNode.java 82.53% <66.66%> (-4.13%) ⬇️
...perator/filter/H3InclusionIndexFilterOperator.java 83.92% <83.92%> (ø)
...he/pinot/segment/local/segment/store/IndexKey.java 75.00% <0.00%> (-5.00%) ⬇️
.../impl/dictionary/BaseOffHeapMutableDictionary.java 81.33% <0.00%> (-3.34%) ⬇️
.../java/org/apache/pinot/spi/data/TimeFieldSpec.java 88.63% <0.00%> (-2.28%) ⬇️
...e/pinot/segment/local/io/util/PinotDataBitSet.java 95.62% <0.00%> (-1.46%) ⬇️
...server/starter/helix/HelixInstanceDataManager.java 78.64% <0.00%> (ø)
.../pinot/plugin/metrics/yammer/YammerMetricName.java 33.33% <0.00%> (ø)
.../pinot/server/api/resources/MmapDebugResource.java 0.00% <0.00%> (ø)
... and 338 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fc12dce...094d2f1. Read the comment docs.

Copy link
Member

@richardstartin richardstartin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

@@ -210,11 +226,47 @@ public void testH3Index()
Assert.assertNotNull(aggregationResult);
Assert.assertEquals((long) aggregationResult.get(0), NUM_RECORDS);
}

// Test st contains in polygon
testQueryStContain("SELECT COUNT(*) FROM testTable WHERE ST_Contains(ST_GeomFromText('POLYGON ((\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we add some tests of the corner cases, such as on the line/vertex of the polygon?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ST_Contains seems doesn't include the point or line lying in polygon boundary. https://postgis.net/docs/ST_Contains.html
but let me try create a test with points very close to the boundary.

for (long h3IndexId : _h3Ids) {
partialMatchDocIds[i++] = _h3IndexReader.getDocIds(h3IndexId);
}
MutableRoaringBitmap mutableRoaringBitmap = BufferFastAggregation.or(partialMatchDocIds);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is just approximation, but it doesn't return same result of ST_Contains, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think

expressionFilterOperator.getNextBlock().getBlockDocIdSet().iterator();
    MutableRoaringBitmap result = docIdIterator.applyAnd(partialMatchDocIds);

will do the final check on the potential match set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, it could be further optimized to skip the cells fully contained in the polygon. But we can leave it for futture (add a TODO) if you find it too difficult.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. will create a TODO.

Copy link
Contributor Author

@WangCHX WangCHX Apr 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created an issue to track here: #8547

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I did the optimization. but still will use the issue to track if H3 support different polyfill natively.

@yupeng9
Copy link
Contributor

yupeng9 commented Apr 13, 2022

High-level comment is that using H3 index can speed up query evaluation, but we need to ensure it does not change query results. H3's Polyfill returns a set of H3 for approximation, but we need one more round of evaluation to check the borderlines.

@WangCHX WangCHX force-pushed the cwang/stcontiansindex branch 4 times, most recently from 2717ee1 to d8f86c4 Compare April 16, 2022 10:07
Copy link
Contributor

@yupeng9 yupeng9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for adding the tests and glad to see the improved index handling

// arg0 is the literal
geometry = GeometrySerializer.deserialize(BytesUtils.toBytes(arguments.get(0).getLiteral()));
// must be some h3 index
assert _h3IndexReader != null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throw some meaningful error message for better debugging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. add error message for the assert.

@WangCHX WangCHX force-pushed the cwang/stcontiansindex branch 2 times, most recently from dc7b903 to 15e03bb Compare April 23, 2022 11:16
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with minor comments

@@ -56,6 +59,10 @@


public class FilterPlanNode implements PlanNode {

private static final Set<String> CAN_APPLY_H3_INCLUSION_INDEX_FUNCTION_NAMES =
ImmutableSet.of("st_within", "stwithin", "st_contains", "stcontains");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) No need to check st_within and st_contains because the function name is already canonicalized (we keep 2 function names for st_distance for backward compatibility). We may directly use the value equals check into the canApplyH3IndexForInclusionCheck() instead of using a set for 2 values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.

if (arguments.get(0).getType() == ExpressionContext.Type.IDENTIFIER
&& arguments.get(1).getType() == ExpressionContext.Type.LITERAL) {
String columnName = arguments.get(0).getIdentifier();
return columnName != null && _indexSegment.getDataSource(columnName).getH3Index() != null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) columnName will never be null

Suggested change
return columnName != null && _indexSegment.getDataSource(columnName).getH3Index() != null;
return _indexSegment.getDataSource(columnName).getH3Index() != null;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update on line 183 as well.

@Override
protected FilterBlock getNextBlock() {
// get the set of H3 cells at the specified resolution which completely cover the input shape and potential cover.
final Pair<LongSet, LongSet> fullCoverAndPotentialCoverCells =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) We don't usually put final for local variables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it.

// return filtered num_docs
ImmutableRoaringBitmap[] potentialMatchDocIds = new ImmutableRoaringBitmap[potentialCoverH3Cells.size()];
int i = 0;
for (long h3IndexId : potentialCoverH3Cells) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use iterator (LongIterator.nextLong()) instead to avoid the unnecessary boxing/unboxing, same for other places when iterating over LongSet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. updated.

LongSet polyfilledSet = new LongOpenHashSet(H3_CORE.polyfill(
Arrays.stream(polygon.getExteriorRing().getCoordinates())
.map(coordinate -> new GeoCoord(coordinate.y, coordinate.x))
.collect(Collectors.toList()), ImmutableList.of(), resolution));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor)

Suggested change
.collect(Collectors.toList()), ImmutableList.of(), resolution));
.collect(Collectors.toList()), Collections.emptyList(), resolution));

@Jackie-Jiang Jackie-Jiang merged commit 498387a into apache:master Apr 30, 2022
@WangCHX WangCHX deleted the cwang/stcontiansindex branch May 10, 2022 12:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants