-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance json index to support regexp and range predicate evaluation #12568
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #12568 +/- ##
============================================
- Coverage 61.75% 61.73% -0.02%
- Complexity 207 211 +4
============================================
Files 2436 2451 +15
Lines 133233 133853 +620
Branches 20636 20752 +116
============================================
+ Hits 82274 82639 +365
- Misses 44911 45106 +195
- Partials 6048 6108 +60
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
cc: @itschrispeck and @chenboat for reviewing changes related to mutable json index moving to TreeMap since it changes the implementation of json_extract_index functions. IMO, iterating over all the keys of the mutable json index's map is wasteful, and we can do better by filtering the keys to iterate over by using a TreeMap. The implementation of range and regexp evaluation also benefits from this move. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome!
|
||
Pattern pattern = ((RegexpLikePredicate) predicate).getPattern(); | ||
MutableRoaringBitmap result = null; | ||
char nextChar = BaseJsonIndexCreator.KEY_VALUE_SEPARATOR + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(minor) We can make this char constant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
Pattern pattern = ((RegexpLikePredicate) predicate).getPattern(); | ||
MutableRoaringBitmap result = null; | ||
char nextChar = BaseJsonIndexCreator.KEY_VALUE_SEPARATOR + 1; | ||
SortedMap<String, RoaringBitmap> subMap = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can extract a method to calculate the subMap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
@@ -291,6 +297,88 @@ private RoaringBitmap getMatchingFlattenedDocIds(Predicate predicate) { | |||
} else { | |||
return new RoaringBitmap(); | |||
} | |||
} else if (predicateType == Predicate.Type.REGEXP_LIKE) { | |||
String ceilingKey = _postingListMap.ceilingKey(key + BaseJsonIndexCreator.KEY_VALUE_SEPARATOR); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(minor) This is actually the lower boundary, consider naming it properly. Currently it is slightly confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
@@ -291,6 +297,88 @@ private RoaringBitmap getMatchingFlattenedDocIds(Predicate predicate) { | |||
} else { | |||
return new RoaringBitmap(); | |||
} | |||
} else if (predicateType == Predicate.Type.REGEXP_LIKE) { | |||
String ceilingKey = _postingListMap.ceilingKey(key + BaseJsonIndexCreator.KEY_VALUE_SEPARATOR); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we post key along? If so, we can directly try to find the key, and when calculating the subMap, exclude the lower bound
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry didn't follow this. Can you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was suggesting to handle it similar to the immutable one, where we lookup the key
from the map (_postingListMap.get(key)
), if it returns null
it means the key does not exist; if it exists, we can use key
as lower bound to calculate the subMap
, and set the lower bound to be exclusive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
continue; | ||
} | ||
if (result == null) { | ||
result = entry.getValue().toMutableRoaringBitmap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest not converting it to MutableRoaringBitmap
. RoaringBitmap
itself is mutable, and is faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
MutableRoaringBitmap result = null; | ||
char nextChar = BaseJsonIndexCreator.KEY_VALUE_SEPARATOR + 1; | ||
SortedMap<String, RoaringBitmap> subMap = | ||
_postingListMap.subMap(ceilingKey, true, _postingListMap.floorKey(key + nextChar), true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to calculate floor here? Can we directly put key + "\u0001"
and make higher bound exclusive? This way we don't need to worry about key ending with "\u0001"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
key + "\u0001"
doesn't exist in the map correct? The idea here is to find the key just below key + "\u0001"
such that it's the that last entry for the key we're searchig.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When calling subMap()
, does it require the lower and upper bound exist in the map? If not, we can simply use key
as lower bound, key + '\u0001'
as upper bound, exclusive on both side to get the sub-map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. subMap does not require the key to be present
@@ -67,15 +69,17 @@ public RangePredicate(ExpressionContext lhs, String range) { | |||
int upperLength = upper.length(); | |||
_upperInclusive = upper.charAt(upperLength - 1) == UPPER_INCLUSIVE; | |||
_upperBound = upper.substring(0, upperLength - 1); | |||
_rangeDataType = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider making this UNKNOWN
? We should also put some comments to prevent people from using the data type if the predicate is constructed this way (which is the case most of the time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
_postingListMap.subMap(ceilingKey, true, _postingListMap.floorKey(key + nextChar), true); | ||
|
||
RangePredicate rangePredicate = (RangePredicate) predicate; | ||
FieldSpec.DataType rangeDataType = rangePredicate.getRangeDataType(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In JSON, only number (double) and string are supported. We can probably simplify the handling to only differentiate double comparison and string comparison. This can prevent the corner case where value is double (e.g. 1.5
) in JSON, but being converted to integer when doing predicate key > 1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I've simplified to differentiate b/w numeric and string types
+1 we thought about this but weren't sure how much insert/exact match performance would be impacted. I am happy to see this change, TreeMap fits our typical use cases much better! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with minor comments
case JSON: | ||
return StringUtils.compare((String) value1, (String) value2); | ||
case BYTES: | ||
return new ByteArray((byte[]) value1).compareTo(new ByteArray((byte[]) value2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return new ByteArray((byte[]) value1).compareTo(new ByteArray((byte[]) value2)); | |
return ByteArray.compare((byte[]) value1, (byte[]) value2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
return Long.compare((long) value1, (long) value2); | ||
case STRING: | ||
case JSON: | ||
return StringUtils.compare((String) value1, (String) value2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return StringUtils.compare((String) value1, (String) value2); | |
return ((String) value1).compareTo((String) value2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
return _postingListMap.subMap(key + BaseJsonIndexCreator.KEY_VALUE_SEPARATOR, false, | ||
key + BaseJsonIndexCreator.KEY_VALUE_SEPARATOR_NEXT_CHAR, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return _postingListMap.subMap(key + BaseJsonIndexCreator.KEY_VALUE_SEPARATOR, false, | |
key + BaseJsonIndexCreator.KEY_VALUE_SEPARATOR_NEXT_CHAR, false); | |
return _postingListMap.subMap(key + JsonIndexCreator.KEY_VALUE_SEPARATOR, false, | |
key + JsonIndexCreator.KEY_VALUE_SEPARATOR_NEXT_CHAR, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also modify ImmutableJsonIndexReader
to use this constant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
Added support for REGEXP_LIKE and RANGE predicate evaluation on a json path's values using json index
Release notes