-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement mutable index using index SPI #10687
Conversation
0c90fe9
to
eb430c7
Compare
...al/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java
Show resolved
Hide resolved
...al/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java
Show resolved
Hide resolved
...al/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java
Outdated
Show resolved
Hide resolved
...al/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java
Show resolved
Hide resolved
...al/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java
Show resolved
Hide resolved
...al/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java
Show resolved
Hide resolved
...al/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java
Show resolved
Hide resolved
...al/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java
Outdated
Show resolved
Hide resolved
...al/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java
Show resolved
Hide resolved
...al/src/main/java/org/apache/pinot/segment/local/realtime/impl/geospatial/MutableH3Index.java
Show resolved
Hide resolved
28a9150
to
e4baf84
Compare
...n/java/org/apache/pinot/segment/local/realtime/impl/invertedindex/NativeMutableFSTIndex.java
Show resolved
Hide resolved
...n/java/org/apache/pinot/segment/local/realtime/impl/invertedindex/NativeMutableFSTIndex.java
Show resolved
Hide resolved
...src/main/java/org/apache/pinot/segment/local/segment/index/datasource/MutableDataSource.java
Show resolved
Hide resolved
...cal/src/main/java/org/apache/pinot/segment/local/segment/index/forward/ForwardIndexType.java
Show resolved
Hide resolved
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/AbstractIndexType.java
Outdated
Show resolved
Hide resolved
...egment-spi/src/main/java/org/apache/pinot/segment/spi/index/mutable/MutableForwardIndex.java
Show resolved
Hide resolved
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/mutable/MutableIndex.java
Show resolved
Hide resolved
...al/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java
Show resolved
Hide resolved
...al/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java
Show resolved
Hide resolved
import org.apache.pinot.spi.data.FieldSpec; | ||
|
||
|
||
public interface MutableIndexContext { | ||
PinotDataBufferMemoryManager getMemoryManager(); | ||
public class MutableIndexContext { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like we did in IndexCreationContext during index-spi
, the idea here is:
- Make the interface a class
- Move the logic from Common to the class
- Remove Wrapper and all extra flavors.
Instead of using the extra flavors, each index type receives this context and their own configuration.
I also added 3 extra attributes to the class:
- _estimatedColSize
- _estimatedCardinality
- _avgNumMultiValues
As they were used by some indexes and may be useful for future indexes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like IndexCreationContext is still an interface, but it keeps the inner Common classes to hold the context fields. should it be a class too to be consistent with here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I forgot to push something. It should be fixed now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pardon me, what's the new fixes you pushed? got a link to them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we consider this resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry, I thought you mean to change MutableIndexContext from interface to class. I didn't touch IndexCreationContext
. In case we change it, it should be done in another PR
875d4ff
to
a11a68c
Compare
Codecov Report
@@ Coverage Diff @@
## master #10687 +/- ##
=============================================
- Coverage 70.32% 31.60% -38.73%
+ Complexity 6473 453 -6020
=============================================
Files 2157 2159 +2
Lines 116016 116028 +12
Branches 17552 17563 +11
=============================================
- Hits 81586 36665 -44921
- Misses 28731 76046 +47315
+ Partials 5699 3317 -2382
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 1347 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
62da41e
to
8253a6f
Compare
...l/src/main/java/org/apache/pinot/segment/local/segment/index/inverted/InvertedIndexType.java
Show resolved
Hide resolved
import org.apache.pinot.segment.spi.index.IndexReader; | ||
|
||
|
||
public interface MutableIndex extends IndexReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense for a MutableIndex to implement both IndexReader and IndexCreator? The add methods can be reused I believe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do see you need docId in mutable index's add call. I remember discussing passing docId to IndexCreator add too.. Can we unify both mutable and immutable creator to unify the add signature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it have to extend IndexReader? As this is the counter part of IndexCreator
but for creating index on mutable segments, maybe we call this MutableIndexCreator
and extends Closable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this is extending IndexReader because the implementations like RealtimeInvertedIndex are acting as Reader as well as Creator.
- would it be a tall order to split that up and clean it up for all impls?
- If 1 is too much, can we take Saurabh's suggestion and have this extend IndexCreator too by unifying the signatures?
Either way, please add javadocs to the interface, it is confusing to understand without that as to why this interface has add methods but is extending Reader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest I don't understand the concern here. In the theoretical void it makes sense to make MutableIndex
to extend IndexReader
and IndexCreator
. I mean, at high level we see a MutableIndex as something that is a reader and a creator at the same time. But at lower level it doesn't make sense. It is clear once we focus on the semantics these interfaces.
As suggested by Neha, I will try to add a javadoc in MutableIndex trying to summarize what is said in this (quite long) text.
IndexReader
Given that each specific IndexReader uses their own methods to read, we weren't able to generalize more methods in IndexReader, this interface is actually just a marker interface that extends Closeable. As you can expect, there is nothing in common in how a bloom filter and range index are used, for example.
does it have to extend IndexReader?
That is a good question. Given that IndexReader is just a marker interface, they don't have to extend IndexReader. What we would actually like to say is something like the following:
public interface MutableIndex<IR extends IndexReader> extends IR
Then MutableInvertedIndex should be declared as:
public interface MutableInvertedIndex extends MutableIndex<InvertedIndexReader>
If I correctly understood Neha, that is what she suggest when she said:
would it be a tall order to split that up and clean it up for all impls?
But Java typesystem is not expressive enough to say that and the code above doesn't compile. In public interface Whatever<E> extends E
, E must be an interface and the Java generic system cannot ensure that (there is no way to say that E has to be a interface and not a class).
Another option would be to change IndexType interface from:
public interface IndexType<C extends IndexConfig, IR extends IndexReader, IC extends IndexCreator> {
...
MutableIndex createMutableIndex(MutableIndexContext context, C config);
...
}
to:
public interface IndexType<C extends IndexConfig, IR extends IndexReader, IC extends IndexCreator, IM extends IR & IC> {
...
IM createMutableIndex(MutableIndexContext context, C config);
...
}
That has two problems. First, it breaks compatibility. We should change all usages of IndexType<?, ?, ?>
with IndexType<?, ?, ?, ?>
. This is not desired, but acceptable. The main problem is that, again, it doesn't compile
Therefore, as far as I know, the most we can do in MutableIndex in the reading part is to extend IndexReader. This means that each time we want to use a MutableWhateverIndex as a reader we actually need to cast it to the specific index reader we need to use. Do we have to extend IndexReader? No, we don't. But as a marker interface, it marks some intention: MutableIndexes should implement some specific index reader. This is not true in the case of IndexCreators.
About IndexCreator
IndexCreator isn't just the interface used to create indexes. It is the interface used to create indexes in offline tables. MutableIndexes is the interface used to read and create indexes in realtime segments. That detail is important. Being focused on offline segments IndexCreator imposes some contraints that are not fulfilled in MutableIndexes. More specifically, rows must be added in docId order starting from docId 0. This is is a constraint we always had, but was just explicitly added in its javadoc in index-spi
.
On the other hand, MutableIndexes are used in realtime segments where the docId isn't just increasing each time a new row is added. TBH I don't fully understand how MutableSegmentImpl
deals with docIds when they can change once segment is committed and therefore reordered if there is a sorted column, but it is clear that it isn't just an auto-increasing int, which is the IndexCreator constraint. Given that MutableIndexes cannot assume that constraint, they cannot implement IndexCreator.
This isn't something new. Before this PR, each MutableWhateverIndexes
implemented WhateverIndexReader
but they did not implement WhateverIndexCreator
.
I do see you need docId in mutable index's add call. I remember discussing passing docId to IndexCreator add too.. Can we unify both mutable and immutable creator to unify the add signature?
I hope the text above makes it clear, but in order to be more explicit, when we discussed about docId in IndexCreator, the question was don't we need to include the docId as a parameter? and the answer was no, the docId is an auto-increasing value.
We cannot remove the docId in MutableIndex and we should not add the docId in IndexCreator. The first is already reasoned here, the later was reasoned in the other PR, but I think it is worth to repeat it here. We should not add docId as a parameter in IndexCreator because implementations of IndexCreator assume the docId is an auto-increasing value. In case we add the parameter we would need to add a runtime precondition in IndexCreators that verify the constraint we have at compilation level right now. This will make the code less table and more confusing. In that scenario a reader would think: why do IndexCreators accept a parameter that is always deterministic but add a precondition to it?. We would lose compile time checks and substitute them with a runtime check. And the only reason to do that is to make interfaces fit in our high level mental model where a MutableIndex is used to create and read indexes, when there is no actual semantical equivalent between these interfaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think it would be clearer if we don't even extend IndexReader. That creates the clear separation between the Immutable index's IndexCreator / IndexReader and the Mutable index's MutableIndex. I think all the confusion from the reviewers stemmed from the fact that the boundaries got blurred in one case but we're calling out separation in another case. But will leave it up to you.
...al/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java
Show resolved
Hide resolved
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/AbstractIndexType.java
Outdated
Show resolved
Hide resolved
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/mutable/MutableIndex.java
Outdated
Show resolved
Hide resolved
import org.apache.pinot.segment.spi.index.IndexReader; | ||
|
||
|
||
public interface MutableIndex extends IndexReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it have to extend IndexReader? As this is the counter part of IndexCreator
but for creating index on mutable segments, maybe we call this MutableIndexCreator
and extends Closable
import org.apache.pinot.spi.data.FieldSpec; | ||
|
||
|
||
public interface MutableIndexContext { | ||
PinotDataBufferMemoryManager getMemoryManager(); | ||
public class MutableIndexContext { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like IndexCreationContext is still an interface, but it keeps the inner Common classes to hold the context fields. should it be a class too to be consistent with here.
...cal/src/main/java/org/apache/pinot/segment/local/segment/index/forward/ForwardIndexType.java
Outdated
Show resolved
Hide resolved
...gment-local/src/main/java/org/apache/pinot/segment/local/segment/index/fst/FstIndexType.java
Outdated
Show resolved
Hide resolved
...ent-local/src/main/java/org/apache/pinot/segment/local/segment/index/text/TextIndexType.java
Outdated
Show resolved
Hide resolved
...al/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java
Show resolved
Hide resolved
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/IndexType.java
Show resolved
Hide resolved
import org.apache.pinot.segment.spi.index.IndexReader; | ||
|
||
|
||
public interface MutableIndex extends IndexReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this is extending IndexReader because the implementations like RealtimeInvertedIndex are acting as Reader as well as Creator.
- would it be a tall order to split that up and clean it up for all impls?
- If 1 is too much, can we take Saurabh's suggestion and have this extend IndexCreator too by unifying the signatures?
Either way, please add javadocs to the interface, it is confusing to understand without that as to why this interface has add methods but is extending Reader
...main/java/org/apache/pinot/controller/recommender/realtime/provisioning/MemoryEstimator.java
Outdated
Show resolved
Hide resolved
if (config.isDisabled()) { | ||
return null; | ||
} | ||
if (!context.getFieldSpec().isSingleValueField()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this check needed? (for all indexes that dont support MV). Shouldn't this be caught by validation itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is an interesting question.
What previous code does when if founds a multi-valued column indexed with an index it doesnt support is:
- Create the mutable index.
- Never call
add
on it.
AFAIU this means that at query time:
- Either we check check if the column is multivalued and in that case do not apply the index.
- Or we apply the index and we return false results because the MutableIndex will be empty.
Instead of creating an empty mutable index, the new code do not create the index at all. That is compatible with option number 1, which I'm assuming is the one we follow.
Shouldn't this be caught by validation itself?
A derivative question is: These indexes that do not support mutable multi-valued versions... do support immutable multi-valued versions? For example, range index supports multi-valued columns in offline tables but do not support them in realtime. Therefore it makes sense to define a range index on a multi-valued column.
I'm assuming that what the older code does in that case is to create a MutableRangeIndex, never add elements to it and when the segment is committed, the segment is actually indexed by the RangeIndexCreator. Therefore we cannot fail when a MutableRangeIndex is trying to be created on a multi-valued column.
In case an index is not supported neither in realltime and in offline tables on multi-valued columns, we should fail in the getConfig
function, which is the one that should do the actual validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM to my best knowledge, and left a few non-blocking comments.
...al/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java
Show resolved
Hide resolved
...al/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java
Show resolved
Hide resolved
import org.apache.pinot.spi.data.FieldSpec; | ||
|
||
|
||
public interface MutableIndexContext { | ||
PinotDataBufferMemoryManager getMemoryManager(); | ||
public class MutableIndexContext { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pardon me, what's the new fixes you pushed? got a link to them?
Is there something else we need to modify here? or can we merge it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
import org.apache.pinot.segment.spi.index.IndexReader; | ||
|
||
|
||
public interface MutableIndex extends IndexReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think it would be clearer if we don't even extend IndexReader. That creates the clear separation between the Immutable index's IndexCreator / IndexReader and the Mutable index's MutableIndex. I think all the confusion from the reviewers stemmed from the fact that the boundaries got blurred in one case but we're calling out separation in another case. But will leave it up to you.
I understand the confusion, but I think this is the correct design:
IndexCreator, as all Java interfaces, is a contract that imposes some constraints. MutableIndex implementations do not fulfil this contract, so MutableIndex cannot implement IndexCreator. I would say that the confusion is not due to MutableIndex implementing IndexReader but not IndexCreator. The thing that creates the confusion is the name of IndexCreator. Given that IndexCreator is the interface that is used to create offline indexes, it should be called something like We can change the name of IndexCreator in this or another PR if you think it would be better. I would prefer to do that in another PR in order to do not merge the changes from a rename here. |
… not implement IndexCreator
…egment.spi.index.IndexUtil
…nfig says otherwise
Tags should be:
feature
backward-incompat
When #10183 was added, mutable indexes were explicitly not migrated to this new system in order to reduce the changes and make the initial feature easier to merge.
This new PR introduces mutable indexes in the Index Service Interface.
IndexType
subclasses can optionally implementcreateMutableIndex
. In case they return a value different than null, theMutableIndex.add
method will be called whenever theMutableSegmentImpl
receives a new row.Things to note:
MutableSegmentImpl
that talks about a check that doesn't exist. It seems that the comment was introduced in 2019 and the code was changed between 2022 and 2022. I guess we want to delete it, but I'm leaving the// TODO (mutable-index-spi)
there to make it easier to reviewers to find the place where the comment is.FieldSpec.IndexType
enum was used inMutableSegmentImpl
to show some errors and export some metrics. That enum is partial. It doesn't include all possible types and it will never do (given that now index types are not know at compile time). Therefore I've changed that code to useIndexType
. There is a compatibility issue with that: metrics will have a slightly different name. Therefore that is a change that breaks compatibility and whybackward-incompat
has to be added.MemoryEstimator
requires aSchema
now. That means thatRealtimeProvisioningHelperCommand
needs to provide one.RealtimeProvisioningHelperCommand
was already able to read the schema when --schemaWithMetadataFile
was supplied. In fact that argument was mandatory when-sampleCompletedSegmentDir
was not supplied. The easiest way I've found to make everything compile now is to always require-schemaWithMetadataFile
even when-sampleCompletedSegmentDir
was used. I don't know how much this tool is used or the impact it may have. The file was mainly touched by @npawar and @walterddr, so please take a look in case I break something.