-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow to overwrite index configs at tier level #10553
Conversation
Codecov Report
@@ Coverage Diff @@
## master #10553 +/- ##
============================================
+ Coverage 67.84% 70.44% +2.59%
- Complexity 6001 6504 +503
============================================
Files 1575 2106 +531
Lines 81930 113068 +31138
Branches 12870 17041 +4171
============================================
+ Hits 55586 79648 +24062
- Misses 22450 27837 +5387
- Partials 3894 5583 +1689
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 835 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/AbstractIndexType.java
Outdated
Show resolved
Hide resolved
...-tools/src/main/resources/examples/batch/airlineStats/airlineStats_offline_table_config.json
Show resolved
Hide resolved
...rc/test/java/org/apache/pinot/segment/local/segment/index/loader/IndexLoadingConfigTest.java
Outdated
Show resolved
Hide resolved
In order to add the ability to override configs by adding a tiered field, here you follow a pattern that is quite common in Pinot but I tried to avoid in
A clear example of this (used in the index spi PEP) is the bloom filter config. The first Pinot versions that supported bloom filter expected to configure them in In this PR I think we are falling in the same pattern. The new concept is the ability to override configs by tier. The customer specific syntax is still the TableConfig and instead of centralize the transformation between how the user writes something and how the code reacts to it, we are again spreading the logic in several parts of the code. In this case, index types and startree indexes need to know about tiered (when in fact they don't actually care). IICU the code, encodingType is not actually being override, so we should also add this logic in at least another place. Instead of doing this, I suggest to try to follow another pattern. For example:
I think it is easier to see this with an example. In this PR you modifies the airlineStats_offline_table_config.json in order to add the following in FieldConfigList (I'm going to ignore the changes in Startree indexes, but it would also apply there): {
"name": "ArrTimeBlk",
"encodingType": "DICTIONARY",
"indexes": {
"inverted": {
"enabled": "true"
}
},
"tierOverwrites": {
"hotTier": {
"encodingType": "DICTIONARY",
"indexes": {
"bloom": {
"enabled": "true"
}
}
},
"coldTier": {
"encodingType": "RAW",
"indexes": {
"text": {
"enabled": "true"
}
}
}
} This means that we have 2 tiers plus the default one, so the preprocessor would generate 3 tables that will be equal except for the FieldConfig named The default would be: "name": "ArrTimeBlk",
"encodingType": "DICTIONARY",
"indexes": {
"inverted": {
"enabled": "true"
}
} The hotTier would be: "name": "ArrTimeBlk",
"encodingType": "DICTIONARY",
"indexes": {
"inverted": {
"enabled": "true"
},
"bloom": {
"enabled": "true"
}
} The coldTier would be: "name": "ArrTimeBlk",
"encodingType": "RAW", // This was also modified!
"indexes": {
"inverted": {
"enabled": "true"
},
"indexes": {
"text": {
"enabled": "true"
}
}
} In these 3 new TableConfigs there is no tierOverride, so we can read indexes (and any other config!) without actually needed to know about them. The preprocessor should not be very complex. It needs to know how to merge jsons (which is not that difficult) and needs to verify that the thing that has been override is correct (for example in this case we could forbid to override the name). In the abstract void, this approach seems more elegant than the other. The separation of responsibilities is clear: The preprocessor takes care of the tieredOverride, the IndexType.getConfig() take a TableConfig without overrides and generates an index config. The main issue here is that I don't know how much it may affect the rest of the code. In fact I would like to use different classes for TableConfigWithTieredOverride and TieredConfigWithoutOverride, but that would imply so many changes in the code that I don't think it would be feasible right now. |
@gortiz thanks for the comments. I like your idea of generating different TableConfig instances for tiers for better separation. The Index.getConfig() encapsulates a lot of details of assembling index configs from everywhere in TableConfig, so I'd +1 to reuse it than make it more complex for being aware of tier (which is the idea of current change). Below are a few points about current change, but I'll think more about your idea anyway.
The current change allows to overwrite the encodingType via
The current change can use any other config as well. If tier is provided, the tier specific ones take precedence. |
503010d
to
ef1a4c4
Compare
} | ||
Schema schema = new Schema(); | ||
for (String column : getAllKnownColumns()) { | ||
schema.addField(new DimensionFieldSpec(column, FieldSpec.DataType.STRING, true)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is only going to be executed in tests, but using a random type here looks... problematic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I was a bit confused while seeing this. This logic was there already and I didn't change how it's used other than moving it to this helper method.
The column names put in this dummy schema are used in some prod code. This dummy STRING type must not. It looks like the sole purpose of this dummy schema is to pass the column names to FieldIndexConfigsUtil.createIndexConfigsByColName(), when _schema is null (which can happen e.g. in MultipleTreesBuilder).
I would like to take a look to the tests tomorrow, but it looks good to me. Let's see what actual committers think about it :) |
@klsince Is there a reason we're not setting |
Thanks for reviewing @saurabhd336, and as to your question: the overwriting logic is more like merging, as it is done for those top level config keys. So need to keep the original _tableConfig to derive tier specific index configs correctly, otherwise we may accumulate index configs from different tiers. |
* }, | ||
* ... | ||
* ] | ||
* } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does this work given tableIndexConfig is at table level, vs fieldConfigList is at column level? What exactly is one overriding when adding in tableIndexing? Or is the instruction to just repeat whatever json fields one is interested in ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The precedence between using fieldConfigList or tableIndexConfig is defined by IndexType.getConfig() from index-spi, which basically favors fieldConfigList. But this is the detail of IndexType.getConfig() and may get changed as needed in future. So changes in this PR tried to decouple IndexType.getConfig() from the logic of applying tier specific configs.
But for IndexType.getConfig() to access tier specific configs, we make a temp TableConfig by copying tier specific configs from tierOverwrites
out and overwrite those defined for default tier, and then pass the TableConfig object to IndexType.getConfig().
... is the instruction to just repeat whatever json fields...
mostly right, but only for top-level
config keys defined in IndexingConfig or FieldConfig, e.g. StarTreeIndexConfig is overwritten as a whole, instead of overwriting the inner fields of StarTreeIndexConfig.
@@ -460,6 +467,7 @@ public TableConfig build() { | |||
indexingConfig.setAggregateMetrics(_aggregateMetrics); | |||
indexingConfig.setOptimizeDictionaryForMetrics(_optimizeDictionaryForMetrics); | |||
indexingConfig.setNoDictionarySizeRatioThreshold(_noDictionarySizeRatioThreshold); | |||
indexingConfig.setTierOverwrites(_tierOverwrites); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this only being set in indexingConfig and not in fieldConfigList? or rather - how does one know which overwrites on is providing via the setter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both tableIndexConfig (i.e. the indexingConfig object here) and fieldConfigList have their own tierOverwrites
. The fieldConfigList (with tierOverwrites on individual fieldConfig as needed by the caller) is provided to this TableConfigBuilder separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clarified from offline discussion (method name is a bit confusing as it's on top level TableConfigBuilder, but can only be used to set indexingConfig). This can be addressed separately as it only affects tests.
This PR tries to make index config tier overwriting #10553 more robust. Turns out the fix in #10642 is not enough, e.g. if one updates tierConfigs in table config and calls rebalance API to move segments around and it happens before SegmentRelocator task kicks in to update the segment target tiers, it's still possible that segments are loaded on the new servers without using the tier specific index configs. So moving the logic that updates segment target tier into rebalanceTable method to be shared by both table rebalance API and SegmentRelocator periodic task.
This PR allows one to overwrite index configs at tier level, e.g. using many index for hot tier for better query perf but a lot less index for cold tier for more cost saving.
Basically, both
tableIndexConfig
andfieldConfig
sections are extended with a new section calledtierOverwrites
to specify tier specific index configs. A tier specific TableConfig is generated on the fly when loading segments, so that segments are processed with tier specific index configs. This transient tier specific TableConfig decouples consumers of TableConfig, like IndexType.getConfig() from index-spi, with the logic of looking for tier specific configs.The MultiDirQuickstart example has been extended a bit to demo this feature, and example config can be found in its table config file, for example:
fieldConfigList
tableIndexConfig
Note: it's also working to overwrite single-column index with
tableIndexConfig
but recommended to use fieldConfigList for clarity.