Allow to overwrite index configs at tier level #10553

klsince · 2023-04-05T18:13:18Z

This PR allows one to overwrite index configs at tier level, e.g. using many index for hot tier for better query perf but a lot less index for cold tier for more cost saving.

Basically, both tableIndexConfig and fieldConfig sections are extended with a new section called tierOverwrites to specify tier specific index configs. A tier specific TableConfig is generated on the fly when loading segments, so that segments are processed with tier specific index configs. This transient tier specific TableConfig decouples consumers of TableConfig, like IndexType.getConfig() from index-spi, with the logic of looking for tier specific configs.

The MultiDirQuickstart example has been extended a bit to demo this feature, and example config can be found in its table config file, for example:

Overwriting single-column index configs using fieldConfigList

{
  ...
  "fieldConfigList": [    
    {
      "name": "ArrTimeBlk",
      "encodingType": "DICTIONARY",
      "indexes": {
        "inverted": {
          "enabled": "true"
        }
      },
      "tierOverwrites": {
        "hotTier": {
          "encodingType": "DICTIONARY",
          "indexes": { // can change index types to use on this tier
            "bloom": {
              "enabled": "true"
            }
          }
        },
        "coldTier": {
          "encodingType": "RAW", // can change encoding type to use on this tier
          "indexes": { } // remove all indexes
        }
      }
    }
  ],

Overwriting StarTree index configs using tableIndexConfig

  "tableIndexConfig": {
    "starTreeIndexConfigs": [
      {
        "dimensionsSplitOrder": [
          "AirlineID",
          "Origin",
          "Dest"
        ],
        "skipStarNodeCreationForDimensions": [],
        "functionColumnPairs": [
          "COUNT__*",
          "MAX__ArrDelay"
        ],
        "maxLeafRecords": 10
      }
    ],
...
    "tierOverwrites": {
      "hotTier": {
        "starTreeIndexConfigs": [ // use a different ST index on this tier
          {
            "dimensionsSplitOrder": [
              "Carrier",
              "CancellationCode",
              "Origin",
              "Dest"
            ],
            "skipStarNodeCreationForDimensions": [],
            "functionColumnPairs": [
              "MAX__CarrierDelay",
              "AVG__CarrierDelay"
            ],
            "maxLeafRecords": 10
          }
        ]
      },
      "coldTier": {
        "starTreeIndexConfigs": [] // removes ST index for this tier
      }
    }
  },
 ...

Note: it's also working to overwrite single-column index with tableIndexConfig but recommended to use fieldConfigList for clarity.

codecov-commenter · 2023-04-05T18:50:29Z

Codecov Report

Merging #10553 (ef1a4c4) into master (513fa31) will increase coverage by 2.59%.
The diff coverage is 85.71%.

@@             Coverage Diff              @@
##             master   #10553      +/-   ##
============================================
+ Coverage     67.84%   70.44%   +2.59%     
- Complexity     6001     6504     +503     
============================================
  Files          1575     2106     +531     
  Lines         81930   113068   +31138     
  Branches      12870    17041    +4171     
============================================
+ Hits          55586    79648   +24062     
- Misses        22450    27837    +5387     
- Partials       3894     5583    +1689

Flag	Coverage Δ
integration1	`24.53% <1.19%> (?)`
integration2	`24.20% <1.19%> (?)`
unittests1	`67.99% <85.71%> (+0.15%)`	⬆️
unittests2	`13.88% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...org/apache/pinot/spi/config/table/TableConfig.java	`63.80% <ø> (ø)`
...local/segment/index/loader/IndexLoadingConfig.java	`83.61% <75.75%> (-1.32%)`	⬇️
...he/pinot/common/utils/config/TableConfigUtils.java	`81.25% <89.74%> (+4.91%)`	⬆️
...org/apache/pinot/spi/config/table/FieldConfig.java	`97.05% <100.00%> (+0.18%)`	⬆️
.../apache/pinot/spi/config/table/IndexingConfig.java	`95.69% <100.00%> (+0.19%)`	⬆️
...he/pinot/spi/utils/builder/TableConfigBuilder.java	`87.27% <100.00%> (+0.23%)`	⬆️

... and 835 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/AbstractIndexType.java

...-tools/src/main/resources/examples/batch/airlineStats/airlineStats_offline_table_config.json

...rc/test/java/org/apache/pinot/segment/local/segment/index/loader/IndexLoadingConfigTest.java

gortiz · 2023-04-11T07:52:33Z

In order to add the ability to override configs by adding a tiered field, here you follow a pattern that is quite common in Pinot but I tried to avoid in index-spi. This pattern is the following:

We add a new concept that modify several different parts of the code.
This new concept change how different parts of the code behave.
We drag this concept until the specific parts of the code that are actually modified by the new concept.

A clear example of this (used in the index spi PEP) is the bloom filter config. The first Pinot versions that supported bloom filter expected to configure them in bloomFilterColumns attribute. Then we wanted to be able to configure the bloom filter on each column, so we added bloomFilterConfigs. Then each time we wanted to know the bloom filter config for a column we had to first look for the column in bloomFIlterConfig and in case we don't find it, look for it in bloomFilterColumns (and if we found it there we generate a default config). Not being centralized, we needed to do that several times in different parts of the code. This is a problem because each time we add a new concept like that we have to be sure we touch all the places where this logic is affected. In index-spi PEP I said that this is a problem I associate with not having different models for user specific syntax (in this case, the different ways a user may define a bloom filter) and internal and developer oriented syntax (where we don't care about the different ways the user may configure a bloom filter, we just want them to be unified in a single model). index-spi solved this problem by extracting all the different ways an index can be configured in the TableConfig (customer specific syntax/model) and returning a single config object (developer specific model).

In this PR I think we are falling in the same pattern. The new concept is the ability to override configs by tier. The customer specific syntax is still the TableConfig and instead of centralize the transformation between how the user writes something and how the code reacts to it, we are again spreading the logic in several parts of the code. In this case, index types and startree indexes need to know about tiered (when in fact they don't actually care). IICU the code, encodingType is not actually being override, so we should also add this logic in at least another place.

Instead of doing this, I suggest to try to follow another pattern. For example:

We still add tiered concept in different places in TableConfig
We add a preprocessor layer where the original TableConfig with N tiered configs is transformed into N+1 TableConfig objects without tiered.
For each N+1 TableConfig, we apply the logic we have right now.

I think it is easier to see this with an example. In this PR you modifies the airlineStats_offline_table_config.json in order to add the following in FieldConfigList (I'm going to ignore the changes in Startree indexes, but it would also apply there):

{
      "name": "ArrTimeBlk",
      "encodingType": "DICTIONARY",
      "indexes": {
        "inverted": {
          "enabled": "true"
        }
      },
      "tierOverwrites": {
        "hotTier": {
          "encodingType": "DICTIONARY",
          "indexes": {
            "bloom": {
              "enabled": "true"
            }
          }
        },
        "coldTier": {
          "encodingType": "RAW",
          "indexes": {
            "text": {
              "enabled": "true"
            }
          }
        }
      }

This means that we have 2 tiers plus the default one, so the preprocessor would generate 3 tables that will be equal except for the FieldConfig named ArrTimeBlk, where we will have 3 different versions:

The default would be:

      "name": "ArrTimeBlk",
      "encodingType": "DICTIONARY",
      "indexes": {
        "inverted": {
          "enabled": "true"
        }
      }

The hotTier would be:

      "name": "ArrTimeBlk",
      "encodingType": "DICTIONARY",
      "indexes": {
        "inverted": {
          "enabled": "true"
        },
        "bloom": {
          "enabled": "true"
        }
      }

The coldTier would be:

      "name": "ArrTimeBlk",
      "encodingType": "RAW", // This was also modified!
      "indexes": {
        "inverted": {
          "enabled": "true"
        },
        "indexes": {
          "text": {
            "enabled": "true"
          }
        }
      }

In these 3 new TableConfigs there is no tierOverride, so we can read indexes (and any other config!) without actually needed to know about them.

The preprocessor should not be very complex. It needs to know how to merge jsons (which is not that difficult) and needs to verify that the thing that has been override is correct (for example in this case we could forbid to override the name).

In the abstract void, this approach seems more elegant than the other. The separation of responsibilities is clear: The preprocessor takes care of the tieredOverride, the IndexType.getConfig() take a TableConfig without overrides and generates an index config. The main issue here is that I don't know how much it may affect the rest of the code. In fact I would like to use different classes for TableConfigWithTieredOverride and TieredConfigWithoutOverride, but that would imply so many changes in the code that I don't think it would be feasible right now.

klsince · 2023-04-11T21:34:33Z

@gortiz thanks for the comments. I like your idea of generating different TableConfig instances for tiers for better separation. The Index.getConfig() encapsulates a lot of details of assembling index configs from everywhere in TableConfig, so I'd +1 to reuse it than make it more complex for being aware of tier (which is the idea of current change). Below are a few points about current change, but I'll think more about your idea anyway.

... encodingType is not actually being override, so we should also add this logic in at least another place...

The current change allows to overwrite the encodingType via FieldConfig.withTierOverwrites(tier)

... In these 3 new TableConfigs there is no tierOverride, so we can read indexes (and any other config!) without actually needed to know about them...

The current change can use any other config as well. If tier is provided, the tier specific ones take precedence.

npawar · 2023-04-12T02:01:28Z

@klsince would you please add some more description to the PR, giving examples of the final config you are going with (based on @gortiz feedback)?

klsince · 2023-04-12T18:56:52Z

@gortiz @npawar I updated PR, trying to implement the new idea proposed by @gortiz pls take another look.
btw, PR description is updated for the new way of implementing the feature.

gortiz · 2023-04-13T18:09:46Z

...al/src/main/java/org/apache/pinot/segment/local/segment/index/loader/IndexLoadingConfig.java

+    }
+    Schema schema = new Schema();
+    for (String column : getAllKnownColumns()) {
+      schema.addField(new DimensionFieldSpec(column, FieldSpec.DataType.STRING, true));


I guess this is only going to be executed in tests, but using a random type here looks... problematic.

yeah, I was a bit confused while seeing this. This logic was there already and I didn't change how it's used other than moving it to this helper method.

The column names put in this dummy schema are used in some prod code. This dummy STRING type must not. It looks like the sole purpose of this dummy schema is to pass the column names to FieldIndexConfigsUtil.createIndexConfigsByColName(), when _schema is null (which can happen e.g. in MultipleTreesBuilder).

gortiz · 2023-04-13T18:11:22Z

I would like to take a look to the tests tomorrow, but it looks good to me. Let's see what actual committers think about it :)

saurabhd336 · 2023-04-17T06:24:46Z

@klsince Is there a reason we're not setting _tableConfig to the TableConfigUtils.overwriteTableConfigForTier(_tableConfig, _segmentTier) inside IndexLoadingConfig?
Is there a reason we'd want to preserve original tableConfig?

klsince · 2023-04-17T17:53:31Z

Thanks for reviewing @saurabhd336, and as to your question: the overwriting logic is more like merging, as it is done for those top level config keys. So need to keep the original _tableConfig to derive tier specific index configs correctly, otherwise we may accumulate index configs from different tiers.

npawar · 2023-04-17T20:27:05Z

pinot-common/src/main/java/org/apache/pinot/common/utils/config/TableConfigUtils.java

+   *     },
+   *     ...
+   *   ]
+   * }


how does this work given tableIndexConfig is at table level, vs fieldConfigList is at column level? What exactly is one overriding when adding in tableIndexing? Or is the instruction to just repeat whatever json fields one is interested in ?

The precedence between using fieldConfigList or tableIndexConfig is defined by IndexType.getConfig() from index-spi, which basically favors fieldConfigList. But this is the detail of IndexType.getConfig() and may get changed as needed in future. So changes in this PR tried to decouple IndexType.getConfig() from the logic of applying tier specific configs.

But for IndexType.getConfig() to access tier specific configs, we make a temp TableConfig by copying tier specific configs from tierOverwrites out and overwrite those defined for default tier, and then pass the TableConfig object to IndexType.getConfig().

... is the instruction to just repeat whatever json fields...

mostly right, but only for top-level config keys defined in IndexingConfig or FieldConfig, e.g. StarTreeIndexConfig is overwritten as a whole, instead of overwriting the inner fields of StarTreeIndexConfig.

npawar · 2023-04-17T20:28:33Z

pinot-spi/src/main/java/org/apache/pinot/spi/utils/builder/TableConfigBuilder.java

@@ -460,6 +467,7 @@ public TableConfig build() {
    indexingConfig.setAggregateMetrics(_aggregateMetrics);
    indexingConfig.setOptimizeDictionaryForMetrics(_optimizeDictionaryForMetrics);
    indexingConfig.setNoDictionarySizeRatioThreshold(_noDictionarySizeRatioThreshold);
+    indexingConfig.setTierOverwrites(_tierOverwrites);


why is this only being set in indexingConfig and not in fieldConfigList? or rather - how does one know which overwrites on is providing via the setter?

Both tableIndexConfig (i.e. the indexingConfig object here) and fieldConfigList have their own tierOverwrites. The fieldConfigList (with tierOverwrites on individual fieldConfig as needed by the caller) is provided to this TableConfigBuilder separately.

clarified from offline discussion (method name is a bit confusing as it's on top level TableConfigBuilder, but can only be used to set indexingConfig). This can be addressed separately as it only affects tests.

This PR tries to make index config tier overwriting #10553 more robust. Turns out the fix in #10642 is not enough, e.g. if one updates tierConfigs in table config and calls rebalance API to move segments around and it happens before SegmentRelocator task kicks in to update the segment target tiers, it's still possible that segments are loaded on the new servers without using the tier specific index configs. So moving the logic that updates segment target tier into rebalanceTable method to be shared by both table rebalance API and SegmentRelocator periodic task.

allow to overwrite index configs at tier level

78f56e0

klsince force-pushed the tier_overwrites branch from 6fda5b9 to 78f56e0 Compare April 5, 2023 22:43

add UT

cb8d355

klsince changed the title ~~[WIP] allow to overwrite index configs at tier level~~ Allow to overwrite index configs at tier level Apr 6, 2023

gortiz reviewed Apr 11, 2023

View reviewed changes

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/AbstractIndexType.java Outdated Show resolved Hide resolved

gortiz reviewed Apr 11, 2023

View reviewed changes

...-tools/src/main/resources/examples/batch/airlineStats/airlineStats_offline_table_config.json Show resolved Hide resolved

gortiz reviewed Apr 11, 2023

View reviewed changes

...rc/test/java/org/apache/pinot/segment/local/segment/index/loader/IndexLoadingConfigTest.java Outdated Show resolved Hide resolved

create tier specific TableConfig to be used by IndexType.getConfig

ef1a4c4

klsince force-pushed the tier_overwrites branch from 503010d to ef1a4c4 Compare April 12, 2023 18:46

gortiz reviewed Apr 13, 2023

View reviewed changes

saurabhd336 approved these changes Apr 17, 2023

View reviewed changes

npawar reviewed Apr 17, 2023

View reviewed changes

npawar merged commit d03eab2 into apache:master Apr 17, 2023

klsince deleted the tier_overwrites branch April 17, 2023 22:00

klsince mentioned this pull request Apr 19, 2023

update target tier for segments if tierConfigs is provided #10642

Merged

klsince mentioned this pull request Apr 27, 2023

update segment target tier before table rebalance #10695

Merged

saurabhd336 added the feature label Sep 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to overwrite index configs at tier level #10553

Allow to overwrite index configs at tier level #10553

klsince commented Apr 5, 2023 •

edited

Loading

codecov-commenter commented Apr 5, 2023 •

edited

Loading

gortiz commented Apr 11, 2023

klsince commented Apr 11, 2023

npawar commented Apr 12, 2023

klsince commented Apr 12, 2023

gortiz Apr 13, 2023

klsince Apr 13, 2023

gortiz commented Apr 13, 2023

saurabhd336 commented Apr 17, 2023

klsince commented Apr 17, 2023

npawar Apr 17, 2023

klsince Apr 17, 2023 •

edited

Loading

npawar Apr 17, 2023

klsince Apr 17, 2023

npawar Apr 17, 2023

Allow to overwrite index configs at tier level #10553

Allow to overwrite index configs at tier level #10553

Conversation

klsince commented Apr 5, 2023 • edited Loading

codecov-commenter commented Apr 5, 2023 • edited Loading

Codecov Report

gortiz commented Apr 11, 2023

klsince commented Apr 11, 2023

npawar commented Apr 12, 2023

klsince commented Apr 12, 2023

gortiz Apr 13, 2023

Choose a reason for hiding this comment

klsince Apr 13, 2023

Choose a reason for hiding this comment

gortiz commented Apr 13, 2023

saurabhd336 commented Apr 17, 2023

klsince commented Apr 17, 2023

npawar Apr 17, 2023

Choose a reason for hiding this comment

klsince Apr 17, 2023 • edited Loading

Choose a reason for hiding this comment

npawar Apr 17, 2023

Choose a reason for hiding this comment

klsince Apr 17, 2023

Choose a reason for hiding this comment

npawar Apr 17, 2023

Choose a reason for hiding this comment

klsince commented Apr 5, 2023 •

edited

Loading

codecov-commenter commented Apr 5, 2023 •

edited

Loading

klsince Apr 17, 2023 •

edited

Loading