Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to overwrite index configs at tier level #10553

Merged
merged 3 commits into from
Apr 17, 2023

Conversation

klsince
Copy link
Contributor

@klsince klsince commented Apr 5, 2023

This PR allows one to overwrite index configs at tier level, e.g. using many index for hot tier for better query perf but a lot less index for cold tier for more cost saving.

Basically, both tableIndexConfig and fieldConfig sections are extended with a new section called tierOverwrites to specify tier specific index configs. A tier specific TableConfig is generated on the fly when loading segments, so that segments are processed with tier specific index configs. This transient tier specific TableConfig decouples consumers of TableConfig, like IndexType.getConfig() from index-spi, with the logic of looking for tier specific configs.

The MultiDirQuickstart example has been extended a bit to demo this feature, and example config can be found in its table config file, for example:

  1. Overwriting single-column index configs using fieldConfigList
{
  ...
  "fieldConfigList": [    
    {
      "name": "ArrTimeBlk",
      "encodingType": "DICTIONARY",
      "indexes": {
        "inverted": {
          "enabled": "true"
        }
      },
      "tierOverwrites": {
        "hotTier": {
          "encodingType": "DICTIONARY",
          "indexes": { // can change index types to use on this tier
            "bloom": {
              "enabled": "true"
            }
          }
        },
        "coldTier": {
          "encodingType": "RAW", // can change encoding type to use on this tier
          "indexes": { } // remove all indexes
        }
      }
    }
  ],
  1. Overwriting StarTree index configs using tableIndexConfig
  "tableIndexConfig": {
    "starTreeIndexConfigs": [
      {
        "dimensionsSplitOrder": [
          "AirlineID",
          "Origin",
          "Dest"
        ],
        "skipStarNodeCreationForDimensions": [],
        "functionColumnPairs": [
          "COUNT__*",
          "MAX__ArrDelay"
        ],
        "maxLeafRecords": 10
      }
    ],
...
    "tierOverwrites": {
      "hotTier": {
        "starTreeIndexConfigs": [ // use a different ST index on this tier
          {
            "dimensionsSplitOrder": [
              "Carrier",
              "CancellationCode",
              "Origin",
              "Dest"
            ],
            "skipStarNodeCreationForDimensions": [],
            "functionColumnPairs": [
              "MAX__CarrierDelay",
              "AVG__CarrierDelay"
            ],
            "maxLeafRecords": 10
          }
        ]
      },
      "coldTier": {
        "starTreeIndexConfigs": [] // removes ST index for this tier
      }
    }
  },
 ...

Note: it's also working to overwrite single-column index with tableIndexConfig but recommended to use fieldConfigList for clarity.

@codecov-commenter
Copy link

codecov-commenter commented Apr 5, 2023

Codecov Report

Merging #10553 (ef1a4c4) into master (513fa31) will increase coverage by 2.59%.
The diff coverage is 85.71%.

@@             Coverage Diff              @@
##             master   #10553      +/-   ##
============================================
+ Coverage     67.84%   70.44%   +2.59%     
- Complexity     6001     6504     +503     
============================================
  Files          1575     2106     +531     
  Lines         81930   113068   +31138     
  Branches      12870    17041    +4171     
============================================
+ Hits          55586    79648   +24062     
- Misses        22450    27837    +5387     
- Partials       3894     5583    +1689     
Flag Coverage Δ
integration1 24.53% <1.19%> (?)
integration2 24.20% <1.19%> (?)
unittests1 67.99% <85.71%> (+0.15%) ⬆️
unittests2 13.88% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...org/apache/pinot/spi/config/table/TableConfig.java 63.80% <ø> (ø)
...local/segment/index/loader/IndexLoadingConfig.java 83.61% <75.75%> (-1.32%) ⬇️
...he/pinot/common/utils/config/TableConfigUtils.java 81.25% <89.74%> (+4.91%) ⬆️
...org/apache/pinot/spi/config/table/FieldConfig.java 97.05% <100.00%> (+0.18%) ⬆️
.../apache/pinot/spi/config/table/IndexingConfig.java 95.69% <100.00%> (+0.19%) ⬆️
...he/pinot/spi/utils/builder/TableConfigBuilder.java 87.27% <100.00%> (+0.23%) ⬆️

... and 835 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@klsince klsince changed the title [WIP] allow to overwrite index configs at tier level Allow to overwrite index configs at tier level Apr 6, 2023
@gortiz
Copy link
Contributor

gortiz commented Apr 11, 2023

In order to add the ability to override configs by adding a tiered field, here you follow a pattern that is quite common in Pinot but I tried to avoid in index-spi. This pattern is the following:

  1. We add a new concept that modify several different parts of the code.
  2. This new concept change how different parts of the code behave.
  3. We drag this concept until the specific parts of the code that are actually modified by the new concept.

A clear example of this (used in the index spi PEP) is the bloom filter config. The first Pinot versions that supported bloom filter expected to configure them in bloomFilterColumns attribute. Then we wanted to be able to configure the bloom filter on each column, so we added bloomFilterConfigs. Then each time we wanted to know the bloom filter config for a column we had to first look for the column in bloomFIlterConfig and in case we don't find it, look for it in bloomFilterColumns (and if we found it there we generate a default config). Not being centralized, we needed to do that several times in different parts of the code. This is a problem because each time we add a new concept like that we have to be sure we touch all the places where this logic is affected. In index-spi PEP I said that this is a problem I associate with not having different models for user specific syntax (in this case, the different ways a user may define a bloom filter) and internal and developer oriented syntax (where we don't care about the different ways the user may configure a bloom filter, we just want them to be unified in a single model). index-spi solved this problem by extracting all the different ways an index can be configured in the TableConfig (customer specific syntax/model) and returning a single config object (developer specific model).

In this PR I think we are falling in the same pattern. The new concept is the ability to override configs by tier. The customer specific syntax is still the TableConfig and instead of centralize the transformation between how the user writes something and how the code reacts to it, we are again spreading the logic in several parts of the code. In this case, index types and startree indexes need to know about tiered (when in fact they don't actually care). IICU the code, encodingType is not actually being override, so we should also add this logic in at least another place.

Instead of doing this, I suggest to try to follow another pattern. For example:

  1. We still add tiered concept in different places in TableConfig
  2. We add a preprocessor layer where the original TableConfig with N tiered configs is transformed into N+1 TableConfig objects without tiered.
  3. For each N+1 TableConfig, we apply the logic we have right now.

I think it is easier to see this with an example. In this PR you modifies the airlineStats_offline_table_config.json in order to add the following in FieldConfigList (I'm going to ignore the changes in Startree indexes, but it would also apply there):

{
      "name": "ArrTimeBlk",
      "encodingType": "DICTIONARY",
      "indexes": {
        "inverted": {
          "enabled": "true"
        }
      },
      "tierOverwrites": {
        "hotTier": {
          "encodingType": "DICTIONARY",
          "indexes": {
            "bloom": {
              "enabled": "true"
            }
          }
        },
        "coldTier": {
          "encodingType": "RAW",
          "indexes": {
            "text": {
              "enabled": "true"
            }
          }
        }
      }

This means that we have 2 tiers plus the default one, so the preprocessor would generate 3 tables that will be equal except for the FieldConfig named ArrTimeBlk, where we will have 3 different versions:

The default would be:

      "name": "ArrTimeBlk",
      "encodingType": "DICTIONARY",
      "indexes": {
        "inverted": {
          "enabled": "true"
        }
      }

The hotTier would be:

      "name": "ArrTimeBlk",
      "encodingType": "DICTIONARY",
      "indexes": {
        "inverted": {
          "enabled": "true"
        },
        "bloom": {
          "enabled": "true"
        }
      }

The coldTier would be:

      "name": "ArrTimeBlk",
      "encodingType": "RAW", // This was also modified!
      "indexes": {
        "inverted": {
          "enabled": "true"
        },
        "indexes": {
          "text": {
            "enabled": "true"
          }
        }
      }

In these 3 new TableConfigs there is no tierOverride, so we can read indexes (and any other config!) without actually needed to know about them.

The preprocessor should not be very complex. It needs to know how to merge jsons (which is not that difficult) and needs to verify that the thing that has been override is correct (for example in this case we could forbid to override the name).

In the abstract void, this approach seems more elegant than the other. The separation of responsibilities is clear: The preprocessor takes care of the tieredOverride, the IndexType.getConfig() take a TableConfig without overrides and generates an index config. The main issue here is that I don't know how much it may affect the rest of the code. In fact I would like to use different classes for TableConfigWithTieredOverride and TieredConfigWithoutOverride, but that would imply so many changes in the code that I don't think it would be feasible right now.

@klsince
Copy link
Contributor Author

klsince commented Apr 11, 2023

@gortiz thanks for the comments. I like your idea of generating different TableConfig instances for tiers for better separation. The Index.getConfig() encapsulates a lot of details of assembling index configs from everywhere in TableConfig, so I'd +1 to reuse it than make it more complex for being aware of tier (which is the idea of current change). Below are a few points about current change, but I'll think more about your idea anyway.

... encodingType is not actually being override, so we should also add this logic in at least another place...

The current change allows to overwrite the encodingType via FieldConfig.withTierOverwrites(tier)

... In these 3 new TableConfigs there is no tierOverride, so we can read indexes (and any other config!) without actually needed to know about them...

The current change can use any other config as well. If tier is provided, the tier specific ones take precedence.

@npawar
Copy link
Contributor

npawar commented Apr 12, 2023

@klsince would you please add some more description to the PR, giving examples of the final config you are going with (based on @gortiz feedback)?

@klsince
Copy link
Contributor Author

klsince commented Apr 12, 2023

@gortiz @npawar I updated PR, trying to implement the new idea proposed by @gortiz pls take another look.
btw, PR description is updated for the new way of implementing the feature.

}
Schema schema = new Schema();
for (String column : getAllKnownColumns()) {
schema.addField(new DimensionFieldSpec(column, FieldSpec.DataType.STRING, true));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is only going to be executed in tests, but using a random type here looks... problematic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I was a bit confused while seeing this. This logic was there already and I didn't change how it's used other than moving it to this helper method.

The column names put in this dummy schema are used in some prod code. This dummy STRING type must not. It looks like the sole purpose of this dummy schema is to pass the column names to FieldIndexConfigsUtil.createIndexConfigsByColName(), when _schema is null (which can happen e.g. in MultipleTreesBuilder).

@gortiz
Copy link
Contributor

gortiz commented Apr 13, 2023

I would like to take a look to the tests tomorrow, but it looks good to me. Let's see what actual committers think about it :)

@saurabhd336
Copy link
Contributor

@klsince Is there a reason we're not setting _tableConfig to the TableConfigUtils.overwriteTableConfigForTier(_tableConfig, _segmentTier) inside IndexLoadingConfig?
Is there a reason we'd want to preserve original tableConfig?

@klsince
Copy link
Contributor Author

klsince commented Apr 17, 2023

Thanks for reviewing @saurabhd336, and as to your question: the overwriting logic is more like merging, as it is done for those top level config keys. So need to keep the original _tableConfig to derive tier specific index configs correctly, otherwise we may accumulate index configs from different tiers.

* },
* ...
* ]
* }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this work given tableIndexConfig is at table level, vs fieldConfigList is at column level? What exactly is one overriding when adding in tableIndexing? Or is the instruction to just repeat whatever json fields one is interested in ?

Copy link
Contributor Author

@klsince klsince Apr 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The precedence between using fieldConfigList or tableIndexConfig is defined by IndexType.getConfig() from index-spi, which basically favors fieldConfigList. But this is the detail of IndexType.getConfig() and may get changed as needed in future. So changes in this PR tried to decouple IndexType.getConfig() from the logic of applying tier specific configs.

But for IndexType.getConfig() to access tier specific configs, we make a temp TableConfig by copying tier specific configs from tierOverwrites out and overwrite those defined for default tier, and then pass the TableConfig object to IndexType.getConfig().

... is the instruction to just repeat whatever json fields...

mostly right, but only for top-level config keys defined in IndexingConfig or FieldConfig, e.g. StarTreeIndexConfig is overwritten as a whole, instead of overwriting the inner fields of StarTreeIndexConfig.

@@ -460,6 +467,7 @@ public TableConfig build() {
indexingConfig.setAggregateMetrics(_aggregateMetrics);
indexingConfig.setOptimizeDictionaryForMetrics(_optimizeDictionaryForMetrics);
indexingConfig.setNoDictionarySizeRatioThreshold(_noDictionarySizeRatioThreshold);
indexingConfig.setTierOverwrites(_tierOverwrites);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this only being set in indexingConfig and not in fieldConfigList? or rather - how does one know which overwrites on is providing via the setter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both tableIndexConfig (i.e. the indexingConfig object here) and fieldConfigList have their own tierOverwrites. The fieldConfigList (with tierOverwrites on individual fieldConfig as needed by the caller) is provided to this TableConfigBuilder separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarified from offline discussion (method name is a bit confusing as it's on top level TableConfigBuilder, but can only be used to set indexingConfig). This can be addressed separately as it only affects tests.

@npawar npawar merged commit d03eab2 into apache:master Apr 17, 2023
@klsince klsince deleted the tier_overwrites branch April 17, 2023 22:00
npawar pushed a commit that referenced this pull request May 2, 2023
This PR tries to make index config tier overwriting #10553 more robust.

Turns out the fix in #10642 is not enough, e.g. if one updates tierConfigs in table config and calls rebalance API to move segments around and it happens before SegmentRelocator task kicks in to update the segment target tiers, it's still possible that segments are loaded on the new servers without using the tier specific index configs. So moving the logic that updates segment target tier into rebalanceTable method to be shared by both table rebalance API and SegmentRelocator periodic task.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants