Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Test framework to simulate segment balancing #12822

Closed
kfaraz opened this issue Jul 26, 2022 · 1 comment · Fixed by #13074
Closed

Proposal: Test framework to simulate segment balancing #12822

kfaraz opened this issue Jul 26, 2022 · 1 comment · Fixed by #13074

Comments

@kfaraz
Copy link
Contributor

kfaraz commented Jul 26, 2022

Motivation

This proposal is inspired by the original thoughts in #9087

From a point of view of testing, segment balancing poses the following challenges:

  • In practice, balancing is a slow process and issues (or their resolution) may take several hours
    or even days to manifest themselves.
  • A bug may occur only in a very specific cluster setup, thus making it difficult to reproduce.
  • The level of confidence in any change made to the balancing logic or a strategy is low as
    there are no ITs around this and the unit tests verify only the very basic behaviour.
  • Owing to the large number of moving parts and underlying async execution, balancing is prone to
    erroneous behaviour resulting from race conditions. These race conditions are difficult to discover
    using typical unit tests.

We can begin to address these concerns with a framework that allows developers to simulate typical
segment balancing scenarios with ease, preferably in a low duty environment, such as a unit test
or an integration test. Such a framework can also help identify performance bottlenecks and
potential bugs in the current system and even compare different balancing strategies.

Possible approaches

Given the requirements, we could choose any one of the following setups for a simulator:

  1. A running service
    • Pros:
      • Expose APIs to specify inputs
      • Support live visualizations on a browser
    • Cons:
      • Large amount of manual intervention.
      • No method to verify the results of a run.
      • No way to save input parameters of an adhoc test run (DB is not an option)
      • The only real value-add compared to other options is visualization which would be overkill for the task at hand.
  2. An integration test framework
    • Pros:
      • Closely resemble a production setup
      • Live interaction between all of Druid's moving parts
    • Con: The fact that it would closely resemble a production setup is what makes this a bad candidate as
      it would suffer from the same reproduction challenges.
      • difficult to recreate scenarios which involve a large number of servers or segments
      • not possible to verify the effect of multiple coordinator runs in a short span of time
      • resource-intensive
  3. A unit test framework
    • Pros:
      • Great degree of control
      • Easy to add a new combination of input parameters as a new test case
      • No manual intervention required for verification
      • Easy to recreate even the most elaborate of cluster setups and actions on the fly
      • The underlying framework can be extended to power visualizations if needed
    • Cons:
      • Not a perfect representation of the production environment (a vice allowed to all tests)

Proposed changes

As seen above, a unit-test framework would be the ideal candidate for these simulations.
The framework should be able to:

  • recreate varied cluster setups
  • run simulations that can cycle through a large number of coordinator runs in a short amount of time
  • test the interaction of the main coordinator pieces involved
  • take the actions listed below at pre-determined points during the course of the simulation
  • verify results of the simulation

Programmer inputs

The framework should give control over the following aspects of the setup:

Input Details Actions
cluster server name, type, tier, size add a server, remove a server
segment datasource, interval, version, partition num, size add/remove from server, mark used/unused, publish new segments
rules type (foreverLoad, drop, etc), replica count per tier add a rule for a datasource, add default rule
configs coordinator period, load queue type, load queue size, max segments to balance, and a bunch of other configs set or update a config

Basic setup

The following classes are the objects under test and must not be mocked:

  • DruidCoordinator
  • LoadQueuePeon
  • various coordinator duties: BalanceSegments, RunRules, UnloadUnusedSegments, etc.

The following behaviour needs to be mocked:

  • loading of a segment on a server
  • interactions with metadata store

Since some behaviour is mocked, we might miss out on some actual production scenarios
but the mock entities can always be augmented to account for it in specific test cases, say
by adding a delay before completing a segment load or failing a metadata store operation.

Interim actions

A custom coordinator duty can be used to invoke the specified actions at the end of every coordinator run.

// ActionRunnableCoordinatorDuty

@Override
public DruidCoordinatorRuntimeParams run(DruidCoordinatorRuntimeParams params) {
    final int currentRun = runCount.incrementAndGet();

    if (actions.containsKey(currentRun)) {
       actions.get(currentRun).invoke();
    }
    
    return params;
}

The actions for a particular simulation could be specified like this:

ActionRunnableCoordinatorDuty actionsDuty = ActionRunnableCoordinatorDuty
    .createDutyWithActions()
    .afterEveryRun(runNumber -> {collectStats(runNumber); stopIfBalanced();})
    .afterRun(10, () -> killHistorical("historicalXyz"))
    .afterRun(20, () -> addHistoricalTier("reporting_tier"))
    .afterRun(30, () -> updateLoadRules(...))
    .afterRun(50, () -> completeCoordinatorRun(...))
    .build();
    

Typical test case

@Test
public void testBalancingThenTierShift() {

   // Initial rule, 2 replicas on _default_tier
   List<Rule> initialRules = Collections.singletonList(
      new ForeverLoadRule(Collections.singletonMap("_default_tier", 2))
   );
   
   // Updated rule, 3 replicas on reporting_tier
   List<Rule> tierShiftRules = Collections.singletonList(
      new ForeverLoadRule(Collections.singletonMap("reporting_tier", 3))
   );   

   // Balance segments first across "_default_tier" and then shifted to "reporting_tier"
   ActionRunnableCoordinatorDuty actionsDuty = ActionRunnableCoordinatorDuty
         .createDutyWithActions()
         .afterRun(20, () -> metadataRuleManager.overrideRule("wikitest", tierShiftRules, null))
         .afterRun(50, () -> completeCoordinatorRun(...))
         .build();
   
   // Create segments with a bit of fluid syntax sugar (or syntax syrup if you will)
   List<DataSegment> segments =
              createSegmentsForDatasource("wikitest")
                         .overInterval("2022-01-01/2022-03-01")
                         .withGranularity(Granularities.DAY)
                         .andPartitions(10)
                         .eachOfSizeInMb(500);
                         
   // Create servers
   List<DruidServer> allServers = 
              createHistoricals(
                     createTier("_default_tier").withServers(5).eachOfSizeInGb(100)
                     createTier("reporting_tier").withServers(3).eachOfSizeInGb(50)
              );
   
   // Build and run the simulation                      
   buildSimulation()
          .withActionsDuty(actionsDuty)
          .withRules(initialRules)
          .withUsedSegments(segments)
          .withServers(allServers)
          .withCoordinatorPeriod("PT1s")
          .run();
      
   assertThatDatasourceIsFullyLoaded("wikitest");
   assertThatClusterIsBalanced();
}

Work status

I am currently working on a PR which contains the above features except

  • assertion of balanced state
  • starting a simulation with a given segment distribution without having to wait for initial balancing

I have also been able to discover some race conditions thanks to this framework and
intend to create subsequent PRs for those.

Future work

Measure of balance

In order to verify the success of a simulated run, we need to define some criteria of success.
The motivation to balance segments and the major underlying strategy has been discussed
at length here:

  1. Overview of coordinator process
  2. Overview of balancing
  3. New interval cost function

Taking this as our starting point, our measures of success should be that:

  • segments are evenly distributed across servers
  • time-adjacent segments are not co-located
  • the system is able to achieve this state as quickly as possible

These parameters would need to be quantified and measured at the end of every run to get a clear sense
of the balancing progress.

@abhishekagarwal87
Copy link
Contributor

Thank you for putting this together @kfaraz. Looking forward to the PR. Having a unit test framework will make it easier to write and run more tests. As it stands today, writing ITs in druid is non-trivial work. Even if we add a framework for running ITs for segment balancing, we are unlikely to have similar coverage as we can via the unit test framework.

kfaraz added a commit that referenced this issue Sep 21, 2022
Fixes #12822 

The framework added here make it easy to write tests that verify the behaviour and interactions
of the following entities under various conditions:
- `DruidCoordinator`
- `HttpLoadQueuePeon`, `LoadQueueTaskMaster`
- coordinator duties: `BalanceSegments`, `RunRules`, `UnloadUnusedSegments`, etc.
- datasource retention rules: `LoadRule`, `DropRule`

Changes:
Add the following main classes:
- `CoordinatorSimulation` and related interfaces to dictate behaviour of simulation
- `CoordinatorSimulationBuilder` to build a simulation.
- `BlockingExecutorService` to keep submitted tasks in queue and execute them
  only when explicitly invoked.

Add tests:
- `CoordinatorSimulationBaseTest`, `SegmentLoadingTest`, `SegmentBalancingTest`
- `SegmentLoadingNegativeTest` to contain tests which assert the existing erroneous behaviour
of segment loading. Once the behaviour is fixed, these tests will be moved to the regular
`SegmentLoadingTest`.

Please refer to the README.md in `org.apache.druid.server.coordinator.simulate` for more details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants