new interval based cost function #2972

xvrl · 2016-05-16T23:09:36Z

Our current segment balancing algorithm uses a cost function which has some unintended effects on the distribution of segments in the cluster. While it keeps the total number of segments roughly balanced across nodes, it does not evenly distribute segments within a given interval. This leads to many queries hitting only a fraction of the historical nodes in the cluster.

This is a proposal to revamp the cost function to address those shortcoming and improve overall segment distribution.

TL;DR here's how segment distribution has improved in the days since we rolled out this improved cost function to one of our Druid clusters (each dot is a segment, each color represents a different data source).

Issues with our existing cost function

Here is a breakdown of the issues with the current cost function

Segments tend to form clusters with large gaps in between

If you have take segments roughly 30 days apart, the existing gapPenalty causes the cost of putting a segment anywhere between those two to be higher than adding a segment that overlaps with them. This causes the balancing to converge to form clusters of segments ~30 days apart as seen below.
Segments for recent time intervals tend to be distributed unevenly

The recencyPenalty, in combination with the gapPenalty, makes the cost function non-trivial for recent segments and causes strange gaps of data within the most recent 7 days worth of data.
The base joint cost between two segments is tied to segment byte sizes. This causes cost to be affected by different levels compression, and not necessarily tied to the actual cost of scanning a segment.

Improved cost function

Our improved cost function is designed to be simple and have the following properties:

Our cost function reflect the cost of scanning a segment.
The joint cost should reflect the likelihood of segments being scanned simultaneously, i.e. the joint cost of querying two time-slices should decrease based on how far apart they are.
Our cost function should be additive, i.e. given three intervals A, B, and C, cost(A, B union C) = cost(A, B) + cost(A, C)

To satisfy 1. we will assume that the cost of scanning a unit time-slice of data is constant. This means our cost function will only depend on segment intervals.

Note: In practice, this means that if all segments in a cluster have the same segment granularity, then the cpu time spent scanning each partition should be roughly constant. Ensuring the unit of computation is constant is also good practice to reduce variance in segment scan times across the cluster.

To model the joint cost of querying two time-slices and satisfy 2., we use an exponential decay exp(-lambda * t), where lambda defines the rate at which the interaction decays, and t is the relative time differences between two to time slices. Currently lambda is set to give the decay a half-life of 24 hours.

Since we assume the cost of querying each time-slice is constant, we can compute the joint cost of two segments by simply integrating over the segment intervals, i.e. for two segments X and Y covering intervals [x_0, x_1) and [y_0, y_1) respectively, our joint cost becomes:

This has the nice property of being additive as in 3. and of having a closed form solution.

One nice side-effect of additivity is that re-indexing the same data at different segment granularities (e.g. going form hourly segments to daily segments) will not affect the joint cost, assuming the cost of scanning a unit of time stays the same per shard.

Segments of the same datasources are of course more likely to be queried simultaneously, so we multiply the cost by a constant factor (2 in this case) if the two segments are from the same data source. This ensures that if two datasources are distributed equally, we are more likely to spread them across servers.

Hopefully this all makes sense, but comments are of course welcome.

Regarding performance, the new cost function is more expensive to compute than the existing one, but thanks to the optimizations in #2910 it will not be worse than what we have in our current release, and maybe even a little bit faster still.

The results of rolling out this new cost function speak for themselves. You can see how segment distribution has massively improved in our Druid cluster for the problem cases depicted above.

Addresses issues with balancing of segments in the existing cost function - `gapPenalty` led to clusters of segments ~30 days apart - `recencyPenalty` caused imbalance among recent segments - size-based cost could be skewed by compression New cost function is purely based on segment intervals: - assumes each time-slice of a partition is a constant cost - cost is additive, i.e. cost(A, B union C) = cost(A, B) + cost(A, C) - cost decays exponentially based on distance between time-slices

fjy · 2016-05-16T23:18:27Z

server/src/main/java/io/druid/server/coordinator/CostBalancerStrategy.java

-      return 0;
-    }
-  }
+  static final double HALF_LIFE = 24.0; // cost function half-life in hours


it'd be really nice to have some comments about what everything means and how the algo works

half life is by definition ln(2) / lambda i.e. the time difference that will make the joint cost go down by half

fjy · 2016-05-16T23:20:07Z

👍 if we have more comments about how the algorithm works

xvrl · 2016-05-16T23:46:35Z

@fjy added a few more comments + links to the PR description in the code

drcrallen · 2016-05-17T00:51:05Z

server/src/main/java/io/druid/server/coordinator/CostBalancerStrategy.java

+      return intervalCost(y0, y0, y1) +
+             intervalCost(beta, beta, gamma) +
+             // cost of exactly overlapping intervals of size beta
+             2 * (beta + FastMath.exp(-beta) - 1);


Can this be a constant?

Or I suppose this is just the solution to the integral?

Can the comment be described as such?

@drcrallen this can't be a constant, since beta depends on interval start / end. And yes, this is the solution to
\int_0^{\beta} \int_{0}^{\beta} e^{|x-y|}dxdy = 2 \cdot (\beta + e^{-\beta} - 1)

I'll add some comments

I meant the constant out front before I realized its just the equation solution

nishantmonu51 · 2016-05-17T07:46:51Z

server/src/main/java/io/druid/server/coordinator/CostBalancerStrategy.java

  {
-    final long gapMillis = gapMillis(segment1.getInterval(), segment2.getInterval());


you can probably also delete static method gapMillis now

I already removed it.

nishantmonu51 · 2016-05-17T07:57:03Z

👍

* new interval based cost function Addresses issues with balancing of segments in the existing cost function - `gapPenalty` led to clusters of segments ~30 days apart - `recencyPenalty` caused imbalance among recent segments - size-based cost could be skewed by compression New cost function is purely based on segment intervals: - assumes each time-slice of a partition is a constant cost - cost is additive, i.e. cost(A, B union C) = cost(A, B) + cost(A, C) - cost decays exponentially based on distance between time-slices * comments and formatting * add more comments to explain the calculation

Backport balancing improvements apache#2910 apache#2964 apache#2972

sascha-coenen · 2016-05-23T20:45:18Z

Awesome work! Can't wait to migrate. As a little teaser, could you maybe let me know how much impact this new balancing strategy has on the query performance?

leventov · 2017-06-23T18:56:47Z

@xvrl could you comment why did you use FastMath instead of plain Math?

xvrl · 2017-07-10T22:49:16Z

@leventov I used FastMath simply because it was faster than Math and I was ok with the precision trade-offs for this particular use-case. The objective was to keep the performance of the balancing comparable to what it was prior to #2972.

leventov · 2017-07-11T19:14:56Z

It's interesting because according to https://issues.apache.org/jira/browse/MATH-1422 "FastMath" may not actually be faster, but OTOH is is supposed to be more precise.

FYI @dgolitsyn

xvrl · 2017-07-11T19:32:47Z

@leventov I only used FastMath.exp(), and in this particular case I do remember it being faster. It is quite possible that as time goes by, with better hardware and improved JIT this may not hold forever. If you have data showing it does not help much, I wouldn't have any objection in removing the fastmath dependency.

xvrl added Bug Discuss labels May 16, 2016

xvrl force-pushed the updated-balancing-costfunction branch from 1c08dd5 to 9d9b3eb Compare May 16, 2016 23:14

fjy added this to the 0.9.1 milestone May 16, 2016

fjy reviewed May 16, 2016
View reviewed changes

comments and formatting

5dd1fde

drcrallen reviewed May 17, 2016
View reviewed changes

add more comments to explain the calculation

8910a2c

nishantmonu51 reviewed May 17, 2016
View reviewed changes

drcrallen removed the Discuss label May 17, 2016

drcrallen merged commit e79284d into apache:master May 17, 2016

drcrallen deleted the updated-balancing-costfunction branch May 17, 2016 16:56

xvrl mentioned this pull request May 18, 2016

Backport balancing improvements #2910 #2964 #2972 metamx/druid#34

Merged

drcrallen added a commit to metamx/druid that referenced this pull request May 18, 2016

Merge pull request #34 from metamx/balancing-improvements

777d4b8

Backport balancing improvements apache#2910 apache#2964 apache#2972

xvrl added Improvement and removed Bug labels May 20, 2016

fjy mentioned this pull request May 24, 2016

[WIP] Druid 0.9.1 Release Notes #2999

Closed

leventov mentioned this pull request Jun 23, 2017

Add CachingCostBalancerStrategy metamx/druid#61

Closed

jihoonson mentioned this pull request Dec 21, 2019

A simulator for segment balancing by the coordinator #9087

Closed

AmatyaAvadhanula mentioned this pull request Apr 14, 2022

Optimize SegmentsCostCache #12419

Closed

9 tasks

kfaraz mentioned this pull request Jul 26, 2022

Proposal: Test framework to simulate segment balancing #12822

Closed

AmatyaAvadhanula mentioned this pull request Oct 23, 2022

New balancer strategy: sortingCost #13254

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new interval based cost function #2972

new interval based cost function #2972

xvrl commented May 16, 2016 •

edited

Loading

fjy May 16, 2016

xvrl May 16, 2016

fjy commented May 16, 2016

xvrl commented May 16, 2016

drcrallen May 17, 2016

drcrallen May 17, 2016

xvrl May 17, 2016 •

edited

Loading

xvrl May 17, 2016

drcrallen May 17, 2016

nishantmonu51 May 17, 2016

xvrl May 17, 2016

nishantmonu51 commented May 17, 2016

sascha-coenen commented May 23, 2016

leventov commented Jun 23, 2017

xvrl commented Jul 10, 2017

leventov commented Jul 11, 2017

xvrl commented Jul 11, 2017

		{
		final long gapMillis = gapMillis(segment1.getInterval(), segment2.getInterval());

new interval based cost function #2972

new interval based cost function #2972

Conversation

xvrl commented May 16, 2016 • edited Loading

Issues with our existing cost function

Improved cost function

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjy commented May 16, 2016

xvrl commented May 16, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xvrl May 17, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nishantmonu51 commented May 17, 2016

sascha-coenen commented May 23, 2016

leventov commented Jun 23, 2017

xvrl commented Jul 10, 2017

leventov commented Jul 11, 2017

xvrl commented Jul 11, 2017

xvrl commented May 16, 2016 •

edited

Loading

xvrl May 17, 2016 •

edited

Loading