Skip to content

Advanced Circuit Breaker

reisenberger edited this page Apr 15, 2016 · 20 revisions

The original CircuitBreaker

The original Polly CircuitBreaker takes the number of consecutive exceptions thrown as its indicator of the health of the underlying actions. This remains highly effective in many scenarios, is easy to understand, and simple to configure. We recommend it as the starting point for most situations.

There are however situations where a breaker with more detailed configuration parameters may be useful. In high-throughput (and variable throughput) scenarios in particular, proportion of failures can be a more consistent indicator of circuit health than consecutive count, which can fluctuate with load.

The AdvancedCircuitBreaker (v4.2)

The AdvancedCircuitBreaker offers a circuit-breaker which:

  • Reacts on proportion of failures, the failureThreshold; eg break if over 50% of actions result in a handled failure
  • Measures that proportion over a rolling samplingDuration, so that older failures can be excluded and have no effect
  • Imposes a minimumThroughput before acting, such that the circuit reacts only when statistically significant, and does not trip in 'slow' periods

Syntax

Policy
   .Handle<Whatever>(...)
   .AdvancedCircuitBreaker(
        double failureThreshold, 
        TimeSpan samplingDuration, 
        int minimumThroughput, 
        TimeSpan durationOfBreak)

Definition

The circuit will break if, within any timespan of duration samplingDuration, the proportion of actions resulting in a handled exception exceeds failureThreshold, provided also that the number of actions through the circuit in the timespan is at least minimumThroughput.

Parameters

failureThreshold

failureThreshold: the proportion of failures at which to break. A double between 0 and 1. For example, 0.5 represents break on 50% or more of actions through the circuit resulting in a handled failure.

samplingDuration

samplingDuration: the failure rate is considered for actions over this period. Successes/failures older than the period are discarded from metrics.

minimumThroughput

minimumThroughput: this many calls must have passed through the circuit within the active samplingDuration for the circuit to consider breaking.

Configuration recommendations

A starting configuration for governing RESTful calls to a downstream system might be:

Policy
   .Handle<TException>(...)
   .AdvancedCircuitBreaker(
        failureThreshold: 0.5,
        samplingDuration: TimeSpan.FromSeconds(5),
        minimumThroughput: 20, 
        durationOfBreak: TimeSpan.FromSeconds(30))

... if you want your circuit to respond to a faulty underlying subsystem within at most 10 seconds (see discussion below), assuming your calls meet this minimum throughput and you want to break for 30 seconds at a time.

There is no substitute for tuning circuit configuration in light of the performance characteristics of your individual system. A good strategy for tuning can be to take circuit settings from a configuration file, have code monitor that file for changes, and replace the live circuit with a newly-configured one any time you detect changes in the underlying config file.

The following points are also worth bearing in mind:

failureThreshold

  • Middling values such as 0.5 (50% failure or above) or 0.7 (70% failure) are good starting points from which to adjust.
  • A very low value (0.1) will naturally cause a circuit to break very easily, likely too easily.
  • Tolerances to set can depend on many factors, such as whether you have alternative systems for actions to failover to, if this circuit breaks.

samplingDuration

The samplingDuration is the duration over which statistics are measured; any statistic aging beyond this period is forgotten.

  • Keep in mind that this translates into how quickly your circuit will respond to failure. For a responsive circuit, configure samplingDuration in the order of seconds (rather than minutes or hours).

To understand this, consider how a 'long tail' of successes may affect statistics. Consider a circuit with reasonable throughput, configured with a sampling duration of 5 minutes and set to break at 50% failure threshold. Suppose the actions governed by the circuit have been working perfectly (100% success) for the past 5 minutes, then 100% failure starts occurring. Such a circuit would take (other things being equal) around 2.5 minutes to reach 50% failure rate, to 'work off' the statistics from the 100% success era. Generalising, a circuit with sampling duration T requiring a failure ratio r to trip, would (other things being equal) require T * r time to react to worst case (100% failure) from best case (100% success). While the real world may not be as regular as this, circuit responsiveness remains broadly proportional to samplingDuration in this manner. To ensure that a circuit will respond to its failureThreshold within the order of, say, 10 or 30 seconds, use samplingDuration in a similar order of magnitude.

  • The minimum permitted for samplingDuration is 20 milliseconds, reflecting the minimum resolution for the circuit's timers.

Detailed operation of failure statistics

Internally, the circuit-breaker measures statistics with a rolling statistical window. The configured samplingDuration T is divided into ten slices, such that statistics for 10% of the period T are discarded every time T / 10. This smooths the calculation of failure rates to acceptable tolerances (compared to disposing 100% of the statistics every time T).

The samplingDuration is not however further subdivided into slices if samplingDuration is set below 200 milliseconds. This prevents excessively frequent recalculation when there is little responsiveness gain.

Minimum throughput

Choose any minimum throughput you consider appropriate: set a value to keep statistics significant, and to eliminate hard-breaking in 'slower' periods if desired.

  • Keep in mind that the value should be considered a minimum, not close to the circuit's typical throughput. If the value is in practice too close to the circuit's typical throughput, the circuit may spend a significant proportion of time waiting to meet the minimum throughput threshold, and therefore not break as often as you might want/expect.

  • At the lower end of the scale, bear in mind how minimumThroughput will translate into the minimum resolution of the breaker's failure statistics. A low minimumThroughput will result in a coarse initial resolution of the statistics. For example, a minimumThroughput value of 2 will mean your circuit's minimum/initial resolution will be the value set 0%, 50%, 100%. This may be too coarse depending on your configured failureThreshold.

Clone this wiki locally