Update default values in CoordinatorDynamicConfig #14269

kfaraz · 2023-05-12T09:02:59Z

Changes

The defaults of the following config values in the CoordinatorDynamicConfig are being updated.

1. `maxSegmentsInNodeLoadingQueue = 500` (previous = 100)

Rationale: With round-robin segment assignment now being the default assignment technique, the Coordinator can assign a large number of under-replicated/unavailable segments very quickly. Before round-robin, a large queue size would cause the Coordinato to get stuck in RunRules duty due to very slow strategy-based cost computations.

2. `replicationThrottleLimit = 500` (previous = 10)

Rationale: Along with the reasoning given for maxSegmentsInNodeLoadingQueue, a very low replicationThrottleLimit can cause clusters to be very slow in getting to full replication, even when there are loading threads sitting idle.

Note: It is okay to keep this value equal to maxSegmentsInNodeLoadingQueue. Even with equal values, load queues will not get filled up with just replicas, and segments that are completely unavailable will still get a fair chance. This is because while MSINLQ applies to a single server, replicationThrottleLimit applies to each tier.

3. `maxSegmentsToMove = 100` (previous = 5)

Rationale: A very low value of this config (say 5) turns out to be very ineffective in balancing especially if there are a large number of segments in a cluster and/or a large skew between usages of two historical servers.
On the other hand, a very large value can cause excessive moves every minute, which might have the following disadvantages:

Load of moving segments competing with load of unavailable/under-replicated segments
Unnecessary network costs due to constant download and delete of segments

These defaults will be revisited after #13197 is merged.

Testing

These values have been tried on different production cluster sizes, and have been found to give satisfactory results.

Release note

Update default values of the following coordinator dynamic configs:

maxSegmentsInNodeLoadingQueue = 500
maxSegmentsToMove = 100
replicationThrottleLimit = 500

This PR has:

AmatyaAvadhanula · 2023-05-12T09:08:31Z

Changes may be needed in coordinator-dynamic-config.tsx as well.

ektravel · 2023-05-18T14:51:12Z

docs/configuration/index.md

 |`balancerComputeThreads`|Thread pool size for computing moving cost of segments in segment balancing. Consider increasing this if you have a lot of segments and moving segments starts to get stuck.|1|
 |`emitBalancingStats`|Boolean flag for whether or not we should emit balancing stats. This is an expensive operation.|false|
 |`killDataSourceWhitelist`|List of specific data sources for which kill tasks are sent if property `druid.coordinator.kill.on` is true. This can be a list of comma-separated data source names or a JSON array.|none|
 |`killPendingSegmentsSkipList`|List of data sources for which pendingSegments are _NOT_ cleaned up if property `druid.coordinator.kill.pendingSegments.on` is true. This can be a list of comma-separated data sources or a JSON array.|none|
-|`maxSegmentsInNodeLoadingQueue`|The maximum number of segments that could be queued for loading to any given server. This parameter could be used to speed up segments loading process, especially if there are "slow" nodes in the cluster (with low loading speed) or if too much segments scheduled to be replicated to some particular node (faster loading could be preferred to better segments distribution). Desired value depends on segments loading speed, acceptable replication time and number of nodes. Value 1000 could be a start point for a rather big cluster. Default value is 100. |100|
+|`maxSegmentsInNodeLoadingQueue`|The maximum number of segments that could be queued for loading to any given server. This parameter could be used to speed up segments loading process, especially if there are "slow" nodes in the cluster (with low loading speed) or if too much segments scheduled to be replicated to some particular node (faster loading could be preferred to better segments distribution). Desired value depends on segments loading speed, acceptable replication time and number of nodes. Value 1000 could be a start point for a rather big cluster. |500|


Suggested change

|`maxSegmentsInNodeLoadingQueue`|The maximum number of segments that could be queued for loading to any given server. This parameter could be used to speed up segments loading process, especially if there are "slow" nodes in the cluster (with low loading speed) or if too much segments scheduled to be replicated to some particular node (faster loading could be preferred to better segments distribution). Desired value depends on segments loading speed, acceptable replication time and number of nodes. Value 1000 could be a start point for a rather big cluster. |500|

|`maxSegmentsInNodeLoadingQueue`|The maximum number of segments allowed in a loading queue for any given server. Use this parameter to load the segments faster—for example, if the cluster contains slow-loading nodes, or if there are too many segments to be replicated to a particular node (when faster loading is preferred to better segments distribution). Desired value depends on the loading speed of segments, acceptable replication time, and number of nodes. Value 1000 is a good starting point for a big cluster. |500|

ektravel

Added some suggestions to improve readability.

kfaraz · 2023-05-19T06:37:35Z

Thanks for the review, @ektravel ! I have incorporated your feedback.

ektravel

Looks good from the docs standpoint.

Update default values in CoordinatorDynamicConfig

7118388

github-actions bot added the Area - Documentation label May 12, 2023

kfaraz added the Area - Segment Balancing/Coordination label May 12, 2023

Fix docs

8054b33

Update default values in coordinator-dyna-config.tsx

c165f6b

AmatyaAvadhanula approved these changes May 12, 2023

View reviewed changes

ektravel reviewed May 18, 2023

View reviewed changes

Update docs

af83153

ektravel approved these changes May 19, 2023

View reviewed changes

kfaraz merged commit 8091c6a into apache:master May 30, 2023

kfaraz deleted the update_default_dynamic_config_values branch May 30, 2023 03:21

abhishekagarwal87 added this to the 27.0 milestone Jul 19, 2023

AmatyaAvadhanula mentioned this pull request Aug 6, 2023

[DRAFT] 27.0.0 release notes #14761

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update default values in CoordinatorDynamicConfig #14269

Update default values in CoordinatorDynamicConfig #14269

kfaraz commented May 12, 2023 •

edited

Loading

AmatyaAvadhanula commented May 12, 2023

ektravel May 18, 2023

ektravel left a comment

kfaraz commented May 19, 2023

ektravel left a comment

Update default values in CoordinatorDynamicConfig #14269

Update default values in CoordinatorDynamicConfig #14269

Conversation

kfaraz commented May 12, 2023 • edited Loading

Changes

1. maxSegmentsInNodeLoadingQueue = 500 (previous = 100)

2. replicationThrottleLimit = 500 (previous = 10)

3. maxSegmentsToMove = 100 (previous = 5)

Testing

Release note

AmatyaAvadhanula commented May 12, 2023

ektravel May 18, 2023

Choose a reason for hiding this comment

ektravel left a comment

Choose a reason for hiding this comment

kfaraz commented May 19, 2023

ektravel left a comment

Choose a reason for hiding this comment

kfaraz commented May 12, 2023 •

edited

Loading

1. `maxSegmentsInNodeLoadingQueue = 500` (previous = 100)

2. `replicationThrottleLimit = 500` (previous = 10)

3. `maxSegmentsToMove = 100` (previous = 5)