Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle shard over allocation during partial zone/rack or independent node failures #1149

Merged
merged 10 commits into from
Sep 20, 2021

Conversation

Bukhtawar
Copy link
Collaborator

@Bukhtawar Bukhtawar commented Aug 24, 2021

Signed-off-by: Bukhtawar Khan bukhtawa@amazon.com

Description

The changes ensure that in the event of a partial zone failure, the surviving nodes in the minority zone don't get overloaded with shards, this is governed by a skewness limit.

Issues Resolved

#938

Check List

  • Add support for absolute skewness limits
  • Add more unit tests
  • Add integ tests
  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Wrapper Validation success fdd86f6

@opensearch-ci-bot
Copy link
Collaborator

✅   DCO Check Passed fdd86f6

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Wrapper Validation success c8a2066

@opensearch-ci-bot
Copy link
Collaborator

✅   DCO Check Passed c8a2066

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Precommit failure fdd86f6
Log 975

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Precommit success c8a2066

@opensearch-ci-bot
Copy link
Collaborator

✅   DCO Check Passed 4561a95547df43c28a8a04b3477e8e421b5730b3

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Wrapper Validation success 4561a95547df43c28a8a04b3477e8e421b5730b3

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Precommit success 4561a95547df43c28a8a04b3477e8e421b5730b3

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Wrapper Validation success 1a2a6a4ed37c5be73f75fbfc9dca050b3be145d2

@opensearch-ci-bot
Copy link
Collaborator

✅   DCO Check Passed 1a2a6a4ed37c5be73f75fbfc9dca050b3be145d2

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Precommit failure 1a2a6a4ed37c5be73f75fbfc9dca050b3be145d2
Log 978

@opensearch-ci-bot
Copy link
Collaborator

✅   DCO Check Passed cd20052d7d1b7dcf4089a96159602ab65c38d067

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Wrapper Validation success cd20052d7d1b7dcf4089a96159602ab65c38d067

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Precommit success cd20052d7d1b7dcf4089a96159602ab65c38d067

@opensearch-ci-bot
Copy link
Collaborator

✅   DCO Check Passed 4eaf4b5c9773a73472a0f64a689b526f4688d7a2

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Wrapper Validation success 4eaf4b5c9773a73472a0f64a689b526f4688d7a2

@opensearch-ci-bot
Copy link
Collaborator

✅   DCO Check Passed 9d29e79aa55b0832dfeef505e7b38208a13dabad

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Wrapper Validation success 9d29e79aa55b0832dfeef505e7b38208a13dabad

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Precommit success 4eaf4b5c9773a73472a0f64a689b526f4688d7a2

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Precommit success 9d29e79aa55b0832dfeef505e7b38208a13dabad

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Wrapper Validation success 00c7f10e1962c668f1aa007240fbba0da53d0f8e

@Bukhtawar Bukhtawar marked this pull request as draft September 2, 2021 16:52
@opensearch-ci-bot
Copy link
Collaborator

✅   DCO Check Passed 00c7f10e1962c668f1aa007240fbba0da53d0f8e

@Bukhtawar Bukhtawar changed the title Initial changes to handle skewness Handle shard skewness during partial zone/rack failures Sep 2, 2021
@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Precommit success 00c7f10e1962c668f1aa007240fbba0da53d0f8e

@opensearch-ci-bot
Copy link
Collaborator

✅   DCO Check Passed 6ce9f9b477afe599889c2e7ac9d38df9557d2c30

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Wrapper Validation success 6ce9f9b477afe599889c2e7ac9d38df9557d2c30

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Precommit success 7bfb392

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Check success 7bfb392
Log 535

Reports 535

Copy link

@muralikpbhat muralikpbhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes. One overall comment is to refactor the tests. Those are really lengthy and painful to review and will be hard to maintain.

* due to node failures or otherwise on the surviving nodes. The allocation limits
* are decided by the user provisioned capacity, to determine if there were lost nodes
* <pre>
* cluster.routing.allocation.overload_awareness.provisioned_capacity: N

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please document the expectation from admin that this setting is supposed to be updated whenever the cluster is scaled up or down?

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Wrapper Validation success 24f29ee

@opensearch-ci-bot
Copy link
Collaborator

✅   DCO Check Passed 24f29ee

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Precommit success 24f29ee

Copy link

@muralikpbhat muralikpbhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice refactor, thanks.

@@ -25,7 +25,9 @@
/**
* This {@link NodeLoadAwareAllocationDecider} controls shard over-allocation
* due to node failures or otherwise on the surviving nodes. The allocation limits
* are decided by the user provisioned capacity, to determine if there were lost nodes
* are decided by the user provisioned capacity, to determine if there were lost nodes.
* The provisioned capacity as defined by the below settings needs to updated one every

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

@@ -323,20 +278,14 @@ public void testExistingPrimariesAllocationOnOverload() {
assertThat(newState.getRoutingNodes().node("node4").size(), equalTo(12));

logger.info("--> Remove node4 from zone holding primaries");
newState = removeNode(newState, "node4", strategy);
newState = removeNodes(newState, strategy,"node4");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space after ,

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
@opensearch-ci-bot
Copy link
Collaborator

✅   DCO Check Passed 38cfe08

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Wrapper Validation success 38cfe08

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Precommit success 38cfe08

@shwetathareja
Copy link
Member

start gradle check

1 similar comment
@adnapibar
Copy link
Contributor

start gradle check

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Check success 38cfe08
Log 547

Reports 547

@adnapibar adnapibar merged commit 390e678 into opensearch-project:main Sep 20, 2021
@adnapibar adnapibar added v1.2.0 Issues related to version 1.2.0 pending backport Identifies an issue or PR that still needs to be backported labels Sep 20, 2021
@Bukhtawar Bukhtawar deleted the zone-aware branch September 22, 2021 06:26
Bukhtawar added a commit to Bukhtawar/OpenSearch that referenced this pull request Sep 22, 2021
…node failures (opensearch-project#1149)

The changes ensure that in the event of a partial zone failure, the surviving nodes in the minority zone don't get overloaded with shards, this is governed by a skewness limit.

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
@dblock dblock removed the pending backport Identifies an issue or PR that still needs to be backported label Dec 6, 2021
dblock pushed a commit that referenced this pull request Feb 7, 2022
…ndependent … (#1268)

* Handle shard over allocation during partial zone/rack or independent node failures  (#1149)

The changes ensure that in the event of a partial zone failure, the surviving nodes in the minority zone don't get overloaded with shards, this is governed by a skewness limit.

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>

* Fix up imports

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>

* Fix up imports

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>

* Fix up imports

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>

* Fix up check style

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v1.2.0 Issues related to version 1.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants