Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add node "roles" to allocation explain response #98550

5 changes: 5 additions & 0 deletions docs/changelog/98550.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
pr: 98550
summary: Report a node's "roles" setting in the /_cluster/allocation/explain response
area: Allocation
type: enhancement
issues: []
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,12 @@ In order to diagnose the unassigned shards, follow the next steps:
. Log in to the {ess-console}[{ecloud} console].
+

. On the **Elasticsearch Service** panel, click the name of your deployment.
. On the **Elasticsearch Service** panel, click the name of your deployment.
+

NOTE: If the name of your deployment is disabled your {kib} instances might be
unhealthy, in which case please contact https://support.elastic.co[Elastic Support].
If your deployment doesn't include {kib}, all you need to do is
If your deployment doesn't include {kib}, all you need to do is
{cloud}/ec-access-kibana.html[enable it first].

. Open your deployment's side navigation menu (placed under the Elastic logo in the upper left corner)
Expand Down Expand Up @@ -106,6 +106,7 @@ The response will look like this:
"node_id" : "8qt2rY-pT6KNZB3-hGfLnw",
"node_name" : "node-0",
"transport_address" : "127.0.0.1:9401",
"roles": [data, data_cold, data_content, data_frozen, data_hot, data_warm, ingest, master, ml, remote_cluster_client, transform]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: (not picked up by the test framework because this example is skipped)

Suggested change
"roles": [data, data_cold, data_content, data_frozen, data_hot, data_warm, ingest, master, ml, remote_cluster_client, transform]
"roles": [data, data_cold, data_content, data_frozen, data_hot, data_warm, ingest, master, ml, remote_cluster_client, transform],

You could also reasonably list fewer roles here, maybe just data_content, data_hot, I don't think it helps the docs to have them all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"node_attributes" : {},
"node_decision" : "no", <4>
"weight_ranking" : 1,
Expand Down Expand Up @@ -151,7 +152,7 @@ settings>> and <<cluster-update-settings,cluster update settings>> APIs to the
correct values in order to allow the index to be allocated.

For more guidance on fixing the most common causes for unassinged shards please follow
<<fix-red-yellow-cluster-status, this guide>> or contact https://support.elastic.co[Elastic Support].
<<fix-red-yellow-cluster-status, this guide>> or contact https://support.elastic.co[Elastic Support].

//end::kibana-api-ex[]
// end::cloud[]
Expand Down Expand Up @@ -231,6 +232,7 @@ The response will look like this:
"node_id" : "8qt2rY-pT6KNZB3-hGfLnw",
"node_name" : "node-0",
"transport_address" : "127.0.0.1:9401",
"roles": [data, data_cold, data_content, data_frozen, data_hot, data_warm, ingest, master, ml, remote_cluster_client, transform]
"node_attributes" : {},
"node_decision" : "no", <4>
"weight_ranking" : 1,
Expand Down Expand Up @@ -276,7 +278,7 @@ settings>> and <<cluster-update-settings,cluster update settings>> APIs to the
correct values in order to allow the index to be allocated.

For more guidance on fixing the most common causes for unassinged shards please follow
<<fix-red-yellow-cluster-status, this guide>>.
<<fix-red-yellow-cluster-status, this guide>>.

// end::self-managed[]

Original file line number Diff line number Diff line change
Expand Up @@ -92,3 +92,21 @@
- is_true: can_rebalance_cluster
- is_true: can_rebalance_to_other_node
- is_true: rebalance_explanation

---
"Cluster allocation explanation response includes node's roles":
- skip:
version: " - 8.10.99"
reason: The roles field was introduced in 8.11.0

- do:
indices.create:
index: test

- match: { acknowledged: true }

- do:
cluster.allocation_explain:
body: { "index": "test", "shard": 0, "primary": true }

- is_true: current_node.roles
Copy link
Contributor

@idegtiarenko idegtiarenko Aug 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roles are added to other places as well:

  • allocate unassigned section (if present)
  • move shard section (if present)
  • each node of the per node explanation

I believe the first 2 are not present since this explain is executed against allocated shard that could stay on the current node. I do not think they require additional test case and could be covered by serialization test, see https://github.com/elastic/elasticsearch/pull/98550/files#r1299861004

I wonder about the last one. I think it could be covered with additional:
- is_true: node_allocation_decisions.0.roles
unless I miss something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried adding - is_true: node_allocation_decisions.0.roles and node_allocation_decisions is null in this test case. I saw current_node testing in this file, so that was easy to add. I haven't looked into how node_allocation_decisions might be evoked: left that to the the Java unit/integration tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The YAML tests (sometimes) run in a one-node cluster so we can't assert anything about the other changed locations here.

I think we could check the node_allocation_decisions bit in ClusterAllocationExplainIT#testExplainRolesOutput if you'd like, by starting a second node with different roles (after creating the index, so we know which node has the shard). I didn't see an easy way to get nodes in the AllocateUnassignedDecision or MoveDecision here, it might be easier to do that in a unit test, but tbh it's the node_allocation_decision that really matters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extended ClusterAllocationExplainIT#testExplainRolesOutput to test node_allocation_decisions 👍

Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
import org.elasticsearch.cluster.health.ClusterHealthStatus;
import org.elasticsearch.cluster.metadata.IndexMetadata;
import org.elasticsearch.cluster.node.DiscoveryNode;
import org.elasticsearch.cluster.node.DiscoveryNodeRole;
import org.elasticsearch.cluster.routing.ShardRoutingState;
import org.elasticsearch.cluster.routing.UnassignedInfo;
import org.elasticsearch.cluster.routing.UnassignedInfo.AllocationStatus;
Expand Down Expand Up @@ -1150,6 +1151,42 @@ public void testCannotAllocateStaleReplicaExplanation() throws Exception {
}
}

public void testExplainRolesOutput() throws Exception {
logger.info("--> starting 1 node with 'roles' setting specified");
Settings roleSettings = Settings.builder().putList("node.roles", List.of("master", "data_hot", "ingest")).build();
internalCluster().startNode(Settings.builder().put(roleSettings).build());
prepareIndex(1, 0);

boolean includeYesDecisions = randomBoolean();
boolean includeDiskInfo = randomBoolean();
ClusterAllocationExplanation explanation = runExplain(true, includeYesDecisions, includeDiskInfo);

assertEquals(
Set.of(DiscoveryNodeRole.DATA_HOT_NODE_ROLE, DiscoveryNodeRole.INGEST_ROLE, DiscoveryNodeRole.MASTER_ROLE),
explanation.getCurrentNode().getRoles()
);

try (XContentParser parser = getParser(explanation)) {
// Fast-forward to the "current_node" object, which contains "roles".
do {
parser.nextToken();
assertNotEquals(Token.END_OBJECT, parser.currentToken());
// The START_OBJECT has a null currentName(), so check for that before de-referencing.
} while (parser.currentName() == null || (parser.currentName().equals("current_node")) == false);
assertEquals(Token.START_OBJECT, parser.nextToken());

// Fast-forward to "roles" field in the "current_node" object.
do {
parser.nextToken();
assertNotEquals(Token.END_OBJECT, parser.currentToken());
} while ((parser.currentName().equals("roles")) == false);

// Check that the "roles" reported are those explicitly set via the node Settings.
// Note: list() implicitly consumes the parser START_ARRAY and END_ARRAY tokens.
assertEquals(List.of("data_hot", "ingest", "master"), parser.list());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, I am surprised we have to do this in this test.
I wonder if it is worth opening a tech debt issue to have AbstractXContentSerializingTestCase for ClusterAllocationExplanation instead of doing parsing in IT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not currently familiar with how AbstractXContentSerializingTestCase works, so can't really weigh in on this without some context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think there's scope for better testing in this area indeed. Not sure AbstractXContentSerializingTestCase is quite right, that is for checking that things faithfully round-trip through XContent which is a much stronger property than we really need here. There's probably a nicer way to verify specific features of the XContent output tho, but I'm ok with following the existing style in this area right now.

}

private void verifyClusterInfo(ClusterInfo clusterInfo, boolean includeDiskInfo, int numNodes) {
if (includeDiskInfo) {
assertThat(clusterInfo.getNodeMostAvailableDiskUsages().size(), greaterThanOrEqualTo(0));
Expand Down Expand Up @@ -1309,8 +1346,11 @@ private void verifyShardInfo(XContentParser parser, boolean primary, boolean inc
parser.currentName().equals("id")
|| parser.currentName().equals("name")
|| parser.currentName().equals("transport_address")
|| parser.currentName().equals("roles")
|| parser.currentName().equals("weight_ranking")
);
} else if (token == Token.START_ARRAY || token == Token.END_ARRAY) {
assertEquals("roles", parser.currentName());
} else {
assertTrue(token.isValue());
assertNotNull(parser.text());
Expand Down Expand Up @@ -1436,6 +1476,10 @@ private String verifyNodeDecisionPrologue(XContentParser parser) throws IOExcept
parser.nextToken();
assertNotNull(parser.text());
parser.nextToken();
assertEquals("roles", parser.currentName());
parser.nextToken();
assertNotEquals(0, parser.list().size());
parser.nextToken();
assertEquals("node_decision", parser.currentName());
parser.nextToken();
return nodeName;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
package org.elasticsearch.cluster.routing.allocation;

import org.elasticsearch.cluster.node.DiscoveryNode;
import org.elasticsearch.cluster.node.DiscoveryNodeRole;
import org.elasticsearch.cluster.routing.allocation.decider.Decision.Type;
import org.elasticsearch.common.io.stream.StreamInput;
import org.elasticsearch.common.io.stream.StreamOutput;
Expand Down Expand Up @@ -110,6 +111,11 @@ public static XContentBuilder discoveryNodeToXContent(DiscoveryNode node, boolea
}
builder.endObject();
}
builder.startArray("roles");
for (DiscoveryNodeRole role : node.getRoles()) {
builder.value(role.roleName());
}
builder.endArray();
return builder;
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@
import java.time.Instant;
import java.util.Collections;
import java.util.Locale;
import java.util.stream.Collectors;

import static org.elasticsearch.action.admin.cluster.allocation.TransportClusterAllocationExplainAction.findShardToExplain;
import static org.hamcrest.Matchers.allOf;
Expand Down Expand Up @@ -109,6 +110,7 @@ public ShardAllocationDecision decideShardAllocation(ShardRouting shard, Routing
cae.getCurrentNode().getId(),
cae.getCurrentNode().getName(),
cae.getCurrentNode().getAddress(),
cae.getCurrentNode().getRoles().stream().map(r -> '"' + r.roleName() + '"').collect(Collectors.joining(", ", "[", "]")),
explanation };
assertEquals(XContentHelper.stripWhitespace(Strings.format("""
{
Expand All @@ -120,7 +122,8 @@ public ShardAllocationDecision decideShardAllocation(ShardRouting shard, Routing
"current_node": {
"id": "%s",
"name": "%s",
"transport_address": "%s"
"transport_address": "%s",
"roles": %s
},
"explanation": "%s"
}""", args)), Strings.toString(builder));
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,7 @@ public void testExplanationToXContent() throws Exception {
"id": "node-0",
"name": "",
"transport_address": "%s",
"roles": [],
"weight_ranking": 3
},
"can_remain_on_current_node": "yes",
Expand Down Expand Up @@ -123,6 +124,7 @@ public void testRandomShardExplanationToXContent() throws Exception {
"id": "node-0",
"name": "",
"transport_address": "%s",
"roles": [],
"weight_ranking": 3
},
"can_remain_on_current_node": "yes",
Expand Down