-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add node "roles" to allocation explain response #98550
Add node "roles" to allocation explain response #98550
Conversation
30b1085
to
7f79eb6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall structure is great; I left a few superficial comments.
version: " - 8.9.99" | ||
reason: The roles field was introduced in 8.10.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will (eventually) land in 8.11 so this bit will need adjusting.
Normally I'd recommend avoiding rebasing PR branches once they're under review, but at the moment the 8.10 freeze has uncovered some issues in CI. You should be able to get a green CI run if you rebase this branch back onto the commit before the version bump (bb2a30f), leaving these lines as they are. Once CI is passing on main
again you'll need to merge it in and fix these lines up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I believe a rebase backwards would require a force push to the PR -- as opposed to a merge of the latest that's benign to push. Would not be great to wipe out your comments 😅 Perhaps I should just update this code to 8.11 and wait? Or I might be missing something about the process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah a force-push would be required, which might confuse Github into losing review context. It doesn't always, but it's a risk. I'm ok with that, it's extremely unusual for main
to be this unstable. Or you can indeed just wait for things to settle down before you get a passing CI run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
main
has stabilised now so I would expect you can just merge it in and this'll pass CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
...rTest/java/org/elasticsearch/action/admin/cluster/allocation/ClusterAllocationExplainIT.java
Outdated
Show resolved
Hide resolved
logger.info("--> restarting the stopped node with 'roles' setting added"); | ||
Settings roleSettings = Settings.builder().putList("node.roles", Arrays.asList("master", "data_hot", "ingest")).build(); | ||
internalCluster().startNode(Settings.builder().put(settings).put(roleSettings).build()); | ||
ensureStableCluster(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
startNode()
calls validateClusterFormed()
as below, so there's no real need to ensureStableCluster()
too (although it doesn't hurt)
at org.elasticsearch.test.InternalTestCluster.validateClusterFormed(InternalTestCluster.java:1220) ~[framework-8.11.0-SNAPSHOT.jar:8.11.0-SNAPSHOT]
at org.elasticsearch.test.InternalTestCluster.startAndPublishNodesAndClients(InternalTestCluster.java:1750) ~[framework-8.11.0-SNAPSHOT.jar:8.11.0-SNAPSHOT]
at org.elasticsearch.test.InternalTestCluster.startNodes(InternalTestCluster.java:2181) ~[framework-8.11.0-SNAPSHOT.jar:8.11.0-SNAPSHOT]
at org.elasticsearch.test.InternalTestCluster.startNode(InternalTestCluster.java:2113) ~[framework-8.11.0-SNAPSHOT.jar:8.11.0-SNAPSHOT]
at org.elasticsearch.test.InternalTestCluster.startNode(InternalTestCluster.java:2106) ~[framework-8.11.0-SNAPSHOT.jar:8.11.0-SNAPSHOT]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, nice. So it's only worth calling when restarting a subset of the nodes in a cluster, or stopping nodes, maybe (depending those functions' implementations).
Is it typical for patches to remain focused exclusively on one topic, or are touchups elsewhere OK? For example, I could remove the check here, too, since it's redundant and I've noticed it. Though perhaps validateClusterFormed() and ensureStableCluster() vary enough that the real fix would be a refactor -- which seems too much of a distraction/off topic to do in this patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes those other calls to ensureStableCluster()
also look redundant to me. I think they were necessary when they were added in #22182, but #49248 changed that.
And yes I generally prefer to keep PRs focussed on a single change where possible. Easier on the reviewers, and on future-us if ever we need to look back at the history of changes in this area. A separate PR which removes all those unnecessary ensureStableCluster()
calls would be worthwhile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the context.
Separate PR to remove: #98694
...rTest/java/org/elasticsearch/action/admin/cluster/allocation/ClusterAllocationExplainIT.java
Outdated
Show resolved
Hide resolved
...rTest/java/org/elasticsearch/action/admin/cluster/allocation/ClusterAllocationExplainIT.java
Outdated
Show resolved
Hide resolved
...rTest/java/org/elasticsearch/action/admin/cluster/allocation/ClusterAllocationExplainIT.java
Outdated
Show resolved
Hide resolved
...r/src/main/java/org/elasticsearch/cluster/routing/allocation/AbstractAllocationDecision.java
Outdated
Show resolved
Hide resolved
"data", // begin roles array | ||
"data_cold", | ||
"data_content", | ||
"data_frozen", | ||
"data_hot", | ||
"data_warm", | ||
"index", | ||
"ingest", | ||
"master", | ||
"ml", | ||
"remote_cluster_client", | ||
"search", | ||
"transform", | ||
"voting_only", // end roles array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would need updating if we ever add a new node role, but I don't think the test really cares about the exact roles. How about constructing the expected string dynamically:
"data", // begin roles array | |
"data_cold", | |
"data_content", | |
"data_frozen", | |
"data_hot", | |
"data_warm", | |
"index", | |
"ingest", | |
"master", | |
"ml", | |
"remote_cluster_client", | |
"search", | |
"transform", | |
"voting_only", // end roles array | |
cae.getCurrentNode().getRoles().stream().map(r -> '"' + r.roleName() + '"').collect(Collectors.joining(", ", "[", "]")), |
(and then the format string just needs "roles": %s
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great, figuring out the formatting to get everything to work in Java was kicking my ass :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback, David! Responded and pushed an update.
...rTest/java/org/elasticsearch/action/admin/cluster/allocation/ClusterAllocationExplainIT.java
Outdated
Show resolved
Hide resolved
logger.info("--> restarting the stopped node with 'roles' setting added"); | ||
Settings roleSettings = Settings.builder().putList("node.roles", Arrays.asList("master", "data_hot", "ingest")).build(); | ||
internalCluster().startNode(Settings.builder().put(settings).put(roleSettings).build()); | ||
ensureStableCluster(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, nice. So it's only worth calling when restarting a subset of the nodes in a cluster, or stopping nodes, maybe (depending those functions' implementations).
Is it typical for patches to remain focused exclusively on one topic, or are touchups elsewhere OK? For example, I could remove the check here, too, since it's redundant and I've noticed it. Though perhaps validateClusterFormed() and ensureStableCluster() vary enough that the real fix would be a refactor -- which seems too much of a distraction/off topic to do in this patch.
...rTest/java/org/elasticsearch/action/admin/cluster/allocation/ClusterAllocationExplainIT.java
Outdated
Show resolved
Hide resolved
...rTest/java/org/elasticsearch/action/admin/cluster/allocation/ClusterAllocationExplainIT.java
Outdated
Show resolved
Hide resolved
...rTest/java/org/elasticsearch/action/admin/cluster/allocation/ClusterAllocationExplainIT.java
Outdated
Show resolved
Hide resolved
"data", // begin roles array | ||
"data_cold", | ||
"data_content", | ||
"data_frozen", | ||
"data_hot", | ||
"data_warm", | ||
"index", | ||
"ingest", | ||
"master", | ||
"ml", | ||
"remote_cluster_client", | ||
"search", | ||
"transform", | ||
"voting_only", // end roles array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great, figuring out the formatting to get everything to work in Java was kicking my ass :)
...r/src/main/java/org/elasticsearch/cluster/routing/allocation/AbstractAllocationDecision.java
Outdated
Show resolved
Hide resolved
version: " - 8.9.99" | ||
reason: The roles field was introduced in 8.10.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I believe a rebase backwards would require a force push to the PR -- as opposed to a merge of the latest that's benign to push. Would not be great to wipe out your comments 😅 Perhaps I should just update this code to 8.11 and wait? Or I might be missing something about the process.
Pinging @elastic/es-distributed (Team:Distributed) |
@@ -1150,6 +1152,42 @@ public void testCannotAllocateStaleReplicaExplanation() throws Exception { | |||
} | |||
} | |||
|
|||
public void testExplainRolesOutput() throws Exception { | |||
logger.info("--> starting 1 node with 'roles' setting specified"); | |||
Settings roleSettings = Settings.builder().putList("node.roles", Arrays.asList("master", "data_hot", "ingest")).build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: This is not important here, but please generally prefer List.of(..)
to Arrays.asList(..)
as first one creates immutable list while the later allows some modifications
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks!
// Check that the "roles" reported are those explicitly set via the node Settings. | ||
// Note: list() implicitly consumes the parser START_ARRAY and END_ARRAY tokens. | ||
assertEquals(Arrays.asList("data_hot", "ingest", "master"), parser.list()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, I am surprised we have to do this in this test.
I wonder if it is worth opening a tech debt issue to have AbstractXContentSerializingTestCase
for ClusterAllocationExplanation instead of doing parsing in IT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not currently familiar with how AbstractXContentSerializingTestCase works, so can't really weigh in on this without some context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think there's scope for better testing in this area indeed. Not sure AbstractXContentSerializingTestCase
is quite right, that is for checking that things faithfully round-trip through XContent which is a much stronger property than we really need here. There's probably a nicer way to verify specific features of the XContent output tho, but I'm ok with following the existing style in this area right now.
cluster.allocation_explain: | ||
body: { "index": "test", "shard": 0, "primary": true } | ||
|
||
- is_true: current_node.roles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Roles are added to other places as well:
- allocate unassigned section (if present)
- move shard section (if present)
- each node of the per node explanation
I believe the first 2 are not present since this explain is executed against allocated shard that could stay on the current node. I do not think they require additional test case and could be covered by serialization test, see https://github.com/elastic/elasticsearch/pull/98550/files#r1299861004
I wonder about the last one. I think it could be covered with additional:
- is_true: node_allocation_decisions.0.roles
unless I miss something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried adding - is_true: node_allocation_decisions.0.roles
and node_allocation_decisions
is null in this test case. I saw current_node
testing in this file, so that was easy to add. I haven't looked into how node_allocation_decisions
might be evoked: left that to the the Java unit/integration tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The YAML tests (sometimes) run in a one-node cluster so we can't assert anything about the other changed locations here.
I think we could check the node_allocation_decisions
bit in ClusterAllocationExplainIT#testExplainRolesOutput
if you'd like, by starting a second node with different roles (after creating the index, so we know which node has the shard). I didn't see an easy way to get nodes in the AllocateUnassignedDecision
or MoveDecision
here, it might be easier to do that in a unit test, but tbh it's the node_allocation_decision
that really matters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extended ClusterAllocationExplainIT#testExplainRolesOutput
to test node_allocation_decisions
👍
Hi @DiannaHohensee, I've created a changelog YAML for you. |
@@ -106,6 +106,7 @@ The response will look like this: | |||
"node_id" : "8qt2rY-pT6KNZB3-hGfLnw", | |||
"node_name" : "node-0", | |||
"transport_address" : "127.0.0.1:9401", | |||
"roles": [data, data_cold, data_content, data_frozen, data_hot, data_warm, ingest, master, ml, remote_cluster_client, transform] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: (not picked up by the test framework because this example is skipped)
"roles": [data, data_cold, data_content, data_frozen, data_hot, data_warm, ingest, master, ml, remote_cluster_client, transform] | |
"roles": [data, data_cold, data_content, data_frozen, data_hot, data_warm, ingest, master, ml, remote_cluster_client, transform], |
You could also reasonably list fewer roles here, maybe just data_content, data_hot
, I don't think it helps the docs to have them all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 from my side. Leaving it to David to give a final approval
Hi @DiannaHohensee, I've updated the changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM great work @DiannaHohensee
Report node "roles" in the /_cluster/allocation/explain response.
Nodes with limited sets of roles may affect shard distribution in ways
users did not originally consider, so it is helpful to surface this
information along with node allocation decision explanations.