Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow CCR on nodes with legacy roles only #60093

Merged
merged 12 commits into from
Jul 29, 2020

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Jul 22, 2020

CCR will stop functioning if the master node is on 7.8, but data nodes are before that version because the master node considers that all data nodes do not have the remote cluster client role. This commit allows CCR work on data nodes with legacy roles only.

Relates #54146
Relates #59375

@dnhatn dnhatn added >bug :Distributed/CCR Issues around the Cross Cluster State Replication features v7.8.2 v7.10.0 v7.9.1 labels Jul 22, 2020
@dnhatn dnhatn requested a review from jasontedor July 22, 2020 21:18
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/CCR)

@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team label Jul 22, 2020
Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it work out if instead in the CCR primary allocation decider, we tactically make this assumption there about old nodes (node.node().getVersion())?

@dnhatn
Copy link
Member Author

dnhatn commented Jul 23, 2020

Yes, that can be an option. However, we check the remote_cluster_client role in other places; hence, I put this assumption in DiscoveryNode instead. Let me know if you are ok with this approach.

@dnhatn
Copy link
Member Author

dnhatn commented Jul 23, 2020

testThatLoadingWithNonExistingIndexWorks (tracked at #54096)

@elasticmachine run elasticsearch-ci/2

@dnhatn dnhatn requested a review from jasontedor July 23, 2020 01:41
@jasontedor
Copy link
Member

The concern I have with this approach is that it assumes the role is held, even for nodes that have cluster.remote.connect set to false.

@dnhatn
Copy link
Member Author

dnhatn commented Jul 23, 2020

@jasontedor Thank you for your feedback. I've pushed c75f661. Can you take another look?

@dnhatn dnhatn changed the title Assume old nodes have remote cluster client role Allow CCR on nodes with legacy roles only Jul 28, 2020
Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments.

clusterState,
((Predicate<DiscoveryNode>) DiscoveryNode::isDataNode).and(DiscoveryNode::isRemoteClusterClient)
node -> node.isDataNode() && (node.isRemoteClusterClient() || node.getVersion().before(DiscoveryNode.PLUGGABLE_ROLES_VERSION))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assigning to a node that is a data node but before version 7.3.0 could still fail (if the node doesn't have the remote cluster client role). I wonder if we should implement this as a last resort? So try to select the least loaded node that is a data node and is a remote cluster client. If that doesn't turn up any nodes, try to assign to a node that is a data node and is before the version that we formalized the remote cluster client role. Note that that is a different version than when we made roles pluggable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented this in fda3927, but I backed it out because I thought it could lead to hotspots on upgraded nodes. I am okay with either option. I pushed 7eeaa14 to restore this behavior.

return allocation.decision(Decision.YES, NAME,
"shard is a primary follower and node has the " + DiscoveryNodeRole.REMOTE_CLUSTER_CLIENT_ROLE.roleName() + " role");
}
if (node.node().getVersion().before(DiscoveryNode.PLUGGABLE_ROLES_VERSION)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be the version that we introduced the remote cluster client role in as opposed to the version that we made node roles pluggable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed in 3a33f0a

@dnhatn
Copy link
Member Author

dnhatn commented Jul 29, 2020

@jasontedor Thanks for reviewing. Would you mind taking another look?

@dnhatn dnhatn requested a review from jasontedor July 29, 2020 02:57
Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me now. Thanks for the iterations @dnhatn.

@dnhatn
Copy link
Member Author

dnhatn commented Jul 29, 2020

Thanks Jason!

@dnhatn dnhatn merged commit 9d4a64e into elastic:7.x Jul 29, 2020
@dnhatn dnhatn deleted the 7x-remote-cluster-role branch July 29, 2020 14:57
dnhatn added a commit that referenced this pull request Aug 11, 2020
CCR will stop functioning if the master node is on 7.8, but data nodes 
are before that version because the master node considers that all data
nodes do not have the remote cluster client role. This commit allows CCR
work on data nodes with legacy roles only.

Relates #54146
Relates #59375
dnhatn added a commit that referenced this pull request Aug 11, 2020
CCR will stop functioning if the master node is on 7.8, but data nodes 
are before that version because the master node considers that all data
nodes do not have the remote cluster client role. This commit allows CCR
work on data nodes with legacy roles only.

Relates #54146
Relates #59375
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/CCR Issues around the Cross Cluster State Replication features Team:Distributed Meta label for distributed team v7.8.2 v7.9.1 v7.10.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants