-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed: Akka.Cluster.Singleton no longer moves when node with matching role and higher AppVersion
joins cluster
#7197
Fixed: Akka.Cluster.Singleton no longer moves when node with matching role and higher AppVersion
joins cluster
#7197
Conversation
AppVersion
joins cluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Detailed my changes
|
||
[Fact(DisplayName = | ||
"Singletons should not move to higher AppVersion nodes until after older incarnation is downed")] | ||
public async Task Bugfix7196Spec() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
End to End reproduction spec:
- Forms 2 node cluster: 1 with the role enabled for singleton (
_hostNodeV1
), the other without (Sys
). Both nodes are running withAppVersion
== 1.0.0. Sys
will create theClusterSingletonProxy
- which will point to theSingletonManger
and the resultingSingleton
actor hosted on_hostNodeV1
._otherNodeV2
joins the cluster withAppVersion
== 1.0.2. The bug from Akka.Cluster.Tools.Singleton: singleton moves earlier than expected - as soon as new node joins #7196 meant that this node would IMMEDIATELY become theOldest
and the singleton would move (and the old one would not die, creating a split brain.)- We validate that the singleton is still on
_hostNodeV1
- which means we've fixed the bug, but we still need to make sure the oldest gets replaced if the current one leaves. - We down
_hostNodeV1
and observe that the singleton migrates over to_otherNodeV2
afterwards.
This validates, end to end, that the issue is resolved. The only way we could make this more robust is adding a second _hostNodeV1
instance and observe that _otherNodeV2
gets elected in favor over it, but we have other tests below that ensure this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests for my OldestChangedBuffersState
abstraction, which is now shared by both the SingletonProxy
and the OldestChangedBuffer
@@ -5,6 +5,7 @@ | |||
// </copyright> | |||
//----------------------------------------------------------------------- | |||
|
|||
#nullable enable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enable nullability here - just trying to do that as I go back through older code.
@@ -102,7 +99,8 @@ public ClusterSingletonProxy(string singletonManagerPath, ClusterSingletonProxyS | |||
_memberAgeComparer = settings.ConsiderAppVersion | |||
? MemberAgeOrdering.DescendingWithAppVersion | |||
: MemberAgeOrdering.Descending; | |||
_membersByAge = ImmutableSortedSet<Member>.Empty.WithComparer(_memberAgeComparer); | |||
_state = new OldestChangedBufferState(ImmutableSortedSet<Member>.Empty.WithComparer(_memberAgeComparer), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the new OldestChangedBufferState
instead of separate fields
{ | ||
var (newState, oldestChanged) = _state.AddMember(member); | ||
_state = newState; | ||
if (oldestChanged) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check to see if the oldest has changed, if so retry identifying the singleton (consistent with the behavior we had before, just no Action
delegates and the TrackChanges
method to do it.)
|
||
State = State with { MembersByAge = newMembersByAge }; | ||
|
||
// compute the initial oldest - doesn't matter if it's not "safe" or not. That's our parent's problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment for details - behavior is the same as before, just computed by the State
_membersByAge = _membersByAge.Remove(member); | ||
_membersByAge = _membersByAge.Add(member); | ||
}); | ||
_changes = _changes.Enqueue(new OldestChanged(State.CurrentOldest?.UniqueAddress)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same behavior as before, but this is just now internally handled by the State
.
/// <summary> | ||
/// Immutable data object that represents the state of the oldest changed buffer. | ||
/// </summary> | ||
internal sealed record OldestChangedBufferState |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New immutable data structure designed to do two things:
- Predict who the next leader should be
- Not actually allow the leader to be changed until the previous leader has exited the cluster <-- THIS IS THE KEY BUGFIX.
if (MatchingRole(state, member)) | ||
{ | ||
// remove then add node to replace it, as it's possible that the upNumber is changed | ||
return ComputeNextOldest(state with { MembersByAge = state.MembersByAge.Remove(member).Add(member) }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same "replacement" behavior that we used inside the OldestChanged
buffer before.
public static (OldestChangedBufferState newState, bool oldestChanged) ComputeNextOldest(this OldestChangedBufferState state) | ||
{ | ||
// if the current oldest has not been removed, then it remains the oldest | ||
if(state.CurrentOldest is not null && state.MembersByAge.Contains(state.CurrentOldest)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Key bugfix happens right here - we don't allow a new "oldest" node to be decided until AFTER the previous one is removed from the cluster. This prevents split brains from happening AND it stops the ClusterSingletonManager
from moving immediately after a new with a high AppVersion
and matching role joins the cluster.
I have a racy failure / instability of some kind in my test. Working on ruling that out before I get anyone else to look at this PR. I think it might have been related to the auto-downing setting I copied and pasted from the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
So this "fix" will actually create additional problems - I'm thinking that the right thing to do here is to remove the "consider app version" support from Akka.Cluster.Singleton altogether. That problem doesn't seem workable without adding another communication layer around singletons, which I am not inclined to do. Better that we have a single heuristic for deciding where singletons live so we can keep the design of the infrastructure as simple as possible and eliminate any possibility of split brains. |
f1b8b9d
to
91094a2
Compare
Superseded by #7297 |
Changes
Fixes #7196 - not done yet.
Checklist
For significant changes, please ensure that the following have been completed (delete if not relevant):