-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ClusterSingleton does not start after oldest node shutdown #1784
Comments
Could you show us the code used to instantiate cluster singleton on each node? |
They use the same code.
Where |
|
This may be a bug indeed, however it's also possible that:
|
As I've said, both nodes have identical code, so parent actor and this code is executed on both nodes. |
I've run into this as well, and I've tracked it to these issues:
|
@zbynek001 Wow. This is really great info to hear. Ofc PRs are always welcome! |
@zbynek001 Cool, thanks a lot. |
Btw, right now all ClusterSingletonManager instances are reporting "Oldest" state. With this fix only the oldest will report "Oldest" and all others will report "Younger". |
@zbynek001 That's strange. In my case I see that 2 out of 3 nodes reports |
Resolved by #1799 |
I'm still experiencing this issue. Not only does the singleton not get created again until the previously hosting node comes back up, but the singleton also isn't being brought up on the oldest node in my cluster. I have a 3 node setup with identical code. I start the two seed nodes followed by a 3rd node. I then do the following:
I perform the previously mentioned steps in arbitrary order and note that even though node 3 never goes down, and is therefore the oldest, it is never handed the singleton. Any insight into this would be appreciated. My code is below: ActorSystem system = ActorSystem.create("time-magus");
ClusterSingletonManagerSettings settings = ClusterSingletonManagerSettings.create(system);
system.actorOf(ClusterSingletonManager.props(Props.create(MasterScheduler.class,
new MasterSchedulerConfiguration()), PoisonPill.getInstance(), settings), "test"); |
@enzeart coincidental timing on this, but I just found a bug with the way |
@Aaronontheweb Awesome, thanks for the updates. Any idea on when this will be merged or released? Or is there a snapshot of sorts that I can try out? |
@enzeart going to do a release of Akka.NET v1.2.2 tomorrow, but the currently nightly has it as well: http://getakka.net/docs/akka-developers/nightly-builds Give that a try and let me know if it works. In the meantime, I'll take a look at our test suite more closely and see if we need to add a spec to verify this fix. |
ClusterSingletonManagerLeave and ClusterSingletonManagerr chaos spec seems to verify this, although they have been racy in the past possibly because of this bug I fixed in the #2794 PR. Give the nightly a try and let me know if this is resolved. |
@Aaronontheweb I tried out the 2.5-snapshot from http://repo.akka.io/snapshots/. The problem seemed to be fixed at first as the 3rd, non-seed, cluster member was receiving the singleton as expected, but after a few random restarts of the 3 cluster members, the singleton started to juggle strictly between the two seed nodes even though neither of them should have been the oldest member in the cluster (node 3 was the only node that hadn't been killed and restarted). This was a quick test. Let me know if there's any information I can provide that would be more helpful. |
edit: probably not, this doesn't fit the scenario I think this might be even by design. UpNumber does not guarantee strict ordering:
|
@enzeart yeah, it sounds like the bug was fixed here (moving the third node in the first place) but if the two seed nodes are restarted at the same time, they're not going to reform the cluster with the third node. If you restart only one seed at a time, it'll rejoin the original cluster and the SingletonManager should stay with the third node. A thing that would be helpful would be some logs showing the action inside the cluster from the perspective of the third node - that would help! |
@enzeart flipping this to 1.3 for now in case it's still a bug; if you can provide me with those logs that would be helpful. |
@Aaronontheweb I'll try to run some tests, generate some logs, and annotate them with the sequence of events that triggered them tonight. |
I just realized that I was using "Stop" in Intellij during my tests. So, I tried testing this with "Exit". The difference, I'm guessing, appears to be between that of a SIGKILL in the former and a SIGTERM in the latter (a hard shutdown of the JVM vs a graceful one which allows proper cluster signaling). When using "Exit", the migration of the singleton behaves as expected and it is transferred to the expected node in a timely fashion. When using "Stop", I run into the previously mentioned behavior of the singleton juggling between seed nodes. I just wanted to confirm what the expectation should be in the case of an "ungraceful" shutdown. Is the migration of the singleton still expected to happen without the previously hosting process coming back up and is the concept of "oldest gets it" still a valid expectation. I wanted to make sure that I'm not going down the rabbit hole based on invalid expectations on my part while dragging you along with me for the ride. |
@enzeart it is a repository for Akka.NET, not for Akka (JVM) |
I have cluster of three nodes (one seed and two workers).
Worker nodes are identical by code, have the same worker role and have a configured ClusterSingleton actor for this role.
At first everything is ok, actor starts on the first worker node and accepts messages (through the proxy nodes).
Then I terminate the first worker node.
I expect that singleton actor should go up on the second worker node. But this just not happen.
The last log message I get on the worker-2 is:
[Information] Previous oldest removed ["akka.tcp://Cluster@worker1:43314"]
and that is all.
If I'll bring the first worker node up again - it will start the singleton. And that is strange and unexpected.
The text was updated successfully, but these errors were encountered: