ClusterSingleton does not start after oldest node shutdown #1784

kantora · 2016-03-13T09:23:58Z

I have cluster of three nodes (one seed and two workers).
Worker nodes are identical by code, have the same worker role and have a configured ClusterSingleton actor for this role.
At first everything is ok, actor starts on the first worker node and accepts messages (through the proxy nodes).
Then I terminate the first worker node.
I expect that singleton actor should go up on the second worker node. But this just not happen.
The last log message I get on the worker-2 is:
[Information] Previous oldest removed ["akka.tcp://Cluster@worker1:43314"]
and that is all.

If I'll bring the first worker node up again - it will start the singleton. And that is strange and unexpected.

The text was updated successfully, but these errors were encountered:

Horusiath · 2016-03-13T10:43:16Z

Could you show us the code used to instantiate cluster singleton on each node?

kantora · 2016-03-13T11:05:50Z

They use the same code.

 context.ActorOf(
                ClusterSingletonManager.Props(
                    context.System.DI().Props(type),
                    new ClusterSingletonManagerSettings(
                        singletonName,
                        role,
                        actorConfig.GetTimeSpan("removal-margin", TimeSpan.FromSeconds(1), false),
                        actorConfig.GetTimeSpan("handover-retry-interval", TimeSpan.FromSeconds(5), false))),
                pathName);

Where type is sngleton actor end-type.
SingletonManager is instantiated as a child of some other actor, but they are both created under same path.

kantora · 2016-03-13T11:07:15Z

singletonName, role and pathName are all the same for both nodes.

Horusiath · 2016-03-13T11:18:28Z

This may be a bug indeed, however it's also possible that:

Parent actor (creating a singleton actor) is not present on each node under the same path.
Maybe your node is not being marked as down. In that case other nodes may not be aware, that singleton has died, therefore they won't try to respawn it on another node.

kantora · 2016-03-13T11:39:27Z

As I've said, both nodes have identical code, so parent actor and this code is executed on both nodes.
Also seed node received ClusterEvent.MemberRemoved event about worker1.
And I suppose that log message [Information] Previous oldest removed ["akka.tcp://Cluster@worker1:43314"] on worker2 also indicates that it is aware of worker1 state.

zbynek001 · 2016-03-17T09:50:23Z

I've run into this as well, and I've tracked it to these issues:

Member.UpNumber is not being propagated across cluster
MemberAgeComparer is sorting in a reverse order then it should
I have this fixes, so I can create PR for this if you are interested

Horusiath · 2016-03-17T09:56:57Z

@zbynek001 Wow. This is really great info to hear. Ofc PRs are always welcome!

kantora · 2016-03-17T13:41:35Z

@zbynek001 Cool, thanks a lot.

zbynek001 · 2016-03-17T16:17:41Z

Btw, right now all ClusterSingletonManager instances are reporting "Oldest" state. With this fix only the oldest will report "Oldest" and all others will report "Younger".

kantora · 2016-03-17T17:04:26Z

@zbynek001 That's strange. In my case I see that 2 out of 3 nodes reports
ClusterSingletonManager state change [Start -> Younger] and only one ClusterSingletonManager state change [Start -> Oldest] (the one with started singleton)

Aaronontheweb · 2016-03-17T21:20:06Z

Resolved by #1799

enzeart · 2017-06-26T02:52:31Z

I'm still experiencing this issue. Not only does the singleton not get created again until the previously hosting node comes back up, but the singleton also isn't being brought up on the oldest node in my cluster. I have a 3 node setup with identical code. I start the two seed nodes followed by a 3rd node. I then do the following:

Take down seed node 1.
Log messages indicating that "association with the remote system has failed" appear.
Bring seed node 1 back up.
The singleton may or may not instantiate on seed node 2. This is inconsistent from what I've tested. Once in a while, it will come up on node 1.
Take down seed node 2.
Bring seed node 2 back up.
The singleton appears on seed node 1.

I perform the previously mentioned steps in arbitrary order and note that even though node 3 never goes down, and is therefore the oldest, it is never handed the singleton. Any insight into this would be appreciated. My code is below:

ActorSystem system = ActorSystem.create("time-magus");
        ClusterSingletonManagerSettings settings = ClusterSingletonManagerSettings.create(system);

        system.actorOf(ClusterSingletonManager.props(Props.create(MasterScheduler.class,
                new MasterSchedulerConfiguration()), PoisonPill.getInstance(), settings), "test");

Aaronontheweb · 2017-06-26T04:10:13Z

@enzeart coincidental timing on this, but I just found a bug with the way upNumber is incremented in Akka.Cluster while looking at another issue. Result was that two different nodes can have the same upNumber, which is wrong and could cause this behavior. I'll take a look at the ClusterSingletonManager sources once I have this other issue I'm working on sorted out, but I'll reopen this until I can confirm a fix.

Aaronontheweb · 2017-06-26T18:54:39Z

@enzeart possible fix on #2794

enzeart · 2017-06-27T04:26:15Z

@Aaronontheweb Awesome, thanks for the updates. Any idea on when this will be merged or released? Or is there a snapshot of sorts that I can try out?

Aaronontheweb · 2017-06-27T21:26:11Z

@enzeart going to do a release of Akka.NET v1.2.2 tomorrow, but the currently nightly has it as well:

http://getakka.net/docs/akka-developers/nightly-builds

Give that a try and let me know if it works. In the meantime, I'll take a look at our test suite more closely and see if we need to add a spec to verify this fix.

Aaronontheweb · 2017-06-27T22:03:58Z

ClusterSingletonManagerLeave and ClusterSingletonManagerr chaos spec seems to verify this, although they have been racy in the past possibly because of this bug I fixed in the #2794 PR.

Give the nightly a try and let me know if this is resolved.

enzeart · 2017-06-28T04:09:36Z

@Aaronontheweb I tried out the 2.5-snapshot from http://repo.akka.io/snapshots/. The problem seemed to be fixed at first as the 3rd, non-seed, cluster member was receiving the singleton as expected, but after a few random restarts of the 3 cluster members, the singleton started to juggle strictly between the two seed nodes even though neither of them should have been the oldest member in the cluster (node 3 was the only node that hadn't been killed and restarted). This was a quick test. Let me know if there's any information I can provide that would be more helpful.

zbynek001 · 2017-06-28T12:21:36Z

edit: probably not, this doesn't fit the scenario

I think this might be even by design. UpNumber does not guarantee strict ordering:

/**
   * Is this member older, has been part of cluster longer, than another
   * member. It is only correct when comparing two existing members in a
   * cluster. A member that joined after removal of another member may be
   * considered older than the removed member.
   */
def isOlderThan(other: Member): Boolean

Aaronontheweb · 2017-06-28T14:48:33Z

@enzeart yeah, it sounds like the bug was fixed here (moving the third node in the first place) but if the two seed nodes are restarted at the same time, they're not going to reform the cluster with the third node. If you restart only one seed at a time, it'll rejoin the original cluster and the SingletonManager should stay with the third node.

A thing that would be helpful would be some logs showing the action inside the cluster from the perspective of the third node - that would help!

Aaronontheweb · 2017-06-28T20:01:48Z

@enzeart flipping this to 1.3 for now in case it's still a bug; if you can provide me with those logs that would be helpful.

enzeart · 2017-06-29T22:46:45Z

@Aaronontheweb I'll try to run some tests, generate some logs, and annotate them with the sequence of events that triggered them tonight.

enzeart · 2017-06-30T01:39:45Z

I just realized that I was using "Stop" in Intellij during my tests. So, I tried testing this with "Exit". The difference, I'm guessing, appears to be between that of a SIGKILL in the former and a SIGTERM in the latter (a hard shutdown of the JVM vs a graceful one which allows proper cluster signaling). When using "Exit", the migration of the singleton behaves as expected and it is transferred to the expected node in a timely fashion. When using "Stop", I run into the previously mentioned behavior of the singleton juggling between seed nodes.

I just wanted to confirm what the expectation should be in the case of an "ungraceful" shutdown. Is the migration of the singleton still expected to happen without the previously hosting process coming back up and is the concept of "oldest gets it" still a valid expectation. I wanted to make sure that I'm not going down the rabbit hole based on invalid expectations on my part while dragging you along with me for the ride.

alexvaluyskiy · 2017-07-22T11:04:03Z

@enzeart it is a repository for Akka.NET, not for Akka (JVM)

Horusiath added akka-cluster potential bug labels Mar 13, 2016

zbynek001 mentioned this issue Mar 17, 2016

Member.UpNumber fix #1799

Merged

Aaronontheweb closed this as completed Mar 17, 2016

Aaronontheweb reopened this Jun 26, 2017

Aaronontheweb self-assigned this Jun 26, 2017

Aaronontheweb added this to the 1.2.2 milestone Jun 26, 2017

Aaronontheweb mentioned this issue Jun 26, 2017

Fix cluster node stuck joining #2794

Merged

2 tasks

Aaronontheweb modified the milestones: 1.3.0, 1.2.2 Jun 28, 2017

alexvaluyskiy closed this as completed Jul 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ClusterSingleton does not start after oldest node shutdown #1784

ClusterSingleton does not start after oldest node shutdown #1784

kantora commented Mar 13, 2016

Horusiath commented Mar 13, 2016

kantora commented Mar 13, 2016

kantora commented Mar 13, 2016

Horusiath commented Mar 13, 2016

kantora commented Mar 13, 2016

zbynek001 commented Mar 17, 2016

Horusiath commented Mar 17, 2016

kantora commented Mar 17, 2016

zbynek001 commented Mar 17, 2016

kantora commented Mar 17, 2016

Aaronontheweb commented Mar 17, 2016

enzeart commented Jun 26, 2017 •

edited

Loading

Aaronontheweb commented Jun 26, 2017

Aaronontheweb commented Jun 26, 2017

enzeart commented Jun 27, 2017

Aaronontheweb commented Jun 27, 2017

Aaronontheweb commented Jun 27, 2017

enzeart commented Jun 28, 2017

zbynek001 commented Jun 28, 2017 •

edited

Loading

Aaronontheweb commented Jun 28, 2017

Aaronontheweb commented Jun 28, 2017

enzeart commented Jun 29, 2017

enzeart commented Jun 30, 2017 •

edited

Loading

alexvaluyskiy commented Jul 22, 2017

ClusterSingleton does not start after oldest node shutdown #1784

ClusterSingleton does not start after oldest node shutdown #1784

Comments

kantora commented Mar 13, 2016

Horusiath commented Mar 13, 2016

kantora commented Mar 13, 2016

kantora commented Mar 13, 2016

Horusiath commented Mar 13, 2016

kantora commented Mar 13, 2016

zbynek001 commented Mar 17, 2016

Horusiath commented Mar 17, 2016

kantora commented Mar 17, 2016

zbynek001 commented Mar 17, 2016

kantora commented Mar 17, 2016

Aaronontheweb commented Mar 17, 2016

enzeart commented Jun 26, 2017 • edited Loading

Aaronontheweb commented Jun 26, 2017

Aaronontheweb commented Jun 26, 2017

enzeart commented Jun 27, 2017

Aaronontheweb commented Jun 27, 2017

Aaronontheweb commented Jun 27, 2017

enzeart commented Jun 28, 2017

zbynek001 commented Jun 28, 2017 • edited Loading

Aaronontheweb commented Jun 28, 2017

Aaronontheweb commented Jun 28, 2017

enzeart commented Jun 29, 2017

enzeart commented Jun 30, 2017 • edited Loading

alexvaluyskiy commented Jul 22, 2017

enzeart commented Jun 26, 2017 •

edited

Loading

zbynek001 commented Jun 28, 2017 •

edited

Loading

enzeart commented Jun 30, 2017 •

edited

Loading