Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClusterSingleton does not start after oldest node shutdown #1784

Closed
kantora opened this issue Mar 13, 2016 · 24 comments
Closed

ClusterSingleton does not start after oldest node shutdown #1784

kantora opened this issue Mar 13, 2016 · 24 comments

Comments

@kantora
Copy link
Contributor

kantora commented Mar 13, 2016

I have cluster of three nodes (one seed and two workers).
Worker nodes are identical by code, have the same worker role and have a configured ClusterSingleton actor for this role.
At first everything is ok, actor starts on the first worker node and accepts messages (through the proxy nodes).
Then I terminate the first worker node.
I expect that singleton actor should go up on the second worker node. But this just not happen.
The last log message I get on the worker-2 is:
[Information] Previous oldest removed ["akka.tcp://Cluster@worker1:43314"]
and that is all.

If I'll bring the first worker node up again - it will start the singleton. And that is strange and unexpected.

@Horusiath
Copy link
Contributor

Could you show us the code used to instantiate cluster singleton on each node?

@kantora
Copy link
Contributor Author

kantora commented Mar 13, 2016

They use the same code.

 context.ActorOf(
                ClusterSingletonManager.Props(
                    context.System.DI().Props(type),
                    new ClusterSingletonManagerSettings(
                        singletonName,
                        role,
                        actorConfig.GetTimeSpan("removal-margin", TimeSpan.FromSeconds(1), false),
                        actorConfig.GetTimeSpan("handover-retry-interval", TimeSpan.FromSeconds(5), false))),
                pathName);

Where type is sngleton actor end-type.
SingletonManager is instantiated as a child of some other actor, but they are both created under same path.

@kantora
Copy link
Contributor Author

kantora commented Mar 13, 2016

singletonName, role and pathName are all the same for both nodes.

@Horusiath
Copy link
Contributor

This may be a bug indeed, however it's also possible that:

  1. Parent actor (creating a singleton actor) is not present on each node under the same path.
  2. Maybe your node is not being marked as down. In that case other nodes may not be aware, that singleton has died, therefore they won't try to respawn it on another node.

@kantora
Copy link
Contributor Author

kantora commented Mar 13, 2016

As I've said, both nodes have identical code, so parent actor and this code is executed on both nodes.
Also seed node received ClusterEvent.MemberRemoved event about worker1.
And I suppose that log message [Information] Previous oldest removed ["akka.tcp://Cluster@worker1:43314"] on worker2 also indicates that it is aware of worker1 state.

@zbynek001
Copy link
Contributor

I've run into this as well, and I've tracked it to these issues:

  1. Member.UpNumber is not being propagated across cluster
  2. MemberAgeComparer is sorting in a reverse order then it should
    I have this fixes, so I can create PR for this if you are interested

@Horusiath
Copy link
Contributor

@zbynek001 Wow. This is really great info to hear. Ofc PRs are always welcome!

@kantora
Copy link
Contributor Author

kantora commented Mar 17, 2016

@zbynek001 Cool, thanks a lot.

@zbynek001
Copy link
Contributor

Btw, right now all ClusterSingletonManager instances are reporting "Oldest" state. With this fix only the oldest will report "Oldest" and all others will report "Younger".

@kantora
Copy link
Contributor Author

kantora commented Mar 17, 2016

@zbynek001 That's strange. In my case I see that 2 out of 3 nodes reports
ClusterSingletonManager state change [Start -> Younger] and only one ClusterSingletonManager state change [Start -> Oldest] (the one with started singleton)

@Aaronontheweb
Copy link
Member

Resolved by #1799

@enzeart
Copy link

enzeart commented Jun 26, 2017

I'm still experiencing this issue. Not only does the singleton not get created again until the previously hosting node comes back up, but the singleton also isn't being brought up on the oldest node in my cluster. I have a 3 node setup with identical code. I start the two seed nodes followed by a 3rd node. I then do the following:

  1. Take down seed node 1.
  2. Log messages indicating that "association with the remote system has failed" appear.
  3. Bring seed node 1 back up.
  4. The singleton may or may not instantiate on seed node 2. This is inconsistent from what I've tested. Once in a while, it will come up on node 1.
  5. Take down seed node 2.
  6. Bring seed node 2 back up.
  7. The singleton appears on seed node 1.

I perform the previously mentioned steps in arbitrary order and note that even though node 3 never goes down, and is therefore the oldest, it is never handed the singleton. Any insight into this would be appreciated. My code is below:

ActorSystem system = ActorSystem.create("time-magus");
        ClusterSingletonManagerSettings settings = ClusterSingletonManagerSettings.create(system);

        system.actorOf(ClusterSingletonManager.props(Props.create(MasterScheduler.class,
                new MasterSchedulerConfiguration()), PoisonPill.getInstance(), settings), "test");

@Aaronontheweb Aaronontheweb reopened this Jun 26, 2017
@Aaronontheweb
Copy link
Member

@enzeart coincidental timing on this, but I just found a bug with the way upNumber is incremented in Akka.Cluster while looking at another issue. Result was that two different nodes can have the same upNumber, which is wrong and could cause this behavior. I'll take a look at the ClusterSingletonManager sources once I have this other issue I'm working on sorted out, but I'll reopen this until I can confirm a fix.

@Aaronontheweb Aaronontheweb self-assigned this Jun 26, 2017
@Aaronontheweb Aaronontheweb added this to the 1.2.2 milestone Jun 26, 2017
@Aaronontheweb
Copy link
Member

@enzeart possible fix on #2794

@enzeart
Copy link

enzeart commented Jun 27, 2017

@Aaronontheweb Awesome, thanks for the updates. Any idea on when this will be merged or released? Or is there a snapshot of sorts that I can try out?

@Aaronontheweb
Copy link
Member

@enzeart going to do a release of Akka.NET v1.2.2 tomorrow, but the currently nightly has it as well:

http://getakka.net/docs/akka-developers/nightly-builds

Give that a try and let me know if it works. In the meantime, I'll take a look at our test suite more closely and see if we need to add a spec to verify this fix.

@Aaronontheweb
Copy link
Member

ClusterSingletonManagerLeave and ClusterSingletonManagerr chaos spec seems to verify this, although they have been racy in the past possibly because of this bug I fixed in the #2794 PR.

Give the nightly a try and let me know if this is resolved.

@enzeart
Copy link

enzeart commented Jun 28, 2017

@Aaronontheweb I tried out the 2.5-snapshot from http://repo.akka.io/snapshots/. The problem seemed to be fixed at first as the 3rd, non-seed, cluster member was receiving the singleton as expected, but after a few random restarts of the 3 cluster members, the singleton started to juggle strictly between the two seed nodes even though neither of them should have been the oldest member in the cluster (node 3 was the only node that hadn't been killed and restarted). This was a quick test. Let me know if there's any information I can provide that would be more helpful.

@zbynek001
Copy link
Contributor

zbynek001 commented Jun 28, 2017

edit: probably not, this doesn't fit the scenario

I think this might be even by design. UpNumber does not guarantee strict ordering:

/**
   * Is this member older, has been part of cluster longer, than another
   * member. It is only correct when comparing two existing members in a
   * cluster. A member that joined after removal of another member may be
   * considered older than the removed member.
   */
def isOlderThan(other: Member): Boolean

@Aaronontheweb
Copy link
Member

@enzeart yeah, it sounds like the bug was fixed here (moving the third node in the first place) but if the two seed nodes are restarted at the same time, they're not going to reform the cluster with the third node. If you restart only one seed at a time, it'll rejoin the original cluster and the SingletonManager should stay with the third node.

A thing that would be helpful would be some logs showing the action inside the cluster from the perspective of the third node - that would help!

@Aaronontheweb Aaronontheweb modified the milestones: 1.3.0, 1.2.2 Jun 28, 2017
@Aaronontheweb
Copy link
Member

@enzeart flipping this to 1.3 for now in case it's still a bug; if you can provide me with those logs that would be helpful.

@enzeart
Copy link

enzeart commented Jun 29, 2017

@Aaronontheweb I'll try to run some tests, generate some logs, and annotate them with the sequence of events that triggered them tonight.

@enzeart
Copy link

enzeart commented Jun 30, 2017

I just realized that I was using "Stop" in Intellij during my tests. So, I tried testing this with "Exit". The difference, I'm guessing, appears to be between that of a SIGKILL in the former and a SIGTERM in the latter (a hard shutdown of the JVM vs a graceful one which allows proper cluster signaling). When using "Exit", the migration of the singleton behaves as expected and it is transferred to the expected node in a timely fashion. When using "Stop", I run into the previously mentioned behavior of the singleton juggling between seed nodes.

I just wanted to confirm what the expectation should be in the case of an "ungraceful" shutdown. Is the migration of the singleton still expected to happen without the previously hosting process coming back up and is the concept of "oldest gets it" still a valid expectation. I wanted to make sure that I'm not going down the rabbit hole based on invalid expectations on my part while dragging you along with me for the ride.

@alexvaluyskiy
Copy link
Contributor

@enzeart it is a repository for Akka.NET, not for Akka (JVM)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants