Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Member not being removed to allow new incarnation of that member to join. #1458

Closed
cgstevens opened this issue Nov 23, 2015 · 46 comments
Closed

Comments

@cgstevens
Copy link

I have a 2 websites that are part of my cluster. Sometimes one websites will become unreachable and my logic for all of the other members determine that after 120 seconds they should do a Cluster.Down(ThatWebsiteAddress). I can see that all of the members get a MemberRemoved for that website and then it is reported as Down.
My problem is that when my other service, Solarwinds, detects that this website is down and has been removed from the cluster it tries restarts that website. I can see see it trying to join and I get this message: [[akka://MyService/system/cluster/core/daemon]] - New incarnation of existing member [UniqueAddress: (akka.tcp://MyService@1.1.9.2:57771, 303375918)] is trying to join. Existing will be removed from the cluster and then new member will be allowed to join.
But the website member never gets removed from the cluster so the member that is coming back up is never able to join. I can try to have a member .Leave(ThatWebsiteAddress) and do another .Down(ThatWebsiteAddress) but it never gets removed. Basically I have a ClusterStatus actor that monitors the status of the cluster from its view and determines if it needs to restart itself or if a member has been unreachable for x seconds to down it. If so it shuts that service down and logs to the event log for Solarwinds to determine if the service or website needs to be started back up.
In order for me to recover from this issue I have to restart the entire cluster. Note... This has happened to one of my services so it isn't website specific but happens more on the websites as they become unreachable more often.

If you are interested here is a snapshot of my cluster status page which shows the member being down and never gets removed. When I view the cluster information about the website there are no members or cluster leader.

image

I can force a down on a ServiceWorker which will cause it to be removed from cluster just fine. When it started back up it join the cluster just fine.

Here you can see I downed the service worker.
image

Here you can see the service worker joining the cluster after being detected that it was down.... still the website isn't joining...
image

@Aaronontheweb Aaronontheweb added this to the Akka.NET v1.1 milestone Nov 23, 2015
@Aaronontheweb
Copy link
Member

Thanks @cgstevens - this is extremely detailed, which I appreciate. We'll be looking into this shortly!

@cgstevens
Copy link
Author

No problem for the details. My cluster is still running and I want to see if I can find more detail when in debug mode.
I don't have DEBUG logging enabled in that environment and I am about to take a lighthouse and restart it with more logging so that I can see if it is reporting any issues. I just hate doing that because it introduces a memory issue which I haven't pin pointed yet.

@cgstevens
Copy link
Author

I should note that I am running 1.0.4.12-beta.

Here are the messages I am getting when I enabled DEBUG on a lighthouse; which I was obviously able to force it to Leave using my tool, update the config and start the service back up to have it rejoin the cluster.
First the website member is trying to join so lighthouse logged that action; InitJoin, then the lighthouse member saying that the website wants to join and then stating that the website is a new incarnation of the existing member and that the existing member will be removed... which it is not.

[[akka://MyService/system/cluster/core/daemon]] - [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[[akka://MyService/system/cluster/core/daemon]] - [Initialized] Received Akka.Cluster.InternalClusterAction+Join: UniqueAddress: (akka.tcp://MyService@1.1.9.2:57771, 303375918) wants to join on Roles [Website]
[[akka://MyService/system/cluster/core/daemon]] - New incarnation of existing member [UniqueAddress: (akka.tcp://MyService@1.1.9.2:57771, 303375918)] is trying to join. Existing will be removed from the cluster and then new member will be allowed to join.

@easuter
Copy link
Contributor

easuter commented Dec 7, 2015

@cgstevens have you had any luck with your problem? I experience the (maybe) same thing today (sorry for the SO post, too tired to write up something else here):

http://stackoverflow.com/questions/34145111/node-doesnt-rejoin-cluster-after-being-downed

Not sure if it's exactly the same thing, but its suspiciously similar...

@Aaronontheweb
Copy link
Member

@cgstevens @easuter can both of you try this with 1.0.5? Just came out late last week.

@Aaronontheweb
Copy link
Member

ah, you already are @easuter

@Aaronontheweb Aaronontheweb self-assigned this Dec 8, 2015
@Aaronontheweb
Copy link
Member

I'll take this one on - probably covered by a multi-node test we haven't ported yet.

@easuter
Copy link
Contributor

easuter commented Dec 8, 2015

@Aaronontheweb thanks! Please let me know if there is any more information I can provide.

@cgstevens
Copy link
Author

I have just completed the upgrade from 1.0.4 to 1.0.5.
Since I have logged this issue in great detail I have not had it happen to me.
We have been making progress on the rest of the app so the cluster itself really hasn't had an opportunity to run for any length of time. I will post if I run across this again. If I do I am hoping to get the debug build on the dev server so that I can debug it.

@cgstevens
Copy link
Author

@Aaronontheweb
This is also happening on 1.0.5 for me as it just occurred when I tried to redeploy my 2 websites.
Both left and only was able to rejoin.

Taking a peek at the code finally... After logging the, New incarnation of existing member [{0}] is trying to join... it should call the method Downing(localMember.Address);
To me it doesn't matter what code path is taken in the Downing method there should be another message being logged...
* Ignoring down of unknown node [{0}]
* Marking node [{0}] as Down
* Marking unreachable node [{0}] as Down

But in fact I don't see anything being logged from that method so it is like the that local member status is either Down, Leaving or Exiting. But the fact is up... all of my members are listed as Up except my Website marked as Down.

     if (localMember.Status != MemberStatus.Down && localMember.Status != MemberStatus.Leaving && localMember.Status != MemberStatus.Exiting)
                        Downing(localMember.Address);

Was not able to debug it as I don't have the tools on those servers and for some reason the website would not connect to my local box.

Will try again next time it happens.

@Aaronontheweb
Copy link
Member

Thanks guys - you've both been super detailed. Now that 1.0.5 is out and I've hit some of my other development goals for this quarter, my full focus is coming back to hardening Akka.Cluster and Akka.Remote and eliminating these bugs.

@HakanL
Copy link

HakanL commented Dec 10, 2015

I like that cluster status page, is that something that's available in Akka.Cluster, or something you built yourself?

@easuter
Copy link
Contributor

easuter commented Dec 10, 2015

So I've been doing a bit more digging trying to figure out why my "worker" nodes stop logging anything new after a reboot.

I added a few more debug statements and it seems that the worker becomes stuck in JoinSeenNodeProcess.OnReceive() and is endlessly sending new JoinSeenNode (to self) / InitJoin to the "master" node.

With a _log.Debug() after that line, and another at InitJoin() to see the ack being sent, this is the output I get (master on the left, worker on the right):

compute_emulator_akka_cluster_debug_2

So somewhere along the line that InitJoinAck message from the master is getting lost and not making it to the worker.

@cgstevens
Copy link
Author

@HakanL it is something I built myself. Turned out pretty cool though. I am using AngluarJS, SignalR (with SQL Backplane), MVC and then of course Akka!
I created a ClusterStatusActor so that each member can monitor itself and other members then put our business rules around it so I can have Solarwind pick up events to restart the service. If a service becomes disassociated it will basically shut itself down and attempt to leave the cluster gracefully.
My webpage asks for the ClusterEvent.CurrentClusterState of a member and that member will, ScheduleTell the state to the SignalR Hub and then routes that to any of the web clients that have subscribed to that member. I love there is no refresh and I can watch my members join and exit from a specific nodes view. This opened the door for us to be able to do display real time work stats for long running or continuous processes that we are currently developing ;)

I haven't had this happen yet again so I am waiting and hopefully I can get some more details out of it.

I am currently trying to figure out why the heck I keep running into this which kills my tasker....

Akka.Remote.EndpointException: Failed to write message to the transport ---> System.InvalidOperationException: Collection was modified; enumeration operation may not execute.
   at System.ThrowHelper.ThrowInvalidOperationException(ExceptionResource resource)
   at System.Collections.Generic.List`1.Enumerator.MoveNextRare()
   at System.Collections.Generic.List`1.Enumerator.MoveNext()
   at Newtonsoft.Json.Serialization.JsonSerializerInternalWriter.SerializeList(JsonWriter writer, IEnumerable values, JsonArrayContract contract, JsonProperty member, JsonContainerContract collectionContract, JsonProperty containerProperty)
   at Newtonsoft.Json.Serialization.JsonSerializerInternalWriter.SerializeValue(JsonWriter writer, Object value, JsonContract valueContract, JsonProperty member, JsonContainerContract containerContract, JsonProperty containerProperty)
   at Newtonsoft.Json.Serialization.JsonSerializerInternalWriter.SerializeObject(JsonWriter writer, Object value, JsonObjectContract contract, JsonProperty member, JsonContainerContract collectionContract, JsonProperty containerProperty)
   at Newtonsoft.Json.Serialization.JsonSerializerInternalWriter.SerializeValue(JsonWriter writer, Object value, JsonContract valueContract, JsonProperty member, JsonContainerContract containerContract, JsonProperty containerProperty)
   at Newtonsoft.Json.Serialization.JsonSerializerInternalWriter.Serialize(JsonWriter jsonWriter, Object value, Type objectType)
   at Newtonsoft.Json.JsonSerializer.SerializeInternal(JsonWriter jsonWriter, Object value, Type objectType)
   at Newtonsoft.Json.JsonConvert.SerializeObjectInternal(Object value, Type type, JsonSerializer jsonSerializer)
   at Akka.Serialization.NewtonSoftJsonSerializer.ToBinary(Object obj)
   at Akka.Serialization.Serializer.<>c__DisplayClass7_0.<ToBinaryWithAddress>b__0()
   at Akka.Serialization.Serialization.SerializeWithTransport[T](ActorSystem system, Address address, Func`1 action)
   at Akka.Serialization.

@easuter
Copy link
Contributor

easuter commented Dec 10, 2015

@cgstevens I've only seen that error from Json.NET before when it is trying to serialize a collection which someone else is modifying at the same time. Odd this should happen in an actor...perhaps a Task that was allowed to run at the same time as the actor is processing new messages?

EDIT:

I'm also going to have to steal your idea in the future...that cluster dashboard looks awesome :)

@cgstevens
Copy link
Author

@easuter
Steal away... if it wasn't wrapped up in my business app right now I would share but it really shouldn't be hard for you do... it is just unwrapping that ClusterEvent.CurrentClusterState object and making it presentable and then wire up the Leave and Down events.

The serialization error... I went as far as making the only 2 lists I have in that actor to be ConcurrentDictionary and then doing a .ToList when looping and then creating a new Message to send and I still get this issue so right now I am confused. I am doing something screwy... :(

@Aaronontheweb
Copy link
Member

@easuter @cgstevens yep, that's definitely odd - super unlikely that it's something internal to Akka.Cluster doing that. We use immutable collections everywhere. Check to see if you have a user-defined message with a collection that is being modified.

@HakanL
Copy link

HakanL commented Dec 10, 2015

@cgstevens could you break it out and put it in a separate github repo, or you can't break it out (at some point)? It would be nice to have that as a starting point, including your UI.

@HakanL
Copy link

HakanL commented Dec 10, 2015

Btw, I also have issue with my cluster going berserk when a node drops out (unreachable) and trying to re-join. And if my cluster is in that state then I can't re-join any new nodes so I have to restart the whole cluster. I don't have it isolated as nicely as OP, but I'm reading these comments to see how I can best report/isolate it. I seem to have messages about the leader being unreachable as well, could that cause issues? I was hoping the cluster would just vote for a new leader, so there's not a single point of failure, have I misunderstood the leader mechanism?

@Aaronontheweb
Copy link
Member

sidebar here. I think there's a potential bug on this thread but I'm also hearing some conceptual whiffs.

@HakanL in that case, the issue is the Unreachable nodes aren't also being Down-ed from the cluster, which you have to do manually right now. Akka.Cluster's failure detection mechanism relies on a human being to intercede with a node goes permanently offline in an unexpected manner - Apache Cassandra, Riak, and all other Dynamo-based cluster implementations work this way.

Most developers incorrectly assume the cluster will manually kick out unreachable nodes without them having to do anything. This is wrong and ironically, not partition-tolerant. I'm going to need to write a blog post explaining this I think because this issue comes up a lot - but if you want the ability to automatically have nodes leave the cluster without you doing anything then you need to turn auto-down=on inside your cluster.

We consider auto-down to be harmful though, because in the event of temporary network disruption your cluster will fragment into a bunch of small mini-clusters and never reform. This is called a split brain.

The default behavior is to put the cluster into a state where the leader can't achieve convergence until the unreachable nodes are all back in communication or are manually downed by a user via the Cluster.Down method.

There's some algorithms out there that are more sophisticated than the current auto-down strategy that can perform more accurate failure detection and resolution analysis than our current manual system, but they're still very much at the cutting edge of computer science.

@cgstevens
Copy link
Author

@HakanL Yes I had the same issue with my cluster going berserk and that is why I ended up creating this monitor so that I can watch it. This allows me to kick the members which goes through a process of logging it so that soloarwinds will restart it back up. I am trying to make sure that the services will recover if something happens to avoid human intervention.
I had bad luck with the auto-down=on so in my clusterstatus I am monitoring unreachable members.
If I have an unreachable member it will Cluster.Down after 120 seconds. This seems to keep my cluster from going berserk. It will also monitor itself by checking if it contains the correct members to be a cluster. If it does not meet this after x time it will decide that itself can't perform it will Cluster.Leave and perform a shutdown on the services or website. These deciding factors can be; Missing Lighthouse or I am the only one in the cluster. In the case where it becomes berserk it will actually just correct itself.

NOTE: Solarwinds piece is still work in progress so the starting of the services or website still doesn't work yet.

Now with all of that I have only ran into this issue a handful of times.
Sounds like you have able to reproduce it a lot. One thing I did find... Is I when connect from my local computer to my dev environment the cluster would freak out even more due to the network dropping packets and what not. All of those members would start to freak out because I am unreachable.
So if I keep it on that network segment the cluster runs more smoothly and might be better once we go from Dev to Prod.

@rogeralsing
Copy link
Contributor

Any chance you can share something about the Solarwinds monitor?
We use solarwinds at work, so it might be useful :)

Skickat från min iPhone

10 dec. 2015 kl. 21:31 skrev Chris G. Stevens notifications@github.com:

@HakanL Yes I had the same issue with my cluster going berserk and that is why I ended up creating this monitor so that I can watch it. This allows me to kick the members which goes through a process of logging it so that soloarwinds will restart it back up. I am trying to make sure that the services will recover if something happens to avoid human intervention.
I had bad luck with the auto-down=on so in my clusterstatus I am monitoring unreachable members.
If I have an unreachable member it will Cluster.Down after 120 seconds. This seems to keep my cluster from going berserk. It will also monitor itself by checking if it contains the correct members to be a cluster. If it does not meet this after x time it will decide that itself can't perform it will Cluster.Leave and perform a shutdown on the services or website. These deciding factors can be; Missing Lighthouse or I am the only one in the cluster. In the case where it becomes berserk it will actually just correct itself.

NOTE: Solarwinds piece is still work in progress so the starting of the services or website still doesn't work yet.

Now with all of that I have only ran into this issue a handful of times.

Sounds like you have able to reproduce it a lot. One thing I did find... Is I when connect from my local computer to my dev environment the cluster would freak out even more due to the network dropping packets and what not. All of those members would start to freak out because I am unreachable.
So if I keep it on that network segment the cluster runs more smoothly and might be better once we go from Dev to Prod.


Reply to this email directly or view it on GitHub.

@HakanL
Copy link

HakanL commented Dec 10, 2015

@cgstevens @Aaronontheweb good info, thank you. First I'm sorry if this is not the right spot to discuss this, and I realize I'm a noob when it comes to Akka/Akka.Net. In my cluster I have one "master" and 3-4 "slaves". The master has its own role, the salves the same role. I also tried to add a Lighthouse. The slaves run on RPi (not that it should matter too much). The master is actually running in VS in debug (it's my controller for my xmas lights, so I restart it often while I develop the controller features. So my "master" (realizing there isn't a "master" in the cluster, but in my set up it's a master of the system) goes in/out of unreachable often, but I don't need to take it permanently down (if I understood that concept correct). But even with this, my cluster goes berserk quite often, nodes that have been reachable all along are flagged as unreachable and logs fill with cluster messages. I'm not sure how to isolate for a bug report/fix, or if it's something that I've misconfigured (very possible). I appreciate any advice on approach.

@cgstevens
Copy link
Author

@rogeralsing
This isn't in place yet and we are supposed to start working on that piece in the next week.
Basically my understanding, so correct me if I am wrong, but we want to have Solarwinds monitor the Eventlog and/or services and using powershell scripts start them backup based on type of events being logged.

@HakanL
If you are taking out your master, which sounds like it is your seed node and you don't have another seed node then your cluster can't form again and you caused it to partition. If you replaced your master with a lighthouse that would allow your master to leave and join based on the seed node.
This is why I have 2 lighthouses and it is recommended to have 2 or more for this reason. At least that is how I understand it....

@HakanL
Copy link

HakanL commented Dec 10, 2015

@cgstevens I have a lighthouse (just one though) which is the seed for all nodes. When I say master, it's just master in my own system, there's nothing "master-ery" about it as far as Akka.net-cluster is concerned. Maybe I'm wrong, but it seems that if that node becomes the leader when it runs then everything is going haywire when it's restarted (but I haven't actually determined that that is the case, just a theory). I'd prefer to keep the lighthouse as a leader, but as far as I can tell there's no place to indicate that a node should be a leader (or better, a node should not be considered a leader may make more sense). On the other hand, I don't add/remove nodes to my cluster, same machines, same network name/ip, same port number, it's just that (at least one) node(s) become unreachable sometimes. Does the leader have to be up to accept a node going back to reachable?

@cgstevens
Copy link
Author

@HakanL
No new members can join your cluster if you have any that are in a weird state because all members have to vote that member in. Your cluster leader is elected and it can really be anyone. I would love for my leader to always be one of my lighthouses but in the case last night I came in and I have 3 members that downed themselves due to UNREACHABLE. So looking into that now.

Akka.Cluster.ClusterCoreDaemon  [[akka://MyService/system/cluster/core/daemon]] - Cluster Node [akka.tcp://MyService@1.1.9.34:59991] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://MyService@1.1.9.34:4053, status = Up, Member(address = akka.tcp://MyService@1.1.9.41:61111, status = Up]

This is the first logged event and has been one of my greatest pains. The ServiceTasker is marking my Lighthouse (same server) and Worker (different server) as unreachable then the Worker and Lighthouse reports the ServiceTasker UNREACHABLE. It is NOT always the same as it can be any member.
I started up my Lighthouse and it joined as well as took over as the Cluster Leader where as one of the Websites was leader after it became unreachable... Nice 👍
Started up the other Worker and Tasker which joined just fine and everything started to work again.

Now... These unreachable members where triggered to be Cluster.Down by the other members that were still running. Which caused these unreachable member services to stop. It was suppose to be 2 minutes but it seems wrong and obviously since the timestamps are only 5 seconds apart from the bad member being unreachable and the good member telling it to leave the cluster. I am reporting that it was 120 seconds but perhaps I have bug there... Going to play with that today...
Quick look I found I am missing the handling when the Member was reachable again based on the ClusterEvent.ReachableMember message and will add that.
Also glad that I found I need to check the ClusterEvent.CurrentClusterState of unreachable members and remove the ones I stored 👎
I am tracking how long they are unreachable and other states so I have more info to make a decision if I just need to restart or someone really needs to look at the cluster and fire off an email or what not.

I have not actually had my full cluster run for more than a day... this either happens or I need to deploy an update or blah blah blah... Hopefully this change will help and prevent it from prematurely shutting down.

@HakanL
Copy link

HakanL commented Dec 15, 2015

Thanks for the additional information. My cluster actually stays up if I leave it alone, but I picked akka.net cluster because I wanted to have flexibility on taking nodes up/down and not have to hard-code IP addresses. I wish I could see what it would take to help out with a fix, but there's a lot I'm not completely understanding in the cluster messages, like why does the log line say a node is unreachable, but the Status=Up?
At least I can make it all run by just restarting my whole cluster (which means SSH into 3 machines and manually taking everything down and then back up again, not great), and once it's up it seems to stay up. But I'd love to help out with a fix so that
a) I can bounce a node any time I want and the cluster would not become unstable, it should just not deliver messages to that node until it's back up.
b) prevent a node from being a leader (or have a preferred leader setting). I know which nodes are "stable" (lighthouse) and which ones are not (my "master" since I run it in debug in VS).

Finally I'd love to get @cgstevens cluster status monitoring site built into lighthouse... :)

@cgstevens
Copy link
Author

@HakanL thanks for the love of my status page. Honestly with all of the work I have put into this my manager wants me to do a Presentation on Akka.Net at CodeStock next year.
Once I get our business application done I plan on gutting it and making a smaller demo version which then anyone can grab my code. It is cool to watch the members all change when you restart a service. Though I have found it interesting that the "SeenBy" members change when I restart a service. Not sure yet what that means as I haven't looked at that part of the code yet.

@easuter
I finally found my issue which was causing my Tasker to fail when sending status updates.
This error "System.InvalidOperationException: Collection was modified; enumeration operation may not execute." was actually being caused by my message containing an IActorRef.
Basically I was just doing a .Copy() which just returned a new object. I added a simple DeepClone which uses a BinaryFormatter to Serialize and Deserialize. This caused me to put [Serializable] on my classes which I ended up having to put [NonSerialized()] on that IActorRef property. All I need is the path so I just created a string to pass along instead of the whole object. Since I have made this change I no longer have errors and my Tasker actually has stayed up for a couple of days now.
Honestly I don't understand how that causes the error that I was seeing and actually have been seeing for the past 2 months... The error happened 2-3 times an hour but would cause my Tasker to become unreachable. I knew what collection it was occurring with and I refactored it so many different way including making it an ConcurrentDictionary, casting it out to a new List and then do a copy of the object to send.

Now if I can just figure out this tickets issue and get my member to actually be removed from the cluster even though it is down I will be golden!

@cgstevens
Copy link
Author

@HakanL or anyone else... I do have the following as well which is my original one. This is pretty simple and I can remove the other tabs and clean it up but this will give you the samething except this is done in WPF and not Angular/SignalR. Has all the same features for it to force members to leave or be down including itself by just clicking on the row and click leave. Pretty cool 👍
Give me this week to export it out and put it on git hub if anyone is interested. My problem is why I use my website is that when I have this app join my cluster my cluster freaks out!!!! So the website is on the same network segment so the cluster is happier.
As you can see soon as I connect my tester and one of my websites was removed from the cluster because that website kept getting an "Tried to associate with unreachable remote address" then eventually just fails... Soon all of them will crash due to the same thing 👎
So basically my intentions of monitoring with a Windows Application and be able to join and connect to a cluster to view what is going on failed due to Network issues that are out of my control. So I put the monitor on the same server and the issue went away. I manually started the website back up and it rejoined just fine.

image

image

@easuter
Copy link
Contributor

easuter commented Dec 15, 2015

@cgstevens ah, thats good to know! AFAIK IActorRefs should be serializable; my own cluster does have messages which include IActorRefs and I haven't run into this issue.

IIRC older versions of Json.NET could fail to serialize objects if their classes were decorated with those types of attributes you mentioned, didn't think it was still a problem now.

Regarding that monitoring tool, thanks for taking the time to make it public! :) I'll take a peek once it's out.

Some time ago I was thinking about creating a management tool that doesn't actually join the cluster since I don't want something that is essentially a "client" to actually behave like a full-blown node...but at that time I ran into this issue: #748

Don't know if the problem is still present in 1.0.5 though.

Regarding the actual problem reported in this thread I haven't made any progress since my last tests...had too much other work to be digging into Akka's internals. I'll take another shot at this as soon as I can though.

@HakanL
Copy link

HakanL commented Dec 15, 2015

Very cool, yes please share! I'm with you, I want to be able to run a client like this, join the cluster and then later drop out without it causing any problems with the cluster. I'm ok that I have to force-remove that node if I want it out of lists, but it shouldn't mess up the cluster.

@cgstevens
Copy link
Author

@HakanL and @easuter
I put this together this morning.... Hopefully it works! It contains the monitor app and a lighthouse service running topshelf for a demo.
https://github.com/cgstevens/Akka.Cluster.Monitor

@HakanL
Copy link

HakanL commented Dec 16, 2015

Cool, will check it out!

@cgstevens
Copy link
Author

@HakanL @Aaronontheweb
You said;
But I'd love to help out with a fix so that
a) I can bounce a node any time I want and the cluster would not become unstable, it should just not deliver messages to that node until it's back up.

I will have to say that you need to make sure your members are cleaning leaving the cluster.
Example with the current lighthouse when running in interactive mode:
https://github.com/petabridge/lighthouse/blob/master/src/Lighthouse/LighthouseService.cs#L41

The only way this service exits out of the cluster cleanly is only when it runs as a service. Which is fine but during testing it will become unreachable if you are in Interactive Mode.

So if you look at what I added to mine I can click on exit when running as a application and it will still leave the cluster gracefully. Anything you connect to the cluster needs to leave the cluster gracefully or just screws things up and then you have to be able to tell it to leave or be down.
https://github.com/cgstevens/Akka.Cluster.Monitor/blob/master/Lighthouse/Program.cs#L50
You could do something similar in your other services if you are running them in interactive mode.

Here I make sure the windows application leaves the cluster.
https://github.com/cgstevens/Akka.Cluster.Monitor/blob/master/Akka.Cluster.Monitor/Main.cs#L21

Then in the website add it to the Global.asax file for Application_Stop()
NOTE: I have been having issue with just restarting the website. I find that I have to do the AppPool as well as it doesn't always work right when just restarting the website. still investigating that part...

Basically if you don't tell your cluster member to leave when it shuts down then all the other members will freak out and you will not be able to have another member join until that member has been downed.

Nice thing about the ClusterMonitor is that it can down an unreachable member such as itself before joining another one to the cluster. Test this by launching 2 ClusterMonitors and ending task, NOT closing and then start up another ClusterMonitor. This will be in a join state. Down the unreachable member and watch the joining one become up.

Hope this helps!

@cgstevens
Copy link
Author

@HakanL and @easuter

Added a new project called Worker which has an WebApiEndpoint. Creating services similar to this will allow you to view that members current state. This will allow you to access the ClusterEvent.CurrentClusterState by hitting the endpoint http://localhost:8080/api/ClusterStatus/1 Opens the door to be able to send a Leave/Down message to that member through that api. With having this access to the member my ClusterMonitor has no need to actually be part of the cluster to view just its state but instead can just request the state from a specific member. This will disconnect this client to avoid any interruptions with in the cluster perhaps.
Start all of the projects up, then open up your favorite browser and navigate to http://localhost:8080/api/ClusterStatus/1 This will give you the JSON results of the worker member.

@easuter
Copy link
Contributor

easuter commented Dec 17, 2015

@cgstevens this is awesome, thanks so much!

I'll probably drop this into my next project test-run :)

@cgstevens
Copy link
Author

@easuter nice! I updated the notes about the website project I will be adding in the next few days.

@Aaronontheweb
Copy link
Member

Give this a try with the latest nightly build - my recent changes in #1596 should have resolved this.

http://getakka.net/docs/akka-developers/nightly-builds

@easuter
Copy link
Contributor

easuter commented Jan 8, 2016

@Aaronontheweb awesome, will try this ASAP!

Do I need to install the nightly of the entire stack (Akka, Akka.Remote, Akka.Cluster) or will only Akka.Cluster suffice?

@Aaronontheweb
Copy link
Member

@easuter you need to install the nightly of Akka.Remote specifically - that's where the issue was.

@cgstevens
Copy link
Author

Thanks Aaron!

@easuter
Copy link
Contributor

easuter commented Jan 10, 2016

@Aaronontheweb I'm using the Akka+Akka.Remote nightly and so far things are looking good. It's still early to tell if the problem is completely solved, so I'll post an update later on when I can confirm that the fix works for sure!

Again, thanks very much for your work on this!

@garrardkitchen
Copy link

@cgstevens My cluster is also getting bent out of shape if a node stops (e.g. IIS app pool recycled or times out) or one of several win services are restarted. None of these can rejoin. Causing major headache! Are you still experiencing this or has one of the nightly builds rectified this now?

Anyways, saw this while going through your code:

Application_Stop() in your Global.asax file

Shouldn't this be Application_End()?

@Aaronontheweb
Copy link
Member

This is fixed - confirmed by our Akka.Remote MNTR specs and fixes to the EndpointManager as well as our cluster node restart specs.

@gkolok
Copy link

gkolok commented Aug 31, 2017

Hi, in which version was the issue fixed?

@Danthar
Copy link
Member

Danthar commented Aug 31, 2017

1.1.0 I believe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants