-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Member not being removed to allow new incarnation of that member to join. #1458
Comments
Thanks @cgstevens - this is extremely detailed, which I appreciate. We'll be looking into this shortly! |
No problem for the details. My cluster is still running and I want to see if I can find more detail when in debug mode. |
I should note that I am running 1.0.4.12-beta. Here are the messages I am getting when I enabled DEBUG on a lighthouse; which I was obviously able to force it to Leave using my tool, update the config and start the service back up to have it rejoin the cluster.
|
@cgstevens have you had any luck with your problem? I experience the (maybe) same thing today (sorry for the SO post, too tired to write up something else here): http://stackoverflow.com/questions/34145111/node-doesnt-rejoin-cluster-after-being-downed Not sure if it's exactly the same thing, but its suspiciously similar... |
@cgstevens @easuter can both of you try this with 1.0.5? Just came out late last week. |
ah, you already are @easuter |
I'll take this one on - probably covered by a multi-node test we haven't ported yet. |
@Aaronontheweb thanks! Please let me know if there is any more information I can provide. |
I have just completed the upgrade from 1.0.4 to 1.0.5. |
@Aaronontheweb Taking a peek at the code finally... After logging the, New incarnation of existing member [{0}] is trying to join... it should call the method Downing(localMember.Address); But in fact I don't see anything being logged from that method so it is like the that local member status is either Down, Leaving or Exiting. But the fact is up... all of my members are listed as Up except my Website marked as Down.
Was not able to debug it as I don't have the tools on those servers and for some reason the website would not connect to my local box. Will try again next time it happens. |
Thanks guys - you've both been super detailed. Now that 1.0.5 is out and I've hit some of my other development goals for this quarter, my full focus is coming back to hardening Akka.Cluster and Akka.Remote and eliminating these bugs. |
I like that cluster status page, is that something that's available in Akka.Cluster, or something you built yourself? |
So I've been doing a bit more digging trying to figure out why my "worker" nodes stop logging anything new after a reboot. I added a few more debug statements and it seems that the worker becomes stuck in With a So somewhere along the line that |
@HakanL it is something I built myself. Turned out pretty cool though. I am using AngluarJS, SignalR (with SQL Backplane), MVC and then of course Akka! I haven't had this happen yet again so I am waiting and hopefully I can get some more details out of it. I am currently trying to figure out why the heck I keep running into this which kills my tasker....
|
@cgstevens I've only seen that error from Json.NET before when it is trying to serialize a collection which someone else is modifying at the same time. Odd this should happen in an actor...perhaps a EDIT: I'm also going to have to steal your idea in the future...that cluster dashboard looks awesome :) |
@easuter The serialization error... I went as far as making the only 2 lists I have in that actor to be ConcurrentDictionary and then doing a .ToList when looping and then creating a new Message to send and I still get this issue so right now I am confused. I am doing something screwy... :( |
@easuter @cgstevens yep, that's definitely odd - super unlikely that it's something internal to Akka.Cluster doing that. We use immutable collections everywhere. Check to see if you have a user-defined message with a collection that is being modified. |
@cgstevens could you break it out and put it in a separate github repo, or you can't break it out (at some point)? It would be nice to have that as a starting point, including your UI. |
Btw, I also have issue with my cluster going berserk when a node drops out (unreachable) and trying to re-join. And if my cluster is in that state then I can't re-join any new nodes so I have to restart the whole cluster. I don't have it isolated as nicely as OP, but I'm reading these comments to see how I can best report/isolate it. I seem to have messages about the leader being unreachable as well, could that cause issues? I was hoping the cluster would just vote for a new leader, so there's not a single point of failure, have I misunderstood the leader mechanism? |
sidebar here. I think there's a potential bug on this thread but I'm also hearing some conceptual whiffs. @HakanL in that case, the issue is the Most developers incorrectly assume the cluster will manually kick out unreachable nodes without them having to do anything. This is wrong and ironically, not partition-tolerant. I'm going to need to write a blog post explaining this I think because this issue comes up a lot - but if you want the ability to automatically have nodes leave the cluster without you doing anything then you need to turn We consider The default behavior is to put the cluster into a state where the leader can't achieve convergence until the unreachable nodes are all back in communication or are manually downed by a user via the There's some algorithms out there that are more sophisticated than the current auto-down strategy that can perform more accurate failure detection and resolution analysis than our current manual system, but they're still very much at the cutting edge of computer science. |
@HakanL Yes I had the same issue with my cluster going berserk and that is why I ended up creating this monitor so that I can watch it. This allows me to kick the members which goes through a process of logging it so that soloarwinds will restart it back up. I am trying to make sure that the services will recover if something happens to avoid human intervention. NOTE: Solarwinds piece is still work in progress so the starting of the services or website still doesn't work yet. Now with all of that I have only ran into this issue a handful of times. |
Any chance you can share something about the Solarwinds monitor? Skickat från min iPhone
|
@cgstevens @Aaronontheweb good info, thank you. First I'm sorry if this is not the right spot to discuss this, and I realize I'm a noob when it comes to Akka/Akka.Net. In my cluster I have one "master" and 3-4 "slaves". The master has its own role, the salves the same role. I also tried to add a Lighthouse. The slaves run on RPi (not that it should matter too much). The master is actually running in VS in debug (it's my controller for my xmas lights, so I restart it often while I develop the controller features. So my "master" (realizing there isn't a "master" in the cluster, but in my set up it's a master of the system) goes in/out of unreachable often, but I don't need to take it permanently down (if I understood that concept correct). But even with this, my cluster goes berserk quite often, nodes that have been reachable all along are flagged as unreachable and logs fill with cluster messages. I'm not sure how to isolate for a bug report/fix, or if it's something that I've misconfigured (very possible). I appreciate any advice on approach. |
@rogeralsing @HakanL |
@cgstevens I have a lighthouse (just one though) which is the seed for all nodes. When I say master, it's just master in my own system, there's nothing "master-ery" about it as far as Akka.net-cluster is concerned. Maybe I'm wrong, but it seems that if that node becomes the leader when it runs then everything is going haywire when it's restarted (but I haven't actually determined that that is the case, just a theory). I'd prefer to keep the lighthouse as a leader, but as far as I can tell there's no place to indicate that a node should be a leader (or better, a node should not be considered a leader may make more sense). On the other hand, I don't add/remove nodes to my cluster, same machines, same network name/ip, same port number, it's just that (at least one) node(s) become unreachable sometimes. Does the leader have to be up to accept a node going back to reachable? |
@HakanL
This is the first logged event and has been one of my greatest pains. The ServiceTasker is marking my Lighthouse (same server) and Worker (different server) as unreachable then the Worker and Lighthouse reports the ServiceTasker UNREACHABLE. It is NOT always the same as it can be any member. Now... These unreachable members where triggered to be Cluster.Down by the other members that were still running. Which caused these unreachable member services to stop. It was suppose to be 2 minutes but it seems wrong and obviously since the timestamps are only 5 seconds apart from the bad member being unreachable and the good member telling it to leave the cluster. I am reporting that it was 120 seconds but perhaps I have bug there... Going to play with that today... I have not actually had my full cluster run for more than a day... this either happens or I need to deploy an update or blah blah blah... Hopefully this change will help and prevent it from prematurely shutting down. |
Thanks for the additional information. My cluster actually stays up if I leave it alone, but I picked akka.net cluster because I wanted to have flexibility on taking nodes up/down and not have to hard-code IP addresses. I wish I could see what it would take to help out with a fix, but there's a lot I'm not completely understanding in the cluster messages, like why does the log line say a node is unreachable, but the Status=Up? Finally I'd love to get @cgstevens cluster status monitoring site built into lighthouse... :) |
@HakanL thanks for the love of my status page. Honestly with all of the work I have put into this my manager wants me to do a Presentation on Akka.Net at CodeStock next year. @easuter Now if I can just figure out this tickets issue and get my member to actually be removed from the cluster even though it is down I will be golden! |
@HakanL or anyone else... I do have the following as well which is my original one. This is pretty simple and I can remove the other tabs and clean it up but this will give you the samething except this is done in WPF and not Angular/SignalR. Has all the same features for it to force members to leave or be down including itself by just clicking on the row and click leave. Pretty cool 👍 |
@cgstevens ah, thats good to know! AFAIK IActorRefs should be serializable; my own cluster does have messages which include IActorRefs and I haven't run into this issue. IIRC older versions of Json.NET could fail to serialize objects if their classes were decorated with those types of attributes you mentioned, didn't think it was still a problem now. Regarding that monitoring tool, thanks for taking the time to make it public! :) I'll take a peek once it's out. Some time ago I was thinking about creating a management tool that doesn't actually join the cluster since I don't want something that is essentially a "client" to actually behave like a full-blown node...but at that time I ran into this issue: #748 Don't know if the problem is still present in 1.0.5 though. Regarding the actual problem reported in this thread I haven't made any progress since my last tests...had too much other work to be digging into Akka's internals. I'll take another shot at this as soon as I can though. |
Very cool, yes please share! I'm with you, I want to be able to run a client like this, join the cluster and then later drop out without it causing any problems with the cluster. I'm ok that I have to force-remove that node if I want it out of lists, but it shouldn't mess up the cluster. |
@HakanL and @easuter |
Cool, will check it out! |
@HakanL @Aaronontheweb I will have to say that you need to make sure your members are cleaning leaving the cluster. The only way this service exits out of the cluster cleanly is only when it runs as a service. Which is fine but during testing it will become unreachable if you are in Interactive Mode. So if you look at what I added to mine I can click on exit when running as a application and it will still leave the cluster gracefully. Anything you connect to the cluster needs to leave the cluster gracefully or just screws things up and then you have to be able to tell it to leave or be down. Here I make sure the windows application leaves the cluster. Then in the website add it to the Global.asax file for Application_Stop() Basically if you don't tell your cluster member to leave when it shuts down then all the other members will freak out and you will not be able to have another member join until that member has been downed. Nice thing about the ClusterMonitor is that it can down an unreachable member such as itself before joining another one to the cluster. Test this by launching 2 ClusterMonitors and ending task, NOT closing and then start up another ClusterMonitor. This will be in a join state. Down the unreachable member and watch the joining one become up. Hope this helps! |
Added a new project called Worker which has an WebApiEndpoint. Creating services similar to this will allow you to view that members current state. This will allow you to access the ClusterEvent.CurrentClusterState by hitting the endpoint http://localhost:8080/api/ClusterStatus/1 Opens the door to be able to send a Leave/Down message to that member through that api. With having this access to the member my ClusterMonitor has no need to actually be part of the cluster to view just its state but instead can just request the state from a specific member. This will disconnect this client to avoid any interruptions with in the cluster perhaps. |
@cgstevens this is awesome, thanks so much! I'll probably drop this into my next project test-run :) |
@easuter nice! I updated the notes about the website project I will be adding in the next few days. |
Give this a try with the latest nightly build - my recent changes in #1596 should have resolved this. |
@Aaronontheweb awesome, will try this ASAP! Do I need to install the nightly of the entire stack (Akka, Akka.Remote, Akka.Cluster) or will only Akka.Cluster suffice? |
@easuter you need to install the nightly of Akka.Remote specifically - that's where the issue was. |
Thanks Aaron! |
@Aaronontheweb I'm using the Akka+Akka.Remote nightly and so far things are looking good. It's still early to tell if the problem is completely solved, so I'll post an update later on when I can confirm that the fix works for sure! Again, thanks very much for your work on this! |
@cgstevens My cluster is also getting bent out of shape if a node stops (e.g. IIS app pool recycled or times out) or one of several win services are restarted. None of these can rejoin. Causing major headache! Are you still experiencing this or has one of the nightly builds rectified this now? Anyways, saw this while going through your code: Application_Stop() in your Global.asax file Shouldn't this be Application_End()? |
This is fixed - confirmed by our Akka.Remote MNTR specs and fixes to the |
Hi, in which version was the issue fixed? |
1.1.0 I believe. |
I have a 2 websites that are part of my cluster. Sometimes one websites will become unreachable and my logic for all of the other members determine that after 120 seconds they should do a Cluster.Down(ThatWebsiteAddress). I can see that all of the members get a MemberRemoved for that website and then it is reported as Down.
My problem is that when my other service, Solarwinds, detects that this website is down and has been removed from the cluster it tries restarts that website. I can see see it trying to join and I get this message: [[akka://MyService/system/cluster/core/daemon]] - New incarnation of existing member [UniqueAddress: (akka.tcp://MyService@1.1.9.2:57771, 303375918)] is trying to join. Existing will be removed from the cluster and then new member will be allowed to join.
But the website member never gets removed from the cluster so the member that is coming back up is never able to join. I can try to have a member .Leave(ThatWebsiteAddress) and do another .Down(ThatWebsiteAddress) but it never gets removed. Basically I have a ClusterStatus actor that monitors the status of the cluster from its view and determines if it needs to restart itself or if a member has been unreachable for x seconds to down it. If so it shuts that service down and logs to the event log for Solarwinds to determine if the service or website needs to be started back up.
In order for me to recover from this issue I have to restart the entire cluster. Note... This has happened to one of my services so it isn't website specific but happens more on the websites as they become unreachable more often.
If you are interested here is a snapshot of my cluster status page which shows the member being down and never gets removed. When I view the cluster information about the website there are no members or cluster leader.
I can force a down on a ServiceWorker which will cause it to be removed from cluster just fine. When it started back up it join the cluster just fine.
Here you can see I downed the service worker.
Here you can see the service worker joining the cluster after being detected that it was down.... still the website isn't joining...
The text was updated successfully, but these errors were encountered: