-
Notifications
You must be signed in to change notification settings - Fork 39.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support master election #1542
Comments
I think the harder problem here is failover. Most quorum based consensus protocols can steer through member outage by running rounds of leader election (eg etcd, zookeeper). I think simple petset+some form of leader aware Service should suffice here (#28443, #10456). However, classic master/slave does not failover gracefully. Managing such applications usually requires a babysitter. Take mysql for example, the babysitter needs to be aware of the master log position and make sure it doesn't promote a lagging slave. At a high level the steps might involve:
So assuming we implement a leader aware Service, what other patterns help master/slave failover? Or should we just leave everything to the babysitter? @jberkus now that we have basic PetSet, maybe you can give your views on postgres failover? @kubernetes/sig-apps |
@bprashanth that's being far to kind for stateful distributed systems. You would need a hook-container for literally every lifecycle event to determine the correct action. At that point you're almost better off creating a controller. |
@bprashanth why would we elect masters with k8s when the app elects masters? I am not quite getting what you are proposing. Stateful apps should do this w/o the help of k8s. Or maybe I am not getting what you are saying. |
@bprashanth I guess I don't understand why we need this when Etcd/Consul/Zookeeper work perfectly well? Leader election for Postgres is 2-stage, and stage #2 is highly application specific (it has to do with PostgreSQL log positions). Given that, I don't really see a way that any generic API would be able to satisfy it. Right now, we either use state machine knowledge (Flynn.io) where no leader election is necessary, or we use the 2-stage process, and only the first stage is a standard leader election. Also, importantly, the master failover must originate on the nodes themselves, not on the Kube Controller. Otherwise you get lots of lovely race conditions. The current PetSet pretty much gives me what I need in terms of endpoints, since I can define a service for the master, and update it. |
The only thing I could see having is a Kube API for generic DCS functions so that applications written for Kubernetes don't have to know whether they're addressing Etcd, Consul or Zookeeper. That would require dumbing down the interfaces, though. |
Yeah that's one of the questions on the table, what api do you need so you don't have to run your own instance of an HA database, to bootstrap and augment a non-HA database? We're already running a CP store behind the apiserver. The other issue is that you've written a bunch of code that handles the persnikety edge cases. We have a spectrum of applications ranging from more legacy (mysql/postgres master/slave, unclustered, no automatic failover etc) to modern (etcd) and "getting there" (zookeeper). If someone asks me how to deploy master/slave, I can point them at a simple petset, the next question that comes up is going to be how to failover, to which we'd pretty much say: write a kubernetes watching apiserver controller init system. That's a pretty high bar. If there are common patterns we can distil into the api, we should, if everything is site specific we should let it be. Ideally sys admins would not have to care about how to watch the apiserver, but that might be farfetched.
This is for applications that don't elect masters. |
To clarify, I actually do see people writing the same controller repeatedly. 2 services, 1 for writes and 1 for reads. Write service has just 1 endpoint, the master, read service has all endpoints. Then they move the "role=master" label on the write service between pods after every new master election. This is a perfect example of something we should aim to make easier, if its possible. |
100% agree. What we have today in the field is not good enough We probably want to proceed in simple steps as noted, but these are |
@bprashanth well, there's two different features you're talking about here then. Feature 1 is access to the Kubernetes-backing DCS via the apiserver. This seems like a reasonable enough thing to have that it's almost not worth talking about; someone should just write it. Feature 2 is some kind of PetSet extension that understands single-master services. This is more of a special case, down to probably being a Kubernetes extension. I can picture this and even suggest a design, which I'll sketch out when I have a chance. These are two different features though, so we should have two issues for them. |
We're not going to expose the raw k/v store. It will have to be through a well defined api. Ideally other components in the system should be able to leverage the same pattern, and an admin should be able to figure things out with kubectl. What's the exact requirement here? can you simply write to annotations, or do you need to use TTL/etcd leases etc? We did at one point have a proposal for a lease/lock api (#13037) but found that a lot of use cases are satisfied by writing to annotations. This is a good summary #13037 (comment).
yes please, ideally I'd take the controller thingy you have for postgres and seperate into: master/slave generic failover framework, site specific scripts. Then I can swap site specific scripts for mysql (probably).
SG, will fork them when I've collected some thoughts. |
The leader election client used by scheduler and controller manager implement a lease system on top of annotations. It canprobably be modified to meet new requirements or at least used for inspiration. https://github.com/kubernetes/kubernetes/blob/master/pkg/client/leaderelection/leaderelection.go |
@bgrant0607 This issue is pretty old, and I'm not sure the original intent applies given @mikedanese comment. I'd move to close this more general issue, and open specific ones where they apply. |
Forked from #260. Aka leader election.
Applications can use their own master-election/lock services, such as etcd or zookeeper. However, a problem remains, which is how clients of those services find the current master. Some applications, such as databases and key-value stores, provide their own client libraries or special protocols for this. Clients of HTTP APIs could potentially use load balancing plus redirects.
The question is should we do anything to make this easier?
We could:
If we decide to support this, I think the main design decision is whether the application should participate directly in the master election process. We have an etcd-based master-election implementation. We could expose an API for master election that would be independent of the underlying key-value store, but which would couple the application to that API. Or, we could potentially use container status and/or readiness probes (#620) to decide when to transfer mastership to another instance. Using readiness probes (or a variant thereof) would have the nice property that the application could use that mechanism to influence which instance was elected master, such as in the case that the application just wanted to take advantage of the addressing/naming mechanism.
The text was updated successfully, but these errors were encountered: