Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support master election #1542

Closed
bgrant0607 opened this issue Oct 2, 2014 · 14 comments
Closed

Support master election #1542

bgrant0607 opened this issue Oct 2, 2014 · 14 comments
Labels
area/api Indicates an issue on api area. area/downward-api area/HA priority/backlog Higher priority than priority/awaiting-more-evidence. sig/network Categorizes an issue or PR as relevant to SIG Network.

Comments

@bgrant0607
Copy link
Member

Forked from #260. Aka leader election.

Applications can use their own master-election/lock services, such as etcd or zookeeper. However, a problem remains, which is how clients of those services find the current master. Some applications, such as databases and key-value stores, provide their own client libraries or special protocols for this. Clients of HTTP APIs could potentially use load balancing plus redirects.

The question is should we do anything to make this easier?

We could:

  • assign an IP address and DNS name to the current master
  • reflect the current mastership status via the cluster API and via the downward API (Container downward/upward API umbrella issue #386)
    • applications that need true exclusion could use resourceVersion as the "lock sequence number" for their own concurrency control
  • create lifecycle hooks for acquiring and losing mastership

If we decide to support this, I think the main design decision is whether the application should participate directly in the master election process. We have an etcd-based master-election implementation. We could expose an API for master election that would be independent of the underlying key-value store, but which would couple the application to that API. Or, we could potentially use container status and/or readiness probes (#620) to decide when to transfer mastership to another instance. Using readiness probes (or a variant thereof) would have the nice property that the application could use that mechanism to influence which instance was elected master, such as in the case that the application just wanted to take advantage of the addressing/naming mechanism.

@bgrant0607 bgrant0607 added kind/design Categorizes issue or PR as related to design. sig/network Categorizes an issue or PR as relevant to SIG Network. area/api Indicates an issue on api area. area/downward-api labels Oct 2, 2014
@bgrant0607 bgrant0607 added this to the v1.0 milestone Oct 4, 2014
@bgrant0607 bgrant0607 added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Dec 3, 2014
@bgrant0607 bgrant0607 removed this from the v1.0 milestone Dec 3, 2014
@bgrant0607 bgrant0607 added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. area/example labels Feb 28, 2015
@bgrant0607 bgrant0607 changed the title Decide whether/how to support master election Support master election Aug 19, 2015
@bgrant0607
Copy link
Member Author

cc @timothysc @jayunit100

@bgrant0607 bgrant0607 added team/api priority/backlog Higher priority than priority/awaiting-more-evidence. and removed sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. kind/design Categorizes issue or PR as related to design. area/example priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Aug 19, 2015
@timothysc
Copy link
Member

So the lock/lease api is being updated, but the algo around actual leader election I consider separate, but planned to address over time as needed.

/cc @rrati @davidopp

@bprashanth
Copy link
Contributor

I think the harder problem here is failover. Most quorum based consensus protocols can steer through member outage by running rounds of leader election (eg etcd, zookeeper). I think simple petset+some form of leader aware Service should suffice here (#28443, #10456).

However, classic master/slave does not failover gracefully. Managing such applications usually requires a babysitter. Take mysql for example, the babysitter needs to be aware of the master log position and make sure it doesn't promote a lagging slave.

At a high level the steps might involve:

  1. Elect a leader (pet-0 might work)
  2. Advertise the leader (update the leader election Service with DNS name of pet-0)
  3. Detect master failure (simple health check from leader election Service might work)
  4. Elect a healthy slave (babysitter needed)
  5. Advertise healthy slave as new master (babysitter updates the leader election Service)

So assuming we implement a leader aware Service, what other patterns help master/slave failover? Or should we just leave everything to the babysitter?

@jberkus now that we have basic PetSet, maybe you can give your views on postgres failover? @kubernetes/sig-apps

@timothysc
Copy link
Member

@bprashanth that's being far to kind for stateful distributed systems. You would need a hook-container for literally every lifecycle event to determine the correct action. At that point you're almost better off creating a controller.

@chrislovecnm
Copy link
Contributor

@bprashanth why would we elect masters with k8s when the app elects masters? I am not quite getting what you are proposing. Stateful apps should do this w/o the help of k8s. Or maybe I am not getting what you are saying.

@jberkus
Copy link

jberkus commented Jul 14, 2016

@bprashanth I guess I don't understand why we need this when Etcd/Consul/Zookeeper work perfectly well?

Leader election for Postgres is 2-stage, and stage #2 is highly application specific (it has to do with PostgreSQL log positions). Given that, I don't really see a way that any generic API would be able to satisfy it. Right now, we either use state machine knowledge (Flynn.io) where no leader election is necessary, or we use the 2-stage process, and only the first stage is a standard leader election.

Also, importantly, the master failover must originate on the nodes themselves, not on the Kube Controller. Otherwise you get lots of lovely race conditions.

The current PetSet pretty much gives me what I need in terms of endpoints, since I can define a service for the master, and update it.

@jberkus
Copy link

jberkus commented Jul 14, 2016

The only thing I could see having is a Kube API for generic DCS functions so that applications written for Kubernetes don't have to know whether they're addressing Etcd, Consul or Zookeeper. That would require dumbing down the interfaces, though.

@bprashanth
Copy link
Contributor

The only thing I could see having is a Kube API for generic DCS functions so that applications written for Kubernetes don't have to know whether they're addressing Etcd, Consul or Zookeeper.

Yeah that's one of the questions on the table, what api do you need so you don't have to run your own instance of an HA database, to bootstrap and augment a non-HA database? We're already running a CP store behind the apiserver.

The other issue is that you've written a bunch of code that handles the persnikety edge cases. We have a spectrum of applications ranging from more legacy (mysql/postgres master/slave, unclustered, no automatic failover etc) to modern (etcd) and "getting there" (zookeeper). If someone asks me how to deploy master/slave, I can point them at a simple petset, the next question that comes up is going to be how to failover, to which we'd pretty much say: write a kubernetes watching apiserver controller init system. That's a pretty high bar.

If there are common patterns we can distil into the api, we should, if everything is site specific we should let it be. Ideally sys admins would not have to care about how to watch the apiserver, but that might be farfetched.

I am not quite getting what you are proposing. Stateful apps should do this w/o the help of k8s. Or maybe I am not getting what you are saying.

This is for applications that don't elect masters.

@bprashanth
Copy link
Contributor

To clarify, I actually do see people writing the same controller repeatedly. 2 services, 1 for writes and 1 for reads. Write service has just 1 endpoint, the master, read service has all endpoints. Then they move the "role=master" label on the write service between pods after every new master election.

This is a perfect example of something we should aim to make easier, if its possible.

@smarterclayton
Copy link
Contributor

100% agree. What we have today in the field is not good enough
(underscore, exclamation points, etc).

We probably want to proceed in simple steps as noted, but these are
standard patterns that benefit from designed automation tools.

@jberkus
Copy link

jberkus commented Jul 15, 2016

@bprashanth well, there's two different features you're talking about here then.

Feature 1 is access to the Kubernetes-backing DCS via the apiserver. This seems like a reasonable enough thing to have that it's almost not worth talking about; someone should just write it.

Feature 2 is some kind of PetSet extension that understands single-master services. This is more of a special case, down to probably being a Kubernetes extension. I can picture this and even suggest a design, which I'll sketch out when I have a chance.

These are two different features though, so we should have two issues for them.

@bprashanth
Copy link
Contributor

Feature 1 is access to the Kubernetes-backing DCS via the apiserver. This seems like a reasonable enough thing to have that it's almost not worth talking about; someone should just write it.

We're not going to expose the raw k/v store. It will have to be through a well defined api. Ideally other components in the system should be able to leverage the same pattern, and an admin should be able to figure things out with kubectl. What's the exact requirement here? can you simply write to annotations, or do you need to use TTL/etcd leases etc?

We did at one point have a proposal for a lease/lock api (#13037) but found that a lot of use cases are satisfied by writing to annotations. This is a good summary #13037 (comment).

Feature 2 is some kind of PetSet extension that understands single-master services. This is more of a special case, down to probably being a Kubernetes extension. I can picture this and even suggest a design, which I'll sketch out when I have a chance.

yes please, ideally I'd take the controller thingy you have for postgres and seperate into: master/slave generic failover framework, site specific scripts. Then I can swap site specific scripts for mysql (probably).

These are two different features though, so we should have two issues for them.

SG, will fork them when I've collected some thoughts.

@mikedanese
Copy link
Member

The leader election client used by scheduler and controller manager implement a lease system on top of annotations. It canprobably be modified to meet new requirements or at least used for inspiration.

https://github.com/kubernetes/kubernetes/blob/master/pkg/client/leaderelection/leaderelection.go

@timothysc
Copy link
Member

@bgrant0607 This issue is pretty old, and I'm not sure the original intent applies given @mikedanese comment. I'd move to close this more general issue, and open specific ones where they apply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Indicates an issue on api area. area/downward-api area/HA priority/backlog Higher priority than priority/awaiting-more-evidence. sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet
Development

No branches or pull requests

7 participants