Support master election #1542

bgrant0607 · 2014-10-02T17:48:31Z

Forked from #260. Aka leader election.

Applications can use their own master-election/lock services, such as etcd or zookeeper. However, a problem remains, which is how clients of those services find the current master. Some applications, such as databases and key-value stores, provide their own client libraries or special protocols for this. Clients of HTTP APIs could potentially use load balancing plus redirects.

The question is should we do anything to make this easier?

We could:

assign an IP address and DNS name to the current master
reflect the current mastership status via the cluster API and via the downward API (Container downward/upward API umbrella issue #386)
- applications that need true exclusion could use resourceVersion as the "lock sequence number" for their own concurrency control
create lifecycle hooks for acquiring and losing mastership

If we decide to support this, I think the main design decision is whether the application should participate directly in the master election process. We have an etcd-based master-election implementation. We could expose an API for master election that would be independent of the underlying key-value store, but which would couple the application to that API. Or, we could potentially use container status and/or readiness probes (#620) to decide when to transfer mastership to another instance. Using readiness probes (or a variant thereof) would have the nice property that the application could use that mechanism to influence which instance was elected master, such as in the case that the application just wanted to take advantage of the addressing/naming mechanism.

bgrant0607 · 2015-08-19T03:59:37Z

cc @timothysc @jayunit100

timothysc · 2015-08-24T20:08:23Z

So the lock/lease api is being updated, but the algo around actual leader election I consider separate, but planned to address over time as needed.

/cc @rrati @davidopp

bprashanth · 2016-07-07T18:28:24Z

I think the harder problem here is failover. Most quorum based consensus protocols can steer through member outage by running rounds of leader election (eg etcd, zookeeper). I think simple petset+some form of leader aware Service should suffice here (#28443, #10456).

However, classic master/slave does not failover gracefully. Managing such applications usually requires a babysitter. Take mysql for example, the babysitter needs to be aware of the master log position and make sure it doesn't promote a lagging slave.

At a high level the steps might involve:

Elect a leader (pet-0 might work)
Advertise the leader (update the leader election Service with DNS name of pet-0)
Detect master failure (simple health check from leader election Service might work)
Elect a healthy slave (babysitter needed)
Advertise healthy slave as new master (babysitter updates the leader election Service)

So assuming we implement a leader aware Service, what other patterns help master/slave failover? Or should we just leave everything to the babysitter?

@jberkus now that we have basic PetSet, maybe you can give your views on postgres failover? @kubernetes/sig-apps

timothysc · 2016-07-14T19:38:54Z

@bprashanth that's being far to kind for stateful distributed systems. You would need a hook-container for literally every lifecycle event to determine the correct action. At that point you're almost better off creating a controller.

chrislovecnm · 2016-07-14T19:42:18Z

@bprashanth why would we elect masters with k8s when the app elects masters? I am not quite getting what you are proposing. Stateful apps should do this w/o the help of k8s. Or maybe I am not getting what you are saying.

jberkus · 2016-07-14T19:46:47Z

@bprashanth I guess I don't understand why we need this when Etcd/Consul/Zookeeper work perfectly well?

Leader election for Postgres is 2-stage, and stage #2 is highly application specific (it has to do with PostgreSQL log positions). Given that, I don't really see a way that any generic API would be able to satisfy it. Right now, we either use state machine knowledge (Flynn.io) where no leader election is necessary, or we use the 2-stage process, and only the first stage is a standard leader election.

Also, importantly, the master failover must originate on the nodes themselves, not on the Kube Controller. Otherwise you get lots of lovely race conditions.

The current PetSet pretty much gives me what I need in terms of endpoints, since I can define a service for the master, and update it.

jberkus · 2016-07-14T19:47:42Z

The only thing I could see having is a Kube API for generic DCS functions so that applications written for Kubernetes don't have to know whether they're addressing Etcd, Consul or Zookeeper. That would require dumbing down the interfaces, though.

bprashanth · 2016-07-14T20:11:30Z

The only thing I could see having is a Kube API for generic DCS functions so that applications written for Kubernetes don't have to know whether they're addressing Etcd, Consul or Zookeeper.

Yeah that's one of the questions on the table, what api do you need so you don't have to run your own instance of an HA database, to bootstrap and augment a non-HA database? We're already running a CP store behind the apiserver.

The other issue is that you've written a bunch of code that handles the persnikety edge cases. We have a spectrum of applications ranging from more legacy (mysql/postgres master/slave, unclustered, no automatic failover etc) to modern (etcd) and "getting there" (zookeeper). If someone asks me how to deploy master/slave, I can point them at a simple petset, the next question that comes up is going to be how to failover, to which we'd pretty much say: write a kubernetes watching apiserver controller init system. That's a pretty high bar.

If there are common patterns we can distil into the api, we should, if everything is site specific we should let it be. Ideally sys admins would not have to care about how to watch the apiserver, but that might be farfetched.

I am not quite getting what you are proposing. Stateful apps should do this w/o the help of k8s. Or maybe I am not getting what you are saying.

This is for applications that don't elect masters.

bprashanth · 2016-07-14T20:14:24Z

To clarify, I actually do see people writing the same controller repeatedly. 2 services, 1 for writes and 1 for reads. Write service has just 1 endpoint, the master, read service has all endpoints. Then they move the "role=master" label on the write service between pods after every new master election.

This is a perfect example of something we should aim to make easier, if its possible.

smarterclayton · 2016-07-15T02:04:40Z

100% agree. What we have today in the field is not good enough
(underscore, exclamation points, etc).

We probably want to proceed in simple steps as noted, but these are
standard patterns that benefit from designed automation tools.

jberkus · 2016-07-15T16:44:34Z

@bprashanth well, there's two different features you're talking about here then.

Feature 1 is access to the Kubernetes-backing DCS via the apiserver. This seems like a reasonable enough thing to have that it's almost not worth talking about; someone should just write it.

Feature 2 is some kind of PetSet extension that understands single-master services. This is more of a special case, down to probably being a Kubernetes extension. I can picture this and even suggest a design, which I'll sketch out when I have a chance.

These are two different features though, so we should have two issues for them.

bprashanth · 2016-07-15T17:38:33Z

Feature 1 is access to the Kubernetes-backing DCS via the apiserver. This seems like a reasonable enough thing to have that it's almost not worth talking about; someone should just write it.

We're not going to expose the raw k/v store. It will have to be through a well defined api. Ideally other components in the system should be able to leverage the same pattern, and an admin should be able to figure things out with kubectl. What's the exact requirement here? can you simply write to annotations, or do you need to use TTL/etcd leases etc?

We did at one point have a proposal for a lease/lock api (#13037) but found that a lot of use cases are satisfied by writing to annotations. This is a good summary #13037 (comment).

Feature 2 is some kind of PetSet extension that understands single-master services. This is more of a special case, down to probably being a Kubernetes extension. I can picture this and even suggest a design, which I'll sketch out when I have a chance.

yes please, ideally I'd take the controller thingy you have for postgres and seperate into: master/slave generic failover framework, site specific scripts. Then I can swap site specific scripts for mysql (probably).

These are two different features though, so we should have two issues for them.

SG, will fork them when I've collected some thoughts.

mikedanese · 2016-07-15T19:24:07Z

The leader election client used by scheduler and controller manager implement a lease system on top of annotations. It canprobably be modified to meet new requirements or at least used for inspiration.

https://github.com/kubernetes/kubernetes/blob/master/pkg/client/leaderelection/leaderelection.go

timothysc · 2016-07-27T19:26:53Z

@bgrant0607 This issue is pretty old, and I'm not sure the original intent applies given @mikedanese comment. I'd move to close this more general issue, and open specific ones where they apply.

bgrant0607 added kind/design Categorizes issue or PR as related to design. sig/network Categorizes an issue or PR as relevant to SIG Network. area/api Indicates an issue on api area. area/downward-api labels Oct 2, 2014

bgrant0607 mentioned this issue Oct 2, 2014

Support dynamic configuration distribution #1553

Closed

bgrant0607 added this to the v1.0 milestone Oct 4, 2014

bgrant0607 mentioned this issue Oct 7, 2014

Proposal: Headless services #1607

Closed

bgrant0607 mentioned this issue Oct 21, 2014

Proposal: Container groups moby/moby#8637

Closed

bgrant0607 added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Dec 3, 2014

bgrant0607 removed this from the v1.0 milestone Dec 3, 2014

bgrant0607 added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. area/example labels Feb 28, 2015

timothysc mentioned this issue Mar 4, 2015

PetSet (was nominal services) #260

Closed

bgrant0607 mentioned this issue Apr 24, 2015

Proposal for High Availability of Daemons #6995

Merged

bgrant0607 changed the title ~~Decide whether/how to support master election~~ Support master election Aug 19, 2015

bgrant0607 mentioned this issue Aug 19, 2015

WIP: Scale ha combined #12884

Closed

bprashanth mentioned this issue Jul 4, 2016

Better service/ingress routing rules #28443

Closed

bprashanth mentioned this issue Jul 8, 2016

Pet Set in beta #28718

Closed

bgrant0607 added area/HA and removed area/availability labels Jul 12, 2016

webwurst mentioned this issue Jul 25, 2016

Create example for a distributed database with PetSet giantswarm/kubernetes-recipes#1

Open

namliz mentioned this issue Aug 8, 2016

Missing Features for Complex Deployments cncf/demo#29

Open

bgrant0607 closed this as completed Aug 10, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support master election #1542

Support master election #1542

bgrant0607 commented Oct 2, 2014

bgrant0607 commented Aug 19, 2015

timothysc commented Aug 24, 2015

bprashanth commented Jul 7, 2016

timothysc commented Jul 14, 2016

chrislovecnm commented Jul 14, 2016

jberkus commented Jul 14, 2016

jberkus commented Jul 14, 2016

bprashanth commented Jul 14, 2016

bprashanth commented Jul 14, 2016

smarterclayton commented Jul 15, 2016

jberkus commented Jul 15, 2016

bprashanth commented Jul 15, 2016

mikedanese commented Jul 15, 2016

timothysc commented Jul 27, 2016

Support master election #1542

Support master election #1542

Comments

bgrant0607 commented Oct 2, 2014

bgrant0607 commented Aug 19, 2015

timothysc commented Aug 24, 2015

bprashanth commented Jul 7, 2016

timothysc commented Jul 14, 2016

chrislovecnm commented Jul 14, 2016

jberkus commented Jul 14, 2016

jberkus commented Jul 14, 2016

bprashanth commented Jul 14, 2016

bprashanth commented Jul 14, 2016

smarterclayton commented Jul 15, 2016

jberkus commented Jul 15, 2016

bprashanth commented Jul 15, 2016

mikedanese commented Jul 15, 2016

timothysc commented Jul 27, 2016