Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch readiness probe might fail if a single node is stuck #2248

Closed
DaveCTurner opened this issue Dec 11, 2019 · 9 comments · Fixed by #2272
Closed

Elasticsearch readiness probe might fail if a single node is stuck #2248

DaveCTurner opened this issue Dec 11, 2019 · 9 comments · Fixed by #2272
Assignees
Labels
>bug Something isn't working v1.0.0

Comments

@DaveCTurner
Copy link

It looks like we run GET _cat/nodes?local in order to check a pod's readiness:

https://github.com/elastic/cloud-on-k8s/blob/master/pkg/controller/elasticsearch/nodespec/readiness_probe.go#L32

This API call does two things: it gets a list of nodes from the local cluster state, and then it reaches out to all the nodes in the cluster and obtains some extra info from them. If a single node in the cluster fails to respond in this second step within a few seconds then I think this means we consider all the pods to be failing their readiness checks.

I think it's a bug to accept ?local on this API at all given this gotcha (see elastic/elasticsearch#50088) but also think we should be using a different API for our readiness checks. For instance, GET _cluster/health?timeout=0s returns 200 OK iff the node is part of a cluster with a master, and GET / returns (something) if the node is actually alive and responding.

(I say "something" because in certain versions GET / return 503 Service Unavailable if there is no master, but more recent versions always return 200 OK here).

@pebrc pebrc added the discuss We need to figure this out label Dec 11, 2019
@pebrc
Copy link
Collaborator

pebrc commented Dec 11, 2019

We changed this in #1748. The intention was to bind pod readiness to a node local request so that we can evaluate pod readiness independently from overall cluster health/cluster membership as we are only interested in the status of the single node that is being requested and want to know whether it is principally ready to enter into operation.

From what you are saying GET / seems the request that gets us closest to that?

[Update] actually we had the health check at GET / originally and changed it because of the 503s we were seeing.

but more recent versions always return 200 OK here

So do you know which versions that are? I think we saw it on 6.x clusters IIRC

@DaveCTurner
Copy link
Author

So do you know which versions that are? I think we saw it on 6.x clusters IIRC

It changed in 7.0.0, see elastic/elasticsearch#29045.

From what you are saying GET / seems the request that gets us closest to that?

Maybe, but note that this will consider a running node to be ready even if it cannot contact the master of the cluster, and arguably a node without a master isn't "ready" in any meaningful sense since it can't really serve any useful requests.

@DaveCTurner
Copy link
Author

In the light of #1748 maybe we do indeed want GET /, treating 503 as a benign misspelling of 200. Alternatively GET _cluster/health?local would suffice here since that is a local request that always returns 200.

I will admit I don't fully understand the surrounding context -- deleting temporarily-unhealthy pods so enthusiastically seems like it might cause other problems. For instance a node could stop responding even to GET / for a while if it enters a long GC pause, and in that case it seems best to wait.

@LoadingZhang
Copy link

I'm encountering this readiness problem.
I start 3 nodes in a benchmark, each pod will get readiness probe failed every three or five minutes(not same time). if one of pod get failed, ingress(traefix or nginx) will return 503 service unavailable error, it cause write process terminated immediately.

@pebrc
Copy link
Collaborator

pebrc commented Dec 11, 2019

I will admit I don't fully understand the surrounding context -- deleting temporarily-unhealthy pods so enthusiastically seems like it might cause other problems

#1748 is a bit outdated in that regard. We have since rewritten the logic that deletes pods during upgrades and are not deleting pods anymore that become temporarily unhealthy. It is maybe also worth mentioning that the k8s readiness probe retries by default 3 times in 5 second intervals before marking a pod as not ready.

The purpose of this readiness check as I see it is only to see if a node has booted successfully to such an extent that it is ready to accept API calls. This readiness state then predicates whether or not the node is included in the k8s service so that clients can reach it.

@DaveCTurner
Copy link
Author

The purpose of this readiness check as I see it is only to see if a node has booted successfully to such an extent that it is ready to accept API calls. This readiness state then predicates whether or not the node is included in the k8s service so that clients can reach it.

These are two different purposes. A node without a master can technically accept API calls, but can't really handle most of them, so I think it's debatable whether a node without a master should be included in the service so that clients can reach it.

@anyasabo
Copy link
Contributor

If I'm understanding elastic/elasticsearch#29045 correctly, we might want to:

  • in 7.0+, check / returns 200
  • in 6.x, check / returns 200||503 and contains a body that looks like something we expect (e.g.
{
  "name" : "quickstart-es-default-0",
  "cluster_name" : "quickstart",
  "cluster_uuid" : "7Apxo0XXSD6CTNsY_UNNew",
...

Does that make sense?

@pebrc
Copy link
Collaborator

pebrc commented Dec 12, 2019

We discussed offline and are going to implement as proposed in #2248 (comment)

@pebrc pebrc added >bug Something isn't working v1.0.0 and removed discuss We need to figure this out labels Dec 12, 2019
@barkbay barkbay self-assigned this Dec 13, 2019
@jpountz
Copy link

jpountz commented Dec 13, 2019

We made this an issue on the Elasticsearch repo: elastic/elasticsearch#50187.

primeroz pushed a commit to mintel/es-image that referenced this issue Jan 15, 2020
* Add port 9200 to master statefulset to simplify curling the cluster when only master is started
* Change readiness probes see elastic/cloud-on-k8s#2248
* Disable pvc and use empty dir
* Upgrade all iamges to test 7.5.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working v1.0.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants