[elasticsearch] Revisit readinessProbe #553

jmlrt · 2020-04-02T14:53:22Z

We should revisit the readinessProbe - see Elasticsearch readiness probe might fail if a single node is stuck cloud-on-k8s#2248 for what ECK does and why. The TL;DR version is just to call /, but note that there are some subtleties with HTTP response codes and the ES version.

Originally posted by @pugnascotia in elastic/elasticsearch#53426 (comment)

The text was updated successfully, but these errors were encountered:

jmlrt · 2020-04-17T11:20:19Z

Helm Chart readiness probe

We are using a script which run /_cluster/health?timeout=0s endpoint to check that Elasticsearch is started, then /_cluster/health?wait_for_status=green&timeout=1s endpoint to check that Elasticsearch is cluster is OK:

helm-charts/elasticsearch/templates/statefulset.yaml

Lines 215 to 227 in a0e8d77

    
                           if [ -f "${START_FILE}" ]; then 
        
                               echo 'Elasticsearch is already running, lets check the node is healthy and there are master nodes available' 
        
                               http "/_cluster/health?timeout=0s" 
        
                           else 
        
                               echo 'Waiting for elasticsearch cluster to become ready (request params: "{{ .Values.clusterHealthCheckParams }}" )' 
        
                               if http "/_cluster/health?{{ .Values.clusterHealthCheckParams }}" ; then 
        
                                   touch ${START_FILE} 
        
                                   exit 0 
        
                               else 
        
                                   echo 'Cluster is not yet ready (request params: "{{ .Values.clusterHealthCheckParams }}" )' 
        
                                   exit 1 
        
                               fi 
        
                           fi

Note that informations for these checks are retrieved from masters as local query parameter is not set (https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-health.html#request-params).

The desired behaviour here is that if the data nodes are unable to talk to their master nodes for whatever reason, then the data nodes will become Unready and therefore be removed from the Service load-balancer until the master nodes are available again (quoting @fatmcgav from #380 (comment)).

ECK readiness probe

ECK is using a different approach which check / endpoint to only ensure that Elasticsearch node is started:

https://github.com/elastic/cloud-on-k8s/blob/71a3725335596acdb6b7a13c917f9018d9953a6f/pkg/controller/elasticsearch/nodespec/readiness_probe.go#L67-L77

The intention is to check only the single node independently from overall cluster health/cluster membership to know whether it is principally ready to enter into operation (elastic/cloud-on-k8s#2248 (comment)) to avoid issue during rolling upgrade where all nodes loose their ready state and are deleted while master nodes are rolled (more detail in (elastic/cloud-on-k8s#1748 (comment)).

Note that from what I know a failed readiness probe should only remove the pod from service so no traffic is sent to it untill readiness probe is successfull again, it shouldn't kill the pod (unless ECK operator force killing pods not ready).

jmlrt · 2020-04-21T12:02:19Z

Closed by #586

jmlrt added the enhancement New feature or request label Apr 2, 2020

jmlrt mentioned this issue Apr 2, 2020

Elasticsearch Helm Chart review elastic/elasticsearch#53426

Closed

1 task

jmlrt self-assigned this Apr 17, 2020

jmlrt mentioned this issue Apr 21, 2020

[elasticsearch] update readiness probe endpoint #586

Merged

4 tasks

jmlrt closed this as completed Apr 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[elasticsearch] Revisit readinessProbe #553

[elasticsearch] Revisit readinessProbe #553

jmlrt commented Apr 2, 2020

jmlrt commented Apr 17, 2020

jmlrt commented Apr 21, 2020

[elasticsearch] Revisit readinessProbe #553

[elasticsearch] Revisit readinessProbe #553

Comments

jmlrt commented Apr 2, 2020

jmlrt commented Apr 17, 2020

jmlrt commented Apr 21, 2020