Instrument the operator with metrics #212

sebgl · 2018-12-12T07:48:19Z

Metrics we're interested in

rate of reconciliation loops execution, for:
- stack controller
- elasticsearchcluster controller
- kibana controller
rate of reconciliation loop errors (for each of the 3 controllers), labeled with the error "type"
reconciliation loop duration/latency (for each of the 3 controllers): average, ideally p95 & p99

ES communication metrics (labeled with the stack name):

rate of the requests to a stack ES endpoint
duration/latency of those requests

K8S requests metrics:

rate of requests to the apiserver, labeled by verbs, route and status code
rate of req errors to the apiserver, labeled by verbs, route and status code
req duration: average, percentiles

Optional (to discuss?):

gauge of the number of stack/es/kb the operator is responsible for

Metrics collector

1. prometheus lib instrumentation <- metricbeat -> Elasticsearch

rates are usually expressed with counters, which we visualize by applying a rate() function. Counters can be reset when the process restarts. How well does that fit ES/Kibana? We can use a derivative aggregation, does it handle restarts well? Edit: yes, we can tweak this in the TS visual builder: https://www.youtube.com/watch?v=CNR-4kZ6v_E
latencies are usually histograms, with values falling into buckets that we define:

# TYPE stack_operator_stack_reconciliations_duration_seconds histogram
stack_operator_stack_reconciliations_duration_seconds_bucket{le="0.005"} 1
stack_operator_stack_reconciliations_duration_seconds_bucket{le="0.01"} 1
stack_operator_stack_reconciliations_duration_seconds_bucket{le="0.025"} 1
stack_operator_stack_reconciliations_duration_seconds_bucket{le="0.05"} 1
stack_operator_stack_reconciliations_duration_seconds_bucket{le="0.1"} 1
stack_operator_stack_reconciliations_duration_seconds_bucket{le="0.25"} 1
stack_operator_stack_reconciliations_duration_seconds_bucket{le="0.5"} 1
stack_operator_stack_reconciliations_duration_seconds_bucket{le="1"} 1
stack_operator_stack_reconciliations_duration_seconds_bucket{le="2.5"} 1
stack_operator_stack_reconciliations_duration_seconds_bucket{le="5"} 1
stack_operator_stack_reconciliations_duration_seconds_bucket{le="10"} 1
stack_operator_stack_reconciliations_duration_seconds_bucket{le="+Inf"} 1
stack_operator_stack_reconciliations_duration_seconds_sum 0.000187009
stack_operator_stack_reconciliations_duration_seconds_count 1

Not sure how to visualize that with ES/Kibana.

gauges represent values that can go up/down (a good example for that is CPU usage).

2. go-metrics <- metricbeats -> elasticsearch

Main benefit: histograms are simpler than the prometheus alternative, since they emit avg and percentiles values directly.

3. logs <- filebeat -> Elasticsearch

If each reconciliation loop execution is logged anyway, it's quite easy to include all metrics in the logs we produce, and build dashboards with that. In other terms: leave the entire aggregation to ES, don't pre-aggregate.

The text was updated successfully, but these errors were encountered:

sebgl · 2018-12-13T12:52:46Z

Basic metrics set up in PR #214.
But not really exposed yet. It should be behind an optional service for that, with maybe the Prometheus annotation set.

donbowman · 2019-07-31T19:09:00Z

https://github.com/vvanholl/elasticsearch-prometheus-exporter
would be nice

anyasabo · 2019-07-31T21:25:47Z

@donbowman that exporter is for the actual ES clusters themselves, where this issue is about instrumenting the operator. That said, since that ES exporter installs as a plugin you can install it with an init container as described here:
https://github.com/elastic/cloud-on-k8s/blob/master/docs/snapshots.asciidoc

charith-elastic · 2019-12-03T09:57:09Z

It would be very useful to have an instrumented Elasticsearch client and gather metrics about the API calls. One of the observations from scale testing (#357) was that the operator seems to spend most of its time on API calls when managing a large number of Elasticsearch clusters. Having the metrics to backup these observations can help us measure the effects of any optimization efforts on that end.

sebgl · 2019-12-03T09:59:49Z

Relates #1189.

dulltz · 2020-01-10T02:54:18Z

Usage data about elastic-licensing might be also worth exposing as Prometheus metrics for admins.
This will prevent users from inadvertently violating limits.
https://www.elastic.co/guide/en/cloud-on-k8s/master/k8s-licensing.html#k8s_getting_usage_data

pebrc · 2020-05-25T14:51:51Z

We have controller-runtime metrics as of #214, the Elasticsearch client is instrumented as of #1189. The question about support for Prometheus histograms is answered as of Elasticsearch 7.6 with the new histogram field mapper

I just created #3140 to follow up on the last comment here and am suggesting to close this issue for now. Please reopen if you disagree.

sebgl self-assigned this Dec 12, 2018

pebrc added this to the Alpha milestone Feb 8, 2019

pebrc added the loe:medium label Feb 14, 2019

thbkrkr mentioned this issue Mar 5, 2019

Metrics collection on Elastic stack applications #452

Closed

pebrc modified the milestones: Alpha, Beta Mar 14, 2019

pebrc removed this from the Beta milestone May 10, 2019

pebrc removed the loe:medium label Apr 27, 2020

botelastic bot added the triage label Apr 27, 2020

thbkrkr added >feature Adds or discusses adding a feature to the product and removed triage labels Apr 29, 2020

pebrc mentioned this issue May 25, 2020

Expose elastic-licensing related usage data as a prometheus metric #3140

Closed

pebrc closed this as completed May 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instrument the operator with metrics #212

Instrument the operator with metrics #212

sebgl commented Dec 12, 2018 •

edited by pebrc

Loading

sebgl commented Dec 13, 2018

donbowman commented Jul 31, 2019

anyasabo commented Jul 31, 2019

charith-elastic commented Dec 3, 2019

sebgl commented Dec 3, 2019

dulltz commented Jan 10, 2020

pebrc commented May 25, 2020

Instrument the operator with metrics #212

Instrument the operator with metrics #212

Comments

sebgl commented Dec 12, 2018 • edited by pebrc Loading

Metrics we're interested in

Metrics collector

1. prometheus lib instrumentation <- metricbeat -> Elasticsearch

2. go-metrics <- metricbeats -> elasticsearch

3. logs <- filebeat -> Elasticsearch

sebgl commented Dec 13, 2018

donbowman commented Jul 31, 2019

anyasabo commented Jul 31, 2019

charith-elastic commented Dec 3, 2019

sebgl commented Dec 3, 2019

dulltz commented Jan 10, 2020

pebrc commented May 25, 2020

sebgl commented Dec 12, 2018 •

edited by pebrc

Loading