Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instrument the operator with metrics #212

Closed
sebgl opened this issue Dec 12, 2018 · 7 comments
Closed

Instrument the operator with metrics #212

sebgl opened this issue Dec 12, 2018 · 7 comments
Assignees
Labels
>feature Adds or discusses adding a feature to the product

Comments

@sebgl
Copy link
Contributor

sebgl commented Dec 12, 2018

Metrics we're interested in

  • rate of reconciliation loops execution, for:
    • stack controller
    • elasticsearchcluster controller
    • kibana controller
  • rate of reconciliation loop errors (for each of the 3 controllers), labeled with the error "type"
  • reconciliation loop duration/latency (for each of the 3 controllers): average, ideally p95 & p99

ES communication metrics (labeled with the stack name):

  • rate of the requests to a stack ES endpoint
  • duration/latency of those requests

K8S requests metrics:

  • rate of requests to the apiserver, labeled by verbs, route and status code
  • rate of req errors to the apiserver, labeled by verbs, route and status code
  • req duration: average, percentiles

Optional (to discuss?):

  • gauge of the number of stack/es/kb the operator is responsible for

Metrics collector

1. prometheus lib instrumentation <- metricbeat -> Elasticsearch

  • rates are usually expressed with counters, which we visualize by applying a rate() function. Counters can be reset when the process restarts. How well does that fit ES/Kibana? We can use a derivative aggregation, does it handle restarts well? Edit: yes, we can tweak this in the TS visual builder: https://www.youtube.com/watch?v=CNR-4kZ6v_E

  • latencies are usually histograms, with values falling into buckets that we define:

# TYPE stack_operator_stack_reconciliations_duration_seconds histogram
stack_operator_stack_reconciliations_duration_seconds_bucket{le="0.005"} 1
stack_operator_stack_reconciliations_duration_seconds_bucket{le="0.01"} 1
stack_operator_stack_reconciliations_duration_seconds_bucket{le="0.025"} 1
stack_operator_stack_reconciliations_duration_seconds_bucket{le="0.05"} 1
stack_operator_stack_reconciliations_duration_seconds_bucket{le="0.1"} 1
stack_operator_stack_reconciliations_duration_seconds_bucket{le="0.25"} 1
stack_operator_stack_reconciliations_duration_seconds_bucket{le="0.5"} 1
stack_operator_stack_reconciliations_duration_seconds_bucket{le="1"} 1
stack_operator_stack_reconciliations_duration_seconds_bucket{le="2.5"} 1
stack_operator_stack_reconciliations_duration_seconds_bucket{le="5"} 1
stack_operator_stack_reconciliations_duration_seconds_bucket{le="10"} 1
stack_operator_stack_reconciliations_duration_seconds_bucket{le="+Inf"} 1
stack_operator_stack_reconciliations_duration_seconds_sum 0.000187009
stack_operator_stack_reconciliations_duration_seconds_count 1

Not sure how to visualize that with ES/Kibana.

  • gauges represent values that can go up/down (a good example for that is CPU usage).

2. go-metrics <- metricbeats -> elasticsearch

Main benefit: histograms are simpler than the prometheus alternative, since they emit avg and percentiles values directly.

3. logs <- filebeat -> Elasticsearch

If each reconciliation loop execution is logged anyway, it's quite easy to include all metrics in the logs we produce, and build dashboards with that. In other terms: leave the entire aggregation to ES, don't pre-aggregate.

@sebgl sebgl self-assigned this Dec 12, 2018
@sebgl
Copy link
Contributor Author

sebgl commented Dec 13, 2018

Basic metrics set up in PR #214.
But not really exposed yet. It should be behind an optional service for that, with maybe the Prometheus annotation set.

@pebrc pebrc added this to the Alpha milestone Feb 8, 2019
@pebrc pebrc modified the milestones: Alpha, Beta Mar 14, 2019
@pebrc pebrc removed this from the Beta milestone May 10, 2019
@donbowman
Copy link

https://github.com/vvanholl/elasticsearch-prometheus-exporter
would be nice

@anyasabo
Copy link
Contributor

@donbowman that exporter is for the actual ES clusters themselves, where this issue is about instrumenting the operator. That said, since that ES exporter installs as a plugin you can install it with an init container as described here:
https://github.com/elastic/cloud-on-k8s/blob/master/docs/snapshots.asciidoc

@charith-elastic
Copy link
Contributor

It would be very useful to have an instrumented Elasticsearch client and gather metrics about the API calls. One of the observations from scale testing (#357) was that the operator seems to spend most of its time on API calls when managing a large number of Elasticsearch clusters. Having the metrics to backup these observations can help us measure the effects of any optimization efforts on that end.

@sebgl
Copy link
Contributor Author

sebgl commented Dec 3, 2019

Relates #1189.

@dulltz
Copy link

dulltz commented Jan 10, 2020

Usage data about elastic-licensing might be also worth exposing as Prometheus metrics for admins.
This will prevent users from inadvertently violating limits.
https://www.elastic.co/guide/en/cloud-on-k8s/master/k8s-licensing.html#k8s_getting_usage_data

@pebrc pebrc removed the loe:medium label Apr 27, 2020
@botelastic botelastic bot added the triage label Apr 27, 2020
@thbkrkr thbkrkr added >feature Adds or discusses adding a feature to the product and removed triage labels Apr 29, 2020
@pebrc
Copy link
Collaborator

pebrc commented May 25, 2020

We have controller-runtime metrics as of #214, the Elasticsearch client is instrumented as of #1189. The question about support for Prometheus histograms is answered as of Elasticsearch 7.6 with the new histogram field mapper

I just created #3140 to follow up on the last comment here and am suggesting to close this issue for now. Please reopen if you disagree.

@pebrc pebrc closed this as completed May 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature Adds or discusses adding a feature to the product
Projects
None yet
Development

No branches or pull requests

7 participants