Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kafka-Minion as alternative to Burrow for consumer lag monitoring #259

Merged
merged 17 commits into from
May 6, 2019

Conversation

solsson
Copy link
Contributor

@solsson solsson commented Apr 4, 2019

See #255 (comment)

@weeco FYI

Only tested in Minikube so far with three replicas. With two kafka replicas I got {"error":"kafka server: Replication-factor is invalid.","level":"panic","msg":"failed to get partition count","time":"2019-04-04T03:23:38Z","topic":"__consumer_offsets"} which might have been a config error. Default replication factor not updated etc.

@weeco
Copy link

weeco commented Apr 4, 2019

Kafka minion tries to launch a partition consumer for each partition of the consumer offsets topic. Therefore it first has to get the topics partition count. I can imagine two reasons why it has failed:

  1. You haven't got any committed consumer offsets and therefore the topic does not exist yet
  2. The consumer offsets topic has a different name (topic name can be configured)

Is one of these two conditions true? If not I'll try to further investigate

@solsson
Copy link
Contributor Author

solsson commented Apr 5, 2019

You haven't got any committed consumer offsets and therefore the topic does not exist yet

Yes, that could have been the case. It was a new cluster. I'll check more closely next time.

@weeco
Copy link

weeco commented Apr 22, 2019

Did you have a chance to give it a spin? I just released v0.1.2 with some more features :-)

Copy link

@weeco weeco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments, but I'll be working on the Kafka Minion helm chart during the next few days and I'll submit more comments about the K8s manifests.

failureThreshold: 1
httpGet:
port: http
path: /metrics
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kafka Minion 1.1.2 introduces a dedicated readiness check which is 200 once Kafka Minion has initially consumed the __consumer_offsets topic which is the point in time when it starts exposing metrics. This is a required feature to run Kafka Minion in high availability / multiple replicas. This is recommended if you intend to setup alerting on these metrics.

Since this can take some time it requires some loose timeouts:

          readinessProbe:
            initialDelaySeconds: 10
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 60 # 60 * 10s equals 10min, should be adapted depending on the given resources and size of consumer offsets topic
            httpGet:
              path: /readycheck
              port: http

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5fc33f4 swiches to this endpoints but keeps everything else default

failureThreshold: 3
httpGet:
port: http
path: /metrics
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a separate endpoint which checks if it's still connected to at least one kafka broker:

          livenessProbe:
            initialDelaySeconds: 10
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 3
            httpGet:
              path: /healthcheck
              port: http

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5fc33f4 swiches to this endpoints but keeps everything else default

## Consumer lag monitoring

See [Burrow](../linkedin-burrow)
or [Kafka Minion](../consumers-prometheus/)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe some additional comments what one may prefer depending on the use case / environment?

  • Many kafka clusters to monitor with just one Exporter? => Burrow
  • Only interested in Consumer Health check? => Burrow
  • Want metrics in prometheus? => Kafka Minion
  • Looking for HA support? => Kafka Minion
  • Using versioning in group ids (e. g. consumer group name "email-sender-5" where 5 indicates the version) ? => Kafka Minion

In fact they can supplement each other and it may be a valid desire to operate both of them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's lots and lots of research to be done for anyone who wants to set up a Kafka stack and I see this repository as a collection of examples rather than a way to discuss the choices.

@solsson
Copy link
Contributor Author

solsson commented May 6, 2019

I've validated with a dev Prometheus stack now. Adding the grafana dashboard will be a separate PR because I didn't want to deal with the jsonnet stuff in https://github.com/coreos/kube-prometheus when Kustomize can produce config maps.

@weeco The current v0.1.2 image has an oddity. One layer is ENV VERSION=0.1.1.

because the v0.1.2 build had an env saying 0.1.1, later fixed in 5a9b9f3
@solsson solsson merged commit d5fc680 into master May 6, 2019
@weeco
Copy link

weeco commented May 6, 2019

@weeco The current v0.1.2 image has an oddity. One layer is ENV VERSION=0.1.1.

I am aware, that's a fault of mine. It'll be fixed with v0.1.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants