Jaeger operational observability #1054

vprithvi · 2018-09-07T14:37:56Z

Requirement - what kind of business use case are you trying to solve?

Verifying whether all components of Jaeger have been deployed successfully without needing to run special applications, etc.

I would like confirmation of Jaeger agent and client health, version, reachability and it's ability to send spans.

Problem - what in Jaeger blocks you from solving the requirement?

none

Proposal - what do you suggest to solve the problem or improve the existing situation?

Have a page in the UI that allows users to see the following:

Hosts where Jaeger agents are running, their configuration values, version, and the last time the agent interacted with Jaeger Collector
Service names of services reporting to individual agents, their configuration values, version, and the last time a span was sent from that service
Effective sampling rate per service

This page is read only, maintains no history and only shows this information in realtime.

Any open questions to address

Determine how to get this data effectively with minimal changes. One approach may be to have Jaeger agents send spans when they start up and shut down.

jpkrohling · 2018-09-07T14:46:21Z

I think the main source of information should be to the collector. A new module (say, "admin") would then query all known collectors and aggregate this information when needed.

Ideally, this admin module would have an extra management endpoint, to get notified when new collectors are added/removed from the cluster.

pavolloffay · 2018-09-07T14:54:55Z

Is it related to #789 (Could we provide pre-made Grafana dashboards for Jaeger backend components?)?

yurishkuro · 2018-10-29T13:07:24Z

Isn't there a way to solve this with existing Observability tools instead of building a bespoke solution?

vprithvi · 2018-10-29T17:49:46Z

@yurishkuro What are you suggesting?

yurishkuro · 2018-10-29T17:56:27Z

Expose metrics so that all these signals can be observed via existing tools. Maybe provide base Grafana dashboards.

vprithvi · 2018-10-29T18:10:13Z

I agree certain things can be served by Grafana, but I'm not convinced it is a good solution.

Let's take the problem of determining which hosts jaeger-agents run on and what versions these agents are on. What metrics would we emit to surface this?

Note that version numbers, host names, and ipv6 are alphanumeric, and might not be easily stored by most metric backends. I'm not sure that storing these as tags is a good idea either - because tags usually don't have the same kind of lifecycle management as the metric

yurishkuro · 2018-10-29T18:25:27Z

Note that version numbers, host names, and ipv6 are alphanumeric, and might not be easily stored by most metric backends.

These attributes are often provided by metrics collection pipeline. E.g. if Prometheus discovers a running service and scrapes its metrics, it already knows version, host, etc. of the process and can include them as tags without our code trying to figure out what they are (which is harder and sometimes impossible).

My view is that we need to provide minimal required observability that is under control of the components themselves, and not venture into trying to capture runtime environments.

vprithvi · 2018-10-29T18:56:03Z

These attributes are often provided by metrics collection pipeline.

I'm not sure - re-reading the ticket, my intent was to determine whether all components are wired up correctly and are connected.

In this context, I believe the question we are answering is whether there is a jaeger-agent with a compatible version range connected to jaeger-collector, and which host/etc that it is running on.

I'm opposed to solely relying on metrics because of the following:

Tracing, an observability tool shouldn't depend on another observability tool to report on it's configuration and deployment. This also makes verifying this locally more involved because people need to set up a metric reporter, collector and UI correctly to verify this easily.
Assumptions are being made by the metrics pipeline - what happens for users not using prom? Or for users who have so many hosts that they are unable to label them in prom?

jpkrohling · 2018-10-30T13:21:08Z

my intent was to determine whether all components are wired up correctly and are connected

This is absolutely something we need, but I think @yurishkuro's question is valid. Depending on how users provision their Jaeger in production, they would have a service map and/or inventory defined somewhere already. As an alternative, I think @isaachier mentioned in another ticket a nice approach using a credit-based system, where clients would spend credits to check via HTTP whether a certain debug span was received by the agent. So, clients could emit this debug span upon bootstrap if a certain env var/config option is set.

About the metrics part, we expose the metrics in Prometheus format, but that doesn't mean that only Prometheus can read it: pretty much all metrics systems nowadays can read Prometheus "format" (OpenMetrics).

I think this request did not come from any end user, so, we are not sure we actually need this UI. I think @objectiser is working on a Grafana dashboard. Perhaps it would be better to wait for that task to complete and then assess what is missing.

vprithvi · 2019-04-04T14:54:25Z

Depending on how users provision their Jaeger in production, they would have a service map and/or inventory defined somewhere already.

This is very likely, but there is an assumption here that end users of tracing have access to this list, which might not always be the case. Additionally, in some organizations, the host jaeger-agent might be operated by a different group of people than those who operate jaeger-collector. (This might extend to metric systems)

Currently, debugging any connectivity issues is extremely painful even without cross organization boundaries. When there are organizational boundaries and limited access to hosts running jaeger components, it is very time consuming to figure out whether spans don't end up on collectors due to misconfiguration/connectivity of jaeger components.

Tools like Flink have a status page that shows connected components and their status, which really aids in debugging. I feel that this could be quite useful for us.

@yurishkuro At a minimum, I would like to capture the following:

agent hostname / ip
agent version
timestamp of last successful span submission

Any objections?

vprithvi · 2019-04-08T23:11:20Z

@yurishkuro bump

yurishkuro · 2019-04-09T01:18:09Z

I don't have a fundamental objection to this, but I prefer not to duplicate the data already present in the metricks. We recently did troubleshooting of connectivity from another DC, and I'm not sure that this extra page would've helped over just looking at metrics (which the other team couldn't do because our internal binary does not allow switching to Prometheus).

vprithvi · 2019-04-09T16:55:12Z

I don't have a fundamental objection to this, but I prefer not to duplicate the data already present in the metricks.

I also prefer not to duplicate metrics, but I feel that some duplication can have a lot of benefits. In fact, we are already in the process of doing this in #1465 where we are duplicating uptime and start time metrics.

I feel that removing the metrics dependency to answer the question of which jaeger-agents have been configured and started up correctly is a good introspection capability to have at the collector.

jpkrohling mentioned this issue Oct 17, 2018

[agent] Added ping endpoint to agent for connectivity check #1124

Closed

jpkrohling mentioned this issue Apr 3, 2019

Graceful degradation of UDP client senders on EMSGSIZE #389

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jaeger operational observability #1054

Jaeger operational observability #1054

vprithvi commented Sep 7, 2018

jpkrohling commented Sep 7, 2018

pavolloffay commented Sep 7, 2018 •

edited by yurishkuro

Loading

yurishkuro commented Oct 29, 2018

vprithvi commented Oct 29, 2018

yurishkuro commented Oct 29, 2018

vprithvi commented Oct 29, 2018 •

edited

Loading

yurishkuro commented Oct 29, 2018

vprithvi commented Oct 29, 2018

jpkrohling commented Oct 30, 2018

vprithvi commented Apr 4, 2019 •

edited

Loading

vprithvi commented Apr 8, 2019

yurishkuro commented Apr 9, 2019

vprithvi commented Apr 9, 2019

Jaeger operational observability #1054

Jaeger operational observability #1054

Comments

vprithvi commented Sep 7, 2018

Requirement - what kind of business use case are you trying to solve?

Problem - what in Jaeger blocks you from solving the requirement?

Proposal - what do you suggest to solve the problem or improve the existing situation?

Any open questions to address

jpkrohling commented Sep 7, 2018

pavolloffay commented Sep 7, 2018 • edited by yurishkuro Loading

yurishkuro commented Oct 29, 2018

vprithvi commented Oct 29, 2018

yurishkuro commented Oct 29, 2018

vprithvi commented Oct 29, 2018 • edited Loading

yurishkuro commented Oct 29, 2018

vprithvi commented Oct 29, 2018

jpkrohling commented Oct 30, 2018

vprithvi commented Apr 4, 2019 • edited Loading

vprithvi commented Apr 8, 2019

yurishkuro commented Apr 9, 2019

vprithvi commented Apr 9, 2019

pavolloffay commented Sep 7, 2018 •

edited by yurishkuro

Loading

vprithvi commented Oct 29, 2018 •

edited

Loading

vprithvi commented Apr 4, 2019 •

edited

Loading