Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metricdog: anonymous bottlerocket metrics #1006

Merged
merged 1 commit into from
Feb 12, 2021
Merged

metricdog: anonymous bottlerocket metrics #1006

merged 1 commit into from
Feb 12, 2021

Conversation

webern
Copy link
Contributor

@webern webern commented Jul 31, 2020

Issue Number

Closes #1000

Description

Adds a program and systemd service that checks whether certain critical services are running. Reports this to a URL via GET request.

Testing

Unit Testing

I organized the code in such a way as to facilitate unit testing everything except the calls to systemctl.

Happy Integ

I set up an S3 bucket and confirmed that metricdog requests are working. I ran metricdog manually from the command line under various scenarious and traced the URL that was produced.

Healthy Health Ping

metricdog --log-level trace send-health-ping

Result:

https://somedomain.com/metricstest?sender=metricdog&event=health-ping&version=0.4.1&variant=aws-k8s-1.15&arch=x86_64&region=us-west-2&seed=1712&version-lock=latest&ignore-waves=false&failed_services=&is_healthy=true

Boot Success

metricdog --log-level trace send-boot-success
https://somedomain.com/metricstest?sender=metricdog&event=boot-success&version=0.4.1&variant=aws-k8s-1.15&arch=x86_64&region=us-west-2&seed=1712&version-lock=latest&ignore-waves=false

One Failed Service

I 'broke' kubelet and ran a health ping, and received:

https://somedomain.com/metricstest?sender=metricdog&event=health-ping&version=0.4.1&variant=aws-k8s-1.15&arch=x86_64&region=us-west-2&seed=1712&version-lock=latest&ignore-waves=false&failed_services=kubelet%3A255&is_healthy=false

The failed services list de-escapes to kubelet:255

Two Failed Services

I 'broke' an additional service:

https://somedomain.com/metricstest?sender=metricdog&event=health-ping&version=0.4.1&variant=aws-k8s-1.15&arch=x86_64&region=us-west-2&seed=1605&version-lock=latest&ignore-waves=false&failed_services=containerd%3A1%2Ckubelet%3A255&is_healthy=false

The failed services list de-escapes to containerd:1,kubelet:255.

Bad URL Integ

To make sure that a bad URL doesn't cause problems, I ran an instance with a fake URL. I found that that, send-boot-success and send-health-ping each caused an error to be reported in the system journal, but no other ill affects were observed.

Systemd

I ran hosts and observed that the systemd unit and timer work as expected.

Edit 8/28/2020: I tested an ECS variant with ab6930c and found the the URLs are still being constructed properly, including the reporting of failed services.

Migration

I tested the migration through upgrade and downgrade, confirming that settings were added/removed and the system was healthy at all three waypoints.

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

Copy link
Contributor

@tjkirch tjkirch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments from the surrounding bits; still need to review healthdog itself.

sources/models/src/aws-k8s-1.15/mod.rs Outdated Show resolved Hide resolved
packages/os/healthdog-toml Outdated Show resolved Hide resolved
packages/os/healthdog.service Outdated Show resolved Hide resolved
packages/os/healthdog.service Outdated Show resolved Hide resolved
packages/os/healthdog.timer Outdated Show resolved Hide resolved
packages/os/healthdog.timer Outdated Show resolved Hide resolved
sources/models/defaults.toml Outdated Show resolved Hide resolved
sources/models/defaults.toml Outdated Show resolved Hide resolved
sources/models/src/aws-k8s-1.15/override-defaults.toml Outdated Show resolved Hide resolved
@webern
Copy link
Contributor Author

webern commented Aug 4, 2020

webern force-pushed the webern:healthdog branch from ec0abbf to f18d841

  • renamed settings.healthdog to settings.metrics.
  • renamed region to aws-region.
  • added (and tested) settings for ecs and dev variants.
  • made suggested changes to systemd configs (and tested them)

Copy link
Contributor

@tjkirch tjkirch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite done with service_check, but have to stop for the moment...

sources/healthdog/Cargo.toml Outdated Show resolved Hide resolved
sources/healthdog/Cargo.toml Outdated Show resolved Hide resolved
sources/healthdog/src/main_test.rs Outdated Show resolved Hide resolved
sources/healthdog/src/main_test.rs Outdated Show resolved Hide resolved
sources/models/src/aws-dev/override-defaults.toml Outdated Show resolved Hide resolved
sources/healthdog/src/healthdog.rs Outdated Show resolved Hide resolved
sources/healthdog/src/healthdog.rs Outdated Show resolved Hide resolved
sources/healthdog/src/service_check.rs Outdated Show resolved Hide resolved
sources/healthdog/src/service_check.rs Outdated Show resolved Hide resolved
sources/healthdog/src/service_check.rs Outdated Show resolved Hide resolved
sources/models/src/aws-dev/override-defaults.toml Outdated Show resolved Hide resolved
sources/models/defaults.toml Outdated Show resolved Hide resolved
sources/models/src/aws-ecs-1/override-defaults.toml Outdated Show resolved Hide resolved
@tjkirch
Copy link
Contributor

tjkirch commented Aug 7, 2020

Maybe also chrony, containerd, host-containerd, since they're all related to default settings.

host-containerd sure, but containerd might not be used in all variants..?

@webern webern marked this pull request as draft August 10, 2020 18:05
@webern
Copy link
Contributor Author

webern commented Aug 24, 2020

webern force-pushed the webern:healthdog branch from f18d841 to 4401e25

First attempt at a slightly precarious rebase.

@webern
Copy link
Contributor Author

webern commented Aug 24, 2020

webern force-pushed the webern:healthdog branch from 4401e25 to 809f98f

Fix horrible things from the first rebase attempt.

@webern
Copy link
Contributor Author

webern commented Aug 24, 2020

webern force-pushed the webern:healthdog branch from 809f98f to 3fae608

Undo a tiny re-order that was a rebase artifact.

@webern
Copy link
Contributor Author

webern commented Aug 25, 2020

webern force-pushed the webern:healthdog branch from 3fae608 to 1083d9b

Use aws_region instead of region and normalize on snake_case for params.

@webern
Copy link
Contributor Author

webern commented Aug 25, 2020

webern force-pushed the webern:healthdog branch from 1083d9b to d62f75d

Rebase the swift-moving develop again.

@webern
Copy link
Contributor Author

webern commented Aug 25, 2020

webern force-pushed the webern:healthdog branch from d62f75d to f515d38

Create a join-strings helper in schnauzer to simplify the healthdog-toml template.

@webern
Copy link
Contributor Author

webern commented Aug 26, 2020

webern force-pushed the webern:healthdog branch from f515d38 to 2a5aef0

Fixes most of the lowest-hanging fruit in the PR comments.

@webern
Copy link
Contributor Author

webern commented Aug 26, 2020

webern force-pushed the webern:healthdog branch from 2a5aef0 to 9dc047c

Weeee... rebase the migration versions again from develop! :)

@webern
Copy link
Contributor Author

webern commented Aug 26, 2020

webern force-pushed the webern:healthdog branch from 9dc047c to a58c5e2

Use structopt.

@webern
Copy link
Contributor Author

webern commented Aug 26, 2020

webern force-pushed the webern:healthdog branch from a58c5e2 to cbc5d1d

Rename the services setting to service-checks.

@webern
Copy link
Contributor Author

webern commented Aug 28, 2020

Maybe also chrony, containerd, host-containerd, since they're all related to default settings.

host-containerd sure, but containerd might not be used in all variants..?

It's in all current variants. I think it would be good to add these to the top-level defaults.toml. A variant that doesn't use containerd can still override.

P.S. We're actually overriding for all current variants now (on my WIP branch), so defaults.toml more like an example anyway.

@webern
Copy link
Contributor Author

webern commented Aug 28, 2020

webern force-pushed the webern:healthdog branch from cbc5d1d to 11d0d7b

  • Enormous improvement to the systemctl exit code parsing (thank you, Tom)
  • Add containerd and chronyd to the default service checks
  • fix some issues with my renaming of the service-checks setting

Copy link
Contributor

@jamieand jamieand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking specifically at the interactions with systemd:

It'd be nice to eventually add support for non-lingering oneshot services, but I think this is a good place to start.

@webern
Copy link
Contributor Author

webern commented Aug 28, 2020

webern force-pushed the webern:healthdog branch from 328b475 to 3909ee7

Fix remaining issues, esp. providing construction defaults in two places.

@webern
Copy link
Contributor Author

webern commented Aug 28, 2020

webern force-pushed the webern:healthdog branch from 3909ee7 to ab6930c

Fix a bug with the exit code parsing.

@webern webern requested a review from etungsten August 28, 2020 22:41
Copy link
Contributor

@tjkirch tjkirch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have the - prefix to not block startup, but otherwise these are basically nits. Thanks!

packages/os/mark-successful-boot.service Outdated Show resolved Hide resolved
sources/Cargo.toml Outdated Show resolved Hide resolved
sources/api/schnauzer/src/helpers.rs Outdated Show resolved Hide resolved
sources/api/schnauzer/src/helpers.rs Outdated Show resolved Hide resolved
sources/models/defaults.toml Outdated Show resolved Hide resolved
sources/metricdog/src/metricdog.rs Outdated Show resolved Hide resolved
sources/metricdog/src/metricdog.rs Outdated Show resolved Hide resolved
sources/metricdog/src/service_check.rs Show resolved Hide resolved
sources/metricdog/src/service_check.rs Outdated Show resolved Hide resolved
sources/metricdog/src/service_check.rs Outdated Show resolved Hide resolved
@webern
Copy link
Contributor Author

webern commented Oct 5, 2020

webern force-pushed the webern:healthdog branch from 153acff to b8b5c3a

Addresses some of the latest comments.

@webern
Copy link
Contributor Author

webern commented Oct 6, 2020

webern force-pushed the webern:healthdog branch from b8b5c3a to c966f19

Addresses more PR comments. Everything is addressed except for the version bumping of the migration. Will do that after our next release and rebase then.

sources/api/schnauzer/src/lib.rs Outdated Show resolved Hide resolved
sources/metricdog/src/main.rs Show resolved Hide resolved
sources/api/schnauzer/src/helpers.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@tjkirch tjkirch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I wasn't trying to block this, I forgot to come back after the SimpleLogger discovery. Just a couple suggestions above, and we need to update the version now that 1.0.2 is released.

@webern
Copy link
Contributor Author

webern commented Oct 20, 2020

webern force-pushed the webern:healthdog branch from c966f19 to 7404d48

  • rename schnauzer helper from join-array to join_array
  • change schnauzer join_arrayhelper to not add spaces
  • rebase
  • Change metrics URL to include the v1/metrics path, which the back-end expects and enforces.

Note: I wish I had done the rebase separately. I will remember to do rebases in their own push in the future to make diffing easier.

@webern
Copy link
Contributor Author

webern commented Oct 20, 2020

webern force-pushed the webern:healthdog branch from c966f19 to 7404d48

Rebase, fix conflict in Cargo.toml

Copy link
Contributor

@bcressey bcressey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done!

Comment on lines 11 to 12
# Don't fire at exactly the same second across machines started together.
RandomizedDelaySec=10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want more jitter than this, maybe as much as an hour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added jitter in this push 11b0927

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops set jitter to 3600 this time for one hour 92230f5

sources/metricdog/README.md Outdated Show resolved Hide resolved
sources/metricdog/src/metricdog.rs Outdated Show resolved Hide resolved
packages/os/metricdog.service Show resolved Hide resolved
* `version`: the Bottlerocket version.
* `variant`: the Bottlerocket variant.
* `arch`: the machine architecture, e.g.'x86_64' or 'arm'.
* `aws_region`: the AWS region the machine is running in.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be AWS specific? As far as I can tell, we could just leave this as "region" and be a bit more agnostic so other platforms can make use of this field as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it were running on another platform that had the same region name, we'd need some way to tell the difference, and I don't think the other fields would help. Another option would be a "platform" field set to "aws", and a "region" field?

Copy link
Contributor Author

@webern webern Oct 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR: I agree region is better than aws_region, and I would prefer not to add a field such as platform=aws.

I initially called this field region and I strongly agree with changing it back to region. My thinking is that region is general-enough of a concept to be potentially useful with any cloud provider, and even for on-prem deployments, since it really just means 'general location'. In fact, I dislike calling aws_region so much that the current version of the backend accepts either region or aws_region! I should have spoken up sooner.

My disinclination to include platform=aws is less strong, but it feels unnecessary since the variant value conveys that information, and especially because we have not yet decided how we plan to codify such 'attributes of variant' in the system itself.

I don't think the other fields would help

Adding platform=aws is currently no different than substr(variant, 3).

P.S. Another way of stating my objection for platform=aws is, where does the value come from at runtime or compile time? Basically if variant in (aws-ecs-1, aws-k8s.1.15, aws-k8s.1.16, aws-k8s.1.17, aws-k8s.1.18) then platform=aws? Doesn't seem any cleaner to code that logic into the build/runtime of Bottlerocket right now than to do so when interpreting the metrics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think variant will convey all of the information we need. aws-dev can run anywhere, but may only be broken on one platform, for example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: the region field - let's change it to be region; we all seem to agree on that.

Thinking about the platform field - I think it's a good idea, but it raises some questions like "How do we know what platform we're on?, etc"

I don't think we need to block on this, and if we choose to ship without the changes, let's open an issue to track fixing this for other platforms.

Copy link
Contributor

@zmrow zmrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥼

@webern
Copy link
Contributor Author

webern commented Nov 3, 2020

webern force-pushed the webern:healthdog branch from a938b1f to 7c89fe2

Changes the name of the aws_region query parameter back to region (as discussed above).

@webern
Copy link
Contributor Author

webern commented Nov 3, 2020

webern force-pushed the webern:healthdog branch from 7c89fe2 to 518f4c6

Rebase.

@gregdek gregdek added the status/needs-triage Pending triage or re-evaluation label Jan 7, 2021
@tjkirch tjkirch removed the status/needs-triage Pending triage or re-evaluation label Jan 7, 2021
@webern
Copy link
Contributor Author

webern commented Jan 13, 2021

webern force-pushed the webern:healthdog branch from 518f4c6 to 520e55d

A substantial rebase.

@bcressey
Copy link
Contributor

metricdog should also respect settings.network.https-proxy and settings.network.no-proxy.

@webern
Copy link
Contributor Author

webern commented Jan 28, 2021

metricdog should also respect settings.network.https-proxy and settings.network.no-proxy.

Maybe as simple as adding .proxy(reqwest::Proxy::http("https://my.prox")?) to the reqwest client when settings.network.https-proxy is set? I'm not sure I understand what settings.network.no-proxy means at the moment.

@webern
Copy link
Contributor Author

webern commented Feb 4, 2021

webern force-pushed the webern:healthdog branch from 520e55d to b7a6dfe

rebase

@webern
Copy link
Contributor Author

webern commented Feb 9, 2021

webern force-pushed the webern:healthdog branch from b7a6dfe to 10a32f7

rebase

@webern
Copy link
Contributor Author

webern commented Feb 10, 2021

webern force-pushed the webern:healthdog branch from 10a32f7 to c6e68bf

Add migration to Release.toml

@webern
Copy link
Contributor Author

webern commented Feb 11, 2021

webern force-pushed the webern:healthdog branch from c6e68bf to 3db70e2

Fixed the migration so it does the right things. I tested it through upgrade and downgrade, will add that to the PR notes.

@webern webern requested a review from tjkirch February 12, 2021 16:14
@webern
Copy link
Contributor Author

webern commented Feb 12, 2021

webern force-pushed the webern:healthdog branch from 3db70e2 to 11b0927

Increase jitter to one hour.

@webern
Copy link
Contributor Author

webern commented Feb 12, 2021

webern force-pushed the webern:healthdog branch from 560f336 to 92230f5

  • Correct the Jitter to 3600 (one hour)
  • Add a TimeoutStartSec to the service

Metricdog is a new command-line program that sends anonymous information
about a Bottlerocket instance to a metrics endpoint. A report of boot
success is sent by the mark-boot-success service and at 2 minutes after
boot, and every 6 hours thereafter, a report of service health is sent.
@webern
Copy link
Contributor Author

webern commented Feb 12, 2021

webern force-pushed the webern:healthdog branch from 92230f5 to 312eff5

The last rebase?

@webern webern merged commit b51b3b2 into bottlerocket-os:develop Feb 12, 2021
@webern webern deleted the healthdog branch February 12, 2021 23:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Send anonymous health metrics
8 participants