Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add usage report into Loki. #5361

Merged
merged 18 commits into from
Feb 10, 2022
Merged

Conversation

cyriltovena
Copy link
Contributor

@cyriltovena cyriltovena commented Feb 10, 2022

What this PR does / why we need it:

This PRs add usage report to grafana.com into Loki.

It basically add a new modules that will never fail, when running the module try to get a consensus on what is the cluster unique ID and then send a report from every component running every hour. The cluster ID is use to compute aggregation of all components the server side.

How does the consensus works ?

Ingesters are leader in the consensus meaning they are the only one that can actually store in the object store the unique ID. They do that using the Loki kv store and object store for persisting the data over restart.

Each ingester will do as follow:

  • Check if the cluster id exists in the kv store.
    • if it does, we verify that it also exists in the object store and reconcile if needed.
  • Check if the cluster id exists in the object store and reconcile the kv store.
  • If none of those are true, ingester will try to CAS the kvstore to set a new cluster id, in case of wining they will store the cluster id in the object store.
  • Then finally they will use that cluster id to send report.

Other component (followers) will only retry indefinitely to fetch the cluster id from the object store and once they have it, they will start sending report with the ID.

In case there are many failure trying to unmarshal the cluster ID, all component can decide to nuke it.

What happen if we change to a new object store ?

Since we also store the cluster ID in the kvstore, and ingester will realize that it is missing in the new object store and will try to reconcile.
This means if you nuke at the same time the object store AND the kv store, you'll end up with having a new cluster ID but we consider this case to be rare.

What stats are we sending ?

Full disclaimer here, we're not sending any confidential data but only informations about:

  • What object store is being used ?
  • What's the scale of the data being ingested ?
  • How fast are we ingesting and flushing ?
  • How fast queries are in that cluster ?
  • Version, CPU count and memory size.

See the json below.

This is a report from a single binary, if you're using multiple component some stats may be missing from one component to another.

json report
{
	"clusterID": "f06b33a4-be8a-45d5-a8f9-9667f003b700",
	"createdAt": "2022-02-09T08:32:10.26395+01:00",
	"interval": "2022-02-10T08:36:10.26395+01:00",
	"target": "all",
	"version": {
		"version": "",
		"revision": "",
		"branch": "",
		"buildUser": "",
		"buildDate": "",
		"goVersion": "go1.17.2"
	},
	"os": "darwin",
	"arch": "amd64",
	"edition": "oss",
	"metrics": {
		"ingester_flushed_chunks_age_seconds": {
			"stddev": 0,
			"stdvar": 0,
			"avg": 32857.973619,
			"count": 1,
			"min": 32857.973619,
			"max": 32857.973619
		},
		"num_cpu": 16,
		"distributor_replication_factor": 1,
		"ingester_streams_count": 1,
		"query_metric_bytes_per_second": {
			"avg": 86512.48688046652,
			"count": 1715,
			"min": 0,
			"max": 7001745,
			"stddev": 578305.5424162439,
			"stdvar": 334437300389.34607
		},
		"query_metric_lines_per_second": {
			"min": 0,
			"max": 308201,
			"stddev": 25884.49007341756,
			"stdvar": 670006826.3608522,
			"avg": 3873.5586005830855,
			"count": 1715
		},
		"ingester_active_tenants": 1,
		"ingester_target_size_bytes": 1572864,
		"memstats": {
			"sys": 70534152,
			"heap_alloc": 33771944,
			"num_gc": 101,
			"gc_cpu_fraction": 0.00025775059945585605,
			"alloc": 33771944,
			"total_alloc": 1515006248,
			"heap_inuse": 41517056,
			"stack_inuse": 3997696,
			"pause_total_ns": 19223528
		},
		"compactor_retention_enabled": "false",
		"distributor_bytes_received": {
			"total": 30968,
			"rate": 516.1260609192866
		},
		"ingester_flushed_chunks": {
			"total": 0,
			"rate": 0
		},
		"query_log_bytes_per_second": {
			"stddev": 663299.4385104065,
			"stdvar": 439966145128.22064,
			"avg": 101709.73578717193,
			"count": 2744,
			"min": 0,
			"max": 7778734
		},
		"store_object_type": "filesystem",
		"ingester_flushed_chunks_lines": {
			"avg": 594,
			"count": 1,
			"min": 594,
			"max": 594,
			"stddev": 0,
			"stdvar": 0
		},
		"ingester_wal": "enabled",
		"ingester_chunk_created": {
			"total": 0,
			"rate": 0
		},
		"ingester_compression": "gzip",
		"ingester_flushed_chunks_lifespan_seconds": {
			"stdvar": 0,
			"avg": 9.126944444444444,
			"count": 1,
			"min": 9.126944444444444,
			"max": 9.126944444444444,
			"stddev": 0
		},
		"ingester_flushed_chunks_utilization": {
			"avg": 0.0017712910970052083,
			"count": 1,
			"min": 0.0017712910970052083,
			"max": 0.0017712910970052083,
			"stddev": 0,
			"stdvar": 0
		},
		"num_goroutine": 258,
		"distributor_lines_received": {
			"total": 3871,
			"rate": 64.51575570097039
		},
		"compactor_default_retention": "31d",
		"store_schema": "v11",
		"query_log_lines_per_second": {
			"count": 2744,
			"min": 0,
			"max": 315413,
			"stddev": 27780.167284388925,
			"stdvar": 771737694.3486327,
			"avg": 4281.011297376088
		},
		"store_index_type": "boltdb-shipper",
		"ingester_flushed_chunks_bytes": {
			"min": 3008,
			"max": 3008,
			"stddev": 0,
			"stdvar": 0,
			"avg": 3008,
			"count": 1
		}
	}
}

Special notes for your reviewer:

Found a bug in DSkit and had to revendor a fix. see grafana/dskit#132

Fixes #5062

Checklist

  • Documentation added
  • Tests updated
  • Add an entry in the CHANGELOG.md about the changes.

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Copy link
Contributor

@jeschkies jeschkies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you document the option to disable reports? I think we should be transparent on this.

// sendReport sends the report to the stats server
func sendReport(ctx context.Context, seed *ClusterSeed, interval time.Time) error {
report := buildReport(seed, interval)
out, err := jsoniter.MarshalIndent(report, "", " ")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it's gonna be Prometheus metrics. What's the reason for a custom API and store?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very hard to read a Prometheus metric. And I needed more stats like counter, min,max, string !

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty goddamn awesome @cyriltovena!
I'd love to add some usage stats around recording/alerting rules, but we can do this later

pkg/storage/store.go Outdated Show resolved Hide resolved
pkg/usagestats/stats.go Show resolved Hide resolved
pkg/usagestats/reporter.go Outdated Show resolved Hide resolved
cyriltovena and others added 4 commits February 10, 2022 10:36
Co-authored-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
@cyriltovena
Copy link
Contributor Author

The new DSKit brought some linter issue on it.

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Copy link
Contributor

@kavirajk kavirajk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks super cool 🎉

Co-authored-by: Danny Kopping <dannykopping@gmail.com>
@cyriltovena
Copy link
Contributor Author

I'll follow up with a documentation on what we collect.

@dannykopping dannykopping merged commit bbaef79 into grafana:main Feb 10, 2022
dannykopping added a commit that referenced this pull request Feb 10, 2022
* Adds leader election process

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>

* fluke

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>

* fixes the kv typecheck

* wire up the http client

* Hooking into loki services, hit a bug

* Add stats variable.

* re-vendor dskit and improve to never fail service

* Intrument Loki with the package

* Add changelog entry

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>

* Fixes compactor test

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>

* Add configuration documentation

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>

* Update pkg/usagestats/reporter.go

Co-authored-by: Danny Kopping <dannykopping@gmail.com>

* Add boundary check

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>

* Add log for success report.

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>

* lint

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>

* Update pkg/usagestats/reporter.go

Co-authored-by: Danny Kopping <dannykopping@gmail.com>

Co-authored-by: Danny Kopping <dannykopping@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add usage reporting capability for Loki to (optionally) send usage stats to Grafana Labs
4 participants