Ingester: limit max chunks per flush #5849

afayngelerindbx · 2022-04-10T22:02:04Z

What this PR does / why we need it:
It seems possible for Loki ingesters to build up streams that are have too many unflushed chunks to successfully flush within a single flush op timeout period. This diff caps the number of chunks that a single flush will attempt to commit to storage. Ideally, this will allow ingesters to incrementally reclaim memory after or during storage degradation.

Which issue(s) this PR fixes:
This may address some of the reports in #5267

Special notes for your reviewer:

Checklist

Tests updated
Metrics Added
Documentation added
Add an entry in the CHANGELOG.md about the changes.

afayngelerindbx · 2022-04-10T22:06:13Z

@slim-bean This is a very rough draft of what I was talking about in #5267

I'm fairly new to Loki so I'd appreciate directional feedback here e.g. am I breaking some assumption/future enhancement/misunderstanding something etc.

slim-bean · 2022-04-12T12:23:03Z

Thanks @afayngelerindbx! Let me regroup with the team a little, this is helpful in narrowing down where to look, but I'm also wondering now that this relates to #5194 and we should not only limit the number of chunks per flush, but also look at a way to make sure multiple flush workers aren't trying to flush the same chunks.

afayngelerindbx · 2022-04-12T13:46:56Z

@slim-bean yea #5194 could definitely play a part here. It's what I was ranting about here: https://grafana.slack.com/archives/CTJ4EN6MU/p1649357960760409

I have gotten extra evidence that the issue(at least my issue) is sequential flushes. I added

                "-store.chunks-cache.cache-stubs=true",
                "-store.chunks-cache.cache.enable-fifocache=true",

to the ingester config and noticed that the cache hit rates go up dramatically before OOM

(left is number of cache hits per second, right is used RAM in bytes)

Zooming in on the RHS graph, the cache hit rate does hit zero periodically, which seems to indicate that these cache hits are from individual flush loop iterations(e.g. not #5194).

Please, let me know, once ya'll decide, whether this PR is worth pursuing further. I'd love to get it merged and tested in our deployment.

afayngelerindbx · 2022-04-13T16:03:48Z

Closing in favor of #5894

afayngelerindbx added 2 commits April 10, 2022 05:12

adds option to limit the number of chunks in each flush operation

b1b93e6

fix build after rebase

6e70895

pull-request-size bot added the size/S label Apr 10, 2022

afayngelerindbx closed this Apr 13, 2022

afayngelerindbx deleted the max-chunk-per-flush branch April 13, 2022 16:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingester: limit max chunks per flush #5849

Ingester: limit max chunks per flush #5849

afayngelerindbx commented Apr 10, 2022 •

edited

Loading

afayngelerindbx commented Apr 10, 2022

slim-bean commented Apr 12, 2022

afayngelerindbx commented Apr 12, 2022 •

edited

Loading

afayngelerindbx commented Apr 13, 2022

Ingester: limit max chunks per flush #5849

Ingester: limit max chunks per flush #5849

Conversation

afayngelerindbx commented Apr 10, 2022 • edited Loading

afayngelerindbx commented Apr 10, 2022

slim-bean commented Apr 12, 2022

afayngelerindbx commented Apr 12, 2022 • edited Loading

afayngelerindbx commented Apr 13, 2022

afayngelerindbx commented Apr 10, 2022 •

edited

Loading

afayngelerindbx commented Apr 12, 2022 •

edited

Loading