Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingester: limit max chunks per flush #5849

Closed

Conversation

afayngelerindbx
Copy link
Contributor

@afayngelerindbx afayngelerindbx commented Apr 10, 2022

What this PR does / why we need it:
It seems possible for Loki ingesters to build up streams that are have too many unflushed chunks to successfully flush within a single flush op timeout period. This diff caps the number of chunks that a single flush will attempt to commit to storage. Ideally, this will allow ingesters to incrementally reclaim memory after or during storage degradation.

Which issue(s) this PR fixes:
This may address some of the reports in #5267

Special notes for your reviewer:

Checklist

  • Tests updated
  • Metrics Added
  • Documentation added
  • Add an entry in the CHANGELOG.md about the changes.

@afayngelerindbx
Copy link
Contributor Author

@slim-bean This is a very rough draft of what I was talking about in #5267

I'm fairly new to Loki so I'd appreciate directional feedback here e.g. am I breaking some assumption/future enhancement/misunderstanding something etc.

@slim-bean
Copy link
Collaborator

Thanks @afayngelerindbx! Let me regroup with the team a little, this is helpful in narrowing down where to look, but I'm also wondering now that this relates to #5194 and we should not only limit the number of chunks per flush, but also look at a way to make sure multiple flush workers aren't trying to flush the same chunks.

@afayngelerindbx
Copy link
Contributor Author

afayngelerindbx commented Apr 12, 2022

@slim-bean yea #5194 could definitely play a part here. It's what I was ranting about here: https://grafana.slack.com/archives/CTJ4EN6MU/p1649357960760409

I have gotten extra evidence that the issue(at least my issue) is sequential flushes. I added

                "-store.chunks-cache.cache-stubs=true",
                "-store.chunks-cache.cache.enable-fifocache=true",

to the ingester config and noticed that the cache hit rates go up dramatically before OOM

Screen Shot 2022-04-12 at 9 31 04 AM

(left is number of cache hits per second, right is used RAM in bytes)

Zooming in on the RHS graph, the cache hit rate does hit zero periodically, which seems to indicate that these cache hits are from individual flush loop iterations(e.g. not #5194).

Screen Shot 2022-04-12 at 9 40 18 AM

Please, let me know, once ya'll decide, whether this PR is worth pursuing further. I'd love to get it merged and tested in our deployment.

@afayngelerindbx
Copy link
Contributor Author

Closing in favor of #5894

@afayngelerindbx afayngelerindbx deleted the max-chunk-per-flush branch April 13, 2022 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants