-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ingester OOM #5267
Comments
note: I even looked at the issue #4713 , but it doesn't seem to be the same context. I applied the changes that were reported, but but were not effective. Including disabling FIFO cache and changing index_cache_validity |
Not being able to flush chunks can definitely lead to memory issues and hitting limits. Is there any more context around those failure logs? Could you check with debug logs? I'm not necessarily expecting to see something but am trying to trace down what timeout is expiring to throw that Are your histograms for GCS showing 0.99? If so that looks to me like something has a 5s timeout. Also the fact you don't see timeouts or errors in the status code makes me think this is a timeout Loki is putting on the request and cancelling it before it succeeds. It's also surprising that uploading a 1MB file takes more than 5s, are you running in google cloud or are you pushing to google cloud from somewhere else? I wonder if having 64 concurrent_flush threads is making things worse too, I believe we discovered fairly recently that the GCS SDK uses http2 (sorry working from memory, this may not be completely accurate), but that means that things get multiplexed over a single connection so trying to send 64 1mb chunks at the same time means most of them get backed up and would hit whatever timeout is being hit. This may seem counterintuitive but it may actually work better to lower this value, maybe try it at an extreme like 1 or 2 to see if it helps. I'll look around and try to find whatever timeout is being hit here too. |
Hello @slim-bean , thanks for your comment and suggestions. For now I don't have more details about the event, but I will enable the debug mode and return with the information. That, the histogram refers to the 0.99 percentile.This screenshot was taken at one of the moments with the issue, where we can see that the duration fluctuates. Check graphs on grafana: I am currently running loki in my on-premisses datacenter and sending the information to the bucket present in gcp. About the concurrent_flushes comment this was a modification I made to try to mitigate/solve the problem. |
@slim-bean, @kavirajk The errors 5XX seen in the chart below refer to the moment the ingester is closed due to the OOM problem. Go Runtime MemoryProfiling
Profiling with WALOnce the OOM problem occurs in the instance, the WAL is triggered and this is the behavior of pprof during its execution. See that it seems that the checkpoints present in the WAL are locked and it seems that it is not possible to release them, due to the errors present in the message: |
thanks @marcusteixeira. I have been observing similar issues with 2.4.x. Eventually it gets to a point where there arent enough healthy instances in the ring and queries fail with
In the logs I see:
If it helps, I havent seen this issue in 2.3.0 and this is the primary reason I am shying away from 2.4.x upgrade |
For better troubleshooting experience in this case, I am attaching the pprof file with the flamegraph example. OOM Issue flamegraph: https://flamegraph.com/share/f1682ecd-8a77-11ec-9068-d2b21641d001 PS: Issue debug template #5360 |
hey @marcusteixeira thanks so much for the detailed info here, sorry I didn't get back to this sooner. Really appreciate you trying out various things. I think we need to figure out what context is getting canceled, i'm not sure there is much you can do to find this other than maybe enable debug logs?? How repeatable is this for you? |
Hi @marcusteixeira I ran into the same or a very similar issue, where some of my ingesters were stuck trying to flush user data preventing them from joining the ring. I increased the @slim-bean FYI |
we were talking today about logging what the WAL is doing while it replays, I'm suspicious that maybe when there is a really big wal to replay it can take a long time. There are metrics about WAL activity but none that give progress I don't believe. |
I was running the ingesters with debug logging as well which told me these messages Those numbers gave me the hint that I needed to increase the timeout even more, I had already tried to set it to 1m which was not enough, but 10m was ok in my case |
@slim-bean FYI
I activated it for a moment, but I didn't find any evidence to help me. I will enable it again and follow along.
I see the presence of this behavior whenever a certain throughput happens. |
@rsteneteg When I was debugging the issue and looking at the tracing and profiling information, I saw the possibility that the issue occurs when snappy chunk_encoding is used. I modified the encoding to gzip and it seems to have worked as a workaround. Where I noticed that this problem started to occur with intake scales greater than 50mb/s, something like 90/100mb/s. |
We use snappy everywhere, gzip compresses more than snappy but is much more cpu expensive at query time so we use snappy for the improved performance. how many write nodes are you running @marcusteixeira and is that 50mb/s per node or over multiple? |
The 50mb/s value is over the sum of all distributors/ingesters. This is segmented into 6 write nodes. Getting an average of 15 mb for each distributor/ingester. |
are you using a replication factor > 1? If so, this would multiply the amount of data each ingester processes by the replication_factor. My sort of gut feeling here is that you are sending just enough data to each ingester that it can't upload it to the remote storage fast enough which results in increasing memory usage. I'm not sure what the bottleneck is though, is your upload connection to the cloud storage sufficiently large? If changing to gzip allowed it to work for you I feel like you must be right on the edge of the amount of bandwidth that can be upload, gzip having a better compression ratio is helping out here. |
I hit this issue in Loki 2.3.0 today. One of the ingester was terminated due to OOM and it then fails to come to ready state with
Redis seems fine. Only way to get this ingester to ready state is to delete the PVC and restart the pod. |
@slim-bean it seems this specific ingester was using 10x more memory then the rest of them. We have replication > 1 and using snappy. Also am not sure the relation with PVC deletion to get it to ready state. Does that mean I am losing logs ? |
I've run into a similar issue running a stress test on Loki. After digging around in the flusher code, I have some thoughts:
|
hi everyone, i changed flush_op_timeout to 10m, changed snappy to gzip and also tried to removing PVC and restart Ingester. My config `auth_enabled: false
Also figured out few logs where we can see that push api is taking more then 10m |
I think almost every instance in this ticket relates to having insufficient network bandwidth for flushing chunks to storage, or sending too much data into too few ingesters. @afayngelerindbx has done some really helpful analysis, the code as written today doesn't seem to super gracefully recover a high volume stream gets backed up far enough, appreciate this investigation @afayngelerindbx! But my best guess here as to how folks are getting into this situation is too much data into too few ingesters or insufficient network bandwidth between your ingesters and your object storage. We tend to favor scaling more horizontal vs vertical with our clusters. If I were to give a sort of guideline i would say 10MB/s per ingester is fairly typical for us measured with the metric Curious what folks here who are seeing issues are sending for MB/s into their ingesters? Also be aware of what the bandwidth on the outgoing side looks like with |
This sounds like a very reasonable change, we'd love to see a PR to take a closer look! |
@slim-bean here is what I see per second in the last minute: container_network_receive_bytes_total container_network_transmit_bytes_total |
hi @slim-bean i am running whole Loki system on GCP. container_network_receive_bytes_total |
X-posting here in case it helps folks: I have gotten extra evidence that the issue(at least my issue) is sequential flushes. I added
to the ingester config. This enables a 1GB in-memory cache that enables deduping chunk storage writes. I noticed that the cache hit rates go up dramatically before OOM (left is number of cache hits per second, right is used RAM in bytes) Zooming in on the RHS graph, the cache hit rate does hit zero periodically, which seems to indicate that these cache hits are from individual flush loop iterations |
Ok, have an update here and @DylanGuedes and I took a look at this and he is gonna work on a change, Summary: the current code doesn't take credit for chunks which flushed successfully during a flush operation when that operation fails for any reason, which means all chunks in a stream will be flushed again when the flush op is retried. Because a flush operation writes to the chunk cache this would also explain the sudden massive increase of writes to the cache. Increasing the flush_op_timeout can reduce the likelyhood of this happening but it's more of a crude workaround than a fix. What we are thinking currently is to change the |
Also thanks so much @afayngelerindbx for digging on this, your sleuthing has been a big help, and all the graphs and due diligence by everyone else has been super helpful too. Sorry progress on this has been slow but really appreciate all the input by everyone. I feel like we are heading in the right direction now. |
Hi. any updates ? |
stale |
Could you share your configuration file here and your environment details? I initially reported this issue, but I am no longer affected by OOM cases. |
|
in version 2.7.0 it no longer occurs for me. |
|
in
how to troubleshoot the problem? my configuration file:
|
what parameters can OOM ingester fix ? |
i'm facing the same issue with ingesters; how did you resolve it? |
2.9.6 still oom |
Hi, Hi, we also had a lot of problems with Loki and the ingester. We spent a lot of time optimizing Loki, but in the end the OOM problem occurred every day. Maybe a update will help to fix your problem. In our case, it was solved. |
It is sad that the loki-distributed helm chart contains version 2.9.8 |
We used the following chart and this is up2date. https://github.com/grafana/loki/tree/main/production/helm/loki |
Can confirm, switching off the loki-distributed helm chart and onto the standard loki chart linked above, allowing us to upgrade to Loki 3.1, helped us too. We're no longer seeing ingester OOMs. Thanks for the info @dschunack! |
@CharlieC3 good to here that |
Describe the bug
I'm running Loki version 2.4.1 with SSD implementation model. I've noticed that ingesters are being impacted with OOM events, which kills all the nodes in the ring, until most of the nodes are in an unhealthy state.After that, the WAL process is activated, but the node is stuck in this process and with that there is no return to the ring.
It seems that the problem starts to occur when the flush queue is greater than > 0 over a period. at the same time the logs record the message:
failed to flush user
.To Reproduce
Steps to reproduce the behavior:
Expected behavior
The expected was that Loki was able to process all the information and not cause OOM events, where the WAL gets stuck and the nodes do not return to the ring.
Environment:
my environment is currently deployed with 6 nodes with target=write, where the spec is 8vcpu/16gb ram.
Screenshots, Promtail config, or terminal output
Metrics
Loki Resource Usage
Loki Chunks
Loki WAL
Go Runtime
Logs
Post "https://storage.googleapis.com/upload/storage/v1/b/loki-example-logs/o?alt=json&name=fake%2F488038805ca808ed%3A17e82e70caf%3A17e81e752a%3A38fd6636&prettyPrint=false&projection=full&uploadType=multipart": context deadline exceeded
level=error ts=2022-01-28T16:08:38.985959868Z caller=flush.go:221 org_id=fake msg="failed to flush user" err="Post \"https://storage.googleapis.com/upload/storage/v1/b/loki-example-logs/o?alt=json&name=fake%2F488038805ca808ed%3A17e82e70caf%3A17e81e752a%3A38fd6636&prettyPrint=false&projection=full&uploadType=multipart\": context deadline exceeded
Config
The text was updated successfully, but these errors were encountered: