Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingester OOM #5267

Open
marcusteixeira opened this issue Jan 28, 2022 · 43 comments
Open

Ingester OOM #5267

marcusteixeira opened this issue Jan 28, 2022 · 43 comments
Labels
component/ingester need-investigation type/bug Somehing is not working as expected

Comments

@marcusteixeira
Copy link
Contributor

marcusteixeira commented Jan 28, 2022

Describe the bug
I'm running Loki version 2.4.1 with SSD implementation model. I've noticed that ingesters are being impacted with OOM events, which kills all the nodes in the ring, until most of the nodes are in an unhealthy state.After that, the WAL process is activated, but the node is stuck in this process and with that there is no return to the ring.

It seems that the problem starts to occur when the flush queue is greater than > 0 over a period. at the same time the logs record the message: failed to flush user.

To Reproduce
Steps to reproduce the behavior:

  1. Started Loki (SHA or version): 2.4.1

Expected behavior
The expected was that Loki was able to process all the information and not cause OOM events, where the WAL gets stuck and the nodes do not return to the ring.

Environment:

  • Infrastructure: bare-metal with virtual machines
  • Deployment tool: Ansible

my environment is currently deployed with 6 nodes with target=write, where the spec is 8vcpu/16gb ram.

Screenshots, Promtail config, or terminal output

Metrics

Loki Resource Usage
loki-resources

Loki Chunks
loki-chunks
loki-streams

Loki WAL
loki-wal

Go Runtime
go-resources
go-resources-pt2

Logs

Post "https://storage.googleapis.com/upload/storage/v1/b/loki-example-logs/o?alt=json&name=fake%2F488038805ca808ed%3A17e82e70caf%3A17e81e752a%3A38fd6636&prettyPrint=false&projection=full&uploadType=multipart": context deadline exceeded

level=error ts=2022-01-28T16:08:38.985959868Z caller=flush.go:221 org_id=fake msg="failed to flush user" err="Post \"https://storage.googleapis.com/upload/storage/v1/b/loki-example-logs/o?alt=json&name=fake%2F488038805ca808ed%3A17e82e70caf%3A17e81e752a%3A38fd6636&prettyPrint=false&projection=full&uploadType=multipart\": context deadline exceeded

Config

#
# Ansible managed
#

target: write

auth_enabled: false

server:
  grpc_listen_port: 9096
  grpc_server_max_concurrent_streams: 2000
  grpc_server_max_recv_msg_size: 209715200
  grpc_server_max_send_msg_size: 209715200
  http_listen_port: 3100
  http_server_read_timeout: 300s
  http_server_write_timeout: 300s
  log_level: info


common:
  ring:
    heartbeat_period: 5s
    heartbeat_timeout: 1m
    instance_addr: 10.220.16.131
    instance_interface_names:
    - ens192
    kvstore:
      store: memberlist
    zone_awareness_enabled: false
  storage:
    gcs:
      bucket_name: loki-logs-example
      request_timeout: 300s


querier:
  max_concurrent: 4
  query_ingesters_within: 2h
  query_timeout: 60s


query_range:
  align_queries_with_step: true
  cache_results: false
  max_retries: 5
  parallelise_shardable_queries: true
  results_cache:
    cache:
      enable_fifocache: false
  split_queries_by_interval: 30m

query_scheduler:
  grpc_client_config:
    grpc_compression: snappy
    max_recv_msg_size: 209715200
    max_send_msg_size: 209715200
  max_outstanding_requests_per_tenant: 1000


ingester:
  autoforget_unhealthy: true
  chunk_block_size: 262144
  chunk_encoding: snappy
  chunk_idle_period: 15m
  chunk_retain_period: 30s
  chunk_target_size: 1572864
  concurrent_flushes: 64
  flush_check_period: 15s
  flush_op_timeout: 10m
  lifecycler:
    address: ip-address
    final_sleep: 0s
    ring:
      kvstore:
        store: memberlist
      replication_factor: 2
  max_chunk_age: 6h
  max_transfer_retries: 0
  wal:
    dir: /var/lib/loki/wal
    enabled: true
    flush_on_shutdown: true
    replay_memory_ceiling: 4GB

ingester_client:
  grpc_client_config:
    grpc_compression: snappy
    max_recv_msg_size: 209715200
    max_send_msg_size: 209715200
  pool_config:
    health_check_ingesters: false
  remote_timeout: 60s


distributor:
  ring:
    kvstore:
      store: memberlist

memberlist:
  abort_if_cluster_join_fails: false
  bind_port: 7946
  dead_node_reclaim_time: 30s
  gossip_to_dead_nodes_time: 15s
  join_members:
  - [members]
  max_join_retries: 1000
  rejoin_interval: 30s

schema_config:
  configs:
  - chunks:
      period: 24h
      prefix: loki_chunks_
    from: 2021-12-01
    index:
      period: 24h
      prefix: loki_index_
    object_store: gcs
    row_shards: 16
    schema: v11
    store: boltdb-shipper

storage_config:
  boltdb_shipper:
    active_index_directory: /var/lib/loki/data/boltdb-shipper-active
    cache_location: /var/lib/loki/data/boltdb-shipper-cache
    cache_ttl: 168h
    shared_store: gcs
  filesystem:
    directory: /var/lib/loki/data/chunks
  index_cache_validity: 5m
  index_queries_cache_config:
    enable_fifocache: false
  max_chunk_batch_size: 500

compactor:
  shared_store: filesystem
  working_directory: /var/lib/loki/data/boltdb-shipper-compactor

limits_config:
  ingestion_burst_size_mb: 200
  ingestion_rate_mb: 300
  ingestion_rate_strategy: local
  max_query_parallelism: 8
  max_query_series: 5000
  per_stream_rate_limit: 500MB
  per_stream_rate_limit_burst: 1000MB
  reject_old_samples: true
  reject_old_samples_max_age: 168h

chunk_store_config:
  chunk_cache_config:
    enable_fifocache: false
  max_look_back_period: 0s

table_manager:
  retention_deletes_enabled: false
  retention_period: 0s

ruler:
  enable_api: true
  ring:
    kvstore:
      store: memberlist
  rule_path: /var/lib/loki/data/rules-temp
  storage:
    local:
      directory: /var/lib/loki/data/rules
    type: local

frontend:
  address: ip-address
  compress_responses: true
  grpc_client_config:
    grpc_compression: snappy
    max_recv_msg_size: 209715200
    max_send_msg_size: 209715200
  instance_interface_names:
  - ens192
  log_queries_longer_than: 5s
  max_outstanding_per_tenant: 2048
@marcusteixeira
Copy link
Contributor Author

note: I even looked at the issue #4713 , but it doesn't seem to be the same context. I applied the changes that were reported, but but were not effective. Including disabling FIFO cache and changing index_cache_validity

@slim-bean
Copy link
Collaborator

Not being able to flush chunks can definitely lead to memory issues and hitting limits.

Is there any more context around those failure logs? Could you check with debug logs?

I'm not necessarily expecting to see something but am trying to trace down what timeout is expiring to throw that context deadline exceeded error.

Are your histograms for GCS showing 0.99? If so that looks to me like something has a 5s timeout.

Also the fact you don't see timeouts or errors in the status code makes me think this is a timeout Loki is putting on the request and cancelling it before it succeeds.

It's also surprising that uploading a 1MB file takes more than 5s, are you running in google cloud or are you pushing to google cloud from somewhere else?

I wonder if having 64 concurrent_flush threads is making things worse too, I believe we discovered fairly recently that the GCS SDK uses http2 (sorry working from memory, this may not be completely accurate), but that means that things get multiplexed over a single connection so trying to send 64 1mb chunks at the same time means most of them get backed up and would hit whatever timeout is being hit. This may seem counterintuitive but it may actually work better to lower this value, maybe try it at an extreme like 1 or 2 to see if it helps.

I'll look around and try to find whatever timeout is being hit here too.

@marcusteixeira
Copy link
Contributor Author

Not being able to flush chunks can definitely lead to memory issues and hitting limits.

Is there any more context around those failure logs? Could you check with debug logs?

I'm not necessarily expecting to see something but am trying to trace down what timeout is expiring to throw that context deadline exceeded error.

Are your histograms for GCS showing 0.99? If so that looks to me like something has a 5s timeout.

Also the fact you don't see timeouts or errors in the status code makes me think this is a timeout Loki is putting on the request and cancelling it before it succeeds.

It's also surprising that uploading a 1MB file takes more than 5s, are you running in google cloud or are you pushing to google cloud from somewhere else?

I wonder if having 64 concurrent_flush threads is making things worse too, I believe we discovered fairly recently that the GCS SDK uses http2 (sorry working from memory, this may not be completely accurate), but that means that things get multiplexed over a single connection so trying to send 64 1mb chunks at the same time means most of them get backed up and would hit whatever timeout is being hit. This may seem counterintuitive but it may actually work better to lower this value, maybe try it at an extreme like 1 or 2 to see if it helps.

I'll look around and try to find whatever timeout is being hit here too.

Hello @slim-bean , thanks for your comment and suggestions.

For now I don't have more details about the event, but I will enable the debug mode and return with the information.

That, the histogram refers to the 0.99 percentile.This screenshot was taken at one of the moments with the issue, where we can see that the duration fluctuates. Check graphs on grafana:

loki-gcs-histogram

I am currently running loki in my on-premisses datacenter and sending the information to the bucket present in gcp.

About the concurrent_flushes comment this was a modification I made to try to mitigate/solve the problem.
I'll change to the recommendations you reported and get back to you with the details.
PS: the value of concurrent_flush was increased from 16 to 32 and then to 64.

@marcusteixeira
Copy link
Contributor Author

@slim-bean, @kavirajk
Here are more details of what I've been investigating about loki ingester.
I changed concurrent_flush to 4 (default value is 16), but I didn't notice any real improvement on the identified problem. All the evidence above is related to the moment we had the memory leak problem.

The errors 5XX seen in the chart below refer to the moment the ingester is closed due to the OOM problem.
loki-ingester
loki-ingestion-chunks

Go Runtime Memory

go-memory-pt1
go-memory-pt2

Profiling

pprof-real
go tool pprof http://localhost:3100/debug/pprof/heap

Profiling with WAL

loki-wal

Once the OOM problem occurs in the instance, the WAL is triggered and this is the behavior of pprof during its execution. See that it seems that the checkpoints present in the WAL are locked and it seems that it is not possible to release them, due to the errors present in the message: failed to flush user

loki-flush-msg

@gmavadiy
Copy link

gmavadiy commented Feb 2, 2022

thanks @marcusteixeira. I have been observing similar issues with 2.4.x. Eventually it gets to a point where there arent enough healthy instances in the ring and queries fail with

Query error
too many unhealthy instances in the ring 

In the logs I see:

level=error ts=2021-12-20T19:46:11.981180089Z caller=flush.go:221 org_id=fake msg="failed to flush user" err="RequestCanceled: request context canceled\ncaused by: context deadline exceeded"
level=error ts=2021-12-20T19:46:11.981210316Z caller=flush.go:221 org_id=fake msg="failed to flush user" err="RequestCanceled: request context canceled\ncaused by: context deadline exceeded"

level=warn ts=2021-12-20T19:50:50.016717692Z caller=lifecycler.go:239 msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance 10.x.y.z:9095 past heartbeat timeout"

If it helps, I havent seen this issue in 2.3.0 and this is the primary reason I am shying away from 2.4.x upgrade

@marcusteixeira
Copy link
Contributor Author

For better troubleshooting experience in this case, I am attaching the pprof file with the flamegraph example.

OOM Issue flamegraph: https://flamegraph.com/share/f1682ecd-8a77-11ec-9068-d2b21641d001

OOM-Issue-flamegraph

PS: Issue debug template #5360

@slim-bean
Copy link
Collaborator

hey @marcusteixeira thanks so much for the detailed info here, sorry I didn't get back to this sooner. Really appreciate you trying out various things.

I think we need to figure out what context is getting canceled, i'm not sure there is much you can do to find this other than maybe enable debug logs??

How repeatable is this for you?

@rsteneteg
Copy link
Contributor

Hi @marcusteixeira I ran into the same or a very similar issue, where some of my ingesters were stuck trying to flush user data preventing them from joining the ring.

I increased the flush_op_timeout ingester config from the default 10s, to 10m, and that seems to have fixed it, took a while for it to flush the data but eventually it completed without getting those timeouts while flushing data

@slim-bean FYI

@slim-bean
Copy link
Collaborator

we were talking today about logging what the WAL is doing while it replays, I'm suspicious that maybe when there is a really big wal to replay it can take a long time. There are metrics about WAL activity but none that give progress I don't believe.

@rsteneteg
Copy link
Contributor

I was running the ingesters with debug logging as well which told me these messages
level=debug ts=2022-02-24T23:00:04.590640241Z caller=replay_controller.go:85 msg="replay flusher pre-flush" bytes="13 GB"
level=debug ts=2022-02-24T23:03:09.412919372Z caller=replay_controller.go:85 msg="replay flusher pre-flush" bytes="15 GB"

Those numbers gave me the hint that I needed to increase the timeout even more, I had already tried to set it to 1m which was not enough, but 10m was ok in my case

@marcusteixeira
Copy link
Contributor Author

marcusteixeira commented Feb 25, 2022

@slim-bean FYI

I think we need to figure out what context is getting canceled, i'm not sure there is much you can do to find this other than maybe enable debug logs??

I activated it for a moment, but I didn't find any evidence to help me. I will enable it again and follow along.

How repeatable is this for you?

I see the presence of this behavior whenever a certain throughput happens.
Example: >50 mb in distributor/ingester. After that, memory consumption starts to increase, until the node shuts down.

@marcusteixeira
Copy link
Contributor Author

@rsteneteg
Thanks for the flush_op_timeout parameter hints. In my environment I had already noticed this detail and it is currently set at 10m. Still, the problem remains :/

When I was debugging the issue and looking at the tracing and profiling information, I saw the possibility that the issue occurs when snappy chunk_encoding is used.

I modified the encoding to gzip and it seems to have worked as a workaround. Where I noticed that this problem started to occur with intake scales greater than 50mb/s, something like 90/100mb/s.

@slim-bean
Copy link
Collaborator

We use snappy everywhere, gzip compresses more than snappy but is much more cpu expensive at query time so we use snappy for the improved performance.

how many write nodes are you running @marcusteixeira and is that 50mb/s per node or over multiple?

@marcusteixeira
Copy link
Contributor Author

We use snappy everywhere, gzip compresses more than snappy but is much more cpu expensive at query time so we use snappy for the improved performance.

how many write nodes are you running @marcusteixeira and is that 50mb/s per node or over multiple?

@slim-bean

The 50mb/s value is over the sum of all distributors/ingesters. This is segmented into 6 write nodes. Getting an average of 15 mb for each distributor/ingester.
sum(rate(loki_distributor_bytes_received_total[1m]))

@slim-bean
Copy link
Collaborator

are you using a replication factor > 1?

If so, this would multiply the amount of data each ingester processes by the replication_factor.

My sort of gut feeling here is that you are sending just enough data to each ingester that it can't upload it to the remote storage fast enough which results in increasing memory usage. I'm not sure what the bottleneck is though, is your upload connection to the cloud storage sufficiently large?

If changing to gzip allowed it to work for you I feel like you must be right on the edge of the amount of bandwidth that can be upload, gzip having a better compression ratio is helping out here.

@gmavadiy
Copy link

I hit this issue in Loki 2.3.0 today. One of the ingester was terminated due to OOM and it then fails to come to ready state with

level=error ts=2022-03-17T17:30:18.120715181Z caller=flush.go:220 org_id=fake msg="failed to flush user" err="RequestCanceled: request context canceled\ncaused by: context deadline exceeded"
level=error ts=2022-03-17T17:30:28.122254393Z caller=flush.go:220 org_id=fake msg="failed to flush user" err="RequestCanceled: request context canceled\ncaused by: context deadline exceeded"
level=error ts=2022-03-17T17:30:37.494175278Z caller=redis_cache.go:72 msg="failed to get from redis" name=chunksredis err="context deadline exceeded"
level=error ts=2022-03-17T17:30:38.130330306Z caller=redis_cache.go:72 msg="failed to get from redis" name=chunksredis err="context deadline exceeded"

Redis seems fine. Only way to get this ingester to ready state is to delete the PVC and restart the pod.

@gmavadiy
Copy link

gmavadiy commented Mar 17, 2022

@slim-bean it seems this specific ingester was using 10x more memory then the rest of them. We have replication > 1 and using snappy. Also am not sure the relation with PVC deletion to get it to ready state. Does that mean I am losing logs ?

@afayngelerindbx
Copy link
Contributor

afayngelerindbx commented Apr 8, 2022

I've run into a similar issue running a stress test on Loki. After digging around in the flusher code, I have some thoughts:

  1. It seems that flush_op_timeout limits the time it takes to flush all flushable chunks within a stream(code ref). This means that if you hit a flush_op_timeout for a high throughput stream, you're fairly likely to never be able to flush it, since the number of chunks is only going to grow. This seems like it would cause unbounded memory growth and eventually an OOM kill. This interacts with the WAL as well. If a stream grows large enough to be "unflushable" within the flush_op_timeout, it likely won't be flushable during WAL recovery.

  2. I was hesitant to set a timeout higher than the flush loop ticker period(-ingester.flush-check-period). I am worried that a longer flush_op_timeout can cause a single stream to be flushed multiple times concurrently, implying that we would be holding in memory both a large stream, as well as the wire serialized chunks prepared for flushing. This was additionally worrisome because if enough of these piled up we would start running into storage throttling, exacerbating the problem

  3. Looking at the code, it would be fairly simple to limit the number of chunks per flush to some configurable threshold. This, I believe, would make it less likely that a stream would become "unflushable". If loki maintainers would entertain such a change, I'd be willing to cut a patch.

@Harshrai3112
Copy link

Harshrai3112 commented Apr 8, 2022

hi everyone,
i am still not able to solve the issue.

i changed flush_op_timeout to 10m, changed snappy to gzip and also tried to removing PVC and restart Ingester.

My config

`auth_enabled: false

server:
  log_level: debug
  http_listen_port: 3100
  grpc_server_max_recv_msg_size: 46137344
  grpc_server_max_send_msg_size: 46137344

distributor:
  ring:
    kvstore:
      store: memberlist

memberlist:
  join_members:
    - {{ include "loki.fullname" . }}-memberlist

ingester:
  lifecycler:
    ring:
      kvstore:
        store: memberlist
      replication_factor: 3
  chunk_idle_period:  45m
  flush_op_timeout: 10m
  max_chunk_age: 1h
  concurrent_flushes: 32
  flush_check_period: 15s
  chunk_block_size: 262144 
  chunk_target_size: 1572864
  chunk_encoding: gzip
  chunk_retain_period: 1m
  wal:
    enabled: true
    dir: /var/loki-wal/wal
    replay_memory_ceiling: 4GB
    checkpoint_duration: 3m

limits_config:
  ingestion_rate_mb: 30
  ingestion_burst_size_mb: 40
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  max_cache_freshness_per_query: 10m
  max_query_parallelism: 88
  unordered_writes: true
  retention_period: 744h
  retention_stream:
  - selector: '{cluster="prod-monitoring"}'
    priority: 1
    period: 168h
  - selector: '{cluster="release-app-cluster"}'
    priority: 2
    period: 168h
  - selector: '{cluster="external-vms"}'
    priority: 3
    period: 168h
  max_global_streams_per_user: 10000
  max_entries_limit_per_query: 8000

schema_config:
  configs:
    - from: 2020-09-07
      store: boltdb-shipper
      object_store: gcs
      schema: v11
      index:
        prefix: loki_index_
        period: 24h

storage_config:
  index_queries_cache_config:
    enable_fifocache: false
    redis:
        endpoint: ***
        expiration : 1h
        password: ***
  boltdb_shipper: 
    shared_store: gcs
    active_index_directory: /var/loki/index
    cache_location: /var/loki/cache
    cache_ttl: 24h
  gcs:
    bucket_name: monitoring-test-bucket-1

chunk_store_config:
  max_look_back_period: 0s
  chunk_cache_config:
    redis:
        endpoint: ***
        expiration : 1h
        password: ***
  write_dedupe_cache_config:
    redis:
        endpoint: ***
        expiration : 1h
        password: ***

table_manager:
  retention_deletes_enabled: false
  retention_period: 0s

query_range:
  align_queries_with_step: true
  max_retries: 5
  split_queries_by_interval: 1h
  parallelise_shardable_queries: true
  cache_results: true
  results_cache:
    cache:
      redis:
        endpoint: ***
        expiration : 1h
        password: ***
  

frontend_worker:
  frontend_address: {{ include "loki.queryFrontendFullname" . }}:9095
  parallelism: 100
  grpc_client_config:
    max_send_msg_size: 46137344 
    max_recv_msg_size: 46137344

frontend:
  log_queries_longer_than: 15s
  compress_responses: true
  max_outstanding_per_tenant: 2048
  tail_proxy_url: http://{{ include "loki.querierFullname" . }}:3100

ingester_client:
  grpc_client_config:
   max_send_msg_size: 46137344
   max_recv_msg_size: 46137344

querier:
  query_ingesters_within: 1h

compactor:
  working_directory: /var/loki/compactor
  shared_store: gcs
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h

ruler:
  storage:
    type: local
    local:
      directory: /etc/loki/rules
  ring:
    kvstore:
      store: memberlist
  rule_path: /tmp/loki/scratch
  alertmanager_url: https://alertmanager.xx
  external_url: https://alertmanager.xx

`
Screenshot 2022-04-08 at 10 29 03 PM

Error Logs of Ingester
Screenshot 2022-04-08 at 10 54 27 PM

Also figured out few logs where we can see that push api is taking more then 10m

Screenshot 2022-04-08 at 10 57 03 PM

@slim-bean
Copy link
Collaborator

I think almost every instance in this ticket relates to having insufficient network bandwidth for flushing chunks to storage, or sending too much data into too few ingesters.

@afayngelerindbx has done some really helpful analysis, the code as written today doesn't seem to super gracefully recover a high volume stream gets backed up far enough, appreciate this investigation @afayngelerindbx!

But my best guess here as to how folks are getting into this situation is too much data into too few ingesters or insufficient network bandwidth between your ingesters and your object storage. We tend to favor scaling more horizontal vs vertical with our clusters. If I were to give a sort of guideline i would say 10MB/s per ingester is fairly typical for us measured with the metric container_network_receive_bytes_total

Curious what folks here who are seeing issues are sending for MB/s into their ingesters? Also be aware of what the bandwidth on the outgoing side looks like with container_network_transmit_bytes_total especially if you are trying to run your compute at any distance from your storage. (i.e. if you are running your compute on prem and trying to use a cloud object store you're probably gonna struggle)

@slim-bean
Copy link
Collaborator

@afayngelerindbx

Looking at the code, it would be fairly simple to limit the number of chunks per flush to some configurable threshold. This, I believe, would make it less likely that a stream would become "unflushable". If loki maintainers would entertain such a change, I'd be willing to cut a patch.

This sounds like a very reasonable change, we'd love to see a PR to take a closer look!

@gmavadiy
Copy link

gmavadiy commented Apr 8, 2022

@slim-bean here is what I see per second in the last minute:

container_network_receive_bytes_total
Of the 6 ingester pods, couple of them are always high, fluctuating between 30-300Mbs rest of them are less than 8Mbs

container_network_transmit_bytes_total
Of the 6 ingester pods, all of them are less than 10Mbs

@Harshrai3112
Copy link

hi @slim-bean

i am running whole Loki system on GCP.
Actually i am running this Loki since 3-4 months in production but suddenly since last week this OOM issue started happening.
Before this OOM Issue i was running 3 ingester with 30G of Memory and after this OOM issue i am running 5 ingester with 40G of memory.

container_network_receive_bytes_total
Screenshot 2022-04-09 at 9 38 08 AM

container_network_transmit_bytes_total
Screenshot 2022-04-09 at 9 39 06 AM

Lines Received
Screenshot 2022-04-09 at 9 44 59 AM

Bytes Received
Screenshot 2022-04-09 at 9 45 21 AM

@afayngelerindbx
Copy link
Contributor

X-posting here in case it helps folks:

I have gotten extra evidence that the issue(at least my issue) is sequential flushes. I added

                "-store.chunks-cache.cache-stubs=true",
                "-store.chunks-cache.cache.enable-fifocache=true",

to the ingester config. This enables a 1GB in-memory cache that enables deduping chunk storage writes. I noticed that the cache hit rates go up dramatically before OOM

image

(left is number of cache hits per second, right is used RAM in bytes)

Zooming in on the RHS graph, the cache hit rate does hit zero periodically, which seems to indicate that these cache hits are from individual flush loop iterations

image

@slim-bean
Copy link
Collaborator

Ok, have an update here and @DylanGuedes and I took a look at this and he is gonna work on a change,

Summary: the current code doesn't take credit for chunks which flushed successfully during a flush operation when that operation fails for any reason, which means all chunks in a stream will be flushed again when the flush op is retried.

Because a flush operation writes to the chunk cache this would also explain the sudden massive increase of writes to the cache.

Increasing the flush_op_timeout can reduce the likelyhood of this happening but it's more of a crude workaround than a fix.

What we are thinking currently is to change the flushChunks method to essentially operate one chunk at a time (serialize to wirechunk, write to store, mark as flushed), this way we make sure to take credit for any chunks that are successfully flushed and not flush them again. It should also reduce memory pressure a little as it will limit the memory used for serialized chunks to one at a time.

@slim-bean
Copy link
Collaborator

Also thanks so much @afayngelerindbx for digging on this, your sleuthing has been a big help, and all the graphs and due diligence by everyone else has been super helpful too. Sorry progress on this has been slow but really appreciate all the input by everyone. I feel like we are heading in the right direction now.

@Harshrai3112
Copy link

Harshrai3112 commented May 12, 2022

Hi Guys,
This is issue is solved for me by changing the disk type used by wal-data PVC.

Previously i had pd-standard disk type in gcp and saw that I/O Utilization is reaching 100%.
Screenshot 2022-05-12 at 4 53 20 PM

After updating the disk to pd-ssd it started working fine.

Screenshot 2022-05-12 at 4 54 29 PM

@LinTechSo
Copy link
Contributor

Hi. any updates ?

@eserden
Copy link

eserden commented Oct 13, 2022

stale

@marcusteixeira
Copy link
Contributor Author

Hi. any updates ?

Could you share your configuration file here and your environment details? I initially reported this issue, but I am no longer affected by OOM cases.

@ranryl
Copy link

ranryl commented Dec 13, 2022

Hi. any updates ?

Could you share your configuration file here and your environment details? I initially reported this issue, but I am no longer affected by OOM cases.

auth_enabled: true
    chunk_store_config:
      max_look_back_period: 0s
    compactor:
      shared_store: s3
      shared_store_key_prefix: index/
      working_directory: /var/loki/boltdb-shipper-compactor
    distributor:
      ring:
        kvstore:
          store: memberlist
    frontend:
      compress_responses: true
      downstream_url: http://loki-loki-distributed-querier.loki.svc.cluster.local:3100
      log_queries_longer_than: 40s
      tail_proxy_url: http://loki-loki-distributed-querier.loki.svc.cluster.local:3100
    ingester:
      chunk_block_size: 262144
      chunk_encoding: gzip
      chunk_retain_period: 10m
      chunk_target_size: 0
      flush_check_period: 2m
      lifecycler:
        ring:
          kvstore:
            store: memberlist
          replication_factor: 3
      max_chunk_age: 30m
      max_returned_stream_errors: 10
      max_transfer_retries: 0
      query_store_max_look_back_period: 0
      sync_min_utilization: 0
      sync_period: 0
      wal:
        dir: /var/loki/wal
    limits_config:
      ingestion_burst_size_mb: 180
      ingestion_rate_mb: 120
      max_cache_freshness_per_query: 10m
      max_global_streams_per_user: 100000
      per_stream_rate_limit: 30MB
      per_stream_rate_limit_burst: 40MB
      split_queries_by_interval: 30m
    memberlist:
      abort_if_cluster_join_fails: false
      bind_port: 7946
      join_members:
      - loki-loki-distributed-memberlist.loki.svc.cluster.local:7946
      max_join_backoff: 1m
      max_join_retries: 10
      min_join_backoff: 1s
    querier:
      max_concurrent: 20
      query_timeout: 2m
    query_range:
      align_queries_with_step: true
      cache_results: true
      max_retries: 3
      results_cache:
        cache:
          redis:
            endpoint: xxx
            expiration: 6h
            timeout: 4000ms
    schema_config:
      configs:
      - chunks:
          period: 24h
          prefix: prefix_chunks
        from: "2022-10-01"
        index:
          period: 24h
          prefix: loki_index_
        object_store: s3
        schema: v11
        store: boltdb-shipper
    server:
      http_listen_port: 3100
    storage_config:
      aws:
        access_key_id: xxx
        bucketnames: xxx
        endpoint: xxx
        region: xxx
        s3forcepathstyle: true
        secret_access_key: xxx
      boltdb_shipper:
        active_index_directory: /var/loki/index
        cache_location: /var/loki/cache
        cache_ttl: 24h
        shared_store: aws
      index_cache_validity: 5m
      index_queries_cache_config:
        redis:
          endpoint: xxx
          expiration: 6h
          timeout: 4000ms
    tracing:
      enabled: true

@johnnytardin
Copy link

in version 2.7.0 it no longer occurs for me.

@ranryl
Copy link

ranryl commented Feb 6, 2023

goroutine profile: total 479556
348621 @ 0x43ccd6 0x44dc7e 0x44dc55 0x46a285 0x1ba3293 0x1ba3269 0x1ba63cb 0x1ba1c0f 0x199234f 0x1ba1846 0x1ba9150 0xba72da 0x1ba90a5 0x133fea9 0x1340f5f 0x133f4ad 0x1379899 0x1378ac5 0x1378805 0x13786c9 0x46e481
#	0x46a284	sync.runtime_SemacquireMutex+0x24											/usr/local/go/src/runtime/sema.go:77
#	0x1ba3292	sync.(*RWMutex).RLock+0x72												/usr/local/go/src/sync/rwmutex.go:71
#	0x1ba3268	github.com/grafana/loki/pkg/storage/stores/shipper/index.(*Table).ForEach+0x48						/src/loki/pkg/storage/stores/shipper/index/table.go:176
#	0x1ba63ca	github.com/grafana/loki/pkg/storage/stores/shipper/index.(*TableManager).ForEach+0x4a					/src/loki/pkg/storage/stores/shipper/index/table_manager.go:109
#	0x1ba1c0e	github.com/grafana/loki/pkg/storage/stores/shipper/index.(*querier).QueryPages.func1+0x1ae				/src/loki/pkg/storage/stores/shipper/index/querier.go:49
#	0x199234e	github.com/grafana/loki/pkg/storage/stores/shipper/util.DoParallelQueries+0x16e						/src/loki/pkg/storage/stores/shipper/util/queries.go:39
#	0x1ba1845	github.com/grafana/loki/pkg/storage/stores/shipper/index.(*querier).QueryPages+0x305					/src/loki/pkg/storage/stores/shipper/index/querier.go:46
#	0x1ba914f	github.com/grafana/loki/pkg/storage/stores/shipper.(*indexClient).QueryPages.func1+0x4f					/src/loki/pkg/storage/stores/shipper/shipper_index_client.go:165
#	0xba72d9	github.com/weaveworks/common/instrument.CollectedRequest+0x279								/src/loki/vendor/github.com/weaveworks/common/instrument/instrument.go:167
#	0x1ba90a4	github.com/grafana/loki/pkg/storage/stores/shipper.(*indexClient).QueryPages+0x124					/src/loki/pkg/storage/stores/shipper/shipper_index_client.go:164
#	0x133fea8	github.com/grafana/loki/pkg/storage/stores/series/index.(*cachingIndexClient).queryPages+0x968				/src/loki/pkg/storage/stores/series/index/caching_index_client.go:176
#	0x1340f5e	github.com/grafana/loki/pkg/storage/stores/series/index.(*cachingIndexClient).doBroadQueries+0x7e			/src/loki/pkg/storage/stores/series/index/caching_index_client.go:234
#	0x133f4ac	github.com/grafana/loki/pkg/storage/stores/series/index.(*cachingIndexClient).QueryPages+0x8c				/src/loki/pkg/storage/stores/series/index/caching_index_client.go:103
#	0x1379898	github.com/grafana/loki/pkg/storage/stores/series.(*indexReaderWriter).lookupEntriesByQueries+0x178			/src/loki/pkg/storage/stores/series/series_index_store.go:568
#	0x1378ac4	github.com/grafana/loki/pkg/storage/stores/series.(*indexReaderWriter).lookupIdsByMetricNameMatcher+0x1c4		/src/loki/pkg/storage/stores/series/series_index_store.go:490
#	0x1378804	github.com/grafana/loki/pkg/storage/stores/series.(*indexReaderWriter).lookupSeriesByMetricNameMatcher+0x84		/src/loki/pkg/storage/stores/series/series_index_store.go:464
#	0x13786c8	github.com/grafana/loki/pkg/storage/stores/series.(*indexReaderWriter).lookupSeriesByMetricNameMatchers.func1+0x88	/src/loki/pkg/storage/stores/series/series_index_store.go:409

130485 @ 0x43ccd6 0x44cbbc 0x1377e29 0x13737e5 0x11e15cd 0xba72da 0x11e149a 0x136b6f2 0x136a4c8 0x136af86 0x136a37e 0x1c07fb0 0xf1ec78 0x1c72924 0xb1469a 0xbcc905 0x1d46ee7 0xb1469a 0xbcce22 0xb1469a 0xb17b42 0xb1469a 0xbce01b 0xb1469a 0xb1453e 0xf1eb38 0xacc54f 0xad0b8f 0xaca0b8 0x46e481
#	0x1377e28	github.com/grafana/loki/pkg/storage/stores/series.(*indexReaderWriter).lookupSeriesByMetricNameMatchers+0x728	/src/loki/pkg/storage/stores/series/series_index_store.go:426
#	0x13737e4	github.com/grafana/loki/pkg/storage/stores/series.(*indexReaderWriter).GetChunkRefs+0x4e4			/src/loki/pkg/storage/stores/series/series_index_store.go:160
#	0x11e15cc	github.com/grafana/loki/pkg/storage/stores/index.monitoredReaderWriter.GetChunkRefs.func1+0x6c			/src/loki/pkg/storage/stores/index/index.go:54
#	0xba72d9	github.com/weaveworks/common/instrument.CollectedRequest+0x279							/src/loki/vendor/github.com/weaveworks/common/instrument/instrument.go:167
#	0x11e1499	github.com/grafana/loki/pkg/storage/stores/index.monitoredReaderWriter.GetChunkRefs+0x1d9			/src/loki/pkg/storage/stores/index/index.go:52
#	0x136b6f1	github.com/grafana/loki/pkg/storage/stores.(*storeEntry).GetChunkRefs+0x631					/src/loki/pkg/storage/stores/composite_store_entry.go:67
#	0x136a4c7	github.com/grafana/loki/pkg/storage/stores.compositeStore.GetChunkRefs.func1+0xa7				/src/loki/pkg/storage/stores/composite_store.go:149
#	0x136af85	github.com/grafana/loki/pkg/storage/stores.compositeStore.forStores+0x265					/src/loki/pkg/storage/stores/composite_store.go:241
#	0x136a37d	github.com/grafana/loki/pkg/storage/stores.compositeStore.GetChunkRefs+0xfd					/src/loki/pkg/storage/stores/composite_store.go:148
#	0x1c07faf	github.com/grafana/loki/pkg/ingester.(*Ingester).GetChunkIDs+0x18f						/src/loki/pkg/ingester/ingester.go:800
#	0xf1ec77	github.com/grafana/loki/pkg/logproto._Querier_GetChunkIDs_Handler.func1+0x77					/src/loki/pkg/logproto/logproto.pb.go:5095
#	0x1c72923	github.com/grpc-ecosystem/go-grpc-middleware/recovery.UnaryServerInterceptor.func1+0xc3				/src/loki/vendor/github.com/grpc-ecosystem/go-grpc-middleware/recovery/interceptors.go:33
#	0xb14699	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x39					/src/loki/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25
#	0xbcc904	github.com/weaveworks/common/middleware.ServerUserHeaderInterceptor+0x64					/src/loki/vendor/github.com/weaveworks/common/middleware/grpc_auth.go:38
#	0x1d46ee6	github.com/grafana/loki/pkg/util/fakeauth.SetupAuthMiddleware.func1+0xa6					/src/loki/pkg/util/fakeauth/fake_auth.go:27
#	0xb14699	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x39					/src/loki/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25
#	0xbcce21	github.com/weaveworks/common/middleware.UnaryServerInstrumentInterceptor.func1+0xa1				/src/loki/vendor/github.com/weaveworks/common/middleware/grpc_instrumentation.go:35
#	0xb14699	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x39					/src/loki/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25
#	0xb17b41	github.com/opentracing-contrib/go-grpc.OpenTracingServerInterceptor.func1+0x401					/src/loki/vendor/github.com/opentracing-contrib/go-grpc/server.go:57
#	0xb14699	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x39					/src/loki/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25
#	0xbce01a	github.com/weaveworks/common/middleware.GRPCServerLog.UnaryServerInterceptor+0xba				/src/loki/vendor/github.com/weaveworks/common/middleware/grpc_logging.go:30
#	0xb14699	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x39					/src/loki/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25
#	0xb1453d	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1+0xbd					/src/loki/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:34
#	0xf1eb37	github.com/grafana/loki/pkg/logproto._Querier_GetChunkIDs_Handler+0x137						/src/loki/pkg/logproto/logproto.pb.go:5097
#	0xacc54e	google.golang.org/grpc.(*Server).processUnaryRPC+0xcce								/src/loki/vendor/google.golang.org/grpc/server.go:1282
#	0xad0b8e	google.golang.org/grpc.(*Server).handleStream+0xa2e								/src/loki/vendor/google.golang.org/grpc/server.go:1619
#	0xaca0b7	google.golang.org/grpc.(*Server).serveStreams.func1.2+0x97							/src/loki/vendor/google.golang.org/grpc/server.go:921

@chaudum chaudum added type/bug Somehing is not working as expected and removed kind/bug labels Jun 14, 2023
@stringang
Copy link

stringang commented Jul 13, 2023

in loki-distributed: v2.8.2

  1. when the ingester instance starts, WAL replay takes a long time and the liveness probe /ready returns 503, causing the pod to be killed by k8s.
level=debug ts=2023-07-13T09:40:15.567663647Z caller=replay_controller.go:85 msg="replay flusher pre-flush" bytes="3.9 GB"
level=info ts=2023-07-13T18:49:54.185275394Z caller=ingester.go:432 msg="recovered WAL checkpoint recovery finished" elapsed=1h25m27.618187613s errors=false
  1. after ingester has been running for a while, /ready returns 503, triggering failureThreshold.
level=debug ts=2023-07-13T21:54:42.009853071Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=133.948µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.970250053Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=97.306µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.959086516Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=345.667µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.883472557Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=390.464µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.857637139Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=143.269µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.850804813Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=2.114682ms msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.834328932Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=154.931µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.82012353Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=207.537µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.813212013Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=92.601µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.792460412Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=108.007µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.791365533Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=1.758775ms msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.785694761Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=128.378µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.779550321Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=115.792µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.744681031Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=134.317µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.741187529Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=184.242µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.700433508Z caller=logging.go:76 traceID=3692480357453467 orgID=fake msg="GET /ready (503) 226.89µs"
level=debug ts=2023-07-13T21:54:41.700315187Z caller=logging.go:76 traceID=013752d39cf5f756 orgID=fake msg="GET /ready (503) 168.439µs"
level=debug ts=2023-07-13T21:54:41.698996636Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=233.53µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.630833375Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=505.952µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.575828089Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=79.789µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.548931278Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=162.128µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.547607013Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=119.453µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.544827245Z caller=grpc_logging.go:46 duration=634.215µs method=/logproto.Pusher/Push msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.529306468Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=489.491µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.523354118Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=2.375194ms msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.512421167Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=135.41µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.428774945Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=152.067µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.420698054Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=240.11µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.42015423Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=108.545µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.41724774Z caller=grpc_logging.go:46 duration=561.023µs method=/logproto.Pusher/Push msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.393577466Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=6.551817ms msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.373680409Z caller=grpc_logging.go:46 duration=132.107µs method=/logproto.Pusher/Push msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.353435658Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=137.495µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.337244543Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=113.371µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.304051499Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=159.196µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.281351806Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=95.614µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.237438257Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=368.574µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.202303788Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=146.243µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.202013848Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=400.918µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.124335318Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=83.731µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.120591359Z caller=grpc_logging.go:46 duration=2.802114ms method=/logproto.Pusher/Push msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.112198628Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=375.067µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.102364787Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=649.065µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.086500042Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=353.689µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:40.961800295Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=289.264µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:40.898557168Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=123.691µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:40.893143334Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=553.4µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:40.868387579Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=214.881µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:40.830839462Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=21.149108ms msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:40.830719571Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=49.80596ms msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:40.830659191Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=51.772022ms msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:40.735416312Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=233.864µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:40.707437724Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=68.091µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:40.691527022Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=233.004µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:40.689179748Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=234.66µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:40.677146084Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=378.731µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:40.656513077Z caller=grpc_logging.go:46 method=/grpc.health.v1.Health/Check duration=101.372µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:40.613227574Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=122.61µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:40.58582904Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=199.931µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:40.583073436Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=336.158µs msg="gRPC (success)"

how to troubleshoot the problem?

my configuration file:

auth_enabled: false
chunk_store_config:
  max_look_back_period: 0s
common:
  compactor_address: http://loki-distributed-compactor:3100
compactor:
  shared_store: s3
  working_directory: /var/loki/compactor
distributor:
  ring:
    kvstore:
      store: memberlist
frontend:
  compress_responses: true
  log_queries_longer_than: 5s
  tail_proxy_url: http://loki-distributed-querier:3100
frontend_worker:
  frontend_address: loki-distributed-query-frontend-headless:9095
ingester:
  chunk_block_size: 262144
  chunk_encoding: snappy
  chunk_idle_period: 1h
  chunk_retain_period: 1m
  chunk_target_size: 8388608
  lifecycler:
    ring:
      heartbeat_timeout: 10m
      kvstore:
        store: memberlist
      replication_factor: 3
  max_transfer_retries: 0
  wal:
    dir: /var/loki/wal
limits_config:
  enforce_metric_name: false
  ingestion_burst_size_mb: 1024
  ingestion_rate_mb: 1024
  max_cache_freshness_per_query: 10m
  max_global_streams_per_user: 0
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  split_queries_by_interval: 15m
memberlist:
  join_members:
  - loki-distributed-memberlist
query_range:
  align_queries_with_step: true
  cache_results: true
  max_retries: 5
  results_cache:
    cache:
      embedded_cache:
        enabled: true
        ttl: 24h
ruler:
  alertmanager_url: http://am.xxxx.com/
  enable_alertmanager_v2: true
  enable_api: true
  enable_sharding: true
  ring:
    kvstore:
      store: memberlist
  rule_path: /tmp/loki/scratch
  storage:
    local:
      directory: /etc/loki/rules
    type: local
runtime_config:
  file: /var/loki-distributed-runtime/runtime.yaml
schema_config:
  configs:
  - from: "2023-06-15"
    index:
      period: 24h
      prefix: loki_index_
    object_store: s3
    schema: v11
    store: boltdb-shipper
server:
  grpc_server_max_recv_msg_size: 8388608
  http_listen_port: 3100
  log_level: debug
storage_config:
  aws:
    access_key_id: xxxxxxxx
    bucketnames: loki
    endpoint: https://s3.xxxx.com
    s3forcepathstyle: true
    secret_access_key: xxxxxxx
  boltdb_shipper:
    active_index_directory: /var/loki/index
    cache_location: /var/loki/cache
    cache_ttl: 24h
    index_gateway_client:
      server_address: dns:///loki-distributed-index-gateway:9095
    shared_store: s3
table_manager:
  retention_deletes_enabled: false
  retention_period: 336h

@kadhamecha-conga
Copy link

hi,

i'm facing the same issue with ingesters , replication factor is set to 1.
and throughput is above 300Mb for sometime.
what are the recommendation to prevent PVC (Persistent Volume Claim) disk space from getting filled up ?

image

thanks.

@patsevanton
Copy link
Contributor

what parameters can OOM ingester fix ?

@waney316
Copy link

in loki-distributed: v2.8.2

  1. when the ingester instance starts, WAL replay takes a long time and the liveness probe /ready returns 503, causing the pod to be killed by k8s.
level=debug ts=2023-07-13T09:40:15.567663647Z caller=replay_controller.go:85 msg="replay flusher pre-flush" bytes="3.9 GB"
level=info ts=2023-07-13T18:49:54.185275394Z caller=ingester.go:432 msg="recovered WAL checkpoint recovery finished" elapsed=1h25m27.618187613s errors=false
  1. after ingester has been running for a while, /ready returns 503, triggering failureThreshold.
level=debug ts=2023-07-13T21:54:42.009853071Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=133.948µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.970250053Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=97.306µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.959086516Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=345.667µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.883472557Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=390.464µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.857637139Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=143.269µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.850804813Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=2.114682ms msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.834328932Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=154.931µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.82012353Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=207.537µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.813212013Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=92.601µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.792460412Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=108.007µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.791365533Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=1.758775ms msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.785694761Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=128.378µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.779550321Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=115.792µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.744681031Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=134.317µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.741187529Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=184.242µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.700433508Z caller=logging.go:76 traceID=3692480357453467 orgID=fake msg="GET /ready (503) 226.89µs"
level=debug ts=2023-07-13T21:54:41.700315187Z caller=logging.go:76 traceID=013752d39cf5f756 orgID=fake msg="GET /ready (503) 168.439µs"
level=debug ts=2023-07-13T21:54:41.698996636Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=233.53µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.630833375Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=505.952µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.575828089Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=79.789µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.548931278Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=162.128µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.547607013Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=119.453µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.544827245Z caller=grpc_logging.go:46 duration=634.215µs method=/logproto.Pusher/Push msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.529306468Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=489.491µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.523354118Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=2.375194ms msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.512421167Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=135.41µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.428774945Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=152.067µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.420698054Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=240.11µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.42015423Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=108.545µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.41724774Z caller=grpc_logging.go:46 duration=561.023µs method=/logproto.Pusher/Push msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.393577466Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=6.551817ms msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.373680409Z caller=grpc_logging.go:46 duration=132.107µs method=/logproto.Pusher/Push msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.353435658Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=137.495µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.337244543Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=113.371µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.304051499Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=159.196µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.281351806Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=95.614µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.237438257Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=368.574µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.202303788Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=146.243µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.202013848Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=400.918µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.124335318Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=83.731µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.120591359Z caller=grpc_logging.go:46 duration=2.802114ms method=/logproto.Pusher/Push msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.112198628Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=375.067µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.102364787Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=649.065µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:41.086500042Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=353.689µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:40.961800295Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=289.264µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:40.898557168Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=123.691µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:40.893143334Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=553.4µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:40.868387579Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=214.881µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:40.830839462Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=21.149108ms msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:40.830719571Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=49.80596ms msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:40.830659191Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=51.772022ms msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:40.735416312Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=233.864µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:40.707437724Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=68.091µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:40.691527022Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=233.004µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:40.689179748Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=234.66µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:40.677146084Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=378.731µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:40.656513077Z caller=grpc_logging.go:46 method=/grpc.health.v1.Health/Check duration=101.372µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:40.613227574Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=122.61µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:40.58582904Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=199.931µs msg="gRPC (success)"
level=debug ts=2023-07-13T21:54:40.583073436Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=336.158µs msg="gRPC (success)"

how to troubleshoot the problem?

my configuration file:

auth_enabled: false
chunk_store_config:
  max_look_back_period: 0s
common:
  compactor_address: http://loki-distributed-compactor:3100
compactor:
  shared_store: s3
  working_directory: /var/loki/compactor
distributor:
  ring:
    kvstore:
      store: memberlist
frontend:
  compress_responses: true
  log_queries_longer_than: 5s
  tail_proxy_url: http://loki-distributed-querier:3100
frontend_worker:
  frontend_address: loki-distributed-query-frontend-headless:9095
ingester:
  chunk_block_size: 262144
  chunk_encoding: snappy
  chunk_idle_period: 1h
  chunk_retain_period: 1m
  chunk_target_size: 8388608
  lifecycler:
    ring:
      heartbeat_timeout: 10m
      kvstore:
        store: memberlist
      replication_factor: 3
  max_transfer_retries: 0
  wal:
    dir: /var/loki/wal
limits_config:
  enforce_metric_name: false
  ingestion_burst_size_mb: 1024
  ingestion_rate_mb: 1024
  max_cache_freshness_per_query: 10m
  max_global_streams_per_user: 0
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  split_queries_by_interval: 15m
memberlist:
  join_members:
  - loki-distributed-memberlist
query_range:
  align_queries_with_step: true
  cache_results: true
  max_retries: 5
  results_cache:
    cache:
      embedded_cache:
        enabled: true
        ttl: 24h
ruler:
  alertmanager_url: http://am.xxxx.com/
  enable_alertmanager_v2: true
  enable_api: true
  enable_sharding: true
  ring:
    kvstore:
      store: memberlist
  rule_path: /tmp/loki/scratch
  storage:
    local:
      directory: /etc/loki/rules
    type: local
runtime_config:
  file: /var/loki-distributed-runtime/runtime.yaml
schema_config:
  configs:
  - from: "2023-06-15"
    index:
      period: 24h
      prefix: loki_index_
    object_store: s3
    schema: v11
    store: boltdb-shipper
server:
  grpc_server_max_recv_msg_size: 8388608
  http_listen_port: 3100
  log_level: debug
storage_config:
  aws:
    access_key_id: xxxxxxxx
    bucketnames: loki
    endpoint: https://s3.xxxx.com
    s3forcepathstyle: true
    secret_access_key: xxxxxxx
  boltdb_shipper:
    active_index_directory: /var/loki/index
    cache_location: /var/loki/cache
    cache_ttl: 24h
    index_gateway_client:
      server_address: dns:///loki-distributed-index-gateway:9095
    shared_store: s3
table_manager:
  retention_deletes_enabled: false
  retention_period: 336h

i'm facing the same issue with ingesters; how did you resolve it?

@Negashev
Copy link

Negashev commented May 7, 2024

2.9.6 still oom

@dschunack
Copy link

Hi,

Hi, we also had a lot of problems with Loki and the ingester. We spent a lot of time optimizing Loki, but in the end the OOM problem occurred every day.
The upgrade last week form 2.9.6 to 3.1 solved our issue and the memory usage is now amazing.

Look at this.
loki_update_3 1

Maybe a update will help to fix your problem. In our case, it was solved.

@patsevanton
Copy link
Contributor

It is sad that the loki-distributed helm chart contains version 2.9.8

@dschunack
Copy link

We used the following chart and this is up2date.

https://github.com/grafana/loki/tree/main/production/helm/loki

@CharlieC3
Copy link

Can confirm, switching off the loki-distributed helm chart and onto the standard loki chart linked above, allowing us to upgrade to Loki 3.1, helped us too. We're no longer seeing ingester OOMs. Thanks for the info @dschunack!

@dschunack
Copy link

@CharlieC3 good to here that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/ingester need-investigation type/bug Somehing is not working as expected
Projects
None yet
Development

No branches or pull requests