Is there any efficient mechanism instilled within prometheus to save 2 hours data during restarts/crash recovery/upgrade? what measures/guidelines should we follow to minimize the data loss with less disruption #2439

varun-krishna · 2020-04-15T10:16:16Z

Hi Team,

Current setup:

I am using using prometheus setup running along with thanos for extended storage to accomodate 2 months of data for longer persistence

2 Instances of prometheus (0 replica/1 replica) and both this instances enabled with thanos-sidecar
thanos-sidecar subsequently writes data to GCS bucket for extended long term storage
thanos querier connected to two instances and reading data via store gateway
Challenge/Issue in current setup:

Despite of having long term storage we would still be losing potential data of real time/latest/current 2 hours.(storage.tsdb.max-block-duration=2h) in the case of all Prometheus instances being down during a single block window - How do we handle the below cases?

Backup/snapshot for older instance during upgrade will be an option (But it cannot be seamless and will fail to perform fault tolerance and post-facto)
2 HA Instances of prometheus to handle DR scenarios but if both the instances fail then it will lose all 2 hours data with extreme case
Question:

Is there any efficient mechanism instilled within prometheus to save 2 hours data during restarts/crash recovery/upgrade? what measures/guidelines should we follow to minimize the data loss with less disruption

bwplotka · 2020-04-15T10:24:41Z

Yes. It's called persistent disk. (:

Available for any cloud provider (e.g. https://cloud.google.com/persistent-disk or https://aws.amazon.com/ebs/) (:

Let me know if that answers you problem, we will reopen!

varun-krishna · 2020-04-15T10:40:26Z

Would like to know more on what has to be the Upgrade strategy with the mentioned setup , so that i do not loose any data. It should be noted that loss of all Prometheus instances during a single block window will result in Thanos not being being able show any results for that period.

varun-krishna · 2020-04-15T10:45:45Z

And is there a way to differentiate if the data is from the buckets or in-memory at any given instance .

dinesh4747 · 2020-04-16T17:33:57Z

@bwplotka @brian-brazil - Great if both of us add shed some light here, @varun-krishna - wanted to echo the same as even we had similar data loss issues

Currently it is quite scrambled where we don't find right path way to do Prometheus upgrade without losing any in memory data

dinesh4747 · 2020-04-16T17:36:01Z

Just looking to infer the issue #2447 where does thanos receiver will come to the rescue

bwplotka · 2020-04-16T18:23:46Z

Again persistent storage is must-have.

For any push method including receive based you still need the persistent disk to avoid loss of data during restarts of Prometheus / Grafana agent / anything else that buffers metrics/logs/traces etc.

The only thing you change by switching to Thanos Receiver is the reduced risk of larger data loss. With the receiver in real-life scenarios you limit data loss to minutes I guess, but in the worst-case scenario it will be the same as anything else. And this is not Prometheus / Thanos disadvantage. Any solution you will use will have this problem WITHOUT "local" side persistency like Persistent Volumes backed up by some disk. Hope that helps (:

matthiasr · 2020-10-21T09:25:34Z

Unfortunately, persistent storage is not a viable answer for us – since we are (mostly) on-prem, we need a Prometheus server that monitors the persistent storage provider (rook-ceph in our case), and that cannot have a dependency on what it monitors. I saw a discussion (but cannot find it anymore) about using the Prometheus snapshot API to cause it to write out a block on shutdown; it seems like that would solve the hard part of the problem.
How can we ship that block to the long-term storage bucket? Would it make sense for Thanos to understand this situation?
This is on Kubernetes, using the Prometheus operator, so there's a few moving parts that we need to line up, but I think it would be possible to cover most of the gap for most of the cases.
This won't be watertight in case of unexpected node failure, but it would take away the pain of regular Prometheus deployment and cluster maintenance, where we can give the pods time to complete this process.

bwplotka closed this as completed Apr 15, 2020

bwplotka added the question label Apr 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there any efficient mechanism instilled within prometheus to save 2 hours data during restarts/crash recovery/upgrade? what measures/guidelines should we follow to minimize the data loss with less disruption #2439

Is there any efficient mechanism instilled within prometheus to save 2 hours data during restarts/crash recovery/upgrade? what measures/guidelines should we follow to minimize the data loss with less disruption #2439

varun-krishna commented Apr 15, 2020 •

edited

Loading

bwplotka commented Apr 15, 2020

varun-krishna commented Apr 15, 2020 •

edited

Loading

varun-krishna commented Apr 15, 2020

dinesh4747 commented Apr 16, 2020

dinesh4747 commented Apr 16, 2020

bwplotka commented Apr 16, 2020 •

edited

Loading

matthiasr commented Oct 21, 2020

Is there any efficient mechanism instilled within prometheus to save 2 hours data during restarts/crash recovery/upgrade? what measures/guidelines should we follow to minimize the data loss with less disruption #2439

Is there any efficient mechanism instilled within prometheus to save 2 hours data during restarts/crash recovery/upgrade? what measures/guidelines should we follow to minimize the data loss with less disruption #2439

Comments

varun-krishna commented Apr 15, 2020 • edited Loading

bwplotka commented Apr 15, 2020

varun-krishna commented Apr 15, 2020 • edited Loading

varun-krishna commented Apr 15, 2020

dinesh4747 commented Apr 16, 2020

dinesh4747 commented Apr 16, 2020

bwplotka commented Apr 16, 2020 • edited Loading

matthiasr commented Oct 21, 2020

varun-krishna commented Apr 15, 2020 •

edited

Loading

varun-krishna commented Apr 15, 2020 •

edited

Loading

bwplotka commented Apr 16, 2020 •

edited

Loading