Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any efficient mechanism instilled within prometheus to save 2 hours data during restarts/crash recovery/upgrade? what measures/guidelines should we follow to minimize the data loss with less disruption #2439

Closed
varun-krishna opened this issue Apr 15, 2020 · 7 comments
Labels

Comments

@varun-krishna
Copy link

varun-krishna commented Apr 15, 2020

Hi Team,

Current setup:

I am using using prometheus setup running along with thanos for extended storage to accomodate 2 months of data for longer persistence

2 Instances of prometheus (0 replica/1 replica) and both this instances enabled with thanos-sidecar
thanos-sidecar subsequently writes data to GCS bucket for extended long term storage
thanos querier connected to two instances and reading data via store gateway
Challenge/Issue in current setup:

Despite of having long term storage we would still be losing potential data of real time/latest/current 2 hours.(storage.tsdb.max-block-duration=2h) in the case of all Prometheus instances being down during a single block window - How do we handle the below cases?

Backup/snapshot for older instance during upgrade will be an option (But it cannot be seamless and will fail to perform fault tolerance and post-facto)
2 HA Instances of prometheus to handle DR scenarios but if both the instances fail then it will lose all 2 hours data with extreme case
Question:

Is there any efficient mechanism instilled within prometheus to save 2 hours data during restarts/crash recovery/upgrade? what measures/guidelines should we follow to minimize the data loss with less disruption

@bwplotka
Copy link
Member

Yes. It's called persistent disk. (:

Available for any cloud provider (e.g. https://cloud.google.com/persistent-disk or https://aws.amazon.com/ebs/) (:

Let me know if that answers you problem, we will reopen!

@varun-krishna
Copy link
Author

varun-krishna commented Apr 15, 2020

Would like to know more on what has to be the Upgrade strategy with the mentioned setup , so that i do not loose any data. It should be noted that loss of all Prometheus instances during a single block window will result in Thanos not being being able show any results for that period.

@varun-krishna
Copy link
Author

And is there a way to differentiate if the data is from the buckets or in-memory at any given instance .

@dinesh4747
Copy link

@bwplotka @brian-brazil - Great if both of us add shed some light here, @varun-krishna - wanted to echo the same as even we had similar data loss issues

Currently it is quite scrambled where we don't find right path way to do Prometheus upgrade without losing any in memory data

@dinesh4747
Copy link

Just looking to infer the issue #2447 where does thanos receiver will come to the rescue

@bwplotka
Copy link
Member

bwplotka commented Apr 16, 2020

Again persistent storage is must-have.

For any push method including receive based you still need the persistent disk to avoid loss of data during restarts of Prometheus / Grafana agent / anything else that buffers metrics/logs/traces etc.

The only thing you change by switching to Thanos Receiver is the reduced risk of larger data loss. With the receiver in real-life scenarios you limit data loss to minutes I guess, but in the worst-case scenario it will be the same as anything else. And this is not Prometheus / Thanos disadvantage. Any solution you will use will have this problem WITHOUT "local" side persistency like Persistent Volumes backed up by some disk. Hope that helps (:

@matthiasr
Copy link

Unfortunately, persistent storage is not a viable answer for us – since we are (mostly) on-prem, we need a Prometheus server that monitors the persistent storage provider (rook-ceph in our case), and that cannot have a dependency on what it monitors. I saw a discussion (but cannot find it anymore) about using the Prometheus snapshot API to cause it to write out a block on shutdown; it seems like that would solve the hard part of the problem.
How can we ship that block to the long-term storage bucket? Would it make sense for Thanos to understand this situation?
This is on Kubernetes, using the Prometheus operator, so there's a few moving parts that we need to line up, but I think it would be possible to cover most of the gap for most of the cases.
This won't be watertight in case of unexpected node failure, but it would take away the pain of regular Prometheus deployment and cluster maintenance, where we can give the pods time to complete this process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants