-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spike - Add cpu, memory requests & limits to deployment #94
Comments
We should do this pre-1.0 and have some metrics for cluster-size and impact on CPU/Memory. |
Just as a baseline comment - we are running 0.9.0 in Openshift 3.9. We initially were running without any limits when I realized the container had consumed 4gb of ram. I set a limit of 1gb and now it's restarting with OOM issues and backups are failing. We have 100 scheduled backups (one for each namespace) and currently have 11,883 backups. Is there a recommended memory setting? Is it based on schedules? Backups? |
In case anyone is following the question from @twforeman, we've moved the discussion to #780 |
xref #1452 (comment) (specifically around adding request/limit flags to |
As part of this issue, we should probably add request/limit flags to |
I've been monitoring Velero's CPU and memory usage and in my GKE cluster using Stackdriver I've found that the server pod uses 42M memory and 5m CPU at rest. I installed various applications in the cluster, including the nginx-example in this repo and 10 instances of the WordPress Helm chart to experiment with backup/restore operations. A full cluster backup created the largest spike of 53M memory and 16m CPU (see results of other operations below). We could generously set requests to 128Mi memory and 100m CPU, and be even more generous with limits (e.g. 256Mi, 200m). What do you think? I need to do some further investigation for restic backups so we can define requests/limits for those pods also.
|
cool! I wonder if the number of backups/restores that Velero is tracking has an impact on cpu/memory? What about Restic? I assume the daemonset consumes much more CPU/memory when doing filesystem backup & restore. |
@prydonius this is great. Let's continue to collect some more info.
Yeah, I imagine this will have an impact - let's look at it. Things will definitely get fuzzier with restic. The daemonset pods are one thing (that's where restic backups and restores run), but also the |
+1, I'll create a scheduled backup to see how the cpu/memory use grows over time with that.
Ah, that's good to know, when is |
it gets run every 24h by default, but you can tweak the frequency - you can |
This seems like a reasonable place to start for non-restic deployments. I definitely want to add flags to We'll also have to find a reasonable way to update the docs so that users (a) know what our baseline recommendations are, (b) understand that they may need to be changed depending on their scenario (particularly if they use restic), and (c) understand how to change them (using the |
For the full cluster backup, how many namespaces were included? I'm also curious about included number of volumes. I expect even a large number of GCP, AWS, or Azure volumes to have minimal impact on these numbers, given we do it asynchronously. |
5 namespaces
21 volumes (2 per each WordPress instance, and 1 for the nginx example). Is there an easy way of spinning up a large amount of workloads/volumes to measure how velero server usage might grow as the amount of workloads/volumes grow? |
Some restic results:
Seems like restic varies quite a lot depending on the size of data being backed up. I'm also concerned that a limit of 1000m (1 core) on the Pod would have caused it to be restarted when performing the larger backup. |
There are slightly larger spikes in the Velero Pod when maintenance is run, but not by a huge amount with the current setup I have: |
@prydonius I wonder how it's impacted if you have a restic backup on the 80GB volume running when the restic maintenance is scheduled. I may be remembering incorrectly, but I think it may not actually impact the CPU as the maintenance will get run once the backup's done. |
@nrb doesn't the restic backup and maintenance happen in different Pods (restic server and velero server respectively)? I'm not sure what the consequences are of running a prune whilst a backup is happening, though this may be prevented by restic locking? |
Yep!
Yes - |
Thanks for sharing @Evesy, this is really useful. Looking at the RAM use, I think the default request/limit we settled on (128Mi, 256Mi respectively) should be fine. The CPU use during your full cluster backups is much higher than what I was seeing with much less resources. We settled on 0.1 request and 0.2 limit, which would clearly not work in your case. I'm not sure how the CPU use scales, but if we were to use your usage as a baseline, a 0.5 request and 1.0 limit could be sufficient? |
I may be reading the graph incorrectly, but looking at https://kubernetes.slack.com/archives/C6VCGP4MT/p1563893211023600?thread_ts=1563890835.023500&cid=C6VCGP4MT, it seems that the CPU usage is just under 4.0. That said, I don't think we're going to land on anything that works for everybody. It may be worthwhile to help users learn how to benchmark it and tune themselves. |
Ah, I see, I was indeed reading it incorrectly. The values proposed seem reasonable to me. |
Revised #'s seem OK to me. They'll never be right for everyone but they seem sane and can be user-tuned as needed. |
@prydonius what (if anything) is left on this issue once #1678 gets merged? I guess we don't have flags/defaults for the restic daemonset yet - anything else? |
@skriss yes just the flags/defaults for restic, I think we can close this out after that |
Fixed version tag for velero 1.6.0 (not rc1)
Measure ark's memory usage over time, set reasonable values
The text was updated successfully, but these errors were encountered: