Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike - Add cpu, memory requests & limits to deployment #94

Closed
ncdc opened this issue Sep 14, 2017 · 27 comments · Fixed by #1710
Closed

Spike - Add cpu, memory requests & limits to deployment #94

ncdc opened this issue Sep 14, 2017 · 27 comments · Fixed by #1710
Assignees
Labels
Enhancement/User End-User Enhancement to Velero Good first issue Looking to contribute to Velero? Issues with this label might be a great place to start! Help wanted
Milestone

Comments

@ncdc
Copy link
Contributor

ncdc commented Sep 14, 2017

Measure ark's memory usage over time, set reasonable values

@rosskukulinski rosskukulinski added the Good first issue Looking to contribute to Velero? Issues with this label might be a great place to start! label Jun 24, 2018
@rosskukulinski
Copy link
Contributor

We should do this pre-1.0 and have some metrics for cluster-size and impact on CPU/Memory.

@rosskukulinski rosskukulinski added this to the v1.0.0 milestone Jun 24, 2018
@rosskukulinski rosskukulinski added Enhancement/User End-User Enhancement to Velero and removed Enhancement labels Jun 25, 2018
@twforeman
Copy link

Just as a baseline comment - we are running 0.9.0 in Openshift 3.9. We initially were running without any limits when I realized the container had consumed 4gb of ram.

I set a limit of 1gb and now it's restarting with OOM issues and backups are failing. We have 100 scheduled backups (one for each namespace) and currently have 11,883 backups.

Is there a recommended memory setting? Is it based on schedules? Backups?

@ncdc
Copy link
Contributor Author

ncdc commented Aug 24, 2018

In case anyone is following the question from @twforeman, we've moved the discussion to #780

@shabx shabx added the ZD2064 label Apr 16, 2019
@skriss skriss modified the milestones: v1.0.0, v1.x May 20, 2019
@VMmore VMmore modified the milestones: v1.x, v1.1 Jun 3, 2019
@VMmore VMmore changed the title Add cpu, memory requests & limits to deployment Spike - Add cpu, memory requests & limits to deployment Jun 3, 2019
@skriss
Copy link
Member

skriss commented Jun 10, 2019

xref #1452 (comment) (specifically around adding request/limit flags to velero install).

@skriss
Copy link
Member

skriss commented Jun 25, 2019

As part of this issue, we should probably add request/limit flags to velero install to be able to specify these.

@prydonius
Copy link
Contributor

prydonius commented Jul 12, 2019

I've been monitoring Velero's CPU and memory usage and in my GKE cluster using Stackdriver I've found that the server pod uses 42M memory and 5m CPU at rest. I installed various applications in the cluster, including the nginx-example in this repo and 10 instances of the WordPress Helm chart to experiment with backup/restore operations. A full cluster backup created the largest spike of 53M memory and 16m CPU (see results of other operations below).

We could generously set requests to 128Mi memory and 100m CPU, and be even more generous with limits (e.g. 256Mi, 200m). What do you think?

I need to do some further investigation for restic backups so we can define requests/limits for those pods also.

operation memory (M) cpu (m)
backup of example nginx 42 8.4
backup 1 instance of Helm wordpress 42 8.4
backup 10 instances of Helm wordpress 42 12
full cluster backup 53 16
delete backup 42 6.7
delete multiple backups 53 16
multiple backup requests (4) 42 14
restore nginx-example 53 8.6

@rosskukulinski
Copy link
Contributor

cool!

I wonder if the number of backups/restores that Velero is tracking has an impact on cpu/memory?

What about Restic? I assume the daemonset consumes much more CPU/memory when doing filesystem backup & restore.

@skriss
Copy link
Member

skriss commented Jul 12, 2019

@prydonius this is great. Let's continue to collect some more info.

I wonder if the number of backups/restores that Velero is tracking has an impact on cpu/memory?

Yeah, I imagine this will have an impact - let's look at it.

Things will definitely get fuzzier with restic. The daemonset pods are one thing (that's where restic backups and restores run), but also the restic prune operation probably consumes a decent amount of resources for larger repos, and that runs in the main velero pod.

@prydonius
Copy link
Contributor

I wonder if the number of backups/restores that Velero is tracking has an impact on cpu/memory?

Yeah, I imagine this will have an impact - let's look at it.

+1, I'll create a scheduled backup to see how the cpu/memory use grows over time with that.

Things will definitely get fuzzier with restic. The daemonset pods are one thing (that's where restic backups and restores run), but also the restic prune operation probably consumes a decent amount of resources for larger repos, and that runs in the main velero pod.

Ah, that's good to know, when is restic prune run? I'll spend some more time testing restic backups.

@skriss
Copy link
Member

skriss commented Jul 12, 2019

it gets run every 24h by default, but you can tweak the frequency - you can kubectl -n velero edit resticrepository NAME, and change the spec.maintenanceFrequency to something more often.

@skriss
Copy link
Member

skriss commented Jul 15, 2019

We could generously set requests to 128Mi memory and 100m CPU, and be even more generous with limits (e.g. 256Mi, 200m). What do you think?

This seems like a reasonable place to start for non-restic deployments.

I definitely want to add flags to velero install for setting these, so they're user tuneable.

We'll also have to find a reasonable way to update the docs so that users (a) know what our baseline recommendations are, (b) understand that they may need to be changed depending on their scenario (particularly if they use restic), and (c) understand how to change them (using the velero install CLI flags or by editing the YAML)

@prydonius
Copy link
Contributor

Just to add some more data, here is the trend for an hourly full-cluster backup and half-hourly nginx-example backup:

Screen Shot 2019-07-15 at 16 56 08

The CPU spikes are pretty consistent, the memory spikes are less so but it's resting at around 25M (which is lower than what I was previously seeing).

@nrb
Copy link
Contributor

nrb commented Jul 16, 2019

velero install flags w/ pre-defined defaults make a lot of sense to me. We may need 2 sets for the Velero Deployment and the Restic DaemonSet, depending on what the characteristics observed are.

For the full cluster backup, how many namespaces were included?

I'm also curious about included number of volumes. I expect even a large number of GCP, AWS, or Azure volumes to have minimal impact on these numbers, given we do it asynchronously.

@prydonius
Copy link
Contributor

For the full cluster backup, how many namespaces were included?

5 namespaces

I'm also curious about included number of volumes. I expect even a large number of GCP, AWS, or Azure volumes to have minimal impact on these numbers, given we do it asynchronously.

21 volumes (2 per each WordPress instance, and 1 for the nginx example).

Is there an easy way of spinning up a large amount of workloads/volumes to measure how velero server usage might grow as the amount of workloads/volumes grow?

@prydonius
Copy link
Contributor

prydonius commented Jul 16, 2019

Some restic results:

  • restic backup of 80gb volume
    • cpu: 1000m (max available in node)
    • mem: 114M
  • restic backup of 5gb volume
    • cpu: 200m
    • mem: 10.2M

Seems like restic varies quite a lot depending on the size of data being backed up. I'm also concerned that a limit of 1000m (1 core) on the Pod would have caused it to be restarted when performing the larger backup.

@prydonius
Copy link
Contributor

it gets run every 24h by default, but you can tweak the frequency - you can kubectl -n velero edit resticrepository NAME, and change the spec.maintenanceFrequency to something more often.

There are slightly larger spikes in the Velero Pod when maintenance is run, but not by a huge amount with the current setup I have:

Screen Shot 2019-07-17 at 10 44 30

@nrb
Copy link
Contributor

nrb commented Jul 17, 2019

@prydonius I wonder how it's impacted if you have a restic backup on the 80GB volume running when the restic maintenance is scheduled. I may be remembering incorrectly, but I think it may not actually impact the CPU as the maintenance will get run once the backup's done.

@prydonius
Copy link
Contributor

@nrb doesn't the restic backup and maintenance happen in different Pods (restic server and velero server respectively)? I'm not sure what the consequences are of running a prune whilst a backup is happening, though this may be prevented by restic locking?

@skriss
Copy link
Member

skriss commented Jul 19, 2019

doesn't the restic backup and maintenance happen in different Pods (restic server and velero server respectively)?

Yep!

I'm not sure what the consequences are of running a prune whilst a backup is happening, though this may be prevented by restic locking?

Yes - prune requires an exclusive lock so would not run concurrently with a backup.

@Evesy
Copy link

Evesy commented Jul 24, 2019

image

Full cluster backup above including GCE volume snapshots

Namespaces: 317
Total Resources: 15199
Persistent Volumes: 45

@prydonius
Copy link
Contributor

Thanks for sharing @Evesy, this is really useful. Looking at the RAM use, I think the default request/limit we settled on (128Mi, 256Mi respectively) should be fine.

The CPU use during your full cluster backups is much higher than what I was seeing with much less resources. We settled on 0.1 request and 0.2 limit, which would clearly not work in your case. I'm not sure how the CPU use scales, but if we were to use your usage as a baseline, a 0.5 request and 1.0 limit could be sufficient?

@nrb
Copy link
Contributor

nrb commented Jul 24, 2019

I may be reading the graph incorrectly, but looking at https://kubernetes.slack.com/archives/C6VCGP4MT/p1563893211023600?thread_ts=1563890835.023500&cid=C6VCGP4MT, it seems that the CPU usage is just under 4.0.

That said, I don't think we're going to land on anything that works for everybody. It may be worthwhile to help users learn how to benchmark it and tune themselves.

@prydonius
Copy link
Contributor

@nrb that graph was the cfs throttling, the actual throttled usage in that graph is seen in the green line that sits just below the limit. @Evesy removed the limits on their Pod to get the above result.

@nrb
Copy link
Contributor

nrb commented Jul 24, 2019

Ah, I see, I was indeed reading it incorrectly.

The values proposed seem reasonable to me.

@skriss
Copy link
Member

skriss commented Jul 24, 2019

Revised #'s seem OK to me. They'll never be right for everyone but they seem sane and can be user-tuned as needed.

@skriss
Copy link
Member

skriss commented Jul 29, 2019

@prydonius what (if anything) is left on this issue once #1678 gets merged? I guess we don't have flags/defaults for the restic daemonset yet - anything else?

@prydonius
Copy link
Contributor

@skriss yes just the flags/defaults for restic, I think we can close this out after that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement/User End-User Enhancement to Velero Good first issue Looking to contribute to Velero? Issues with this label might be a great place to start! Help wanted
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants