Spike - Add cpu, memory requests & limits to deployment #94

ncdc · 2017-09-14T18:35:36Z

Measure ark's memory usage over time, set reasonable values

rosskukulinski · 2018-06-24T18:20:18Z

We should do this pre-1.0 and have some metrics for cluster-size and impact on CPU/Memory.

twforeman · 2018-08-20T14:23:51Z

Just as a baseline comment - we are running 0.9.0 in Openshift 3.9. We initially were running without any limits when I realized the container had consumed 4gb of ram.

I set a limit of 1gb and now it's restarting with OOM issues and backups are failing. We have 100 scheduled backups (one for each namespace) and currently have 11,883 backups.

Is there a recommended memory setting? Is it based on schedules? Backups?

ncdc · 2018-08-24T14:08:02Z

In case anyone is following the question from @twforeman, we've moved the discussion to #780

skriss · 2019-06-10T16:47:55Z

xref #1452 (comment) (specifically around adding request/limit flags to velero install).

skriss · 2019-06-25T14:47:53Z

As part of this issue, we should probably add request/limit flags to velero install to be able to specify these.

prydonius · 2019-07-12T00:45:22Z

I've been monitoring Velero's CPU and memory usage and in my GKE cluster using Stackdriver I've found that the server pod uses 42M memory and 5m CPU at rest. I installed various applications in the cluster, including the nginx-example in this repo and 10 instances of the WordPress Helm chart to experiment with backup/restore operations. A full cluster backup created the largest spike of 53M memory and 16m CPU (see results of other operations below).

We could generously set requests to 128Mi memory and 100m CPU, and be even more generous with limits (e.g. 256Mi, 200m). What do you think?

I need to do some further investigation for restic backups so we can define requests/limits for those pods also.

operation	memory (M)	cpu (m)
backup of example nginx	42	8.4
backup 1 instance of Helm wordpress	42	8.4
backup 10 instances of Helm wordpress	42	12
full cluster backup	53	16
delete backup	42	6.7
delete multiple backups	53	16
multiple backup requests (4)	42	14
restore nginx-example	53	8.6

rosskukulinski · 2019-07-12T02:43:18Z

cool!

I wonder if the number of backups/restores that Velero is tracking has an impact on cpu/memory?

What about Restic? I assume the daemonset consumes much more CPU/memory when doing filesystem backup & restore.

skriss · 2019-07-12T15:26:27Z

@prydonius this is great. Let's continue to collect some more info.

I wonder if the number of backups/restores that Velero is tracking has an impact on cpu/memory?

Yeah, I imagine this will have an impact - let's look at it.

Things will definitely get fuzzier with restic. The daemonset pods are one thing (that's where restic backups and restores run), but also the restic prune operation probably consumes a decent amount of resources for larger repos, and that runs in the main velero pod.

prydonius · 2019-07-12T18:15:04Z

I wonder if the number of backups/restores that Velero is tracking has an impact on cpu/memory?

Yeah, I imagine this will have an impact - let's look at it.

+1, I'll create a scheduled backup to see how the cpu/memory use grows over time with that.

Things will definitely get fuzzier with restic. The daemonset pods are one thing (that's where restic backups and restores run), but also the restic prune operation probably consumes a decent amount of resources for larger repos, and that runs in the main velero pod.

Ah, that's good to know, when is restic prune run? I'll spend some more time testing restic backups.

skriss · 2019-07-12T18:17:22Z

it gets run every 24h by default, but you can tweak the frequency - you can kubectl -n velero edit resticrepository NAME, and change the spec.maintenanceFrequency to something more often.

skriss · 2019-07-15T22:48:21Z

We could generously set requests to 128Mi memory and 100m CPU, and be even more generous with limits (e.g. 256Mi, 200m). What do you think?

This seems like a reasonable place to start for non-restic deployments.

I definitely want to add flags to velero install for setting these, so they're user tuneable.

We'll also have to find a reasonable way to update the docs so that users (a) know what our baseline recommendations are, (b) understand that they may need to be changed depending on their scenario (particularly if they use restic), and (c) understand how to change them (using the velero install CLI flags or by editing the YAML)

prydonius · 2019-07-15T23:58:55Z

Just to add some more data, here is the trend for an hourly full-cluster backup and half-hourly nginx-example backup:

The CPU spikes are pretty consistent, the memory spikes are less so but it's resting at around 25M (which is lower than what I was previously seeing).

nrb · 2019-07-16T17:05:39Z

velero install flags w/ pre-defined defaults make a lot of sense to me. We may need 2 sets for the Velero Deployment and the Restic DaemonSet, depending on what the characteristics observed are.

For the full cluster backup, how many namespaces were included?

I'm also curious about included number of volumes. I expect even a large number of GCP, AWS, or Azure volumes to have minimal impact on these numbers, given we do it asynchronously.

prydonius · 2019-07-16T17:20:02Z

For the full cluster backup, how many namespaces were included?

5 namespaces

I'm also curious about included number of volumes. I expect even a large number of GCP, AWS, or Azure volumes to have minimal impact on these numbers, given we do it asynchronously.

21 volumes (2 per each WordPress instance, and 1 for the nginx example).

Is there an easy way of spinning up a large amount of workloads/volumes to measure how velero server usage might grow as the amount of workloads/volumes grow?

prydonius · 2019-07-16T23:10:38Z

Some restic results:

restic backup of 80gb volume
- cpu: 1000m (max available in node)
- mem: 114M
restic backup of 5gb volume
- cpu: 200m
- mem: 10.2M

Seems like restic varies quite a lot depending on the size of data being backed up. I'm also concerned that a limit of 1000m (1 core) on the Pod would have caused it to be restarted when performing the larger backup.

prydonius · 2019-07-17T17:45:47Z

it gets run every 24h by default, but you can tweak the frequency - you can kubectl -n velero edit resticrepository NAME, and change the spec.maintenanceFrequency to something more often.

There are slightly larger spikes in the Velero Pod when maintenance is run, but not by a huge amount with the current setup I have:

nrb · 2019-07-17T17:55:34Z

@prydonius I wonder how it's impacted if you have a restic backup on the 80GB volume running when the restic maintenance is scheduled. I may be remembering incorrectly, but I think it may not actually impact the CPU as the maintenance will get run once the backup's done.

prydonius · 2019-07-17T18:01:39Z

@nrb doesn't the restic backup and maintenance happen in different Pods (restic server and velero server respectively)? I'm not sure what the consequences are of running a prune whilst a backup is happening, though this may be prevented by restic locking?

skriss · 2019-07-19T15:22:37Z

doesn't the restic backup and maintenance happen in different Pods (restic server and velero server respectively)?

Yep!

I'm not sure what the consequences are of running a prune whilst a backup is happening, though this may be prevented by restic locking?

Yes - prune requires an exclusive lock so would not run concurrently with a backup.

Evesy · 2019-07-24T09:45:15Z

Full cluster backup above including GCE volume snapshots

Namespaces: 317
Total Resources: 15199
Persistent Volumes: 45

prydonius · 2019-07-24T17:38:47Z

Thanks for sharing @Evesy, this is really useful. Looking at the RAM use, I think the default request/limit we settled on (128Mi, 256Mi respectively) should be fine.

The CPU use during your full cluster backups is much higher than what I was seeing with much less resources. We settled on 0.1 request and 0.2 limit, which would clearly not work in your case. I'm not sure how the CPU use scales, but if we were to use your usage as a baseline, a 0.5 request and 1.0 limit could be sufficient?

nrb · 2019-07-24T20:35:35Z

I may be reading the graph incorrectly, but looking at https://kubernetes.slack.com/archives/C6VCGP4MT/p1563893211023600?thread_ts=1563890835.023500&cid=C6VCGP4MT, it seems that the CPU usage is just under 4.0.

That said, I don't think we're going to land on anything that works for everybody. It may be worthwhile to help users learn how to benchmark it and tune themselves.

prydonius · 2019-07-24T20:48:42Z

@nrb that graph was the cfs throttling, the actual throttled usage in that graph is seen in the green line that sits just below the limit. @Evesy removed the limits on their Pod to get the above result.

nrb · 2019-07-24T20:56:51Z

Ah, I see, I was indeed reading it incorrectly.

The values proposed seem reasonable to me.

skriss · 2019-07-24T21:53:36Z

Revised #'s seem OK to me. They'll never be right for everyone but they seem sane and can be user-tuned as needed.

skriss · 2019-07-29T18:05:44Z

@prydonius what (if anything) is left on this issue once #1678 gets merged? I guess we don't have flags/defaults for the restic daemonset yet - anything else?

prydonius · 2019-07-29T18:22:02Z

@skriss yes just the flags/defaults for restic, I think we can close this out after that

Fixed version tag for velero 1.6.0 (not rc1)

ncdc added enhancement labels Sep 14, 2017

ncdc added the Help wanted label Oct 17, 2017

ncdc added p1 - Important and removed p2 - Long-term Important labels May 7, 2018

rosskukulinski added the Good first issue Looking to contribute to Velero? Issues with this label might be a great place to start! label Jun 24, 2018

rosskukulinski added this to the v1.0.0 milestone Jun 24, 2018

This was referenced Jun 24, 2018

Add pprof support #85

Closed

Ark server needs to handle crashed/stopped plugin processes #481

Closed

rosskukulinski added Enhancement/User End-User Enhancement to Velero and removed Enhancement labels Jun 25, 2018

skriss added P2 - Long-term important and removed P1 - Important labels Mar 18, 2019

shabx added the ZD2064 label Apr 16, 2019

skriss modified the milestones: v1.0.0, v1.x May 20, 2019

VMmore modified the milestones: v1.x, v1.1 Jun 3, 2019

VMmore changed the title ~~Add cpu, memory requests & limits to deployment~~ Spike - Add cpu, memory requests & limits to deployment Jun 3, 2019

skriss assigned prydonius Jun 6, 2019

skriss mentioned this issue Jun 6, 2019

Unable to deploy velero #1452

Closed

prydonius mentioned this issue Jul 17, 2019

add configurable CPU/memory requests/limits for velero pod on install #1678

Merged

This was referenced Jul 30, 2019

Add resource limits to restic init container #1677

Merged

add velero install flags for configuring restic resource requirements #1710

Merged

carlisia closed this as completed in #1710 Jul 31, 2019

prydonius mentioned this issue Sep 23, 2019

Backup fails with transport is closing #1856

Closed

alaypatel07 pushed a commit to alaypatel07/velero that referenced this issue May 25, 2021

Merge pull request vmware-tanzu#94 from sseago/version-tag-1.6-final

8fa0a88

Fixed version tag for velero 1.6.0 (not rc1)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spike - Add cpu, memory requests & limits to deployment #94

Spike - Add cpu, memory requests & limits to deployment #94

ncdc commented Sep 14, 2017

rosskukulinski commented Jun 24, 2018

twforeman commented Aug 20, 2018

ncdc commented Aug 24, 2018

skriss commented Jun 10, 2019

skriss commented Jun 25, 2019

prydonius commented Jul 12, 2019 •

edited

Loading

rosskukulinski commented Jul 12, 2019

skriss commented Jul 12, 2019

prydonius commented Jul 12, 2019

skriss commented Jul 12, 2019

skriss commented Jul 15, 2019

prydonius commented Jul 15, 2019

nrb commented Jul 16, 2019

prydonius commented Jul 16, 2019

prydonius commented Jul 16, 2019 •

edited

Loading

prydonius commented Jul 17, 2019

nrb commented Jul 17, 2019

prydonius commented Jul 17, 2019

skriss commented Jul 19, 2019

Evesy commented Jul 24, 2019

prydonius commented Jul 24, 2019

nrb commented Jul 24, 2019

prydonius commented Jul 24, 2019

nrb commented Jul 24, 2019

skriss commented Jul 24, 2019

skriss commented Jul 29, 2019

prydonius commented Jul 29, 2019

Spike - Add cpu, memory requests & limits to deployment #94

Spike - Add cpu, memory requests & limits to deployment #94

Comments

ncdc commented Sep 14, 2017

rosskukulinski commented Jun 24, 2018

twforeman commented Aug 20, 2018

ncdc commented Aug 24, 2018

skriss commented Jun 10, 2019

skriss commented Jun 25, 2019

prydonius commented Jul 12, 2019 • edited Loading

rosskukulinski commented Jul 12, 2019

skriss commented Jul 12, 2019

prydonius commented Jul 12, 2019

skriss commented Jul 12, 2019

skriss commented Jul 15, 2019

prydonius commented Jul 15, 2019

nrb commented Jul 16, 2019

prydonius commented Jul 16, 2019

prydonius commented Jul 16, 2019 • edited Loading

prydonius commented Jul 17, 2019

nrb commented Jul 17, 2019

prydonius commented Jul 17, 2019

skriss commented Jul 19, 2019

Evesy commented Jul 24, 2019

prydonius commented Jul 24, 2019

nrb commented Jul 24, 2019

prydonius commented Jul 24, 2019

nrb commented Jul 24, 2019

skriss commented Jul 24, 2019

skriss commented Jul 29, 2019

prydonius commented Jul 29, 2019

prydonius commented Jul 12, 2019 •

edited

Loading

prydonius commented Jul 16, 2019 •

edited

Loading