Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add resource limits to restic init container #1677

Merged
merged 12 commits into from
Aug 5, 2019
Merged

Conversation

nrb
Copy link
Contributor

@nrb nrb commented Jul 17, 2019

Fixes #1201

  • Add tests for the plugin's Execute method.

Signed-off-by: Nolan Brubaker brubakern@vmware.com

@@ -107,6 +108,12 @@ func (a *ResticRestoreAction) Execute(input *velero.RestoreItemActionExecuteInpu
},
},
},
Resources: corev1.ResourceRequirements{
Limits: corev1.ResourceList{
corev1.ResourceCPU: resource.MustParse("100m"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider making these optionally configurable via the existing ConfigMap for this plugin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about that, but seeing how valid quantities in the wrong fields can break with #1678 gives me pause on the failure mode.

For example, if we allow the user to specify the limits and/or requests and they put cpu: 128Mi, it's a valid quantity string, but not for CPU. MustParse and ParseQuantity only seem to care about the string. So far I don't know that there's a way to find out about that until the pod fails to get scheduled, which I think would cause the Velero Restore to get stuck InProgress.

I'll do some more digging in the Kubernetes code to see if there's a helper function that we may have missed.

@nrb
Copy link
Contributor Author

nrb commented Jul 22, 2019

Added 2 test cases and the necessary helpers for creating initContainers. Let me know if this approach makes sense.

@skriss
Copy link
Member

skriss commented Jul 22, 2019

Approach LGTM.

Looking back at the original issue, looks like we need to add requests and limits (see the error message).

Is the 1Mi mem limit sufficient? Let's just make sure it's big enough that we won't run into issues.

@nrb
Copy link
Contributor Author

nrb commented Jul 22, 2019

Is the 1Mi mem limit sufficient? Let's just make sure it's big enough that we won't run into issues.

I need to actually run this against a cluster to be sure, was mostly making sure I got compilation working first.

Do you have any more thoughts about the failure mode when allowing users to configure the values?

@skriss
Copy link
Member

skriss commented Jul 22, 2019

Do you have any more thoughts about the failure mode when allowing users to configure the values?

Let me think about this a little more.

@skriss
Copy link
Member

skriss commented Jul 22, 2019

Per other PR, I think if we (a) validate that the request <= the limit for both CPU and mem, and (b) put some generous upper bounds on each (e.g. 1 CPU, 128Mi mem?), then we can allow user-configuration. If their values don't conform to these validations, we can log an error and proceed with our own defaults

@nrb
Copy link
Contributor Author

nrb commented Jul 23, 2019

#1577 and #1678 should be merged before this.

@carlisia
Copy link
Contributor

lgtm so far 👍

@prydonius
Copy link
Contributor

@nrb should I wait for this before adding flags to install for restic requests/limits (#94), since I think you wanted to move around the parseResourceRequests helper? https://github.com/heptio/velero/blob/master/pkg/cmd/cli/install/install.go#L262

@nrb
Copy link
Contributor Author

nrb commented Jul 30, 2019

@prydonius That'll work - I'll get this out of draft today.

nrb added 7 commits July 31, 2019 15:31
Fixes vmware-tanzu#1201

Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
nrb added 2 commits July 31, 2019 18:09
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
@nrb nrb marked this pull request as ready for review July 31, 2019 22:14
Copy link
Member

@skriss skriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one nit but otherwise LGTM. Have you been able to test manually?

pkg/restore/restic_restore_action.go Outdated Show resolved Hide resolved
@nrb
Copy link
Contributor Author

nrb commented Jul 31, 2019

@skriss So I'm trying to test it, but I think master might be in a broken state with restic right now.

We're not adding the snapshot annotations to the pod anymore, but the restore action code still aborts if it's not present. (https://github.com/heptio/velero/blob/master/pkg/restore/restic_restore_action.go#L71-L72). The pod comes back unaltered.

@skriss
Copy link
Member

skriss commented Aug 1, 2019

Ah, you're right, we should've continued to write the annotation in the first PR. @carlisia how's progress on #1691? Can we do anything to help?

@skriss
Copy link
Member

skriss commented Aug 1, 2019

We should probably document the additional ConfigMap fields here.

@nrb
Copy link
Contributor Author

nrb commented Aug 2, 2019

EDIT: sorry, my restic daemonset wasn't running. Running it fixes this.

I think this is currently still broken; the init container never finishes, as it doesn't find the Velero directories.

% k logs deployment/nginx-deployment -n nginx-example --container restic-wait
Not found: /restores/nginx-logs/.velero/1bdff6db-b553-11e9-9cae-42010a96001e
Not found: /restores/nginx-logs/.velero/1bdff6db-b553-11e9-9cae-42010a96001e
Not found: /restores/nginx-logs/.velero/1bdff6db-b553-11e9-9cae-42010a96001e
Not found: /restores/nginx-logs/.velero/1bdff6db-b553-11e9-9cae-42010a96001e
Not found: /restores/nginx-logs/.velero/1bdff6db-b553-11e9-9cae-42010a96001e
Not found: /restores/nginx-logs/.velero/1bdff6db-b553-11e9-9cae-42010a96001e
Not found: /restores/nginx-logs/.velero/1bdff6db-b553-11e9-9cae-42010a96001e
[....repeated...]
% k exec nginx-deployment-757ddf98bb-7xj4b  -n nginx-example --container restic-wait -it -- bash
nobody@nginx-deployment-757ddf98bb-7xj4b:/$ ls
bin  boot  dev  etc  home  lib  lib64  media  mnt  opt  proc  restores  root  run  sbin  srv  sys  tmp  usr  var  velero-restic-restore-helper
nobody@nginx-deployment-757ddf98bb-7xj4b:/$ cd restores
nobody@nginx-deployment-757ddf98bb-7xj4b:/restores$ ls
nginx-logs
nobody@nginx-deployment-757ddf98bb-7xj4b:/restores$ cd nginx-logs/
nobody@nginx-deployment-757ddf98bb-7xj4b:/restores/nginx-logs$ ls
lost+found
nobody@nginx-deployment-757ddf98bb-7xj4b:/restores/nginx-logs$ ls -al
total 24
drwxr-xr-x 3 root root  4096 Aug  2 18:27 .
drwxr-xr-x 3 root root  4096 Aug  2 18:27 ..
drwx------ 2 root root 16384 Aug  2 18:27 lost+found

I don't believe my changes caused this, but perhaps the change to the builders may have left out some info.

Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
@nrb
Copy link
Contributor Author

nrb commented Aug 2, 2019

With the latest code, I'm able to restore into a namespace with a quota.

k describe quota -n nginx-example
Name:                       gke-resource-quotas
Namespace:                  nginx-example
Resource                    Used  Hard
--------                    ----  ----
count/ingresses.extensions  0     100
count/jobs.batch            0     5k
pods                        1     1500
services                    1     500


Name:          test
Namespace:     nginx-example
Resource       Used   Hard
--------       ----   ----
limits.cpu     100m   2
limits.memory  128Mi  2Gi

nrb added 2 commits August 2, 2019 17:28
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
@nrb
Copy link
Contributor Author

nrb commented Aug 2, 2019

@skriss Ok, all documentation should be updated now.

Copy link
Contributor

@prydonius prydonius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@prydonius
Copy link
Contributor

@skriss @nrb is this ok to merge for the beta?

@skriss
Copy link
Member

skriss commented Aug 5, 2019

yep - merging

@skriss skriss merged commit a4e7045 into vmware-tanzu:master Aug 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Restic initContainer has no limits, can't be restored into namespaces with resource quotas
4 participants