-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait for namespace to terminate before restoring #826
Conversation
pkg/restore/restore.go
Outdated
go func() { | ||
defer ctx.resourceWaitGroup.Done() | ||
|
||
if _, err := waitForNamespaceDelete(namespaceWatch.ResultChan(), mappedNsName, ctx.namespaceTimeout, ctx.logger); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this pattern what I want to be using here? I see there's also the informer event handlers, as seen at https://github.com/heptio/ark/blob/master/pkg/restic/restorer.go#L64. Is that only for the additions, updates, and deletions from the cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nrb it's better to use the informer. If you create your own watch, it can terminate for a variety of reasons, and you might miss an event. While you're not guaranteed to see 100% of events with the informers, they do handle watch terminations and automatically reestablish new watches for you, and you'll get the latest state of the object (or a "deleted final state unknown" meaning it was deleted but it missed the deletion event).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks - I'll rework this to use an informer DeleteFunc.
pkg/restore/restore.go
Outdated
@@ -696,6 +797,8 @@ func (ctx *context) restoreResource(resource, namespace, resourcePath string) (a | |||
go func() { | |||
defer ctx.resourceWaitGroup.Done() | |||
|
|||
// TODO(nrb): This is currently conflicting with the namespace termination waits. If the namespace becomes | |||
// ready to restore, everything in the NS seems to be restored but the PV never gets created and the PVC is "lost" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on handling this? I'm going to keep testing and see if a new PV is made then deleted during the termination wait, but that may be a dead end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really see why this would affect PVs on first glance - let me see if I can test it and figure out what's going on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's cause of the PV timeout being 1 minute, where I was using 2-5 minute namespace timeouts. I believe the PV restore attempt would timeout before the namespace termination did, and thus abandon doing a restore. I did try using the same ctx.namespaceTimeout
value for both, but I didn't see any improvement.
I think upping the timeout on PV restoration would cause an issue with mapping the PVC, but I'm not positive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, here's what I think is happening:
- namespace
foo
has a PVC claiming PVbar
- namespace
foo
is deleted and moves toTerminating
state - user executes
ark restore create
which starts waiting onfoo
to be deleted foo
finishes deleting, at which point the PVC claiming PVbar
is gone, and PVbar
is released- Ark immediately restores resources in NS
foo
, then moves on to trying to restore the PV - The dynamic provisioner that deletes released PVs hasn't yet deleted the existing PV
- Ark sees that it's trying to restore a resource that already exists, and can't, so it moves on
- The dynamic provisioner deletes the PV
- End result: Ark restores the PVC in namespace
foo
, but the PV never gets restored
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd need to also wait on PVs in the Released
state with a reclaim policy of Delete
to disappear before trying to restore them if we want to handle this correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you're saying this would be a change to the waitForReady
function? Such as how we determine the value of isPVReady
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm saying we'd need a pvWaiter
that does basically the same thing as the namespaceWaiter
, but for pvs - so when you get to restoring a PV, if it's in the released state, wait for it to be deleted before trying to restore it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can remove this comment now
In manual testing, this appears to work in that it waits for both persistent volumes and namespaces to be fully deleted. To simulate deletion and timeouts during testing, I added a fake finalizer to pods ( I'm going to continue working on the tests since a few things did change that need to be added. |
Tests are updated. I'm not explicitly testing the |
pkg/cmd/server/server.go
Outdated
s.logger, | ||
) | ||
cmd.CheckError(err) | ||
|
||
// Start the informat and wait for cache sync here so that the event handlers added in NewKubernetesRestorer are ready by the time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: informat
pkg/restore/restore.go
Outdated
}, nil | ||
} | ||
|
||
namespaceInformer.Informer().AddEventHandler( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon some reflection, I'd like to have test coverage for these two event handlers, but the two solutions I came up with to do so seemed kind of heavy weight considering what they do.
- Mock the namespace and PV informers, such that the tests will add/remove/change objects so the handlers will be triggered. This is more like what our code currently does, but is likely much more scaffolding.
- Instead of doing anonymous functions with closures over
kr
, make a factory function that will wrapkr
and return a handler function. This means all we'd need to create to get handlers to test would be aKubernetesRestorer
, and the functions can be called directly with dummy objects. This is a little further removed from the actual code, though.
@nrb sorry, I didn't make any progress on this while you were out. |
np - I'll get the tests more fleshed out this week. Would appreciate a review, too. |
pkg/cmd/server/server.go
Outdated
restoreResourcePriorities []string | ||
restoreOnly bool | ||
pluginDir, metricsAddress, defaultBackupLocation string | ||
backupSyncPeriod, podVolumeOperationTimeout, namespaceTimeout time.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we're using this value for PV waiting as well, maybe we want to call it something more generic, like terminatingResourceTimeout
or sthg along those lines
pkg/cmd/server/server.go
Outdated
@@ -144,6 +146,7 @@ func NewCommand() *cobra.Command { | |||
command.Flags().BoolVar(&config.restoreOnly, "restore-only", config.restoreOnly, "run in a mode where only restores are allowed; backups, schedules, and garbage-collection are all disabled") | |||
command.Flags().StringSliceVar(&config.restoreResourcePriorities, "restore-resource-priorities", config.restoreResourcePriorities, "desired order of resource restores; any resource not in the list will be restored alphabetically after the prioritized resources") | |||
command.Flags().StringVar(&config.defaultBackupLocation, "default-backup-storage-location", config.defaultBackupLocation, "name of the default backup storage location") | |||
command.Flags().DurationVar(&config.namespaceTimeout, "namespace-timeout", config.namespaceTimeout, "duration to wait on namespace termination before failing a restore") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
genericize flag name and text since being used for PVs as well
pkg/cmd/server/server.go
Outdated
s.logger.Debugln("Starting kube shared informer factory") | ||
go s.kubeSharedInformerFactory.Start(ctx.Done()) | ||
s.logger.Debugln("Waiting for kubesharedinformer to sync") | ||
cache.WaitForCacheSync(ctx.Done(), s.kubeSharedInformerFactory.Core().V1().Namespaces().Informer().HasSynced) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check return val here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice we don't check this elsewhere in the server - should I add that? Is it a total failure mode if it returns false
?
pkg/restore/restore.go
Outdated
return | ||
} | ||
|
||
updateChan, ok := kr.pvWaiter.getDeleteChan(pv.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/updateChan/deleteChan
pkg/restore/restore.go
Outdated
|
||
// pvWaiter encapsulates waiting for a PV with a reclaim policy of 'Delete' to exit the 'Released' state. | ||
// This is to avoid the condition where a PV associated with a PV is released but not yet deleted not getting restored. | ||
type pvWaiter struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could probably eliminate the need for a second PV-specific waiter type. Within the waiter struct, we really only need the ObjectMeta; it doesn't really matter what specific type it is. I'd probably use the metav1.Object
interface - so if you change the channel type to use that, then I think you can consolidate to a single resourceWaiter
struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that makes sense.
pkg/restore/restore.go
Outdated
@@ -696,6 +797,8 @@ func (ctx *context) restoreResource(resource, namespace, resourcePath string) (a | |||
go func() { | |||
defer ctx.resourceWaitGroup.Done() | |||
|
|||
// TODO(nrb): This is currently conflicting with the namespace termination waits. If the namespace becomes | |||
// ready to restore, everything in the NS seems to be restored but the PV never gets created and the PVC is "lost" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can remove this comment now
pkg/restore/restore.go
Outdated
logger.Infof("Namespace still terminating, waiting") | ||
|
||
ctx.nsWaiter.addDeleteChan(mappedNsName) | ||
_, err = ctx.nsWaiter.waitForDelete(mappedNsName, ctx.namespaceTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we probably need to check again here to see if it's been terminated, since it's possible it terminated between our first check and adding the delete channel for it. Only if it still hasn't been terminated to we then need to wait for delete
pkg/restore/restore.go
Outdated
if isPVReleasedOrBound(fromCluster) { | ||
ctx.log.Infof("Persistent volume %q still in use, waiting for deletion", name) | ||
ctx.pvWaiter.addDeleteChan(name) | ||
_, err := ctx.pvWaiter.waitForDelete(name, ctx.namespaceTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably need to do a similar double-check on whether the PV's been deleted here, since it could've been deleted between the first check and adding the channel
pkg/restore/restore.go
Outdated
@@ -455,11 +619,36 @@ func (ctx *context) restoreFromDir(dir string) (api.RestoreResult, api.RestoreRe | |||
if !existingNamespaces.Has(mappedNsName) { | |||
logger := ctx.log.WithField("namespace", nsName) | |||
ns := getNamespace(logger, filepath.Join(dir, api.ResourcesDir, "namespaces", api.ClusterScopedDir, nsName+".json"), mappedNsName) | |||
if _, err := kube.EnsureNamespaceExists(ns, ctx.namespaceClient); err != nil { | |||
nsIsReady, err := kube.EnsureNamespaceExistsAndIsReady(ns, ctx.namespaceClient) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if the logic here (pseudocode) should be:
- Try to get the namespace
- If it exists and is not terminating, add to existingNamespaces and move on to the next step
- Otherwise
- If it doesn't exist, create it
- If it does exist and is terminating, wait for it to no longer exist, up to a timeout. The easiest initial implementation would be to poll with a real client once per second. Given that we aren't doing restores in goroutines and instead are waiting synchronously, I think this logic would be an acceptable tradeoff vs the shared informer/events approach, which makes things much more confusing. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll give that a try - I assume we'd have to do the same thing for PVs, then? I ran in to a race condition when I only looked at the NS. (See #826 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we'll have to wait for anything that's in a terminal state to exit that state (which presumably means "be deleted") before we can proceed restoring that item.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ncdc I pushed an update to this using clients and polling. Mind looking over it? I think I've covered the big use cases, though I'm not sure if we should fail the restore if we get some error besides not found when doing our checks each tick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nrb I would still like to do what I suggested above in steps 1-3. In other words, let's not use kube.EnsureNamespaceExistsAndIsReady
or let's modify it to work with these steps. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I understood EnsureNamespaceExistsAndIsReady was doing 1-2 and 3i; I see I missed putting it in to existingNamespaces. I'll move that logic out and do it all here so it's all present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing is doing 1 yet
92a5ebb
to
82c475c
Compare
pkg/cmd/server/server.go
Outdated
@@ -152,6 +154,7 @@ func NewCommand() *cobra.Command { | |||
command.Flags().StringSliceVar(&config.restoreResourcePriorities, "restore-resource-priorities", config.restoreResourcePriorities, "desired order of resource restores; any resource not in the list will be restored alphabetically after the prioritized resources") | |||
command.Flags().StringVar(&config.defaultBackupLocation, "default-backup-storage-location", config.defaultBackupLocation, "name of the default backup storage location") | |||
command.Flags().Var(&volumeSnapshotLocations, "default-volume-snapshot-locations", "list of unique volume providers and default volume snapshot location (provider1:location-01,provider2:location-02,...)") | |||
command.Flags().DurationVar(&config.resourceTerminatingTimeout, "resource-timeout", config.resourceTerminatingTimeout, "duration to wait on resource termination before failing a restore") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think --resource-timeout
isn't 100% clear what it's for - we should adjust
pkg/restore/restore.go
Outdated
@@ -455,11 +619,36 @@ func (ctx *context) restoreFromDir(dir string) (api.RestoreResult, api.RestoreRe | |||
if !existingNamespaces.Has(mappedNsName) { | |||
logger := ctx.log.WithField("namespace", nsName) | |||
ns := getNamespace(logger, filepath.Join(dir, api.ResourcesDir, "namespaces", api.ClusterScopedDir, nsName+".json"), mappedNsName) | |||
if _, err := kube.EnsureNamespaceExists(ns, ctx.namespaceClient); err != nil { | |||
nsIsReady, err := kube.EnsureNamespaceExistsAndIsReady(ns, ctx.namespaceClient) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nrb I would still like to do what I suggested above in steps 1-3. In other words, let's not use kube.EnsureNamespaceExistsAndIsReady
or let's modify it to work with these steps. WDYT?
pkg/restore/restore.go
Outdated
addArkError(&errs, err) | ||
continue | ||
} | ||
|
||
if !nsIsReady { | ||
timeout := time.After(ctx.resourceTerminatingTimeout) | ||
tick := time.Tick(1 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a deferred stop of tick
I believe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like tick
is just the channel, not a Ticker
. I'll redo it so we have a Ticker
that we can actually stop.
pkg/restore/restore.go
Outdated
for proceed != true { | ||
select { | ||
case <-timeout: | ||
err := errors.Errorf("Timed out waiting on NS %s to be ready", ns.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NS->namespace
pkg/restore/restore.go
Outdated
addArkError(&errs, err) | ||
return warnings, errs | ||
case <-tick: | ||
nsIsReady, err = kube.EnsureNamespaceExistsAndIsReady(ns, ctx.namespaceClient) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to re-call kube.EnsureNamespaceExistsAndIsReady
here. Let's implement my suggestion above unless you have objections?
pkg/restore/restore.go
Outdated
ticker := time.NewTicker(time.Second) | ||
defer ticker.Stop() | ||
loop: | ||
for { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skriss PTAL. I'm not feeling super great about the for/select combo, but maybe it's my unfamiliarity with the Go channels mechanisms.
2 things I'm noticing with this and the namespace loop (L513):
- Need some way to honor ctrl-c or other graceful shutdown. Should we hook it in to
signals.CancelOnShutdown
somehow? - Do we need to find some way to mark the restore as
Failed
for the NS/PV timeout cases? Or is completed with errors ok? For reference, this is what a restore that hits the timeout looks like currently:
x1c in /home/nrb/go/src/github.com/heptio/ark (git) fix-691 U
% ark restore describe voltest-20181109132929
Name: voltest-20181109132929
Namespace: heptio-ark
Labels: <none>
Annotations: <none>
Backup: voltest
Namespaces:
Included: *
Excluded: <none>
Resources:
Included: *
Excluded: nodes, events, events.events.k8s.io, backups.ark.heptio.com, restores.ark.heptio.com
Cluster-scoped: auto
Namespace mappings: <none>
Label selector: <none>
Restore PVs: auto
Phase: Completed
Validation errors: <none>
Warnings:
Ark: <none>
Cluster: <none>
Namespaces:
nginx-example: not restored: persistentvolumeclaims "nginx-logs" already exists and is different from backed up version.
not restored: pods "nginx-deployment-f9d69f8fd-c6k6g" already exists and is different from backed up version.
not restored: deployments.apps "nginx-deployment" already exists and is different from backed up version.
not restored: endpoints "my-nginx" already exists and is different from backed up version.
not restored: replicasets.apps "nginx-deployment-f9d69f8fd" already exists and is different from backed up version.
not restored: services "my-nginx" already exists and is different from backed up version.
Errors:
Ark: Timed out waiting on PV pvc-98f7668e-e390-11e8-9753-42010a96003b to be deleted
Cluster: <none>
Namespaces: <none>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For #1, I'm not sure that we actually want to do this - would like to think some more and maybe @ncdc will find time to provide input. We don't currently have any code that will stop a backup or restore in the middle if a shutdown signal is received - we try to finish the backup/restore, then the controllers/server exit. We'd want to be able to properly mark the restore's status as failed/interrupted, which would take more code that we don't currently have.
If we were going to implement it, looks like the most straight-forward way to do this is in server.go
, around L590, pass ctx
as a new arg to the restore.NewKubernetesRestorer
func. Have that func set it as a field on the kubernetesRestorer
. In the Restore
method on the kubernetesRestorer
, pass that ctx
down to the restoreCtx
that's created (probably as a field). Then, in your select
statements that are doing the polling, add an additional case that receives on ctx.Done()
and exits -- though we'd probably want to end up marking restores that were exited early like this as Failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. That makes sense. My use case was lowering the timeout for testing (default of 10m, but wanted to drop to 1 or 2), so doing a kill -9
on it I guess isn't too terrible.
I can leave that part as is and address the remaining comments and update tests.
As long as we can do some work, I think we said the status is completed (or
processed). Individual item warning or errors cover specific issues. Right?
…On Fri, Nov 9, 2018 at 2:52 PM Nolan Brubaker ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In pkg/restore/restore.go
<#826 (comment)>:
> @@ -696,7 +748,40 @@ func (ctx *context) restoreResource(resource, namespace, resourcePath string) (a
// Check if the PV exists in the cluster before attempting to create
// a volume from the snapshot, in order to avoid orphaned volumes (GH #609)
- _, err := resourceClient.Get(name, metav1.GetOptions{})
+ pv, err := ctx.pvClient.Get(name, metav1.GetOptions{})
+ if err != nil && !apierrors.IsNotFound(err) {
+ addToResult(&errs, namespace, fmt.Errorf("error checking existence for PV %s: %v", name, err))
+ }
+
+ if isPVReleasedOrBound(pv) {
+ ctx.log.Infof("Persistent volume %s still in use, waiting for deletion", name)
+ timeout := time.After(ctx.resourceTerminatingTimeout)
+ ticker := time.NewTicker(time.Second)
+ defer ticker.Stop()
+ loop:
+ for {
@skriss <https://github.com/skriss> PTAL. I'm not feeling super great
about the for/select combo, but maybe it's my unfamiliarity with the Go
channels mechanisms.
2 things I'm noticing with this and the namespace loop (L513):
1. Need some way to honor ctrl-c or other graceful shutdown. Should we
hook it in to signals.CancelOnShutdown
<https://github.com/heptio/ark/blob/master/pkg/cmd/server/server.go#L269>
somehow?
2. Do we need to find some way to mark the restore as Failed for the
NS/PV timeout cases? Or is completed with errors ok? For reference, this is
what a restore that hits the timeout looks like currently:
x1c in /home/nrb/go/src/github.com/heptio/ark (git) fix-691 U
% ark restore describe voltest-20181109132929
Name: voltest-20181109132929
Namespace: heptio-ark
Labels: <none>
Annotations: <none>
Backup: voltest
Namespaces:
Included: *
Excluded: <none>
Resources:
Included: *
Excluded: nodes, events, events.events.k8s.io, backups.ark.heptio.com, restores.ark.heptio.com
Cluster-scoped: auto
Namespace mappings: <none>
Label selector: <none>
Restore PVs: auto
Phase: Completed
Validation errors: <none>
Warnings:
Ark: <none>
Cluster: <none>
Namespaces:
nginx-example: not restored: persistentvolumeclaims "nginx-logs" already exists and is different from backed up version.
not restored: pods "nginx-deployment-f9d69f8fd-c6k6g" already exists and is different from backed up version.
not restored: deployments.apps "nginx-deployment" already exists and is different from backed up version.
not restored: endpoints "my-nginx" already exists and is different from backed up version.
not restored: replicasets.apps "nginx-deployment-f9d69f8fd" already exists and is different from backed up version.
not restored: services "my-nginx" already exists and is different from backed up version.
Errors:
Ark: Timed out waiting on PV pvc-98f7668e-e390-11e8-9753-42010a96003b to be deleted
Cluster: <none>
Namespaces: <none>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#826 (review)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAABYgFNtOI-Z0x5XIRqQUFA-AvNABprks5utd0UgaJpZM4We9Rw>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nrb added some comments on the core restore code for namespaces, which should similarly apply for the PV code.
pkg/restore/restore.go
Outdated
if clusterNS.Status.Phase == v1.NamespaceActive { | ||
// keep track of namespaces that we know exist so we don't | ||
// have to try to create them multiple times | ||
existingNamespaces.Insert(mappedNsName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would move the insertion into existingNamespaces
down to below your for
loop; it won't need to be within an if
statement there - we'll always execute it if we get there. See next comment as well.
pkg/restore/restore.go
Outdated
} | ||
|
||
// Namespace exists but is terminating | ||
if clusterNS.Status.Phase == v1.NamespaceTerminating { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be if clusterNS != nil && ...
since we'll get here if the cluster didn't have the namespace, and we just created it.
pkg/restore/restore.go
Outdated
|
||
// Namespace exists but is terminating | ||
if clusterNS.Status.Phase == v1.NamespaceTerminating { | ||
timeout := time.After(ctx.resourceTerminatingTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd use a time.NewTimer
here (for the timeout) and defer timeout.Stop()
just after
pkg/restore/restore.go
Outdated
timeout := time.After(ctx.resourceTerminatingTimeout) | ||
ticker := time.NewTicker(time.Second) | ||
defer ticker.Stop() | ||
loop: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably move L510-534 into a helper, something like func createNamespaceAfterTerminate(...) error
, and call it here. If it returns an error, addArkError
and continue.
pkg/restore/restore.go
Outdated
for { | ||
select { | ||
// TODO: Need a way to break on process exit here; ctrl-c isn't working when we're in this loop | ||
case <-timeout: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case <-timeout.C:
pkg/restore/restore.go
Outdated
select { | ||
// TODO: Need a way to break on process exit here; ctrl-c isn't working when we're in this loop | ||
case <-timeout: | ||
err := errors.Errorf("Timed out waiting on namespace %s to be ready", mappedNsName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you move this block to a helper, then you can just return this error
pkg/restore/restore.go
Outdated
if apierrors.IsAlreadyExists(err) && c.Status.Phase == v1.NamespaceTerminating { | ||
logger.Debugf("Namespace %s still terminating, waiting", mappedNsName) | ||
} else if err != nil && !apierrors.IsAlreadyExists(err) { | ||
addArkError(&warnings, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you move this block to a helper, then you can just return this error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be returned, but it's going to warnings; would you suggest all the errors returned by the helper go into &errs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, i think it makes sense to record this as an error, so yeah
pkg/restore/restore.go
Outdated
break loop | ||
} else { | ||
logger.Infof("Termination successful, namespace %s restored", mappedNsName) | ||
break loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you move this block to a helper, then you can return nil here
pkg/restore/restore.go
Outdated
if apierrors.IsNotFound(err) { | ||
pvLogger.Debugf("namespace %s for PV not found, waiting", namespace) | ||
// namespace not found but the PV still exists, so continue to wait | ||
return true, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return false?
pkg/restore/restore.go
Outdated
return true, nil | ||
}) | ||
|
||
end := time.Now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check the err
that wait.PollImmediate
returns instead of comparing start/end times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I can just return the err
that wait.PollImmediate
uses, unless I want to add a debug line for timeout. It will be added to the result at L810 in this version of the PR.
pkg/restore/restore.go
Outdated
|
||
end := time.Now() | ||
|
||
if !end.Before(timeout) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
pkg/util/kube/utils.go
Outdated
clusterNS, err := client.Get(namespace.Name, metav1.GetOptions{}) | ||
|
||
if apierrors.IsNotFound(err) { | ||
// Namespace isn't in cluster, we're good to restore. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Namespace isn't in cluster, we're good to restore. | |
// Namespace isn't in cluster, we're good to create. |
pkg/util/kube/utils.go
Outdated
}) | ||
|
||
if err != nil { | ||
return false, errors.Wrapf(err, "err getting namespace %s", namespace.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return false, errors.Wrapf(err, "err getting namespace %s", namespace.Name) | |
return false, errors.Wrapf(err, "error getting namespace %s", namespace.Name) |
pkg/util/kube/utils.go
Outdated
return false, errors.Wrapf(err, "err getting namespace %s", namespace.Name) | ||
} | ||
|
||
// shouldCreate is true *only* if we got a 404 during polling, and we should only be here if we didn't get an error, which means the timeout was hit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing - could you please reword?
pkg/util/kube/utils.go
Outdated
return true, nil | ||
} | ||
|
||
// if we get an AlreadyExists here, something else created the NS after all of our polling... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// if we get an AlreadyExists here, something else created the NS after all of our polling... | |
// if we get any error here, something else created the NS after all of our polling... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments but generally this looks good and like what was discussed last week on slack
pkg/util/kube/utils.go
Outdated
if ready { | ||
return true, nil | ||
} | ||
|
||
if _, err := client.Create(namespace); err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's switch this to handle the non-nil error case first
pkg/cmd/server/server.go
Outdated
@@ -168,6 +171,7 @@ func NewCommand() *cobra.Command { | |||
command.Flags().Float32Var(&config.clientQPS, "client-qps", config.clientQPS, "maximum number of requests per second by the server to the Kubernetes API once the burst limit has been reached") | |||
command.Flags().IntVar(&config.clientBurst, "client-burst", config.clientBurst, "maximum number of requests by the server to the Kubernetes API in a short period of time") | |||
command.Flags().StringVar(&config.profilerAddress, "profiler-address", config.profilerAddress, "the address to expose the pprof profiler") | |||
command.Flags().DurationVar(&config.resourceTerminatingTimeout, "resource-terminate-timeout", config.resourceTerminatingTimeout, "how long to wait on persistent volumes and namespaces to terminate during a restore before timing out") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's maybe call the flag terminating-resource-timeout
?
pkg/cmd/server/server.go
Outdated
pluginRegistry: pluginRegistry, | ||
pluginManager: pluginManager, | ||
config: config, | ||
ctx: ctx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these whitespace changes should be in here. Make sure you're using the latest gofmt/goimports?
pkg/restore/restore.go
Outdated
@@ -30,9 +30,11 @@ import ( | |||
"sync" | |||
"time" | |||
|
|||
"k8s.io/apimachinery/pkg/util/wait" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be grouped with the rest of the imports below.
pkg/util/kube/utils.go
Outdated
} | ||
|
||
if _, err := client.Create(namespace); err != nil { | ||
// if we get any error here, something else created the NS after all of our polling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to check for an already-exists error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think we can, but does that mean we should check the phase and deletion timestamp again? Do we care if something else made the namespace before we got to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't think so. The reason I suggested it was to handle
- we decide we need to create the namespace (it doesn't exist, or it did and we waited for it to be deleted)
- something else creates the namespace
- we try to create the namespace, and it fails because it already exists
If this is the sequence, I don't think we should return an error.
If this is the sequence, and the namespace somehow is terminating, I think it's ok to return an error. I don't expect this to happen, and I don't want to get into an infinite loop 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handful of minor style & doc comments from reading through the core logic. Moving on to look at units now.
pkg/util/kube/utils_test.go
Outdated
"k8s.io/apimachinery/pkg/runtime/schema" | ||
|
||
arktest "github.com/heptio/ark/pkg/util/test" | ||
"github.com/stretchr/testify/assert" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this line needs to move up to the prev section
pkg/restore/restore.go
Outdated
var shouldRestore bool | ||
err := wait.PollImmediate(time.Second, ctx.resourceTerminatingTimeout, func() (bool, error) { | ||
clusterPV, err := pvClient.Get(name, metav1.GetOptions{}) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style nit: rm blank line
pkg/restore/restore.go
Outdated
shouldRestore = true | ||
return true, nil | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style nit: rm blank line
|
||
if err != nil { | ||
return false, errors.Wrapf(err, "could not retrieve in-cluster copy of PV %s", name) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style nit: rm blank line
return false, nil | ||
} | ||
|
||
namespace, err := collections.GetString(clusterPV.UnstructuredContent(), "spec.claimRef.namespace") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add a comment explaining why we're getting the NS and querying the PVC at this point
return false, nil | ||
} | ||
|
||
ns, err := ctx.namespaceClient.Get(namespace, metav1.GetOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably worth another comment here explaining why we're now querying the NS
pkg/restore/restore.go
Outdated
// PVC wasn't found, but the PV still exists, so continue to wait. | ||
return false, nil | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style nit: rm blank line
var ready bool | ||
err := wait.PollImmediate(time.Second, timeout, func() (bool, error) { | ||
clusterNS, err := client.Get(namespace.Name, metav1.GetOptions{}) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style nit: rm blank line
// Namespace isn't in cluster, we're good to create. | ||
return true, nil | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style nit: rm blank line
@nrb tests look pretty good, I don't see any obvious gaps. |
pkg/restore/restore.go
Outdated
@@ -696,10 +798,15 @@ func (ctx *context) restoreResource(resource, namespace, resourcePath string) (a | |||
|
|||
// Check if the PV exists in the cluster before attempting to create | |||
// a volume from the snapshot, in order to avoid orphaned volumes (GH #609) | |||
_, err := resourceClient.Get(name, metav1.GetOptions{}) | |||
shouldRestoreSnapshot, err := ctx.waitForPVDeletion(name, resourceClient) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make the logic more clear if method and variable were both named the same, whichever makes the most sense. Either waitForPVDeletion
or shouldRestoreSnapshot
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change the names to match and push another revision up.
@rbankston confirmed this; are there any other changes that need to be made? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Function name change made, and rebased on the Velero rename commits. PTAL and if it's good, I'll squash. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@skriss Are you ok with me squashing this for a merge? |
If a PV already exists, wait for it, it's associated PVC, and associated namespace to be deleted before attempting to restore it. If a namespace already exists, wait for it to be deleted before attempting to restore it. Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Fixes #691
Signed-off-by: Nolan Brubaker nolan@heptio.com