Wait for namespace to terminate before restoring #826

nrb · 2018-09-07T14:44:39Z

Fixes #691

Signed-off-by: Nolan Brubaker nolan@heptio.com

nrb · 2018-09-07T14:47:25Z

pkg/restore/restore.go

+					go func() {
+						defer ctx.resourceWaitGroup.Done()
+
+						if _, err := waitForNamespaceDelete(namespaceWatch.ResultChan(), mappedNsName, ctx.namespaceTimeout, ctx.logger); err != nil {


Is this pattern what I want to be using here? I see there's also the informer event handlers, as seen at https://github.com/heptio/ark/blob/master/pkg/restic/restorer.go#L64. Is that only for the additions, updates, and deletions from the cache?

@nrb it's better to use the informer. If you create your own watch, it can terminate for a variety of reasons, and you might miss an event. While you're not guaranteed to see 100% of events with the informers, they do handle watch terminations and automatically reestablish new watches for you, and you'll get the latest state of the object (or a "deleted final state unknown" meaning it was deleted but it missed the deletion event).

Ok, thanks - I'll rework this to use an informer DeleteFunc.

pkg/cmd/server/server.go

nrb · 2018-09-25T20:14:45Z

pkg/restore/restore.go

@@ -696,6 +797,8 @@ func (ctx *context) restoreResource(resource, namespace, resourcePath string) (a
 				go func() {
 					defer ctx.resourceWaitGroup.Done()

+					// TODO(nrb): This is currently conflicting with the namespace termination waits. If the namespace becomes
+					// ready to restore, everything in the NS seems to be restored but the PV never gets created and the PVC is "lost"


Thoughts on handling this? I'm going to keep testing and see if a new PV is made then deleted during the termination wait, but that may be a dead end.

I don't really see why this would affect PVs on first glance - let me see if I can test it and figure out what's going on

I believe it's cause of the PV timeout being 1 minute, where I was using 2-5 minute namespace timeouts. I believe the PV restore attempt would timeout before the namespace termination did, and thus abandon doing a restore. I did try using the same ctx.namespaceTimeout value for both, but I didn't see any improvement.

I think upping the timeout on PV restoration would cause an issue with mapping the PVC, but I'm not positive.

OK, here's what I think is happening:

namespace foo has a PVC claiming PV bar

namespace foo is deleted and moves to Terminating state

user executes ark restore create which starts waiting on foo to be deleted

foo finishes deleting, at which point the PVC claiming PV bar is gone, and PV bar is released

Ark immediately restores resources in NS foo, then moves on to trying to restore the PV

The dynamic provisioner that deletes released PVs hasn't yet deleted the existing PV

Ark sees that it's trying to restore a resource that already exists, and can't, so it moves on

The dynamic provisioner deletes the PV

End result: Ark restores the PVC in namespace foo, but the PV never gets restored

We'd need to also wait on PVs in the Released state with a reclaim policy of Delete to disappear before trying to restore them if we want to handle this correctly

So you're saying this would be a change to the waitForReady function? Such as how we determine the value of isPVReady?

I'm saying we'd need a pvWaiter that does basically the same thing as the namespaceWaiter, but for pvs - so when you get to restoring a PV, if it's in the released state, wait for it to be deleted before trying to restore it

Can remove this comment now

nrb · 2018-10-02T21:28:31Z

In manual testing, this appears to work in that it waits for both persistent volumes and namespaces to be fully deleted.

To simulate deletion and timeouts during testing, I added a fake finalizer to pods (ark.heptio.com/fake), deleted the namespace, started a restore, then removed the finalizer via kubectl edit or kept it there to see if the timeout was reached.

I'm going to continue working on the tests since a few things did change that need to be added.

nrb · 2018-10-04T18:39:20Z

Tests are updated. I'm not explicitly testing the DeleteFunc handlers on the informers. Do we want to do that? It looks like it'll be a lot more scaffolding to get them set up, somewhat similar to the backup_controller_test.go tests that interact directly with the Informer store.

nrb · 2018-10-04T18:43:35Z

@wwitzel3 @skriss I think this is ready for a more thorough review.

ncdc · 2018-10-09T17:00:23Z

pkg/cmd/server/server.go

 		s.logger,
 	)
 	cmd.CheckError(err)

+	// Start the informat and wait for cache sync here so that the event handlers added in NewKubernetesRestorer are ready by the time


typo: informat

nrb · 2018-10-09T19:31:20Z

pkg/restore/restore.go

-	}, nil
+	}
+
+	namespaceInformer.Informer().AddEventHandler(


Upon some reflection, I'd like to have test coverage for these two event handlers, but the two solutions I came up with to do so seemed kind of heavy weight considering what they do.

Mock the namespace and PV informers, such that the tests will add/remove/change objects so the handlers will be triggered. This is more like what our code currently does, but is likely much more scaffolding.

Instead of doing anonymous functions with closures over kr, make a factory function that will wrap kr and return a handler function. This means all we'd need to create to get handlers to test would be a KubernetesRestorer, and the functions can be called directly with dummy objects. This is a little further removed from the actual code, though.

skriss · 2018-10-16T15:52:09Z

@nrb sorry, I didn't make any progress on this while you were out.

nrb · 2018-10-16T15:56:39Z

np - I'll get the tests more fleshed out this week. Would appreciate a review, too.

skriss · 2018-10-16T18:13:27Z

pkg/cmd/server/server.go

-	restoreResourcePriorities                        []string
-	restoreOnly                                      bool
+	pluginDir, metricsAddress, defaultBackupLocation              string
+	backupSyncPeriod, podVolumeOperationTimeout, namespaceTimeout time.Duration


since we're using this value for PV waiting as well, maybe we want to call it something more generic, like terminatingResourceTimeout or sthg along those lines

skriss · 2018-10-16T18:13:54Z

pkg/cmd/server/server.go

@@ -144,6 +146,7 @@ func NewCommand() *cobra.Command {
 	command.Flags().BoolVar(&config.restoreOnly, "restore-only", config.restoreOnly, "run in a mode where only restores are allowed; backups, schedules, and garbage-collection are all disabled")
 	command.Flags().StringSliceVar(&config.restoreResourcePriorities, "restore-resource-priorities", config.restoreResourcePriorities, "desired order of resource restores; any resource not in the list will be restored alphabetically after the prioritized resources")
 	command.Flags().StringVar(&config.defaultBackupLocation, "default-backup-storage-location", config.defaultBackupLocation, "name of the default backup storage location")
+	command.Flags().DurationVar(&config.namespaceTimeout, "namespace-timeout", config.namespaceTimeout, "duration to wait on namespace termination before failing a restore")


genericize flag name and text since being used for PVs as well

skriss · 2018-10-16T18:15:08Z

pkg/cmd/server/server.go

+	s.logger.Debugln("Starting kube shared informer factory")
+	go s.kubeSharedInformerFactory.Start(ctx.Done())
+	s.logger.Debugln("Waiting for kubesharedinformer to sync")
+	cache.WaitForCacheSync(ctx.Done(), s.kubeSharedInformerFactory.Core().V1().Namespaces().Informer().HasSynced)


check return val here

I notice we don't check this elsewhere in the server - should I add that? Is it a total failure mode if it returns false?

skriss · 2018-10-16T18:17:57Z

pkg/restore/restore.go

+					return
+				}
+
+				updateChan, ok := kr.pvWaiter.getDeleteChan(pv.Name)


s/updateChan/deleteChan

skriss · 2018-10-16T18:25:56Z

pkg/restore/restore.go

+
+// pvWaiter encapsulates waiting for a PV with a reclaim policy of 'Delete' to exit the 'Released' state.
+// This is to avoid the condition where a PV associated with a PV is released but not yet deleted not getting restored.
+type pvWaiter struct {


I think we could probably eliminate the need for a second PV-specific waiter type. Within the waiter struct, we really only need the ObjectMeta; it doesn't really matter what specific type it is. I'd probably use the metav1.Object interface - so if you change the channel type to use that, then I think you can consolidate to a single resourceWaiter struct.

Ok, that makes sense.

skriss · 2018-10-16T18:26:59Z

pkg/restore/restore.go

@@ -696,6 +797,8 @@ func (ctx *context) restoreResource(resource, namespace, resourcePath string) (a
 				go func() {
 					defer ctx.resourceWaitGroup.Done()

+					// TODO(nrb): This is currently conflicting with the namespace termination waits. If the namespace becomes
+					// ready to restore, everything in the NS seems to be restored but the PV never gets created and the PVC is "lost"


Can remove this comment now

skriss · 2018-10-16T18:28:16Z

pkg/restore/restore.go

+						logger.Infof("Namespace still terminating, waiting")
+
+						ctx.nsWaiter.addDeleteChan(mappedNsName)
+						_, err = ctx.nsWaiter.waitForDelete(mappedNsName, ctx.namespaceTimeout)


I think we probably need to check again here to see if it's been terminated, since it's possible it terminated between our first check and adding the delete channel for it. Only if it still hasn't been terminated to we then need to wait for delete

skriss · 2018-10-16T18:28:52Z

pkg/restore/restore.go

+			if isPVReleasedOrBound(fromCluster) {
+				ctx.log.Infof("Persistent volume %q still in use, waiting for deletion", name)
+				ctx.pvWaiter.addDeleteChan(name)
+				_, err := ctx.pvWaiter.waitForDelete(name, ctx.namespaceTimeout)


probably need to do a similar double-check on whether the PV's been deleted here, since it could've been deleted between the first check and adding the channel

ncdc · 2018-10-31T16:40:48Z

pkg/restore/restore.go

@@ -455,11 +619,36 @@ func (ctx *context) restoreFromDir(dir string) (api.RestoreResult, api.RestoreRe
 			if !existingNamespaces.Has(mappedNsName) {
 				logger := ctx.log.WithField("namespace", nsName)
 				ns := getNamespace(logger, filepath.Join(dir, api.ResourcesDir, "namespaces", api.ClusterScopedDir, nsName+".json"), mappedNsName)
-				if _, err := kube.EnsureNamespaceExists(ns, ctx.namespaceClient); err != nil {
+				nsIsReady, err := kube.EnsureNamespaceExistsAndIsReady(ns, ctx.namespaceClient)


I'm wondering if the logic here (pseudocode) should be:

Try to get the namespace

If it exists and is not terminating, add to existingNamespaces and move on to the next step

Otherwise

If it doesn't exist, create it

If it does exist and is terminating, wait for it to no longer exist, up to a timeout. The easiest initial implementation would be to poll with a real client once per second. Given that we aren't doing restores in goroutines and instead are waiting synchronously, I think this logic would be an acceptable tradeoff vs the shared informer/events approach, which makes things much more confusing. WDYT?

I'll give that a try - I assume we'd have to do the same thing for PVs, then? I ran in to a race condition when I only looked at the NS. (See #826 (comment))

Yes, we'll have to wait for anything that's in a terminal state to exit that state (which presumably means "be deleted") before we can proceed restoring that item.

@ncdc I pushed an update to this using clients and polling. Mind looking over it? I think I've covered the big use cases, though I'm not sure if we should fail the restore if we get some error besides not found when doing our checks each tick.

@nrb I would still like to do what I suggested above in steps 1-3. In other words, let's not use kube.EnsureNamespaceExistsAndIsReady or let's modify it to work with these steps. WDYT?

So I understood EnsureNamespaceExistsAndIsReady was doing 1-2 and 3i; I see I missed putting it in to existingNamespaces. I'll move that logic out and do it all here so it's all present.

Nothing is doing 1 yet

ncdc · 2018-11-08T21:10:12Z

pkg/cmd/server/server.go

@@ -152,6 +154,7 @@ func NewCommand() *cobra.Command {
 	command.Flags().StringSliceVar(&config.restoreResourcePriorities, "restore-resource-priorities", config.restoreResourcePriorities, "desired order of resource restores; any resource not in the list will be restored alphabetically after the prioritized resources")
 	command.Flags().StringVar(&config.defaultBackupLocation, "default-backup-storage-location", config.defaultBackupLocation, "name of the default backup storage location")
 	command.Flags().Var(&volumeSnapshotLocations, "default-volume-snapshot-locations", "list of unique volume providers and default volume snapshot location (provider1:location-01,provider2:location-02,...)")
+	command.Flags().DurationVar(&config.resourceTerminatingTimeout, "resource-timeout", config.resourceTerminatingTimeout, "duration to wait on resource termination before failing a restore")


I think --resource-timeout isn't 100% clear what it's for - we should adjust

ncdc · 2018-11-08T21:46:59Z

pkg/restore/restore.go

@@ -455,11 +619,36 @@ func (ctx *context) restoreFromDir(dir string) (api.RestoreResult, api.RestoreRe
 			if !existingNamespaces.Has(mappedNsName) {
 				logger := ctx.log.WithField("namespace", nsName)
 				ns := getNamespace(logger, filepath.Join(dir, api.ResourcesDir, "namespaces", api.ClusterScopedDir, nsName+".json"), mappedNsName)
-				if _, err := kube.EnsureNamespaceExists(ns, ctx.namespaceClient); err != nil {
+				nsIsReady, err := kube.EnsureNamespaceExistsAndIsReady(ns, ctx.namespaceClient)


@nrb I would still like to do what I suggested above in steps 1-3. In other words, let's not use kube.EnsureNamespaceExistsAndIsReady or let's modify it to work with these steps. WDYT?

ncdc · 2018-11-08T21:47:46Z

pkg/restore/restore.go

 					addArkError(&errs, err)
 					continue
 				}

+				if !nsIsReady {
+					timeout := time.After(ctx.resourceTerminatingTimeout)
+					tick := time.Tick(1 * time.Second)


Need a deferred stop of tick I believe

Looks like tick is just the channel, not a Ticker. I'll redo it so we have a Ticker that we can actually stop.

ncdc · 2018-11-08T21:48:28Z

pkg/restore/restore.go

+					for proceed != true {
+						select {
+						case <-timeout:
+							err := errors.Errorf("Timed out waiting on NS %s to be ready", ns.Name)


NS->namespace

ncdc · 2018-11-08T21:49:21Z

pkg/restore/restore.go

+							addArkError(&errs, err)
+							return warnings, errs
+						case <-tick:
+							nsIsReady, err = kube.EnsureNamespaceExistsAndIsReady(ns, ctx.namespaceClient)


I don't want to re-call kube.EnsureNamespaceExistsAndIsReady here. Let's implement my suggestion above unless you have objections?

nrb · 2018-11-09T19:52:04Z

pkg/restore/restore.go

+				ticker := time.NewTicker(time.Second)
+				defer ticker.Stop()
+			loop:
+				for {


@skriss PTAL. I'm not feeling super great about the for/select combo, but maybe it's my unfamiliarity with the Go channels mechanisms.

2 things I'm noticing with this and the namespace loop (L513):

Need some way to honor ctrl-c or other graceful shutdown. Should we hook it in to signals.CancelOnShutdown somehow?

Do we need to find some way to mark the restore as Failed for the NS/PV timeout cases? Or is completed with errors ok? For reference, this is what a restore that hits the timeout looks like currently:

x1c in /home/nrb/go/src/github.com/heptio/ark (git) fix-691 U % ark restore describe voltest-20181109132929 Name: voltest-20181109132929 Namespace: heptio-ark Labels: <none> Annotations: <none> Backup: voltest Namespaces: Included: * Excluded: <none> Resources: Included: * Excluded: nodes, events, events.events.k8s.io, backups.ark.heptio.com, restores.ark.heptio.com Cluster-scoped: auto Namespace mappings: <none> Label selector: <none> Restore PVs: auto Phase: Completed Validation errors: <none> Warnings: Ark: <none> Cluster: <none> Namespaces: nginx-example: not restored: persistentvolumeclaims "nginx-logs" already exists and is different from backed up version. not restored: pods "nginx-deployment-f9d69f8fd-c6k6g" already exists and is different from backed up version. not restored: deployments.apps "nginx-deployment" already exists and is different from backed up version. not restored: endpoints "my-nginx" already exists and is different from backed up version. not restored: replicasets.apps "nginx-deployment-f9d69f8fd" already exists and is different from backed up version. not restored: services "my-nginx" already exists and is different from backed up version. Errors: Ark: Timed out waiting on PV pvc-98f7668e-e390-11e8-9753-42010a96003b to be deleted Cluster: <none> Namespaces: <none>

For #2, completed with errors is what we'd expect.

Looking at #1.

For #1, I'm not sure that we actually want to do this - would like to think some more and maybe @ncdc will find time to provide input. We don't currently have any code that will stop a backup or restore in the middle if a shutdown signal is received - we try to finish the backup/restore, then the controllers/server exit. We'd want to be able to properly mark the restore's status as failed/interrupted, which would take more code that we don't currently have.

If we were going to implement it, looks like the most straight-forward way to do this is in server.go, around L590, pass ctx as a new arg to the restore.NewKubernetesRestorer func. Have that func set it as a field on the kubernetesRestorer. In the Restore method on the kubernetesRestorer, pass that ctx down to the restoreCtx that's created (probably as a field). Then, in your select statements that are doing the polling, add an additional case that receives on ctx.Done() and exits -- though we'd probably want to end up marking restores that were exited early like this as Failed.

Thanks. That makes sense. My use case was lowering the timeout for testing (default of 10m, but wanted to drop to 1 or 2), so doing a kill -9 on it I guess isn't too terrible.

I can leave that part as is and address the remaining comments and update tests.

ncdc · 2018-11-09T20:31:55Z

As long as we can do some work, I think we said the status is completed (or processed). Individual item warning or errors cover specific issues. Right?

…

On Fri, Nov 9, 2018 at 2:52 PM Nolan Brubaker ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In pkg/restore/restore.go <#826 (comment)>: > @@ -696,7 +748,40 @@ func (ctx *context) restoreResource(resource, namespace, resourcePath string) (a // Check if the PV exists in the cluster before attempting to create // a volume from the snapshot, in order to avoid orphaned volumes (GH #609) - _, err := resourceClient.Get(name, metav1.GetOptions{}) + pv, err := ctx.pvClient.Get(name, metav1.GetOptions{}) + if err != nil && !apierrors.IsNotFound(err) { + addToResult(&errs, namespace, fmt.Errorf("error checking existence for PV %s: %v", name, err)) + } + + if isPVReleasedOrBound(pv) { + ctx.log.Infof("Persistent volume %s still in use, waiting for deletion", name) + timeout := time.After(ctx.resourceTerminatingTimeout) + ticker := time.NewTicker(time.Second) + defer ticker.Stop() + loop: + for { @skriss <https://github.com/skriss> PTAL. I'm not feeling super great about the for/select combo, but maybe it's my unfamiliarity with the Go channels mechanisms. 2 things I'm noticing with this and the namespace loop (L513): 1. Need some way to honor ctrl-c or other graceful shutdown. Should we hook it in to signals.CancelOnShutdown <https://github.com/heptio/ark/blob/master/pkg/cmd/server/server.go#L269> somehow? 2. Do we need to find some way to mark the restore as Failed for the NS/PV timeout cases? Or is completed with errors ok? For reference, this is what a restore that hits the timeout looks like currently: x1c in /home/nrb/go/src/github.com/heptio/ark (git) fix-691 U % ark restore describe voltest-20181109132929 Name: voltest-20181109132929 Namespace: heptio-ark Labels: <none> Annotations: <none> Backup: voltest Namespaces: Included: * Excluded: <none> Resources: Included: * Excluded: nodes, events, events.events.k8s.io, backups.ark.heptio.com, restores.ark.heptio.com Cluster-scoped: auto Namespace mappings: <none> Label selector: <none> Restore PVs: auto Phase: Completed Validation errors: <none> Warnings: Ark: <none> Cluster: <none> Namespaces: nginx-example: not restored: persistentvolumeclaims "nginx-logs" already exists and is different from backed up version. not restored: pods "nginx-deployment-f9d69f8fd-c6k6g" already exists and is different from backed up version. not restored: deployments.apps "nginx-deployment" already exists and is different from backed up version. not restored: endpoints "my-nginx" already exists and is different from backed up version. not restored: replicasets.apps "nginx-deployment-f9d69f8fd" already exists and is different from backed up version. not restored: services "my-nginx" already exists and is different from backed up version. Errors: Ark: Timed out waiting on PV pvc-98f7668e-e390-11e8-9753-42010a96003b to be deleted Cluster: <none> Namespaces: <none> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#826 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAABYgFNtOI-Z0x5XIRqQUFA-AvNABprks5utd0UgaJpZM4We9Rw> .

skriss

@nrb added some comments on the core restore code for namespaces, which should similarly apply for the PV code.

skriss · 2018-11-09T21:37:01Z

pkg/restore/restore.go

+				if clusterNS.Status.Phase == v1.NamespaceActive {
+					// keep track of namespaces that we know exist so we don't
+					// have to try to create them multiple times
+					existingNamespaces.Insert(mappedNsName)


i would move the insertion into existingNamespaces down to below your for loop; it won't need to be within an if statement there - we'll always execute it if we get there. See next comment as well.

skriss · 2018-11-09T21:37:51Z

pkg/restore/restore.go

+				}
+
+				// Namespace exists but is terminating
+				if clusterNS.Status.Phase == v1.NamespaceTerminating {


this needs to be if clusterNS != nil && ... since we'll get here if the cluster didn't have the namespace, and we just created it.

skriss · 2018-11-09T21:38:43Z

pkg/restore/restore.go

+
+				// Namespace exists but is terminating
+				if clusterNS.Status.Phase == v1.NamespaceTerminating {
+					timeout := time.After(ctx.resourceTerminatingTimeout)


i'd use a time.NewTimer here (for the timeout) and defer timeout.Stop() just after

skriss · 2018-11-09T21:41:14Z

pkg/restore/restore.go

+					timeout := time.After(ctx.resourceTerminatingTimeout)
+					ticker := time.NewTicker(time.Second)
+					defer ticker.Stop()
+				loop:


I'd probably move L510-534 into a helper, something like func createNamespaceAfterTerminate(...) error, and call it here. If it returns an error, addArkError and continue.

skriss · 2018-11-09T21:41:41Z

pkg/restore/restore.go

+					for {
+						select {
+						// TODO: Need a way to break on process exit here; ctrl-c isn't working when we're in this loop
+						case <-timeout:


case <-timeout.C:

skriss · 2018-11-09T21:42:10Z

pkg/restore/restore.go

+						select {
+						// TODO: Need a way to break on process exit here; ctrl-c isn't working when we're in this loop
+						case <-timeout:
+							err := errors.Errorf("Timed out waiting on namespace %s to be ready", mappedNsName)


if you move this block to a helper, then you can just return this error

skriss · 2018-11-09T21:44:41Z

pkg/restore/restore.go

+							if apierrors.IsAlreadyExists(err) && c.Status.Phase == v1.NamespaceTerminating {
+								logger.Debugf("Namespace %s still terminating, waiting", mappedNsName)
+							} else if err != nil && !apierrors.IsAlreadyExists(err) {
+								addArkError(&warnings, err)


if you move this block to a helper, then you can just return this error

It can be returned, but it's going to warnings; would you suggest all the errors returned by the helper go into &errs?

hmm, i think it makes sense to record this as an error, so yeah

skriss · 2018-11-09T21:44:58Z

pkg/restore/restore.go

+								break loop
+							} else {
+								logger.Infof("Termination successful, namespace %s restored", mappedNsName)
+								break loop


if you move this block to a helper, then you can return nil here

ncdc · 2018-12-07T18:28:31Z

pkg/restore/restore.go

+		if apierrors.IsNotFound(err) {
+			pvLogger.Debugf("namespace %s for PV not found, waiting", namespace)
+			// namespace not found but the PV still exists, so continue to wait
+			return true, nil


return false?

ncdc · 2018-12-07T18:30:53Z

pkg/restore/restore.go

+		return true, nil
+	})
+
+	end := time.Now()


Check the err that wait.PollImmediate returns instead of comparing start/end times.

I think I can just return the err that wait.PollImmediate uses, unless I want to add a debug line for timeout. It will be added to the result at L810 in this version of the PR.

ncdc · 2018-12-07T18:31:00Z

pkg/restore/restore.go

+
+	end := time.Now()
+
+	if !end.Before(timeout) {


ncdc · 2018-12-07T18:31:21Z

pkg/util/kube/utils.go

+		clusterNS, err := client.Get(namespace.Name, metav1.GetOptions{})
+
+		if apierrors.IsNotFound(err) {
+			// Namespace isn't in cluster, we're good to restore.


Suggested change

// Namespace isn't in cluster, we're good to restore.

// Namespace isn't in cluster, we're good to create.

ncdc · 2018-12-07T18:33:34Z

pkg/util/kube/utils.go

+	})
+
+	if err != nil {
+		return false, errors.Wrapf(err, "err getting namespace %s", namespace.Name)


Suggested change

return false, errors.Wrapf(err, "err getting namespace %s", namespace.Name)

return false, errors.Wrapf(err, "error getting namespace %s", namespace.Name)

ncdc · 2018-12-07T18:34:23Z

pkg/util/kube/utils.go

+		return false, errors.Wrapf(err, "err getting namespace %s", namespace.Name)
+	}
+
+	// shouldCreate is true *only* if we got a 404 during polling, and we should only be here if we didn't get an error, which means the timeout was hit


This is confusing - could you please reword?

ncdc · 2018-12-07T18:35:02Z

pkg/util/kube/utils.go

+		return true, nil
+	}
+
+	// if we get an AlreadyExists here, something else created the NS after all of our polling...


Suggested change

// if we get an AlreadyExists here, something else created the NS after all of our polling...

// if we get any error here, something else created the NS after all of our polling...

skriss

Few comments but generally this looks good and like what was discussed last week on slack

pkg/cmd/server/server.go

skriss · 2018-12-10T18:05:16Z

pkg/util/kube/utils.go

+	if ready {
+		return true, nil
+	}
+
 	if _, err := client.Create(namespace); err == nil {


let's switch this to handle the non-nil error case first

pkg/restore/restore.go

ncdc · 2018-12-07T20:33:16Z

pkg/cmd/server/server.go

@@ -168,6 +171,7 @@ func NewCommand() *cobra.Command {
 	command.Flags().Float32Var(&config.clientQPS, "client-qps", config.clientQPS, "maximum number of requests per second by the server to the Kubernetes API once the burst limit has been reached")
 	command.Flags().IntVar(&config.clientBurst, "client-burst", config.clientBurst, "maximum number of requests by the server to the Kubernetes API in a short period of time")
 	command.Flags().StringVar(&config.profilerAddress, "profiler-address", config.profilerAddress, "the address to expose the pprof profiler")
+	command.Flags().DurationVar(&config.resourceTerminatingTimeout, "resource-terminate-timeout", config.resourceTerminatingTimeout, "how long to wait on persistent volumes and namespaces to terminate during a restore before timing out")


Let's maybe call the flag terminating-resource-timeout?

ncdc · 2019-01-02T21:48:44Z

pkg/cmd/server/server.go

-		pluginRegistry:        pluginRegistry,
-		pluginManager:         pluginManager,
-		config:                config,
+		ctx:            ctx,


I don't think these whitespace changes should be in here. Make sure you're using the latest gofmt/goimports?

pkg/restore/restore.go

ncdc · 2019-01-02T21:49:22Z

pkg/restore/restore.go

@@ -30,9 +30,11 @@ import (
 	"sync"
 	"time"

+	"k8s.io/apimachinery/pkg/util/wait"


This can be grouped with the rest of the imports below.

pkg/restore/restore.go

ncdc · 2019-01-03T14:42:32Z

pkg/util/kube/utils.go

+	}
+
+	if _, err := client.Create(namespace); err != nil {
+		// if we get any error here, something else created the NS after all of our polling


Do you want to check for an already-exists error?

So I think we can, but does that mean we should check the phase and deletion timestamp again? Do we care if something else made the namespace before we got to?

No, I don't think so. The reason I suggested it was to handle

we decide we need to create the namespace (it doesn't exist, or it did and we waited for it to be deleted)

something else creates the namespace

we try to create the namespace, and it fails because it already exists

If this is the sequence, I don't think we should return an error.

If this is the sequence, and the namespace somehow is terminating, I think it's ok to return an error. I don't expect this to happen, and I don't want to get into an infinite loop 😄

pkg/restore/restore.go

skriss

Handful of minor style & doc comments from reading through the core logic. Moving on to look at units now.

skriss · 2019-01-11T17:19:59Z

pkg/util/kube/utils_test.go

+	"k8s.io/apimachinery/pkg/runtime/schema"
+
+	arktest "github.com/heptio/ark/pkg/util/test"
+	"github.com/stretchr/testify/assert"


looks like this line needs to move up to the prev section

skriss · 2019-01-11T17:27:28Z

pkg/restore/restore.go

+	var shouldRestore bool
+	err := wait.PollImmediate(time.Second, ctx.resourceTerminatingTimeout, func() (bool, error) {
+		clusterPV, err := pvClient.Get(name, metav1.GetOptions{})
+


style nit: rm blank line

skriss · 2019-01-11T17:27:36Z

pkg/restore/restore.go

+			shouldRestore = true
+			return true, nil
+		}
+


style nit: rm blank line

skriss · 2019-01-11T17:27:42Z

pkg/restore/restore.go

+
+		if err != nil {
+			return false, errors.Wrapf(err, "could not retrieve in-cluster copy of PV %s", name)
+


style nit: rm blank line

skriss · 2019-01-11T17:29:56Z

pkg/restore/restore.go

+			return false, nil
+		}
+
+		namespace, err := collections.GetString(clusterPV.UnstructuredContent(), "spec.claimRef.namespace")


pls add a comment explaining why we're getting the NS and querying the PVC at this point

skriss · 2019-01-11T17:31:21Z

pkg/restore/restore.go

+			return false, nil
+		}
+
+		ns, err := ctx.namespaceClient.Get(namespace, metav1.GetOptions{})


probably worth another comment here explaining why we're now querying the NS

skriss · 2019-01-11T17:31:42Z

pkg/restore/restore.go

+			// PVC wasn't found, but the PV still exists, so continue to wait.
+			return false, nil
+		}
+


style nit: rm blank line

skriss · 2019-01-11T17:32:24Z

pkg/util/kube/utils.go

+	var ready bool
+	err := wait.PollImmediate(time.Second, timeout, func() (bool, error) {
+		clusterNS, err := client.Get(namespace.Name, metav1.GetOptions{})
+


style nit: rm blank line

skriss · 2019-01-11T17:32:28Z

pkg/util/kube/utils.go

+			// Namespace isn't in cluster, we're good to create.
+			return true, nil
+		}
+


style nit: rm blank line

skriss · 2019-01-11T18:27:09Z

@nrb tests look pretty good, I don't see any obvious gaps.

carlisia · 2019-01-11T19:33:56Z

pkg/restore/restore.go

@@ -696,10 +798,15 @@ func (ctx *context) restoreResource(resource, namespace, resourcePath string) (a

 			// Check if the PV exists in the cluster before attempting to create
 			// a volume from the snapshot, in order to avoid orphaned volumes (GH #609)
-			_, err := resourceClient.Get(name, metav1.GetOptions{})
+			shouldRestoreSnapshot, err := ctx.waitForPVDeletion(name, resourceClient)


I think it would make the logic more clear if method and variable were both named the same, whichever makes the most sense. Either waitForPVDeletion or shouldRestoreSnapshot.

I'm not sure about naming the method and variable the same name; @skriss or @ncdc thoughts?

I'll change the names to match and push another revision up.

nrb · 2019-01-24T17:56:16Z

@rbankston confirmed this; are there any other changes that need to be made?

skriss

LGTM

nrb · 2019-02-05T22:20:33Z

Function name change made, and rebased on the Velero rename commits. PTAL and if it's good, I'll squash.

carlisia

lgtm

nrb · 2019-02-06T20:30:16Z

@skriss Are you ok with me squashing this for a merge?

If a PV already exists, wait for it, it's associated PVC, and associated namespace to be deleted before attempting to restore it. If a namespace already exists, wait for it to be deleted before attempting to restore it. Signed-off-by: Nolan Brubaker <brubakern@vmware.com>

nrb added this to the v0.10.0 milestone Sep 7, 2018

nrb changed the title ~~Wait for namespace to terminate before restoring~~ [WIP] Wait for namespace to terminate before restoring Sep 7, 2018

nrb commented Sep 7, 2018

View reviewed changes

nrb force-pushed the fix-691 branch from da69881 to c648c00 Compare September 25, 2018 19:47

nrb commented Sep 25, 2018

View reviewed changes

pkg/cmd/server/server.go Outdated Show resolved Hide resolved

nrb commented Sep 25, 2018

View reviewed changes

nrb force-pushed the fix-691 branch from c648c00 to c3a0a41 Compare October 2, 2018 21:26

nrb force-pushed the fix-691 branch from c3a0a41 to 251bafe Compare October 4, 2018 18:37

nrb changed the title ~~[WIP] Wait for namespace to terminate before restoring~~ Wait for namespace to terminate before restoring Oct 4, 2018

nrb self-assigned this Oct 4, 2018

ncdc reviewed Oct 9, 2018

View reviewed changes

nrb commented Oct 9, 2018

View reviewed changes

nrb force-pushed the fix-691 branch from 251bafe to 173fbbf Compare October 9, 2018 20:18

nrb assigned skriss and unassigned nrb Oct 9, 2018

skriss reviewed Oct 16, 2018

View reviewed changes

rosskukulinski mentioned this pull request Oct 17, 2018

Wait for namespace to finish terminating on restore. #691

Closed

ncdc reviewed Oct 31, 2018

View reviewed changes

nrb force-pushed the fix-691 branch 2 times, most recently from 92a5ebb to 82c475c Compare November 8, 2018 21:06

ncdc suggested changes Nov 8, 2018

View reviewed changes

nrb force-pushed the fix-691 branch from 82c475c to ba57a10 Compare November 9, 2018 19:44

nrb commented Nov 9, 2018

View reviewed changes

skriss reviewed Nov 9, 2018

View reviewed changes

nrb force-pushed the fix-691 branch from 06623a3 to 213a77d Compare December 7, 2018 18:24

ncdc reviewed Dec 7, 2018

View reviewed changes

pkg/restore/restore.go Outdated

end := time.Now()

if !end.Before(timeout) {

Copy link

Contributor

ncdc Dec 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

ncdc reviewed Dec 7, 2018

View reviewed changes

nrb force-pushed the fix-691 branch from 213a77d to 0ce4b22 Compare December 7, 2018 20:13

skriss reviewed Dec 10, 2018

View reviewed changes

skriss assigned nrb and unassigned skriss Dec 11, 2018

ncdc reviewed Jan 3, 2019

View reviewed changes

nrb commented Jan 4, 2019

View reviewed changes

pkg/restore/restore.go Outdated Show resolved Hide resolved

nrb force-pushed the fix-691 branch from 1c83958 to c201f55 Compare January 8, 2019 20:08

skriss suggested changes Jan 11, 2019

View reviewed changes

carlisia reviewed Jan 11, 2019

View reviewed changes

skriss approved these changes Jan 24, 2019

View reviewed changes

nrb force-pushed the fix-691 branch from 073f13d to cc5de04 Compare February 5, 2019 20:57

carlisia approved these changes Feb 6, 2019

View reviewed changes

nrb force-pushed the fix-691 branch from cc5de04 to 8ccb3d8 Compare February 6, 2019 20:42

nrb modified the milestones: v0.10.1, v0.11.0 Feb 6, 2019

nrb force-pushed the fix-691 branch from 8ccb3d8 to 890202f Compare February 6, 2019 20:43

skriss merged commit 4583aa7 into vmware-tanzu:master Feb 7, 2019

	// Namespace isn't in cluster, we're good to restore.
	// Namespace isn't in cluster, we're good to create.

	return false, errors.Wrapf(err, "err getting namespace %s", namespace.Name)
	return false, errors.Wrapf(err, "error getting namespace %s", namespace.Name)

	// if we get an AlreadyExists here, something else created the NS after all of our polling...
	// if we get any error here, something else created the NS after all of our polling...


		if err != nil {
		return false, errors.Wrapf(err, "could not retrieve in-cluster copy of PV %s", name)

Wait for namespace to terminate before restoring #826

Wait for namespace to terminate before restoring #826

Conversation

nrb commented Sep 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skriss Oct 1, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nrb commented Oct 2, 2018

nrb commented Oct 4, 2018

nrb commented Oct 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skriss commented Oct 16, 2018

nrb commented Oct 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nrb Nov 9, 2018 • edited Loading

Choose a reason for hiding this comment

ncdc commented Nov 9, 2018 via email

skriss left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skriss left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skriss Oct 1, 2018 •

edited

Loading

nrb Nov 9, 2018 •

edited

Loading