Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for namespace to terminate before restoring #826

Merged
merged 1 commit into from
Feb 7, 2019

Conversation

nrb
Copy link
Contributor

@nrb nrb commented Sep 7, 2018

Fixes #691

Signed-off-by: Nolan Brubaker nolan@heptio.com

@nrb nrb added this to the v0.10.0 milestone Sep 7, 2018
@nrb nrb changed the title Wait for namespace to terminate before restoring [WIP] Wait for namespace to terminate before restoring Sep 7, 2018
go func() {
defer ctx.resourceWaitGroup.Done()

if _, err := waitForNamespaceDelete(namespaceWatch.ResultChan(), mappedNsName, ctx.namespaceTimeout, ctx.logger); err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this pattern what I want to be using here? I see there's also the informer event handlers, as seen at https://github.com/heptio/ark/blob/master/pkg/restic/restorer.go#L64. Is that only for the additions, updates, and deletions from the cache?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nrb it's better to use the informer. If you create your own watch, it can terminate for a variety of reasons, and you might miss an event. While you're not guaranteed to see 100% of events with the informers, they do handle watch terminations and automatically reestablish new watches for you, and you'll get the latest state of the object (or a "deleted final state unknown" meaning it was deleted but it missed the deletion event).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks - I'll rework this to use an informer DeleteFunc.

pkg/cmd/server/server.go Outdated Show resolved Hide resolved
@@ -696,6 +797,8 @@ func (ctx *context) restoreResource(resource, namespace, resourcePath string) (a
go func() {
defer ctx.resourceWaitGroup.Done()

// TODO(nrb): This is currently conflicting with the namespace termination waits. If the namespace becomes
// ready to restore, everything in the NS seems to be restored but the PV never gets created and the PVC is "lost"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on handling this? I'm going to keep testing and see if a new PV is made then deleted during the termination wait, but that may be a dead end.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really see why this would affect PVs on first glance - let me see if I can test it and figure out what's going on

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's cause of the PV timeout being 1 minute, where I was using 2-5 minute namespace timeouts. I believe the PV restore attempt would timeout before the namespace termination did, and thus abandon doing a restore. I did try using the same ctx.namespaceTimeout value for both, but I didn't see any improvement.

I think upping the timeout on PV restoration would cause an issue with mapping the PVC, but I'm not positive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, here's what I think is happening:

  1. namespace foo has a PVC claiming PV bar
  2. namespace foo is deleted and moves to Terminating state
  3. user executes ark restore create which starts waiting on foo to be deleted
  4. foo finishes deleting, at which point the PVC claiming PV bar is gone, and PV bar is released
  5. Ark immediately restores resources in NS foo, then moves on to trying to restore the PV
  6. The dynamic provisioner that deletes released PVs hasn't yet deleted the existing PV
  7. Ark sees that it's trying to restore a resource that already exists, and can't, so it moves on
  8. The dynamic provisioner deletes the PV
  9. End result: Ark restores the PVC in namespace foo, but the PV never gets restored

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd need to also wait on PVs in the Released state with a reclaim policy of Delete to disappear before trying to restore them if we want to handle this correctly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're saying this would be a change to the waitForReady function? Such as how we determine the value of isPVReady?

Copy link
Member

@skriss skriss Oct 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm saying we'd need a pvWaiter that does basically the same thing as the namespaceWaiter, but for pvs - so when you get to restoring a PV, if it's in the released state, wait for it to be deleted before trying to restore it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remove this comment now

@nrb
Copy link
Contributor Author

nrb commented Oct 2, 2018

In manual testing, this appears to work in that it waits for both persistent volumes and namespaces to be fully deleted.

To simulate deletion and timeouts during testing, I added a fake finalizer to pods (ark.heptio.com/fake), deleted the namespace, started a restore, then removed the finalizer via kubectl edit or kept it there to see if the timeout was reached.

I'm going to continue working on the tests since a few things did change that need to be added.

@nrb nrb changed the title [WIP] Wait for namespace to terminate before restoring Wait for namespace to terminate before restoring Oct 4, 2018
@nrb
Copy link
Contributor Author

nrb commented Oct 4, 2018

Tests are updated. I'm not explicitly testing the DeleteFunc handlers on the informers. Do we want to do that? It looks like it'll be a lot more scaffolding to get them set up, somewhat similar to the backup_controller_test.go tests that interact directly with the Informer store.

@nrb nrb self-assigned this Oct 4, 2018
@nrb
Copy link
Contributor Author

nrb commented Oct 4, 2018

@wwitzel3 @skriss I think this is ready for a more thorough review.

s.logger,
)
cmd.CheckError(err)

// Start the informat and wait for cache sync here so that the event handlers added in NewKubernetesRestorer are ready by the time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: informat

}, nil
}

namespaceInformer.Informer().AddEventHandler(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon some reflection, I'd like to have test coverage for these two event handlers, but the two solutions I came up with to do so seemed kind of heavy weight considering what they do.

  • Mock the namespace and PV informers, such that the tests will add/remove/change objects so the handlers will be triggered. This is more like what our code currently does, but is likely much more scaffolding.
  • Instead of doing anonymous functions with closures over kr, make a factory function that will wrap kr and return a handler function. This means all we'd need to create to get handlers to test would be a KubernetesRestorer, and the functions can be called directly with dummy objects. This is a little further removed from the actual code, though.

@skriss
Copy link
Member

skriss commented Oct 16, 2018

@nrb sorry, I didn't make any progress on this while you were out.

@nrb
Copy link
Contributor Author

nrb commented Oct 16, 2018

np - I'll get the tests more fleshed out this week. Would appreciate a review, too.

restoreResourcePriorities []string
restoreOnly bool
pluginDir, metricsAddress, defaultBackupLocation string
backupSyncPeriod, podVolumeOperationTimeout, namespaceTimeout time.Duration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we're using this value for PV waiting as well, maybe we want to call it something more generic, like terminatingResourceTimeout or sthg along those lines

@@ -144,6 +146,7 @@ func NewCommand() *cobra.Command {
command.Flags().BoolVar(&config.restoreOnly, "restore-only", config.restoreOnly, "run in a mode where only restores are allowed; backups, schedules, and garbage-collection are all disabled")
command.Flags().StringSliceVar(&config.restoreResourcePriorities, "restore-resource-priorities", config.restoreResourcePriorities, "desired order of resource restores; any resource not in the list will be restored alphabetically after the prioritized resources")
command.Flags().StringVar(&config.defaultBackupLocation, "default-backup-storage-location", config.defaultBackupLocation, "name of the default backup storage location")
command.Flags().DurationVar(&config.namespaceTimeout, "namespace-timeout", config.namespaceTimeout, "duration to wait on namespace termination before failing a restore")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

genericize flag name and text since being used for PVs as well

s.logger.Debugln("Starting kube shared informer factory")
go s.kubeSharedInformerFactory.Start(ctx.Done())
s.logger.Debugln("Waiting for kubesharedinformer to sync")
cache.WaitForCacheSync(ctx.Done(), s.kubeSharedInformerFactory.Core().V1().Namespaces().Informer().HasSynced)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check return val here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice we don't check this elsewhere in the server - should I add that? Is it a total failure mode if it returns false?

return
}

updateChan, ok := kr.pvWaiter.getDeleteChan(pv.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/updateChan/deleteChan


// pvWaiter encapsulates waiting for a PV with a reclaim policy of 'Delete' to exit the 'Released' state.
// This is to avoid the condition where a PV associated with a PV is released but not yet deleted not getting restored.
type pvWaiter struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could probably eliminate the need for a second PV-specific waiter type. Within the waiter struct, we really only need the ObjectMeta; it doesn't really matter what specific type it is. I'd probably use the metav1.Object interface - so if you change the channel type to use that, then I think you can consolidate to a single resourceWaiter struct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that makes sense.

@@ -696,6 +797,8 @@ func (ctx *context) restoreResource(resource, namespace, resourcePath string) (a
go func() {
defer ctx.resourceWaitGroup.Done()

// TODO(nrb): This is currently conflicting with the namespace termination waits. If the namespace becomes
// ready to restore, everything in the NS seems to be restored but the PV never gets created and the PVC is "lost"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remove this comment now

logger.Infof("Namespace still terminating, waiting")

ctx.nsWaiter.addDeleteChan(mappedNsName)
_, err = ctx.nsWaiter.waitForDelete(mappedNsName, ctx.namespaceTimeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably need to check again here to see if it's been terminated, since it's possible it terminated between our first check and adding the delete channel for it. Only if it still hasn't been terminated to we then need to wait for delete

if isPVReleasedOrBound(fromCluster) {
ctx.log.Infof("Persistent volume %q still in use, waiting for deletion", name)
ctx.pvWaiter.addDeleteChan(name)
_, err := ctx.pvWaiter.waitForDelete(name, ctx.namespaceTimeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably need to do a similar double-check on whether the PV's been deleted here, since it could've been deleted between the first check and adding the channel

@@ -455,11 +619,36 @@ func (ctx *context) restoreFromDir(dir string) (api.RestoreResult, api.RestoreRe
if !existingNamespaces.Has(mappedNsName) {
logger := ctx.log.WithField("namespace", nsName)
ns := getNamespace(logger, filepath.Join(dir, api.ResourcesDir, "namespaces", api.ClusterScopedDir, nsName+".json"), mappedNsName)
if _, err := kube.EnsureNamespaceExists(ns, ctx.namespaceClient); err != nil {
nsIsReady, err := kube.EnsureNamespaceExistsAndIsReady(ns, ctx.namespaceClient)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if the logic here (pseudocode) should be:

  1. Try to get the namespace
  2. If it exists and is not terminating, add to existingNamespaces and move on to the next step
  3. Otherwise
    1. If it doesn't exist, create it
    2. If it does exist and is terminating, wait for it to no longer exist, up to a timeout. The easiest initial implementation would be to poll with a real client once per second. Given that we aren't doing restores in goroutines and instead are waiting synchronously, I think this logic would be an acceptable tradeoff vs the shared informer/events approach, which makes things much more confusing. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll give that a try - I assume we'd have to do the same thing for PVs, then? I ran in to a race condition when I only looked at the NS. (See #826 (comment))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we'll have to wait for anything that's in a terminal state to exit that state (which presumably means "be deleted") before we can proceed restoring that item.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ncdc I pushed an update to this using clients and polling. Mind looking over it? I think I've covered the big use cases, though I'm not sure if we should fail the restore if we get some error besides not found when doing our checks each tick.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👀

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nrb I would still like to do what I suggested above in steps 1-3. In other words, let's not use kube.EnsureNamespaceExistsAndIsReady or let's modify it to work with these steps. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I understood EnsureNamespaceExistsAndIsReady was doing 1-2 and 3i; I see I missed putting it in to existingNamespaces. I'll move that logic out and do it all here so it's all present.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing is doing 1 yet

@nrb nrb force-pushed the fix-691 branch 2 times, most recently from 92a5ebb to 82c475c Compare November 8, 2018 21:06
@@ -152,6 +154,7 @@ func NewCommand() *cobra.Command {
command.Flags().StringSliceVar(&config.restoreResourcePriorities, "restore-resource-priorities", config.restoreResourcePriorities, "desired order of resource restores; any resource not in the list will be restored alphabetically after the prioritized resources")
command.Flags().StringVar(&config.defaultBackupLocation, "default-backup-storage-location", config.defaultBackupLocation, "name of the default backup storage location")
command.Flags().Var(&volumeSnapshotLocations, "default-volume-snapshot-locations", "list of unique volume providers and default volume snapshot location (provider1:location-01,provider2:location-02,...)")
command.Flags().DurationVar(&config.resourceTerminatingTimeout, "resource-timeout", config.resourceTerminatingTimeout, "duration to wait on resource termination before failing a restore")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think --resource-timeout isn't 100% clear what it's for - we should adjust

@@ -455,11 +619,36 @@ func (ctx *context) restoreFromDir(dir string) (api.RestoreResult, api.RestoreRe
if !existingNamespaces.Has(mappedNsName) {
logger := ctx.log.WithField("namespace", nsName)
ns := getNamespace(logger, filepath.Join(dir, api.ResourcesDir, "namespaces", api.ClusterScopedDir, nsName+".json"), mappedNsName)
if _, err := kube.EnsureNamespaceExists(ns, ctx.namespaceClient); err != nil {
nsIsReady, err := kube.EnsureNamespaceExistsAndIsReady(ns, ctx.namespaceClient)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nrb I would still like to do what I suggested above in steps 1-3. In other words, let's not use kube.EnsureNamespaceExistsAndIsReady or let's modify it to work with these steps. WDYT?

addArkError(&errs, err)
continue
}

if !nsIsReady {
timeout := time.After(ctx.resourceTerminatingTimeout)
tick := time.Tick(1 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a deferred stop of tick I believe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like tick is just the channel, not a Ticker. I'll redo it so we have a Ticker that we can actually stop.

for proceed != true {
select {
case <-timeout:
err := errors.Errorf("Timed out waiting on NS %s to be ready", ns.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NS->namespace

addArkError(&errs, err)
return warnings, errs
case <-tick:
nsIsReady, err = kube.EnsureNamespaceExistsAndIsReady(ns, ctx.namespaceClient)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to re-call kube.EnsureNamespaceExistsAndIsReady here. Let's implement my suggestion above unless you have objections?

ticker := time.NewTicker(time.Second)
defer ticker.Stop()
loop:
for {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skriss PTAL. I'm not feeling super great about the for/select combo, but maybe it's my unfamiliarity with the Go channels mechanisms.

2 things I'm noticing with this and the namespace loop (L513):

  1. Need some way to honor ctrl-c or other graceful shutdown. Should we hook it in to signals.CancelOnShutdown somehow?
  2. Do we need to find some way to mark the restore as Failed for the NS/PV timeout cases? Or is completed with errors ok? For reference, this is what a restore that hits the timeout looks like currently:
x1c in /home/nrb/go/src/github.com/heptio/ark (git) fix-691 U
% ark restore describe voltest-20181109132929
Name:         voltest-20181109132929
Namespace:    heptio-ark
Labels:       <none>
Annotations:  <none>

Backup:  voltest

Namespaces:
  Included:  *
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        nodes, events, events.events.k8s.io, backups.ark.heptio.com, restores.ark.heptio.com
  Cluster-scoped:  auto

Namespace mappings:  <none>

Label selector:  <none>

Restore PVs:  auto

Phase:  Completed

Validation errors:  <none>

Warnings:
  Ark:        <none>
  Cluster:    <none>
  Namespaces:
    nginx-example:  not restored: persistentvolumeclaims "nginx-logs" already exists and is different from backed up version.
                    not restored: pods "nginx-deployment-f9d69f8fd-c6k6g" already exists and is different from backed up version.
                    not restored: deployments.apps "nginx-deployment" already exists and is different from backed up version.
                    not restored: endpoints "my-nginx" already exists and is different from backed up version.
                    not restored: replicasets.apps "nginx-deployment-f9d69f8fd" already exists and is different from backed up version.
                    not restored: services "my-nginx" already exists and is different from backed up version.

Errors:
  Ark:      Timed out waiting on PV pvc-98f7668e-e390-11e8-9753-42010a96003b to be deleted
  Cluster:    <none>
  Namespaces: <none>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For #2, completed with errors is what we'd expect.

Looking at #1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For #1, I'm not sure that we actually want to do this - would like to think some more and maybe @ncdc will find time to provide input. We don't currently have any code that will stop a backup or restore in the middle if a shutdown signal is received - we try to finish the backup/restore, then the controllers/server exit. We'd want to be able to properly mark the restore's status as failed/interrupted, which would take more code that we don't currently have.

If we were going to implement it, looks like the most straight-forward way to do this is in server.go, around L590, pass ctx as a new arg to the restore.NewKubernetesRestorer func. Have that func set it as a field on the kubernetesRestorer. In the Restore method on the kubernetesRestorer, pass that ctx down to the restoreCtx that's created (probably as a field). Then, in your select statements that are doing the polling, add an additional case that receives on ctx.Done() and exits -- though we'd probably want to end up marking restores that were exited early like this as Failed.

Copy link
Contributor Author

@nrb nrb Nov 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. That makes sense. My use case was lowering the timeout for testing (default of 10m, but wanted to drop to 1 or 2), so doing a kill -9 on it I guess isn't too terrible.

I can leave that part as is and address the remaining comments and update tests.

@ncdc
Copy link
Contributor

ncdc commented Nov 9, 2018 via email

Copy link
Member

@skriss skriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nrb added some comments on the core restore code for namespaces, which should similarly apply for the PV code.

if clusterNS.Status.Phase == v1.NamespaceActive {
// keep track of namespaces that we know exist so we don't
// have to try to create them multiple times
existingNamespaces.Insert(mappedNsName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would move the insertion into existingNamespaces down to below your for loop; it won't need to be within an if statement there - we'll always execute it if we get there. See next comment as well.

}

// Namespace exists but is terminating
if clusterNS.Status.Phase == v1.NamespaceTerminating {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be if clusterNS != nil && ... since we'll get here if the cluster didn't have the namespace, and we just created it.


// Namespace exists but is terminating
if clusterNS.Status.Phase == v1.NamespaceTerminating {
timeout := time.After(ctx.resourceTerminatingTimeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd use a time.NewTimer here (for the timeout) and defer timeout.Stop() just after

timeout := time.After(ctx.resourceTerminatingTimeout)
ticker := time.NewTicker(time.Second)
defer ticker.Stop()
loop:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably move L510-534 into a helper, something like func createNamespaceAfterTerminate(...) error, and call it here. If it returns an error, addArkError and continue.

for {
select {
// TODO: Need a way to break on process exit here; ctrl-c isn't working when we're in this loop
case <-timeout:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case <-timeout.C:

select {
// TODO: Need a way to break on process exit here; ctrl-c isn't working when we're in this loop
case <-timeout:
err := errors.Errorf("Timed out waiting on namespace %s to be ready", mappedNsName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you move this block to a helper, then you can just return this error

if apierrors.IsAlreadyExists(err) && c.Status.Phase == v1.NamespaceTerminating {
logger.Debugf("Namespace %s still terminating, waiting", mappedNsName)
} else if err != nil && !apierrors.IsAlreadyExists(err) {
addArkError(&warnings, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you move this block to a helper, then you can just return this error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be returned, but it's going to warnings; would you suggest all the errors returned by the helper go into &errs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, i think it makes sense to record this as an error, so yeah

break loop
} else {
logger.Infof("Termination successful, namespace %s restored", mappedNsName)
break loop
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you move this block to a helper, then you can return nil here

if apierrors.IsNotFound(err) {
pvLogger.Debugf("namespace %s for PV not found, waiting", namespace)
// namespace not found but the PV still exists, so continue to wait
return true, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return false?

return true, nil
})

end := time.Now()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the err that wait.PollImmediate returns instead of comparing start/end times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I can just return the err that wait.PollImmediate uses, unless I want to add a debug line for timeout. It will be added to the result at L810 in this version of the PR.


end := time.Now()

if !end.Before(timeout) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

clusterNS, err := client.Get(namespace.Name, metav1.GetOptions{})

if apierrors.IsNotFound(err) {
// Namespace isn't in cluster, we're good to restore.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Namespace isn't in cluster, we're good to restore.
// Namespace isn't in cluster, we're good to create.

})

if err != nil {
return false, errors.Wrapf(err, "err getting namespace %s", namespace.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return false, errors.Wrapf(err, "err getting namespace %s", namespace.Name)
return false, errors.Wrapf(err, "error getting namespace %s", namespace.Name)

return false, errors.Wrapf(err, "err getting namespace %s", namespace.Name)
}

// shouldCreate is true *only* if we got a 404 during polling, and we should only be here if we didn't get an error, which means the timeout was hit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing - could you please reword?

return true, nil
}

// if we get an AlreadyExists here, something else created the NS after all of our polling...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// if we get an AlreadyExists here, something else created the NS after all of our polling...
// if we get any error here, something else created the NS after all of our polling...

Copy link
Member

@skriss skriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments but generally this looks good and like what was discussed last week on slack

pkg/cmd/server/server.go Outdated Show resolved Hide resolved
if ready {
return true, nil
}

if _, err := client.Create(namespace); err == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's switch this to handle the non-nil error case first

pkg/restore/restore.go Outdated Show resolved Hide resolved
pkg/restore/restore.go Outdated Show resolved Hide resolved
@skriss skriss assigned nrb and unassigned skriss Dec 11, 2018
@@ -168,6 +171,7 @@ func NewCommand() *cobra.Command {
command.Flags().Float32Var(&config.clientQPS, "client-qps", config.clientQPS, "maximum number of requests per second by the server to the Kubernetes API once the burst limit has been reached")
command.Flags().IntVar(&config.clientBurst, "client-burst", config.clientBurst, "maximum number of requests by the server to the Kubernetes API in a short period of time")
command.Flags().StringVar(&config.profilerAddress, "profiler-address", config.profilerAddress, "the address to expose the pprof profiler")
command.Flags().DurationVar(&config.resourceTerminatingTimeout, "resource-terminate-timeout", config.resourceTerminatingTimeout, "how long to wait on persistent volumes and namespaces to terminate during a restore before timing out")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's maybe call the flag terminating-resource-timeout?

pluginRegistry: pluginRegistry,
pluginManager: pluginManager,
config: config,
ctx: ctx,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these whitespace changes should be in here. Make sure you're using the latest gofmt/goimports?

pkg/restore/restore.go Outdated Show resolved Hide resolved
@@ -30,9 +30,11 @@ import (
"sync"
"time"

"k8s.io/apimachinery/pkg/util/wait"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be grouped with the rest of the imports below.

pkg/restore/restore.go Outdated Show resolved Hide resolved
pkg/restore/restore.go Outdated Show resolved Hide resolved
}

if _, err := client.Create(namespace); err != nil {
// if we get any error here, something else created the NS after all of our polling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to check for an already-exists error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think we can, but does that mean we should check the phase and deletion timestamp again? Do we care if something else made the namespace before we got to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think so. The reason I suggested it was to handle

  1. we decide we need to create the namespace (it doesn't exist, or it did and we waited for it to be deleted)
  2. something else creates the namespace
  3. we try to create the namespace, and it fails because it already exists

If this is the sequence, I don't think we should return an error.

If this is the sequence, and the namespace somehow is terminating, I think it's ok to return an error. I don't expect this to happen, and I don't want to get into an infinite loop 😄

pkg/restore/restore.go Outdated Show resolved Hide resolved
Copy link
Member

@skriss skriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handful of minor style & doc comments from reading through the core logic. Moving on to look at units now.

"k8s.io/apimachinery/pkg/runtime/schema"

arktest "github.com/heptio/ark/pkg/util/test"
"github.com/stretchr/testify/assert"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this line needs to move up to the prev section

var shouldRestore bool
err := wait.PollImmediate(time.Second, ctx.resourceTerminatingTimeout, func() (bool, error) {
clusterPV, err := pvClient.Get(name, metav1.GetOptions{})

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit: rm blank line

shouldRestore = true
return true, nil
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit: rm blank line


if err != nil {
return false, errors.Wrapf(err, "could not retrieve in-cluster copy of PV %s", name)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit: rm blank line

return false, nil
}

namespace, err := collections.GetString(clusterPV.UnstructuredContent(), "spec.claimRef.namespace")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add a comment explaining why we're getting the NS and querying the PVC at this point

return false, nil
}

ns, err := ctx.namespaceClient.Get(namespace, metav1.GetOptions{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably worth another comment here explaining why we're now querying the NS

// PVC wasn't found, but the PV still exists, so continue to wait.
return false, nil
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit: rm blank line

var ready bool
err := wait.PollImmediate(time.Second, timeout, func() (bool, error) {
clusterNS, err := client.Get(namespace.Name, metav1.GetOptions{})

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit: rm blank line

// Namespace isn't in cluster, we're good to create.
return true, nil
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit: rm blank line

@skriss
Copy link
Member

skriss commented Jan 11, 2019

@nrb tests look pretty good, I don't see any obvious gaps.

@@ -696,10 +798,15 @@ func (ctx *context) restoreResource(resource, namespace, resourcePath string) (a

// Check if the PV exists in the cluster before attempting to create
// a volume from the snapshot, in order to avoid orphaned volumes (GH #609)
_, err := resourceClient.Get(name, metav1.GetOptions{})
shouldRestoreSnapshot, err := ctx.waitForPVDeletion(name, resourceClient)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make the logic more clear if method and variable were both named the same, whichever makes the most sense. Either waitForPVDeletion or shouldRestoreSnapshot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about naming the method and variable the same name; @skriss or @ncdc thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change the names to match and push another revision up.

@nrb
Copy link
Contributor Author

nrb commented Jan 24, 2019

@rbankston confirmed this; are there any other changes that need to be made?

Copy link
Member

@skriss skriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nrb
Copy link
Contributor Author

nrb commented Feb 5, 2019

Function name change made, and rebased on the Velero rename commits. PTAL and if it's good, I'll squash.

Copy link
Contributor

@carlisia carlisia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@nrb
Copy link
Contributor Author

nrb commented Feb 6, 2019

@skriss Are you ok with me squashing this for a merge?

If a PV already exists, wait for it, it's associated PVC, and associated
namespace to be deleted before attempting to restore it.

If a namespace already exists, wait for it to be deleted before
attempting to restore it.

Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
@skriss skriss merged commit 4583aa7 into vmware-tanzu:master Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants