volume leak during deployment rollout #733

pohly · 2020-09-14T06:08:55Z

The controller is designed such that it collects information about volumes from nodes as the nodes register themselves. This implies that the controller cannot know about existing volumes for nodes that haven't registered (yet).

This leads to the following problem:

DeleteVolume is called for an existing volume that the controller doesn't know about at the moment.
The controller cannot distinguish between "volume already deleted" (idempotency!) and "need to wait for some node with that volume".
It assumes that the volume is gone and returns success without doing anything, after a misleading log message about "Volume pvc-bd-adc62b1395a868c243a74ee138e313a19c72211c5fbc0d5f2706e486 not created by this controller".
external-provisioner removes PV

=> volume leak

This problem was triggered by the new version skew tests which restart the driver while volumes exist, then does some operations (including removal) with them right after the driver deployment comes up again.

pohly · 2020-09-14T10:43:18Z

The whole registration process is also problematic because of #729. It may also be a scalability problem (single instance of the controller), although we don't have any evidence of that because we haven't done scale testing.

I'm currently leaning towards solving this problem by deploying the external-provisioner alongside each node driver instance and removing the central controller entirely. This was originally proposed in kubernetes-csi/external-provisioner#367 but wasn't finished.

What was missing in that PR was support for immediate binding. With immediate binding, all external-provisioner instances need to collaboratively figure out who's the one who should create the volume. I was proposing leadership election for that (kubernetes-csi/external-provisioner#367 (comment)) but that'll need further thought.

avalluri · 2020-09-14T10:50:51Z

I think the other alternate/quick solution is to persist the controller state(known volume details) in a config map.

pohly · 2020-09-14T12:31:07Z

I think the other alternate/quick solution is to persist the controller state(known volume details) in a config map.

The controller currently doesn't have access to a config map. Introducing that would introduce a dependency on Kubernetes into the CO-agnostic part of PMEM-CSI.

Even with a config map, getting this right in all cases will be tricky. Consider the naive approach:

controller gets CreateVolume request
sends CreateVolume to node
records new volume in config map
returns information to external-provisioner

If the controller dies at any point during this flow, the volume may leak. A lot of effort went into external-provisioner to prevent such leaks; we would have to duplicate all of that.

pohly · 2020-09-14T12:34:06Z

This is a valid alternative solution, but I wouldn't call it quick.

pohly · 2020-09-30T11:03:35Z

The upstream work on enabling de-centralized deployment of external-provisioner is tracked in kubernetes-csi/external-provisioner#487

pohly · 2020-11-24T07:27:59Z

If the controller dies at any point during this flow, the volume may leak. A lot of effort went into external-provisioner to prevent such leaks; we would have to duplicate all of that.

Not only when it dies - error handling also currently isn't sufficient to prevent volume leaks, see issue #823

By putting external-provisioner onto each node and letting it provision volumes directly on the node, we can remove the controller/node communication part in PMEM-CSI. This solves various issues in that part (race conditions that led to volume leaks) and simplifies the deployment (no need for two-way TLS certificates anymore). The webhooks check for capacity by discovering the PMEM-CSI node pods and retrieving metrics data from them via the normal metrics support. The combination of node drivers from 0.8 with a controller from 0.9 is harmless (no volume leaked) but can no longer create new volumes. Existing volumes on the nodes are still usable. Combining a controller from 0.8 with node drivers from 0.9 is more problematic because the old controller will cause volume leaks when volumes are deleted (intel#733). If this is a problem, then the old StatefulSet can be deleted manually before upgrading.

By putting external-provisioner onto each node and letting it provision volumes directly on the node, we can remove the controller/node communication part in PMEM-CSI. This solves various issues in that part (race conditions that led to volume leaks) and simplifies the deployment (no need for two-way TLS certificates anymore). The webhooks check for capacity by discovering the PMEM-CSI node pods and retrieving metrics data from them via the normal metrics support. The combination of node drivers from 0.8 with a controller from 0.9 is harmless (no volume leaked) but can no longer create new volumes. Existing volumes on the nodes are still usable. Combining a controller from 0.8 with node drivers from 0.9 is more problematic because the old controller will cause volume leaks when volumes are deleted (intel#733). If this is a problem, then the old StatefulSet can be deleted manually before upgrading. The operator and tests will be updated in separate commits.

pohly · 2021-01-20T19:28:49Z

Fixed by PR #838

pohly added the 0.9 label Sep 22, 2020

pohly mentioned this issue Sep 25, 2020

scale testing #748

Closed

This was referenced Sep 30, 2020

operator API: replace embedded certificates with references to secrets #759

Closed

controller failed to connect to node driver #766

Closed

pohly added the bug Something isn't working label Oct 6, 2020

pohly self-assigned this Nov 11, 2020

pohly mentioned this issue Dec 18, 2020

Distributed provisioning #838

Merged

pohly closed this as completed Jan 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

volume leak during deployment rollout #733

volume leak during deployment rollout #733

pohly commented Sep 14, 2020

pohly commented Sep 14, 2020

avalluri commented Sep 14, 2020

pohly commented Sep 14, 2020

pohly commented Sep 14, 2020

pohly commented Sep 30, 2020

pohly commented Nov 24, 2020

pohly commented Jan 20, 2021

volume leak during deployment rollout #733

volume leak during deployment rollout #733

Comments

pohly commented Sep 14, 2020

pohly commented Sep 14, 2020

avalluri commented Sep 14, 2020

pohly commented Sep 14, 2020

pohly commented Sep 14, 2020

pohly commented Sep 30, 2020

pohly commented Nov 24, 2020

pohly commented Jan 20, 2021