Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

volume leak during deployment rollout #733

Closed
pohly opened this issue Sep 14, 2020 · 7 comments
Closed

volume leak during deployment rollout #733

pohly opened this issue Sep 14, 2020 · 7 comments
Assignees
Labels
0.9 bug Something isn't working

Comments

@pohly
Copy link
Contributor

pohly commented Sep 14, 2020

The controller is designed such that it collects information about volumes from nodes as the nodes register themselves. This implies that the controller cannot know about existing volumes for nodes that haven't registered (yet).

This leads to the following problem:

  • DeleteVolume is called for an existing volume that the controller doesn't know about at the moment.
  • The controller cannot distinguish between "volume already deleted" (idempotency!) and "need to wait for some node with that volume".
  • It assumes that the volume is gone and returns success without doing anything, after a misleading log message about "Volume pvc-bd-adc62b1395a868c243a74ee138e313a19c72211c5fbc0d5f2706e486 not created by this controller".
  • external-provisioner removes PV

=> volume leak

This problem was triggered by the new version skew tests which restart the driver while volumes exist, then does some operations (including removal) with them right after the driver deployment comes up again.

@pohly
Copy link
Contributor Author

pohly commented Sep 14, 2020

The whole registration process is also problematic because of #729. It may also be a scalability problem (single instance of the controller), although we don't have any evidence of that because we haven't done scale testing.

I'm currently leaning towards solving this problem by deploying the external-provisioner alongside each node driver instance and removing the central controller entirely. This was originally proposed in kubernetes-csi/external-provisioner#367 but wasn't finished.

What was missing in that PR was support for immediate binding. With immediate binding, all external-provisioner instances need to collaboratively figure out who's the one who should create the volume. I was proposing leadership election for that (kubernetes-csi/external-provisioner#367 (comment)) but that'll need further thought.

@avalluri
Copy link
Contributor

I think the other alternate/quick solution is to persist the controller state(known volume details) in a config map.

@pohly
Copy link
Contributor Author

pohly commented Sep 14, 2020

I think the other alternate/quick solution is to persist the controller state(known volume details) in a config map.

The controller currently doesn't have access to a config map. Introducing that would introduce a dependency on Kubernetes into the CO-agnostic part of PMEM-CSI.

Even with a config map, getting this right in all cases will be tricky. Consider the naive approach:

  • controller gets CreateVolume request
  • sends CreateVolume to node
  • records new volume in config map
  • returns information to external-provisioner

If the controller dies at any point during this flow, the volume may leak. A lot of effort went into external-provisioner to prevent such leaks; we would have to duplicate all of that.

@pohly
Copy link
Contributor Author

pohly commented Sep 14, 2020

This is a valid alternative solution, but I wouldn't call it quick.

@pohly pohly added the 0.9 label Sep 22, 2020
@pohly pohly mentioned this issue Sep 25, 2020
@pohly
Copy link
Contributor Author

pohly commented Sep 30, 2020

The upstream work on enabling de-centralized deployment of external-provisioner is tracked in kubernetes-csi/external-provisioner#487

@pohly
Copy link
Contributor Author

pohly commented Nov 24, 2020

If the controller dies at any point during this flow, the volume may leak. A lot of effort went into external-provisioner to prevent such leaks; we would have to duplicate all of that.

Not only when it dies - error handling also currently isn't sufficient to prevent volume leaks, see issue #823

pohly added a commit to pohly/pmem-CSI that referenced this issue Dec 18, 2020
By putting external-provisioner onto each node and letting it
provision volumes directly on the node, we can remove the
controller/node communication part in PMEM-CSI. This solves various
issues in that part (race conditions that led to volume leaks) and
simplifies the deployment (no need for two-way TLS certificates
anymore).

The webhooks check for capacity by discovering the PMEM-CSI node pods
and retrieving metrics data from them via the normal metrics support.

The combination of node drivers from 0.8 with a controller from 0.9 is
harmless (no volume leaked) but can no longer create new
volumes. Existing volumes on the nodes are still usable.

Combining a controller from 0.8 with node drivers from 0.9 is more
problematic because the old controller will cause volume leaks when
volumes are deleted (intel#733).
If this is a problem, then the old StatefulSet can be deleted manually
before upgrading.
pohly added a commit to pohly/pmem-CSI that referenced this issue Dec 18, 2020
By putting external-provisioner onto each node and letting it
provision volumes directly on the node, we can remove the
controller/node communication part in PMEM-CSI. This solves various
issues in that part (race conditions that led to volume leaks) and
simplifies the deployment (no need for two-way TLS certificates
anymore).

The webhooks check for capacity by discovering the PMEM-CSI node pods
and retrieving metrics data from them via the normal metrics support.

The combination of node drivers from 0.8 with a controller from 0.9 is
harmless (no volume leaked) but can no longer create new
volumes. Existing volumes on the nodes are still usable.

Combining a controller from 0.8 with node drivers from 0.9 is more
problematic because the old controller will cause volume leaks when
volumes are deleted (intel#733).
If this is a problem, then the old StatefulSet can be deleted manually
before upgrading.
pohly added a commit to pohly/pmem-CSI that referenced this issue Dec 18, 2020
By putting external-provisioner onto each node and letting it
provision volumes directly on the node, we can remove the
controller/node communication part in PMEM-CSI. This solves various
issues in that part (race conditions that led to volume leaks) and
simplifies the deployment (no need for two-way TLS certificates
anymore).

The webhooks check for capacity by discovering the PMEM-CSI node pods
and retrieving metrics data from them via the normal metrics support.

The combination of node drivers from 0.8 with a controller from 0.9 is
harmless (no volume leaked) but can no longer create new
volumes. Existing volumes on the nodes are still usable.

Combining a controller from 0.8 with node drivers from 0.9 is more
problematic because the old controller will cause volume leaks when
volumes are deleted (intel#733).
If this is a problem, then the old StatefulSet can be deleted manually
before upgrading.
pohly added a commit to pohly/pmem-CSI that referenced this issue Dec 18, 2020
By putting external-provisioner onto each node and letting it
provision volumes directly on the node, we can remove the
controller/node communication part in PMEM-CSI. This solves various
issues in that part (race conditions that led to volume leaks) and
simplifies the deployment (no need for two-way TLS certificates
anymore).

The webhooks check for capacity by discovering the PMEM-CSI node pods
and retrieving metrics data from them via the normal metrics support.

The combination of node drivers from 0.8 with a controller from 0.9 is
harmless (no volume leaked) but can no longer create new
volumes. Existing volumes on the nodes are still usable.

Combining a controller from 0.8 with node drivers from 0.9 is more
problematic because the old controller will cause volume leaks when
volumes are deleted (intel#733).
If this is a problem, then the old StatefulSet can be deleted manually
before upgrading.
pohly added a commit to pohly/pmem-CSI that referenced this issue Jan 14, 2021
By putting external-provisioner onto each node and letting it
provision volumes directly on the node, we can remove the
controller/node communication part in PMEM-CSI. This solves various
issues in that part (race conditions that led to volume leaks) and
simplifies the deployment (no need for two-way TLS certificates
anymore).

The webhooks check for capacity by discovering the PMEM-CSI node pods
and retrieving metrics data from them via the normal metrics support.

The combination of node drivers from 0.8 with a controller from 0.9 is
harmless (no volume leaked) but can no longer create new
volumes. Existing volumes on the nodes are still usable.

Combining a controller from 0.8 with node drivers from 0.9 is more
problematic because the old controller will cause volume leaks when
volumes are deleted (intel#733).
If this is a problem, then the old StatefulSet can be deleted manually
before upgrading.

The operator and tests will be updated in separate commits.
pohly added a commit to pohly/pmem-CSI that referenced this issue Jan 15, 2021
By putting external-provisioner onto each node and letting it
provision volumes directly on the node, we can remove the
controller/node communication part in PMEM-CSI. This solves various
issues in that part (race conditions that led to volume leaks) and
simplifies the deployment (no need for two-way TLS certificates
anymore).

The webhooks check for capacity by discovering the PMEM-CSI node pods
and retrieving metrics data from them via the normal metrics support.

The combination of node drivers from 0.8 with a controller from 0.9 is
harmless (no volume leaked) but can no longer create new
volumes. Existing volumes on the nodes are still usable.

Combining a controller from 0.8 with node drivers from 0.9 is more
problematic because the old controller will cause volume leaks when
volumes are deleted (intel#733).
If this is a problem, then the old StatefulSet can be deleted manually
before upgrading.

The operator and tests will be updated in separate commits.
pohly added a commit to pohly/pmem-CSI that referenced this issue Jan 15, 2021
By putting external-provisioner onto each node and letting it
provision volumes directly on the node, we can remove the
controller/node communication part in PMEM-CSI. This solves various
issues in that part (race conditions that led to volume leaks) and
simplifies the deployment (no need for two-way TLS certificates
anymore).

The webhooks check for capacity by discovering the PMEM-CSI node pods
and retrieving metrics data from them via the normal metrics support.

The combination of node drivers from 0.8 with a controller from 0.9 is
harmless (no volume leaked) but can no longer create new
volumes. Existing volumes on the nodes are still usable.

Combining a controller from 0.8 with node drivers from 0.9 is more
problematic because the old controller will cause volume leaks when
volumes are deleted (intel#733).
If this is a problem, then the old StatefulSet can be deleted manually
before upgrading.

The operator and tests will be updated in separate commits.
pohly added a commit to pohly/pmem-CSI that referenced this issue Jan 17, 2021
By putting external-provisioner onto each node and letting it
provision volumes directly on the node, we can remove the
controller/node communication part in PMEM-CSI. This solves various
issues in that part (race conditions that led to volume leaks) and
simplifies the deployment (no need for two-way TLS certificates
anymore).

The webhooks check for capacity by discovering the PMEM-CSI node pods
and retrieving metrics data from them via the normal metrics support.

The combination of node drivers from 0.8 with a controller from 0.9 is
harmless (no volume leaked) but can no longer create new
volumes. Existing volumes on the nodes are still usable.

Combining a controller from 0.8 with node drivers from 0.9 is more
problematic because the old controller will cause volume leaks when
volumes are deleted (intel#733).
If this is a problem, then the old StatefulSet can be deleted manually
before upgrading.

The operator and tests will be updated in separate commits.
@pohly
Copy link
Contributor Author

pohly commented Jan 20, 2021

Fixed by PR #838

@pohly pohly closed this as completed Jan 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.9 bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants