Put nodes into maintenance mode in preStop hook #378

coro · 2020-10-01T13:45:24Z

This commit causes nodes to go into maintenance mode before the
controller runtime terminates the rabbitmq-server process on a pod stop.
This allows for a more controlled shutdown of pods in a cluster during
rolling restart, or upgrade.

Using https://github.com/Vanlightly/RabbitTestTool, I was able to show that with
these changes, no data loss, or serious periods of cluster
unavailability, were caused during a rolling restart. However, this is
the case without these changes; I attempted to find a demonstrable case
where this change leads to an improvement in rolling restart
reliability, however was not able to improve this beyond its current
reliability.

I still put this PR forward for review, as the core team recommend that
nodes are put into maintenance mode before being stopped, to allow more
predictable transfer of responsibility to other nodes. The
rabbitmq-upgrade drain command is also able to be expanded in the
future if there are further improvements that can be made in the future.

This closes #320.

Note to reviewers: remember to look at the commits in this PR and consider if they can be squashed

Summary Of Changes

Additional Context

All tests pass. Also ran some manual testing with kubectl rollout restart statefulset <>

Local Testing

Please ensure you run the unit, integration and system tests before approving the PR.

To run the unit and integration tests:

$ make unit-tests integration-tests

You will need to target a k8s cluster and have the operator deployed for running the system tests.

For example, for a Kubernetes context named dev-bunny:

$ kubectx dev-bunny
$ make destroy deploy-dev
# wait for operator to be deployed
$ make system-tests

This commit causes nodes to go into maintenance mode before the controller runtime terminates the rabbitmq-server process on a pod stop. This allows for a more controlled shutdown of pods in a cluster during rolling restart, or upgrade. Using github.com/Vanlightly/RabbitTestTool, I was able to show that with these changes, no data loss, or serious periods of cluster unavailability, were caused during a rolling restart. However, this is the case without these changes; I attempted to find a demonstrable case where this change leads to an improvement in rolling restart reliability, however was not able to improve this beyond its current reliability. I still put this PR forward for review, as the core team recommend that nodes are put into maintenance mode before being stopped, to allow more predictable transfer of responsibility to other nodes. The `rabbitmq-upgrade drain` command is also able to be expanded in the future if there are further improvements that can be made in the future. This closes #320.

Maintenance mode is not available in previous versions of RabbitMQ

mkuratczyk

Great stuff. My only comment, and I'm not saying we need to do anything about it, is that we started with a week-long termination timeout but now we pass that value to three different commands so we basically have a 3 week timeout now. I don't think that matters really though - a week basically means infinity and 3*infinity is still infinity ;)

Zerpet

Changes look good and tests are passing in my machine ✅

Zerpet · 2020-10-02T09:53:08Z

Great stuff. My only comment, and I'm not saying we need to do anything about it, is that we started with a week-long termination timeout but now we pass that value to three different commands so we basically have a 3 week timeout now. I don't think that matters really though - a week basically means infinity and 3*infinity is still infinity ;)

The container will receive a SIGKILL if the grace period expires, even if it's executing the preStop hook; the longest we will wait will be one week (infinity).

* Put nodes into maintenance mode in preStop hook This commit causes nodes to go into maintenance mode before the controller runtime terminates the rabbitmq-server process on a pod stop. This allows for a more controlled shutdown of pods in a cluster during rolling restart, or upgrade. Using github.com/Vanlightly/RabbitTestTool, I was able to show that with these changes, no data loss, or serious periods of cluster unavailability, were caused during a rolling restart. However, this is the case without these changes; I attempted to find a demonstrable case where this change leads to an improvement in rolling restart reliability, however was not able to improve this beyond its current reliability. I still put this PR forward for review, as the core team recommend that nodes are put into maintenance mode before being stopped, to allow more predictable transfer of responsibility to other nodes. The `rabbitmq-upgrade drain` command is also able to be expanded in the future if there are further improvements that can be made in the future. This closes #320. * Update minimum RMQ version to 3.8.8 Maintenance mode is not available in previous versions of RabbitMQ

coro added 2 commits October 1, 2020 13:36

Update minimum RMQ version to 3.8.8

5833cfd

Maintenance mode is not available in previous versions of RabbitMQ

coro requested review from Zerpet, ablease, ChunyiLyu and mkuratczyk October 1, 2020 16:36

mkuratczyk approved these changes Oct 2, 2020

View reviewed changes

MirahImage approved these changes Oct 2, 2020

View reviewed changes

Zerpet approved these changes Oct 2, 2020

View reviewed changes

coro merged commit e12a397 into main Oct 2, 2020

coro deleted the 320-drain-on-shutdown branch October 2, 2020 09:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Put nodes into maintenance mode in preStop hook #378

Put nodes into maintenance mode in preStop hook #378

coro commented Oct 1, 2020 •

edited

Loading

mkuratczyk left a comment

Zerpet left a comment

Zerpet commented Oct 2, 2020

Put nodes into maintenance mode in preStop hook #378

Put nodes into maintenance mode in preStop hook #378

Conversation

coro commented Oct 1, 2020 • edited Loading

Summary Of Changes

Additional Context

Local Testing

mkuratczyk left a comment

Choose a reason for hiding this comment

Zerpet left a comment

Choose a reason for hiding this comment

Zerpet commented Oct 2, 2020

coro commented Oct 1, 2020 •

edited

Loading