-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Put nodes into maintenance mode in preStop hook #378
Conversation
This commit causes nodes to go into maintenance mode before the controller runtime terminates the rabbitmq-server process on a pod stop. This allows for a more controlled shutdown of pods in a cluster during rolling restart, or upgrade. Using github.com/Vanlightly/RabbitTestTool, I was able to show that with these changes, no data loss, or serious periods of cluster unavailability, were caused during a rolling restart. However, this is the case without these changes; I attempted to find a demonstrable case where this change leads to an improvement in rolling restart reliability, however was not able to improve this beyond its current reliability. I still put this PR forward for review, as the core team recommend that nodes are put into maintenance mode before being stopped, to allow more predictable transfer of responsibility to other nodes. The `rabbitmq-upgrade drain` command is also able to be expanded in the future if there are further improvements that can be made in the future. This closes #320.
Maintenance mode is not available in previous versions of RabbitMQ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff. My only comment, and I'm not saying we need to do anything about it, is that we started with a week-long termination timeout but now we pass that value to three different commands so we basically have a 3 week timeout now. I don't think that matters really though - a week basically means infinity and 3*infinity is still infinity ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good and tests are passing in my machine ✅
The container will receive a SIGKILL if the grace period expires, even if it's executing the preStop hook; the longest we will wait will be one week (infinity). |
* Put nodes into maintenance mode in preStop hook This commit causes nodes to go into maintenance mode before the controller runtime terminates the rabbitmq-server process on a pod stop. This allows for a more controlled shutdown of pods in a cluster during rolling restart, or upgrade. Using github.com/Vanlightly/RabbitTestTool, I was able to show that with these changes, no data loss, or serious periods of cluster unavailability, were caused during a rolling restart. However, this is the case without these changes; I attempted to find a demonstrable case where this change leads to an improvement in rolling restart reliability, however was not able to improve this beyond its current reliability. I still put this PR forward for review, as the core team recommend that nodes are put into maintenance mode before being stopped, to allow more predictable transfer of responsibility to other nodes. The `rabbitmq-upgrade drain` command is also able to be expanded in the future if there are further improvements that can be made in the future. This closes #320. * Update minimum RMQ version to 3.8.8 Maintenance mode is not available in previous versions of RabbitMQ
This commit causes nodes to go into maintenance mode before the
controller runtime terminates the rabbitmq-server process on a pod stop.
This allows for a more controlled shutdown of pods in a cluster during
rolling restart, or upgrade.
Using https://github.com/Vanlightly/RabbitTestTool, I was able to show that with
these changes, no data loss, or serious periods of cluster
unavailability, were caused during a rolling restart. However, this is
the case without these changes; I attempted to find a demonstrable case
where this change leads to an improvement in rolling restart
reliability, however was not able to improve this beyond its current
reliability.
I still put this PR forward for review, as the core team recommend that
nodes are put into maintenance mode before being stopped, to allow more
predictable transfer of responsibility to other nodes. The
rabbitmq-upgrade drain
command is also able to be expanded in thefuture if there are further improvements that can be made in the future.
This closes #320.
Note to reviewers: remember to look at the commits in this PR and consider if they can be squashed
Summary Of Changes
Additional Context
All tests pass. Also ran some manual testing with
kubectl rollout restart statefulset <>
Local Testing
Please ensure you run the unit, integration and system tests before approving the PR.
To run the unit and integration tests:
You will need to target a k8s cluster and have the operator deployed for running the system tests.
For example, for a Kubernetes context named
dev-bunny
: