Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRAYSAT-1861: Unmount Ceph and s3fs filesystems on ncn-m001 during shutdown #227

Merged

Conversation

haasken-hpe
Copy link
Contributor

@haasken-hpe haasken-hpe commented Jun 20, 2024

Summary and Scope

Unmount filesystems of type ceph and fuse.s3fs on ncn-m001 when shutting down. Also do the same for any filesystems mounted from RBD devices.

Add an activity check that uses lsof to check if filesystems can be unmounted or if there are active processes using them, and report these to the admin, so they can stop the offending processes as appropriate.

Disable the cron job that automatically mounts Ceph and s3fs filesystems on ncn-m001, so that they won't get remounted before the node is shut down. Re-enable it at the appropriate time during the power on.

Issues and Related PRs

Testing

Tested on:

  • fanta

Test description:

Minimal testing so far. Just tested the functions on their own to limit system disruption.

Still need to test full-system power down.

Risks and Mitigations

Adds a significant amount of code and output to sat bootsys, so we will want to update the docs, but overall this should be a really positive improvement to the robustness and completeness of the system shutdown automation.

Pull Request Checklist

  • Version number(s) incremented, if applicable
  • Copyrights updated
  • License file intact
  • Target branch correct
  • CHANGELOG.md updated
  • Testing is appropriate and complete, if applicable
  • HPC Product Announcement prepared, if applicable

@haasken-hpe
Copy link
Contributor Author

@shivaprasad-metimath, @annapoorna-s-alt , there is still further cleanup and testing I would like to do on this, but since it is a large PR, I wanted to give you both plenty of time to review it.

Here is the testing performed so far on fanta: https://gist.github.com/haasken-hpe/bfc0d8aed96c19cf48cba680e17485c0

@haasken-hpe haasken-hpe force-pushed the CRAYSAT-1861-umount-ceph-s3fs-ncn-m001 branch from 2fa9867 to 117893f Compare June 28, 2024 23:30
@haasken-hpe haasken-hpe force-pushed the CRAYSAT-1861-umount-ceph-s3fs-ncn-m001 branch from f7abb55 to 9d7c69b Compare July 12, 2024 02:14
@haasken-hpe haasken-hpe marked this pull request as ready for review July 12, 2024 02:18
@haasken-hpe
Copy link
Contributor Author

Marking this ready for review. I have not yet tested it by actually performing the power off/on of a real Shasta system, but I have thoroughly unit tested it.

Please review, @shivaprasad-metimath and @annapoorna-s-alt.

I will test this tomorrow, or during our meeting on Monday.

@haasken-hpe
Copy link
Contributor Author

Testing was completed on rocket. I ran the full system shutdown procedure with the build of the cray-sat image from this branch. It is far too much testing output to attach to this PR or a gist, so I'll attach the typescripts to the Jira. There were no issues identified with these code changes.

Modify the `sat bootsys shutdown --stage ncn-power` command to
do the following on ncn-m001:

* Find all mounts of filesystems of type `ceph` or `fuse.s3fs`
* Find all mounts of filesystems on RBD devices
* For each mount point identified, check whether there is any activity
  on that mount point that would prevent it being unmounted. Prompt the
  admin when they have stopped the processes before continuing.
* Disable systemd cron job that ensures Ceph and s3fs mounts remain
  mounted.
* Unmount each mount point
* Unmap all RBD devices

Modify the `sat bootsys boot --stage ncn-power` command to re-enable the
systemd cron job that ensures Ceph/s3fs mounts are mounted on ncn-m001.

Prior to this change, only the unmapping of RBD devices was performed.
We want to ensure that all usages of Ceph on ncn-m001 are gracefully
stopped before shutting down the storage nodes so that ncn-m001 will
shut down gracefully.

Test Description:
Added many unit tests.

Testing is needed on a real system as part of a full-system shutdown.
The `FilteredHostKeys` class was created to work around an issue with
Paramiko's loading of SSH host keys that was causing a serious
performance problem when the known_hosts file was large. However, this
class was not being used to filter host keys in all locations where a
Paramiko `SSHClient` was being created. Update the locations which were
not filtering host keys to do so.

Test Description:
Needs to be tested as part of a full-system power off/on on a real
system.
@haasken-hpe haasken-hpe force-pushed the CRAYSAT-1861-umount-ceph-s3fs-ncn-m001 branch from 5f06612 to c740815 Compare July 15, 2024 22:13
@haasken-hpe haasken-hpe merged commit 26008ce into feature/CRAYSAT-1740 Jul 15, 2024
3 checks passed
@haasken-hpe haasken-hpe deleted the CRAYSAT-1861-umount-ceph-s3fs-ncn-m001 branch July 15, 2024 23:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants