Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRAYSAT-1817: Automate procedure of setting next boot device to disk #240

Merged

Conversation

annapoorna-s-alt
Copy link
Contributor

@annapoorna-s-alt annapoorna-s-alt commented Jul 15, 2024

IM:CRAYSAT-1817
Reviewer: Ryan

Summary and Scope

Automate the procedure of setting the next boot device to disk before the management nodes are powered off as part of the full-system shutdown.

Issues and Related PRs

CRAYSAT-1817

Testing

Tested on:

Rocket

Test description:

Will do full system shutdown and boot to see the behavior of boot order

Risks and Mitigations

Minimal

Pull Request Checklist

  • Version number(s) incremented, if applicable
  • Copyrights updated
  • License file intact
  • Target branch correct
  • CHANGELOG.md updated
  • Testing is appropriate and complete, if applicable
  • HPC Product Announcement prepared, if applicable

@annapoorna-s-alt annapoorna-s-alt force-pushed the CRAYSAT-1817-automate-next-boot-device branch 4 times, most recently from 9ffb1e6 to af71f4a Compare July 16, 2024 06:57
@annapoorna-s-alt
Copy link
Contributor Author

Tested on Rocket where disks are not represented in the format CRAY UEFI OS *

ncn-m001:/mnt/developer/sann # sat bootsys shutdown --stage ncn-power --ncn-shutdown-timeout 900
Proceed with shutdown of other management NCNs? [yes,no] yes
Proceeding with shutdown of other management NCNs.
IPMI username: root
IPMI password:
The following Non-compute Nodes (NCNs) will be included in this operation:
managers:
- ncn-m002
- ncn-m003
storage:
- ncn-s001
- ncn-s002
- ncn-s003
workers:
- ncn-w001
- ncn-w002
- ncn-w003
- ncn-w004

The following Non-compute Nodes (NCNs) will be excluded from this operation:
managers:
- ncn-m001
storage: []
workers: []

Are the above NCN groupings and exclusions correct? [yes,no] yes
ERROR: Value Error: No disk boot entries found for ncn-w001
ERROR: Value Error: No disk boot entries found for ncn-w002
ERROR: Value Error: No disk boot entries found for ncn-w003
ERROR: Value Error: No disk boot entries found for ncn-w004
INFO: Starting console logging on ncn-w001,ncn-w002,ncn-w003,ncn-w004.
INFO: Shutting down worker NCNs: ncn-w001, ncn-w002, ncn-w003, ncn-w004
INFO: Executing command on host "ncn-w001": `shutdown -h now`
INFO: Executing command on host "ncn-w002": `shutdown -h now`
INFO: Executing command on host "ncn-w003": `shutdown -h now`
INFO: Executing command on host "ncn-w004": `shutdown -h now`
INFO: Waiting up to 900 seconds for worker NCNs to shut down...
INFO: Stopping console logging on ncn-w001,ncn-w002,ncn-w003,ncn-w004.
ERROR: Value Error: No disk boot entries found for ncn-m002
ERROR: Value Error: No disk boot entries found for ncn-m003
INFO: Starting console logging on ncn-m002,ncn-m003.
INFO: Shutting down manager NCNs: ncn-m002, ncn-m003
INFO: Executing command on host "ncn-m002": `shutdown -h now`
INFO: Executing command on host "ncn-m003": `shutdown -h now`
INFO: Waiting up to 900 seconds for manager NCNs to shut down...
INFO: Stopping console logging on ncn-m002,ncn-m003.
WARNING: /sat/venv/lib/python3.9/site-packages/paramiko/client.py:889: UserWarning: Unknown ssh-ed25519 host key for ncn-m001: b'fdf95ea8a67c3dffba59ab56001f7077'
WARNING:   warnings.warn(
INFO: Finding mounted RBD devices on ncn-m001
INFO: Checking for mounts of RBD device /dev/rbd0 on ncn-m001
INFO: Found mount of RBD device /dev/rbd0 at /mnt/admin on ncn-m001
INFO: Checking for mounts of RBD device /dev/rbd1 on ncn-m001
INFO: Found mount of RBD device /dev/rbd1 at /mnt/developer on ncn-m001
INFO: Found 2 mounted RBD devices on ncn-m001
INFO: Finding mounted Ceph or s3fs filesystems on ncn-m001
INFO: Found 3 mounted Ceph or s3fs filesystems on ncn-m001
INFO: Checking whether mounts are in use on ncn-m001
INFO: Checking whether mount point /etc/cray/upgrade/csm is in use on ncn-m001
INFO: Mount point /etc/cray/upgrade/csm is not in use on ncn-m001
INFO: Checking whether mount point /var/opt/cray/sdu/collection-mount is in use on ncn-m001
INFO: Mount point /var/opt/cray/sdu/collection-mount is not in use on ncn-m001
INFO: Checking whether mount point /var/opt/cray/config-data is in use on ncn-m001
INFO: Mount point /var/opt/cray/config-data is not in use on ncn-m001
INFO: Checking whether mount point /mnt/admin is in use on ncn-m001
INFO: Mount point /mnt/admin is in use by the following processes on ncn-m001:
INFO: COMMAND    PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
INFO: bash    164242 root  cwd    DIR  252,0     4096    2 /mnt/admin
INFO: Checking whether mount point /mnt/developer is in use on ncn-m001
INFO: Mount point /mnt/developer is in use by the following processes on ncn-m001:
INFO: COMMAND     PID USER   FD   TYPE DEVICE SIZE/OFF    NODE NAME
INFO: bash    1229709 root  cwd    DIR 252,16     4096 1179649 /mnt/developer/sann
INFO: bash    1315626 root  cwd    DIR 252,16     4096 1179649 /mnt/developer/sann
INFO: podman  1315633 root  cwd    DIR 252,16     4096 1179649 /mnt/developer/sann
INFO: conmon  1315734 root  cwd    DIR 252,16     4096 1179649 /mnt/developer/sann
INFO: sat     1315746 root  cwd    DIR 252,16     4096 1179649 /sat/share
Some filesystems to be unmounted remain in use. Please address this before continuing.
Proceed with unmount of filesystems? [yes,no] no
Will not proceed with unmount of filesystems. Exiting.

@annapoorna-s-alt annapoorna-s-alt force-pushed the CRAYSAT-1817-automate-next-boot-device branch 2 times, most recently from dcedddd to 4f868bb Compare July 17, 2024 10:03
@annapoorna-s-alt
Copy link
Contributor Author

@annapoorna-s-alt annapoorna-s-alt force-pushed the CRAYSAT-1817-automate-next-boot-device branch from 4f868bb to e5562e0 Compare July 18, 2024 10:48
Copy link
Contributor

@haasken-hpe haasken-hpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments are minor. Overall, this looks good.

sat/cli/bootsys/mgmt_power.py Outdated Show resolved Hide resolved
sat/cli/bootsys/mgmt_power.py Outdated Show resolved Hide resolved
sat/cli/bootsys/mgmt_power.py Outdated Show resolved Hide resolved
sat/cli/bootsys/mgmt_power.py Outdated Show resolved Hide resolved
@annapoorna-s-alt annapoorna-s-alt force-pushed the CRAYSAT-1817-automate-next-boot-device branch from e5562e0 to bd06427 Compare July 22, 2024 05:49
@annapoorna-s-alt
Copy link
Contributor Author

Latest output is here

Copy link
Contributor

@haasken-hpe haasken-hpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Minor unit test improvement suggestions.

tests/cli/bootsys/test_mgmt_power.py Outdated Show resolved Hide resolved
tests/cli/bootsys/test_mgmt_power.py Outdated Show resolved Hide resolved
tests/cli/bootsys/test_mgmt_power.py Outdated Show resolved Hide resolved
tests/cli/bootsys/test_mgmt_power.py Outdated Show resolved Hide resolved
tests/cli/bootsys/test_mgmt_power.py Show resolved Hide resolved
tests/cli/bootsys/test_mgmt_power.py Show resolved Hide resolved
tests/cli/bootsys/test_mgmt_power.py Outdated Show resolved Hide resolved
tests/cli/bootsys/test_mgmt_power.py Outdated Show resolved Hide resolved
tests/cli/bootsys/test_mgmt_power.py Outdated Show resolved Hide resolved
tests/cli/bootsys/test_mgmt_power.py Outdated Show resolved Hide resolved
@annapoorna-s-alt annapoorna-s-alt force-pushed the CRAYSAT-1817-automate-next-boot-device branch 3 times, most recently from c1aa441 to d0b81ac Compare July 24, 2024 12:01
Copy link
Contributor

@haasken-hpe haasken-hpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two unit tests of the same name, and one actually has a NameError, but this is masked since only one of them (the one without the error) is executed.

My comments and suggestions should address this. Otherwise, this still looks good. Feel free to merge once you've addressed that.

tests/cli/bootsys/test_mgmt_power.py Outdated Show resolved Hide resolved
tests/cli/bootsys/test_mgmt_power.py Outdated Show resolved Hide resolved
tests/cli/bootsys/test_mgmt_power.py Show resolved Hide resolved
tests/cli/bootsys/test_mgmt_power.py Show resolved Hide resolved
tests/cli/bootsys/test_mgmt_power.py Outdated Show resolved Hide resolved
tests/cli/bootsys/test_mgmt_power.py Show resolved Hide resolved
tests/cli/bootsys/test_mgmt_power.py Outdated Show resolved Hide resolved
tests/cli/bootsys/test_mgmt_power.py Outdated Show resolved Hide resolved
tests/cli/bootsys/test_mgmt_power.py Outdated Show resolved Hide resolved
tests/cli/bootsys/test_mgmt_power.py Outdated Show resolved Hide resolved
@annapoorna-s-alt annapoorna-s-alt force-pushed the CRAYSAT-1817-automate-next-boot-device branch from d0b81ac to 022369f Compare July 25, 2024 04:46
Copy link
Contributor

@shivaprasad-metimath shivaprasad-metimath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

@annapoorna-s-alt annapoorna-s-alt merged commit ec8154e into feature/CRAYSAT-1740 Jul 25, 2024
3 checks passed
@annapoorna-s-alt annapoorna-s-alt deleted the CRAYSAT-1817-automate-next-boot-device branch July 25, 2024 05:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants