Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smart Switch reboot high level design #1699

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

vvolam
Copy link

@vvolam vvolam commented May 16, 2024

This PR is for smart switch reboot high-level design

Repo Pull Request Status
sonic-gnmi sonic-net/sonic-gnmi#286 Open
sonic-host-services sonic-net/sonic-host-services#164 Open
sonic-platform-common sonic-net/sonic-platform-common#501 Merged
sonic-utilities sonic-net/sonic-utilities#3566 Draft
sonic-platform-daemons sonic-net/sonic-platform-daemons#546 Draft

@vvolam vvolam marked this pull request as ready for review May 16, 2024 23:15
@isabelmsft isabelmsft self-requested a review May 20, 2024 23:31
doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved
doc/smart-switch/reboot/reboot-hld.md Show resolved Hide resolved
doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved
doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved
doc/smart-switch/reboot/reboot-hld.md Show resolved Hide resolved
doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved
doc/smart-switch/reboot/reboot-hld.md Show resolved Hide resolved
doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved
doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved
doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved
doc/smart-switch/reboot/reboot-hld.md Show resolved Hide resolved
doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved
doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved
doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved
doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved
doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved
ganglyu
ganglyu previously approved these changes Jun 17, 2024
isabelmsft
isabelmsft previously approved these changes Jun 18, 2024
@vvolam vvolam dismissed stale reviews from isabelmsft and ganglyu via 1934915 June 26, 2024 19:00
Copy link
Contributor

@oleksandrivantsiv oleksandrivantsiv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As commented

@vvolam vvolam force-pushed the reboot-hld branch 2 times, most recently from 1c9a020 to 7d67e25 Compare July 30, 2024 01:16

DPUs are internally connected to the NPU via PCI-E bridge. Below is the reboot sequence for rebooting a specific DPU:

* Upon receiving a reboot CLI command to restart a particular DPU, the NPU transmits a gNOI Reboot API signal with reboot method set to ‘HALT’, instructing
Copy link
Contributor

@rameshraghupathy rameshraghupathy Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you specify which reboot CLI this refers to?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is regular linux "reboot" command.

@KrisNey-MSFT
Copy link

Discussed in DASH Community call 9/18/2024
If the DPU is unresponsive and we are trying to recover it, is there a way to hard power cycle a DPU w/o having to power cycle the switch?
Via PCIE express lanes, CPLD, or other?
Force-shut or force-reboot the card (w/o forcing the entire switch), and will it be standardized or supplier-specific?
@prgeor

@prgeor
Copy link
Contributor

prgeor commented Sep 18, 2024

@vvolam

Discussed in DASH Community call 9/18/2024 If the DPU is unresponsive and we are trying to recover it, is there a way to hard power cycle a DPU w/o having to power cycle the switch? Via PCIE express lanes, CPLD, or other? Force-shut or force-reboot the card (w/o forcing the entire switch), and will it be standardized or supplier-specific? @prgeor

@vvolam FYI

@prgeor
Copy link
Contributor

prgeor commented Sep 18, 2024

@vvolam please add all the code PRs to this HLD PR description

The test scenarios above ensure that both the NPU and all DPUs are fully operational following any type of reboot. Furthermore, the tests verify the
functionality of PCI communication between NPU and DPUs.

## References
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolam please put reference to PMON HLD design for smart switch

## Test plan ##

Presented below is the test plan within the ```sonic-mgmt``` framework for the smart switch reboot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolam please elaborate what is considered graceful and what is ungraceful

| Planned cold reboot of DPU | - | Graceful reboot |
| Planned power-cycle of Smart Switch | Graceful reboot | Graceful reboot |
| Planned power-cycle of DPU | - | Graceful reboot |
| Unplanned DPU power failure | - | Ungraceful reboot |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolam how are we planning to induce this failure in sonic-mgmt test?

{
.
.
"dpu_halt_services_timeout" : "TBD"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolam please update TBD


1. NPU host is running gNOI client to communicate with DPU.
2. DPU host is running gNOI server to listen to gNOI client requests.
3. Each DPU is assigned an IP address to communicate from NPU.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolam SONiC host services on both the DPU and NOU should be graceful shutdown as part of reboot

* Subsequently, the NPU detaches the DPU PCI with a vendor defined API. If a vendor specific API is not defined, detachment is done via sysfs
(echo 1 > /sys/bus/pci/devices/XXXX:XX:XX.X/remove).

* Next, the NPU triggers a platform vendor reboot API to initiate the reboot process for the DPU. If the DPU is stuck or unresponsive, the DPU reboot platform API should
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolam please specify the platform API as well.

* Upon receiving a reboot CLI command to restart a particular DPU, the NPU transmits a gNOI Reboot API signal with reboot method set to ‘HALT’, instructing
the DPU to terminate all services.

* Upon dispatching the Reboot API, the NPU issues the RebootStatus API to monitor whether the DPU has terminated all services except gNOI and database
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolam please specify the gNOI Reboot API used


* DPUs will send an acknowledgment to the NPU and then undergo a reboot. After receiving the acknowledgment from the DPUs, the NPU will proceed to reboot itself to complete the overall reboot procedure. The vendor-specific reboot API should include an error handling mechanism to manage DPU reboot failures. Additionally log all the failures. DPUs will be in DPU_READY state, if the reboot happened successfully.

* Upon successful reboot, the NPU resumes operation. As part of the post-reboot process, the NPU may choose to rescan the PCI devices. This rescan operation,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolam In which context or service does this rescan happens?


* With the DPUs prepared for reboot, the NPU triggers a platform vendor API to initiate the reboot process for the DPUs. Vendor API reboots a single DPU, but the NPU spawns multiple threads to reboot DPUs in parallel. If any of the the DPU is stuck or unresponsive, the DPU reboot platform API should attempt a cold boot or power cycle to recover it.

* DPUs will send an acknowledgment to the NPU and then undergo a reboot. After receiving the acknowledgment from the DPUs, the NPU will proceed to reboot itself to complete the overall reboot procedure. The vendor-specific reboot API should include an error handling mechanism to manage DPU reboot failures. Additionally log all the failures. DPUs will be in DPU_READY state, if the reboot happened successfully.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolam can you elaborate on this ack "DPUs will send an acknowledgment to the NPU and then undergo a reboot" ? LIke how is this implemented?

| Unplanned Smart Switch System Crash | Ungraceful reboot | Ungraceful reboot |
| Unplanned DPU System Crash | - | Ungraceful reboot |

The test scenarios above ensure that both the NPU and all DPUs are fully operational following any type of reboot. Furthermore, the tests verify the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolam success of reboot should verify the following:-

  1. DPUs that were UP before reboot MUST come up successfully.
  2. DPUs that were admin down MUST remain admin down after rebot
  3. Reboot cause of the DPUs and the NPU host should indicate that the reboot was initiated by the USER

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants