-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Smart Switch reboot high level design #1699
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As commented
1c9a020
to
7d67e25
Compare
|
||
DPUs are internally connected to the NPU via PCI-E bridge. Below is the reboot sequence for rebooting a specific DPU: | ||
|
||
* Upon receiving a reboot CLI command to restart a particular DPU, the NPU transmits a gNOI Reboot API signal with reboot method set to ‘HALT’, instructing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you specify which reboot CLI this refers to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is regular linux "reboot" command.
Discussed in DASH Community call 9/18/2024 |
@vvolam FYI |
@vvolam please add all the code PRs to this HLD PR description |
This is initial draft
The test scenarios above ensure that both the NPU and all DPUs are fully operational following any type of reboot. Furthermore, the tests verify the | ||
functionality of PCI communication between NPU and DPUs. | ||
|
||
## References |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvolam please put reference to PMON HLD design for smart switch
## Test plan ## | ||
|
||
Presented below is the test plan within the ```sonic-mgmt``` framework for the smart switch reboot. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvolam please elaborate what is considered graceful and what is ungraceful
| Planned cold reboot of DPU | - | Graceful reboot | | ||
| Planned power-cycle of Smart Switch | Graceful reboot | Graceful reboot | | ||
| Planned power-cycle of DPU | - | Graceful reboot | | ||
| Unplanned DPU power failure | - | Ungraceful reboot | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvolam how are we planning to induce this failure in sonic-mgmt test?
{ | ||
. | ||
. | ||
"dpu_halt_services_timeout" : "TBD" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvolam please update TBD
|
||
1. NPU host is running gNOI client to communicate with DPU. | ||
2. DPU host is running gNOI server to listen to gNOI client requests. | ||
3. Each DPU is assigned an IP address to communicate from NPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvolam SONiC host services on both the DPU and NOU should be graceful shutdown as part of reboot
* Subsequently, the NPU detaches the DPU PCI with a vendor defined API. If a vendor specific API is not defined, detachment is done via sysfs | ||
(echo 1 > /sys/bus/pci/devices/XXXX:XX:XX.X/remove). | ||
|
||
* Next, the NPU triggers a platform vendor reboot API to initiate the reboot process for the DPU. If the DPU is stuck or unresponsive, the DPU reboot platform API should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvolam please specify the platform API as well.
* Upon receiving a reboot CLI command to restart a particular DPU, the NPU transmits a gNOI Reboot API signal with reboot method set to ‘HALT’, instructing | ||
the DPU to terminate all services. | ||
|
||
* Upon dispatching the Reboot API, the NPU issues the RebootStatus API to monitor whether the DPU has terminated all services except gNOI and database |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvolam please specify the gNOI Reboot API used
|
||
* DPUs will send an acknowledgment to the NPU and then undergo a reboot. After receiving the acknowledgment from the DPUs, the NPU will proceed to reboot itself to complete the overall reboot procedure. The vendor-specific reboot API should include an error handling mechanism to manage DPU reboot failures. Additionally log all the failures. DPUs will be in DPU_READY state, if the reboot happened successfully. | ||
|
||
* Upon successful reboot, the NPU resumes operation. As part of the post-reboot process, the NPU may choose to rescan the PCI devices. This rescan operation, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvolam In which context or service does this rescan happens?
|
||
* With the DPUs prepared for reboot, the NPU triggers a platform vendor API to initiate the reboot process for the DPUs. Vendor API reboots a single DPU, but the NPU spawns multiple threads to reboot DPUs in parallel. If any of the the DPU is stuck or unresponsive, the DPU reboot platform API should attempt a cold boot or power cycle to recover it. | ||
|
||
* DPUs will send an acknowledgment to the NPU and then undergo a reboot. After receiving the acknowledgment from the DPUs, the NPU will proceed to reboot itself to complete the overall reboot procedure. The vendor-specific reboot API should include an error handling mechanism to manage DPU reboot failures. Additionally log all the failures. DPUs will be in DPU_READY state, if the reboot happened successfully. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvolam can you elaborate on this ack "DPUs will send an acknowledgment to the NPU and then undergo a reboot" ? LIke how is this implemented?
| Unplanned Smart Switch System Crash | Ungraceful reboot | Ungraceful reboot | | ||
| Unplanned DPU System Crash | - | Ungraceful reboot | | ||
|
||
The test scenarios above ensure that both the NPU and all DPUs are fully operational following any type of reboot. Furthermore, the tests verify the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvolam success of reboot should verify the following:-
- DPUs that were UP before reboot MUST come up successfully.
- DPUs that were admin down MUST remain admin down after rebot
- Reboot cause of the DPUs and the NPU host should indicate that the reboot was initiated by the USER
This PR is for smart switch reboot high-level design