Smart Switch reboot high level design #1699

vvolam · 2024-05-16T19:45:34Z

This PR is for smart switch reboot high-level design

Repo	Pull Request	Status
sonic-gnmi	sonic-net/sonic-gnmi#286	Open
sonic-host-services	sonic-net/sonic-host-services#164	Open
sonic-platform-common	sonic-net/sonic-platform-common#501	Merged
sonic-utilities	sonic-net/sonic-utilities#3566	Draft
sonic-platform-daemons	sonic-net/sonic-platform-daemons#546	Draft

doc/smart-switch/reboot/reboot-hld.md

oleksandrivantsiv

As commented

doc/smart-switch/reboot/reboot-hld.md

rameshraghupathy · 2024-08-06T17:38:20Z

doc/smart-switch/reboot/reboot-hld.md

+
+DPUs are internally connected to the NPU via PCI-E bridge. Below is the reboot sequence for rebooting a specific DPU:
+
+* Upon receiving a reboot CLI command to restart a particular DPU, the NPU transmits a gNOI Reboot API signal with reboot method set to ‘HALT’, instructing


Can you specify which reboot CLI this refers to?

This is regular linux "reboot" command.

KrisNey-MSFT · 2024-09-18T16:22:09Z

Discussed in DASH Community call 9/18/2024
If the DPU is unresponsive and we are trying to recover it, is there a way to hard power cycle a DPU w/o having to power cycle the switch?
Via PCIE express lanes, CPLD, or other?
Force-shut or force-reboot the card (w/o forcing the entire switch), and will it be standardized or supplier-specific?
@prgeor

prgeor · 2024-09-18T22:41:12Z

@vvolam

Discussed in DASH Community call 9/18/2024 If the DPU is unresponsive and we are trying to recover it, is there a way to hard power cycle a DPU w/o having to power cycle the switch? Via PCIE express lanes, CPLD, or other? Force-shut or force-reboot the card (w/o forcing the entire switch), and will it be standardized or supplier-specific? @prgeor

@vvolam FYI

prgeor · 2024-09-18T22:41:36Z

@vvolam please add all the code PRs to this HLD PR description

This is initial draft

sonic-net/sonic-platform-common#454

prgeor · 2024-10-06T15:00:11Z

doc/smart-switch/reboot/reboot-hld.md

+The test scenarios above ensure that both the NPU and all DPUs are fully operational following any type of reboot. Furthermore, the tests verify the
+functionality of PCI communication between NPU and DPUs.
+
+## References


@vvolam please put reference to PMON HLD design for smart switch

prgeor · 2024-10-06T15:00:41Z

doc/smart-switch/reboot/reboot-hld.md

+## Test plan ##
+
+Presented below is the test plan within the ```sonic-mgmt``` framework for the smart switch reboot.
+


@vvolam please elaborate what is considered graceful and what is ungraceful

prgeor · 2024-10-06T15:01:19Z

doc/smart-switch/reboot/reboot-hld.md

+| Planned cold reboot of DPU                | -                   | Graceful reboot     |
+| Planned power-cycle of Smart Switch       | Graceful reboot     | Graceful reboot     |
+| Planned power-cycle of DPU                | -                   | Graceful reboot     |
+| Unplanned DPU power failure               | -                   | Ungraceful reboot   |


@vvolam how are we planning to induce this failure in sonic-mgmt test?

prgeor · 2024-10-06T15:02:24Z

doc/smart-switch/reboot/reboot-hld.md

+{
+    .
+    .
+    "dpu_halt_services_timeout" : "TBD"


@vvolam please update TBD

prgeor · 2024-10-06T15:10:35Z

doc/smart-switch/reboot/reboot-hld.md

+
+1. NPU host is running gNOI client to communicate with DPU.
+2. DPU host is running gNOI server to listen to gNOI client requests.
+3. Each DPU is assigned an IP address to communicate from NPU.


@vvolam SONiC host services on both the DPU and NOU should be graceful shutdown as part of reboot

prgeor · 2024-10-06T15:11:55Z

doc/smart-switch/reboot/reboot-hld.md

+* Subsequently, the NPU detaches the DPU PCI with a vendor defined API. If a vendor specific API is not defined, detachment is done via sysfs
+(echo 1 > /sys/bus/pci/devices/XXXX:XX:XX.X/remove).
+
+* Next, the NPU triggers a platform vendor reboot API to initiate the reboot process for the DPU. If the DPU is stuck or unresponsive, the DPU reboot platform API should


@vvolam please specify the platform API as well.

prgeor · 2024-10-06T15:12:21Z

doc/smart-switch/reboot/reboot-hld.md

+* Upon receiving a reboot CLI command to restart a particular DPU, the NPU transmits a gNOI Reboot API signal with reboot method set to ‘HALT’, instructing
+the DPU to terminate all services.
+
+* Upon dispatching the Reboot API, the NPU issues the RebootStatus API to monitor whether the DPU has terminated all services except gNOI and database


@vvolam please specify the gNOI Reboot API used

prgeor · 2024-10-07T02:12:53Z

doc/smart-switch/reboot/reboot-hld.md

+
+* DPUs will send an acknowledgment to the NPU and then undergo a reboot. After receiving the acknowledgment from the DPUs, the NPU will proceed to reboot itself to complete the overall reboot procedure. The vendor-specific reboot API should include an error handling mechanism to manage DPU reboot failures. Additionally log all the failures. DPUs will be in DPU_READY state, if the reboot happened successfully.
+
+* Upon successful reboot, the NPU resumes operation. As part of the post-reboot process, the NPU may choose to rescan the PCI devices. This rescan operation,


@vvolam In which context or service does this rescan happens?

prgeor · 2024-10-07T02:17:56Z

doc/smart-switch/reboot/reboot-hld.md

+
+* With the DPUs prepared for reboot, the NPU triggers a platform vendor API to initiate the reboot process for the DPUs. Vendor API reboots a single DPU, but the NPU spawns multiple threads to reboot DPUs in parallel. If any of the the DPU is stuck or unresponsive, the DPU reboot platform API should attempt a cold boot or power cycle to recover it.
+
+* DPUs will send an acknowledgment to the NPU and then undergo a reboot. After receiving the acknowledgment from the DPUs, the NPU will proceed to reboot itself to complete the overall reboot procedure. The vendor-specific reboot API should include an error handling mechanism to manage DPU reboot failures. Additionally log all the failures. DPUs will be in DPU_READY state, if the reboot happened successfully.


@vvolam can you elaborate on this ack "DPUs will send an acknowledgment to the NPU and then undergo a reboot" ? LIke how is this implemented?

prgeor · 2024-10-07T02:26:16Z

doc/smart-switch/reboot/reboot-hld.md

+| Unplanned Smart Switch System Crash       | Ungraceful reboot   | Ungraceful reboot   |
+| Unplanned DPU System Crash                | -                   | Ungraceful reboot   |
+
+The test scenarios above ensure that both the NPU and all DPUs are fully operational following any type of reboot. Furthermore, the tests verify the


@vvolam success of reboot should verify the following:-

DPUs that were UP before reboot MUST come up successfully.

DPUs that were admin down MUST remain admin down after rebot

Reboot cause of the DPUs and the NPU host should indicate that the reboot was initiated by the USER

vvolam marked this pull request as ready for review May 16, 2024 23:15

vvolam requested review from oleksandrivantsiv, rameshraghupathy, prgeor, r12f and dgsudharsan May 16, 2024 23:17

isabelmsft self-requested a review May 20, 2024 23:31

isabelmsft reviewed May 21, 2024

View reviewed changes

doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved

doc/smart-switch/reboot/reboot-hld.md Show resolved Hide resolved

doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved

vvolam force-pushed the reboot-hld branch from 47fa43b to 829b510 Compare May 29, 2024 22:54

vvolam requested review from qiluo-msft and ganglyu May 30, 2024 15:15

rameshraghupathy reviewed May 30, 2024

View reviewed changes

vvolam force-pushed the reboot-hld branch from c45142a to 075d745 Compare June 10, 2024 23:48

isabelmsft reviewed Jun 11, 2024

View reviewed changes

oleksandrivantsiv mentioned this pull request Jun 14, 2024

Smartswitch Platform Test Plan Document sonic-net/sonic-mgmt#12701

Merged

5 tasks

ganglyu previously approved these changes Jun 17, 2024

View reviewed changes

isabelmsft previously approved these changes Jun 18, 2024

View reviewed changes

oleksandrivantsiv reviewed Jun 24, 2024

View reviewed changes

doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved

oleksandrivantsiv approved these changes Jun 24, 2024

View reviewed changes

oleksandrivantsiv mentioned this pull request Jun 25, 2024

PMON Test Plan sonic-net/sonic-mgmt#13200

Open

oleksandrivantsiv reviewed Jun 26, 2024

View reviewed changes

doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved

vvolam dismissed stale reviews from isabelmsft and ganglyu via 1934915 June 26, 2024 19:00

oleksandrivantsiv suggested changes Jun 28, 2024

View reviewed changes

oleksandrivantsiv reviewed Jul 2, 2024

View reviewed changes

doc/smart-switch/reboot/reboot-hld.md Show resolved Hide resolved

vvolam force-pushed the reboot-hld branch 2 times, most recently from 1c9a020 to 7d67e25 Compare July 30, 2024 01:16

rameshraghupathy reviewed Aug 5, 2024

View reviewed changes

doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved

rameshraghupathy reviewed Aug 6, 2024

View reviewed changes

vvolam force-pushed the reboot-hld branch from 7d67e25 to 8b70aab Compare August 6, 2024 19:26

oleksandrivantsiv approved these changes Aug 9, 2024

View reviewed changes

vvolam mentioned this pull request Sep 11, 2024

Added schema for health_info, reboot_cause on chassisStateDB and added the link to pmon-test-plan #1709

Open

vvolam mentioned this pull request Sep 24, 2024

Added new Platform APIs and modified APIs for supporting reboot on a SmartSwitch sonic-net/sonic-platform-common#501

Merged

vvolam added 11 commits September 24, 2024 21:53

Smart Switch reboot high level design

46d8e0a

This is initial draft

Update HLD with modified APIs and images

2af6880

Minor update to test plan

7539f7a

Minor changes based on discussion with the community

24c47fb

Address review comments

94dec18

Minor correction to pci rescan information

c050f48

Update reboot mechanism of the DPU and pcie daemon changes

a0c9412

Minor changes

6b165b2

Minor changes

a37115c

Made a minor change to dup_id based on get_dpu_id() update in

442e8a7

sonic-net/sonic-platform-common#454

Add some enhancements

f7ca496

vvolam force-pushed the reboot-hld branch from 26f3f4e to f7ca496 Compare September 24, 2024 22:44

Minor change to new APIs

605c3a5

vvolam mentioned this pull request Oct 5, 2024

Enhance PCIe device check to skip the warning log, if device is in detaching mode sonic-net/sonic-platform-daemons#546

Draft

prgeor reviewed Oct 6, 2024

View reviewed changes

doc/smart-switch/reboot/reboot-hld.md

{

.

.

"dpu_halt_services_timeout" : "TBD"

Copy link

Contributor

prgeor Oct 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolam please update TBD

prgeor reviewed Oct 6, 2024

View reviewed changes

prgeor reviewed Oct 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Smart Switch reboot high level design #1699

Smart Switch reboot high level design #1699

vvolam commented May 16, 2024 •

edited

Loading

oleksandrivantsiv left a comment

rameshraghupathy Aug 6, 2024 •

edited

Loading

vvolam Aug 6, 2024

KrisNey-MSFT commented Sep 18, 2024

prgeor commented Sep 18, 2024

prgeor commented Sep 18, 2024

prgeor Oct 6, 2024

prgeor Oct 6, 2024

prgeor Oct 6, 2024

prgeor Oct 6, 2024

prgeor Oct 6, 2024

prgeor Oct 6, 2024

prgeor Oct 6, 2024

prgeor Oct 7, 2024

prgeor Oct 7, 2024

prgeor Oct 7, 2024


		DPUs are internally connected to the NPU via PCI-E bridge. Below is the reboot sequence for rebooting a specific DPU:

		* Upon receiving a reboot CLI command to restart a particular DPU, the NPU transmits a gNOI Reboot API signal with reboot method set to ‘HALT’, instructing

		## Test plan ##

		Presented below is the test plan within the ```sonic-mgmt``` framework for the smart switch reboot.


		* DPUs will send an acknowledgment to the NPU and then undergo a reboot. After receiving the acknowledgment from the DPUs, the NPU will proceed to reboot itself to complete the overall reboot procedure. The vendor-specific reboot API should include an error handling mechanism to manage DPU reboot failures. Additionally log all the failures. DPUs will be in DPU_READY state, if the reboot happened successfully.

		* Upon successful reboot, the NPU resumes operation. As part of the post-reboot process, the NPU may choose to rescan the PCI devices. This rescan operation,


		* With the DPUs prepared for reboot, the NPU triggers a platform vendor API to initiate the reboot process for the DPUs. Vendor API reboots a single DPU, but the NPU spawns multiple threads to reboot DPUs in parallel. If any of the the DPU is stuck or unresponsive, the DPU reboot platform API should attempt a cold boot or power cycle to recover it.

		* DPUs will send an acknowledgment to the NPU and then undergo a reboot. After receiving the acknowledgment from the DPUs, the NPU will proceed to reboot itself to complete the overall reboot procedure. The vendor-specific reboot API should include an error handling mechanism to manage DPU reboot failures. Additionally log all the failures. DPUs will be in DPU_READY state, if the reboot happened successfully.

Smart Switch reboot high level design #1699

Are you sure you want to change the base?

Smart Switch reboot high level design #1699

Conversation

vvolam commented May 16, 2024 • edited Loading

oleksandrivantsiv left a comment

Choose a reason for hiding this comment

rameshraghupathy Aug 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KrisNey-MSFT commented Sep 18, 2024

prgeor commented Sep 18, 2024

prgeor commented Sep 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vvolam commented May 16, 2024 •

edited

Loading

rameshraghupathy Aug 6, 2024 •

edited

Loading