-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added schema for health_info, reboot_cause on chassisStateDB and added the link to pmon-test-plan #1709
base: master
Are you sure you want to change the base?
Added schema for health_info, reboot_cause on chassisStateDB and added the link to pmon-test-plan #1709
Changes from 13 commits
bd36cff
98e787b
b8b02c1
7918edf
1ae2222
a14fd88
f325667
0fdd737
aa10015
b7b341a
65f9f33
064ce29
7ae06d6
82f01e1
0a47f51
7d5e14e
962a5e1
5926a44
58130a7
c6b62cc
e734c2f
fa6af05
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,6 +5,7 @@ | |
| 0.1 | 12/02/2023 | Ramesh Raghupathy | Initial version| | ||
| 0.2 | 01/08/2024 | Ramesh Raghupathy | Updated API, CPI sections and addressed review comments | | ||
| 0.3 | 02/26/2024 | Ramesh Raghupathy | Addressed review comments | | ||
| 0.4 | 06/06/2024 | Ramesh Raghupathy | Added schema for DPU health-info and called out phase:1 and phase:2 activities for DPU health-info. Added key suffix to module reboot-cause to avoid key conflicts | | ||
|
||
## Definitions / Abbreviations | ||
|
||
|
@@ -195,7 +196,9 @@ Key: "CHASSIS_MODULE|DPU0" | |
* Health | ||
* SmartSwitch DPUs should store their health data locally and also provide it to the host for a consolidated view of the CLIs | ||
* DPUs should support a CLI to display the health data “show system-health ...” (See CLIs section) | ||
* The host pmon should use this data to support the host side CLIs. Though accessing this data from DPUs and storing them on the switch is implementation specific it is recommended to use redis call and store them on the switch chassisStateDB for faster access. use "UserDefinedChecker" class to provide this data to the CLIs. | ||
* The host pmon should use this data to support the host side CLIs. Though accessing this data from DPUs and storing them on the switch is implementation specific it is recommended to use redis call and store them on the switch chassisStateDB for faster access. | ||
* This is done in two phases. Please refer to section:3.1.5.1 for the HEALTH_INFO schema | ||
* use "UserDefinedChecker" class to provide this data to the CLIs. | ||
* Vendor specific data such as interrupt events can also be placed in user defined fields under this DB | ||
* This table already exists in modular chassis design and the DPUs will use this just like a line card. | ||
* Alarm and Syslog | ||
|
@@ -219,7 +222,7 @@ SmartSwitch PMON block diagram | |
|
||
### 3.1. Platform monitoring and management | ||
* SmartSwitch design Extends the existing chassis_base class and module_base class as described below. | ||
* Extend MODULE_TYPE in ModuleBase class with MODULE_TYPE_DPU and MODULE_TYPE_SWITCH to support SmartSwitch | ||
* Extend MODULE_TYPE in ModuleBase class with MODULE_TYPE_DPU to support SmartSwitch | ||
|
||
#### 3.1.1 ChassisBase class API enhancements | ||
is_modular_chassis(self): | ||
|
@@ -280,7 +283,7 @@ get_dpu_id(self, name): | |
Retrieves the DPU ID for the given dpu-module name. Returns None for non-smartswitch chassis. | ||
|
||
Returns: | ||
An integer, indicating the DPU ID Ex: name:DPU0 return value 1, name:DPU1 return value 2, name:DPUX return value X+1 | ||
An integer, indicating the DPU ID Ex: name:DPU0 return value 0, name:DPU1 return value 1, name:DPUX return value X | ||
``` | ||
|
||
is_smartswitch(self): | ||
|
@@ -306,27 +309,19 @@ get_module_dpu_data_port(self, index): | |
#### 3.1.3 NPU to DPU data port mapping | ||
platform.json of NPU/switch will show the NPU to DPU data port mapping. This will be used by services early in the system boot. | ||
``` | ||
{ | ||
"DPUs" : [ | ||
{ | ||
"dpu0": { | ||
"interface": {"Ethernet224": "Ethernet0"} | ||
} | ||
}, | ||
"DPUS": [ | ||
{ | ||
"dpu1": { | ||
"interface": {"Ethernet232": "Ethernet0"} | ||
"dpu0": { | ||
"interface": {"Ethernet224": "Ethernet0"} | ||
}, | ||
"dpu1": { | ||
"interface": {"Ethernet232": "Ethernet0"} | ||
}, | ||
"dpux": { | ||
"interface": {"Ethernet2xx": "Ethernet0"} | ||
}, | ||
}, | ||
. | ||
. | ||
{ | ||
"dpuX": { | ||
"interface": {"EthernetX": "EthernetY"} | ||
} | ||
} | ||
] | ||
} | ||
``` | ||
#### 3.1.4 ModuleBase class API enhancements | ||
rameshraghupathy marked this conversation as resolved.
Show resolved
Hide resolved
|
||
get_base_mac(self): | ||
|
@@ -367,7 +362,7 @@ get_type(self): | |
|
||
Returns: | ||
A string, the module-type from one of the predefined types: | ||
MODULE_TYPE_SWITCH, MODULE_TYPE_DPU | ||
MODULE_TYPE_DPU | ||
``` | ||
|
||
get_oper_status(self): | ||
|
@@ -436,34 +431,51 @@ is_midplane_reachable(self): | |
#### 3.1.5 ModuleBase class new APIs | ||
|
||
##### 3.1.5.1 Need for consistent storage and access of DPU reboot cause, state and health | ||
#### Reboot Cause | ||
1. The smartswitch needs to know the reboot cause for DPUs. Please refer to the CLI section for the various options and their effects when executed on the switch and DPUs. | ||
|
||
* Each DPU will update its reboot cause history in the Switch ChasissStateDB upon boot up. The recent reboot-cause can be derived from that list of reboot-causes. | ||
* The get_reboot_cause will return the current reboot-cause of the module. | ||
* For persistent storage of the DPU reboot-cause and reboot-caue-history files use the existing host storage path and mechanism. | ||
|
||
#### Schema for REBOOT_CAUSE - switch stateDB | ||
#### Schema for REBOOT_CAUSE of SWITCH on switch stateDB | ||
``` | ||
Key: "REBOOT_CAUSE|2023_06_18_14_56_12" | ||
|
||
"REBOOT_CAUSE|2023_06_18_14_56_12": { | ||
"value": { | ||
"cause": "REBOOT_CAUSE_HOST_RESET_DPU", | ||
"cause": "Unknown", | ||
"comment": "N/A", | ||
"device": "DPU5", | ||
"time": "2023_06_18_14_56_12", | ||
"user": "N/A" | ||
} | ||
}, | ||
} | ||
|
||
``` | ||
#### Schema for REBOOT_CAUSE of DPUs on switch ChassisStateDB | ||
``` | ||
Key: "REBOOT_CAUSE|DPU0|2024_06_06_09_31_18" | ||
|
||
"REBOOT_CAUSE|DPU0|2024_06_06_09_31_18": { | ||
"value": { | ||
"cause": "Software causes (Reboot)", | ||
"comment": "User issued 'reboot' command [User: admin, Time: Thu Jun 6 09:46:43 AM UTC 2024]", | ||
"device": "DPU0", | ||
"time": "N/A", | ||
"user": "N/A" | ||
} | ||
} | ||
|
||
``` | ||
#### DPU State | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @rameshraghupathy can you please specify that this update is done by Chassisd inside PMON. We don't need DPU specific agent to fetch these ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Updated: |
||
2. Though the get_oper_status(self) can get the operational status of the DPU Modules, the current implementation only has limited capabilities. | ||
* Can only state MODULE_STATUS_FAULT and can't show exactly where in the state progression the DPU failed. This is critical in fault isolation, DPU switchover decision, resiliency and recovery | ||
* Though this is platform implementation specific, in a multi vendor use case, there has to be a consistent way of storing and accessing the information. | ||
* Store the state progression (dpu_midplane_link_state, dpu_control_plane_state, dpu_data_plane_state) on the host ChassisStateDB. | ||
* get_state_info(self) will return an object with the ChassisStateDB data | ||
* Potential consumers: HA, LB, Switch CLIs, Utils (install/repair images), Life Cycle Manager | ||
* Use cases: HA, Debuggability, error recovery (reset, power cycle) and fault management, consolidated view of Switch and DPU state | ||
|
||
#### DPU_STATE definition | ||
dpu_midplane_link_state: up refers to the pcie link between the NPU and DPU is operational. This will be updated by the switch pcied. | ||
|
||
|
@@ -490,13 +502,40 @@ dpu_data_plane_state: up refers to configuration downloaded, the pipeline stage | |
”dpu_data_plane_time": ”timestamp", | ||
”dpu_data_plane_reason": ”Pipeline failure", | ||
``` | ||
|
||
3. Each DPU has to store the health data in its local DB and should provide it to the switch. | ||
* When the "show system-health ..." CLI is executed on the switch, the "UserDefinedChecker" class will collect the data and feed it to the CLI. It is up to the platform on how this is done. However, for faster access store it in the switch ChassisStateDB. | ||
* The DPU is a complex hardware, to facilitate debug, a consistent way of storing and accessing the health record of the DPUs is critical in a multi vendor scenario even though it is a platform specific implementation. | ||
* Both switch and the DPUs will follow to the [SONiC system health monitor HLD](https://github.com/sonic-net/SONiC/blob/ce313db92a694e007a6c5332ce3267ac158290f6/doc/system_health_monitoring/system-health-HLD.md) | ||
#### DPU Health | ||
3. This feature is implemented in two phases. | ||
#### Phase:1 | ||
* Each DPU has to store the health info locally and should be available on the DPU when the "show system-health ..." CLI is executed on the DPU just like the switch. | ||
#### Phase:2 | ||
* Each DPU besides storing the health info locally, should also store the DPU health info in the switch ChassisStateDB. The schema for each DPU health info is the same as the switch and also is shown below. | ||
* When the "show system-health <all/DPUx/SWITCH>" CLI is executed on the switch a consolidated view of the entire system health will be provided. | ||
* The DPU is a complex hardware. To facilitate debug, a consistent way of storing and accessing the health record of the DPUs is critical in a multi vendor scenario even though it is a platform specific implementation. | ||
* Both switch and the DPUs will follow the [SONiC system health monitor HLD](https://github.com/sonic-net/SONiC/blob/ce313db92a694e007a6c5332ce3267ac158290f6/doc/system_health_monitoring/system-health-HLD.md) | ||
* Refer to section 3.4.5 for "show system-health .." CLIs | ||
|
||
#### Schema for HEALTH_INFO of DPUs on switch ChassisStateDB | ||
rameshraghupathy marked this conversation as resolved.
Show resolved
Hide resolved
|
||
``` | ||
; Defines information for a system health | ||
key = SYSTEM_HEALTH_INFO|DPUx ; health information for DPUx | ||
; field = value | ||
summary = STRING ; summary status for the DPU | ||
<item_name> = STRING ; an entry for a service or device | ||
|
||
``` | ||
We store items to db only if it is abnormal. Here is an example: | ||
``` | ||
admin@sonic:~$ redis-cli -n 13 hgetall SYSTEM_HEALTH_INFO | ||
1) "lldp:lldpmgrd" | ||
2) "Process 'lldpmgrd' in container 'lldp' is not running" | ||
3) "summary" | ||
4) "Not OK" | ||
``` | ||
If the system status is good, the data in redis is like: | ||
``` | ||
admin@sonic:~$ redis-cli -n 13 hgetall SYSTEM_HEALTH_INFO | ||
1) "summary" | ||
2) "OK" | ||
``` | ||
##### 3.1.5.2 ModuleBase class new APIs | ||
The DPU ID is used only for indexing purpose. | ||
|
||
|
@@ -505,8 +544,7 @@ get_dpu_id(self): | |
Retrieves the DPU ID. Returns None for non-smartswitch chassis. | ||
|
||
Returns: | ||
An integer, indicating the DPU ID. DPU0 returns 1, DPUX returns X+1 | ||
Returns '0' on switch module | ||
An integer, indicating the DPU ID. DPU0 returns 0, DPUX returns X | ||
``` | ||
#### Get DPU reboot cause | ||
def get_reboot_cause(self): | ||
|
@@ -530,43 +568,7 @@ get_state_info(self): | |
``` | ||
|
||
#### DPU_HEALTH Use Case | ||
* The major consumer of this data could be CLIs, fault management, debug, error recovery | ||
|
||
get_health_info(self): | ||
``` | ||
Retrieves the dpu health object having the detailed dpu health Fetched from the DPUs | ||
|
||
Returns: | ||
An object instance of the dpu health. | ||
Returns None when the module is SWITCH | ||
|
||
Example: | ||
{ | ||
"led_status": "green", | ||
|
||
"monitoredlists": { | ||
"Program": [ | ||
{"Name": "routeCheck", "Status": "Not OK", "Type": "Program"}, | ||
// Add more program items here | ||
], | ||
"Service": [ | ||
{"Name": "mgmt-framework", "Status": "Not OK", "Type": "Service"}, | ||
// Add more service items here | ||
], | ||
"Fan": [ | ||
{"Name": "Fan", "Status": "Not OK", "Type": "Fan"} | ||
], | ||
"UserDefined": [ | ||
// Add user-defined items here | ||
] | ||
}, | ||
|
||
"ignore_list": [ | ||
{"Name": "example1", "Status": "OK", "Type": "Type1"}, | ||
// Add more items to ignore list | ||
] | ||
} | ||
``` | ||
* The major consumer of this data could be CLIs, fault management, debug, error recovery. There is no platform API for this. | ||
|
||
### 3.2 Thermal management | ||
* Platform initializes all sensors | ||
|
@@ -606,6 +608,8 @@ A typical modular chassis includes a midplane-interface to interconnect the Supe | |
* By default smartswitch midplane IP address assignment will be done using internal DHCP. | ||
* Please refer to the [ip-address-assignment document](https://github.com/sonic-net/SONiC/blob/master/doc/smart-switch/ip-address-assigment/smart-switch-ip-address-assignment.md) for IP address assignment between the switch host and the DPUs. | ||
* The second option is the static IP address assignment. | ||
* A DPU state change handler will be implemented to monitor PCIe link state change events, DPU control-plane and data-plane state transitions mainly for HA. | ||
* There will be a separate hld for the DPU state change handler. | ||
|
||
### 3.4 Debug & RMA | ||
CLI Extensions and Additions | ||
|
@@ -676,26 +680,10 @@ fantray0 N/A fantray0.fan 55% intake Present OK 20230 | |
fantray1 N/A fantray1.fan 56% intake Present OK 20230728 06:41:17 | ||
``` | ||
|
||
#### 3.4.1 Reboot Cause | ||
#### 3.4.1 Reboot Cause CLIs | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @rameshraghupathy Please specify which entity or service will update the chassisStateDB There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The PMON on the DPU side will responsible to update the switch side chassisStateDB on DPU boot up, using the push model specified in section: 3.2.4 of SONiC Chassis Platform Management & Monitoring HLD There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @rameshraghupathy understood its PMON. Which agent/daemon inside pmon? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @prgeor "Though how DPU pmon updates this is vendor dependent, it is recommended to use the sonic telemetry agent to align with the existing SONiC implementation." |
||
* There are two CLIs "show reboot-cause" and "show reboot-cause history" which are applicable to both DPUs and the Switch. However, when executed on the Switch the CLIs provide a consolidated view of reboot cause as shown below. | ||
* Each DPU will update its reboot cause history in the Switch ChasissStateDB upon boot up. The recent reboot-cause can be derived from that list of reboot-causes. | ||
* The switch side PMON will copy this into the stateDB so that the existing workflow will not be affected. | ||
* The get_reboot_cause API will return the current reboot-cause of the module. | ||
* Each DPU will update its reboot cause history in the Switch ChasissStateDB upon boot up. The DPUs will limit the number of history entries to a maximum of ten. The recent reboot-cause can be derived from that list of reboot-causes. Platforms which are not capable of populating the ChasissStateDB can use the "get_reboot_cause" API to fetch the data from the DPUs. The trigger to activate the API will eventually come from the DPU state change handler. | ||
|
||
#### REBOOT_CAUSE DB schema | ||
``` | ||
Key: "REBOOT_CAUSE|2023_06_18_14_56_12" | ||
|
||
"REBOOT_CAUSE|2023_06_18_14_56_12": { | ||
"value": { | ||
"cause": "REBOOT_CAUSE_HOST_RESET_DPU", | ||
"comment": "N/A", | ||
"device": "DPU5", | ||
"time": "2023_06_18_14_56_12", | ||
"user": "N/A" | ||
} | ||
} | ||
``` | ||
#### 3.4.2 Reboot Cause CLIs on the DPUs <font>**`Executed on the DPU`**</font> | ||
* The "show reboot-cause" shows the most recent reboot-cause | ||
* The "show reboot-cause history" shows the reboot-cause history | ||
|
@@ -718,17 +706,32 @@ Name Cause Time | |
* The "show reboot-cause history" CLI on the switch shows the history of the Switch and all DPUs | ||
* The "show reboot-cause history module-name" CLI on the switch shows the history of the specified module | ||
|
||
"show reboot-cause history" <font>**`Executed on the switch`**</font> | ||
``` | ||
root@sonic:~#show reboot-cause | ||
|
||
Name Cause Time User Comment | ||
|
||
2023_10_20_18_52_28 Watchdog:1 expired; Wed 20 Oct 2023 06:52:28 PM UTC N/A N/A | ||
|
||
|
||
root@sonic:~#show reboot-cause history | ||
|
||
Name Cause Time User Comment | ||
|
||
2023_10_20_18_52_28 Watchdog:1 expired; Wed 20 Oct 2023 06:52:28 PM UTC N/A N/A | ||
2023_10_05_18_23_46 reboot Wed 05 Oct 2023 06:23:46 PM UTC user N/A | ||
|
||
|
||
root@sonic:~#show reboot-cause all | ||
|
||
Device Name Cause Time User Comment | ||
|
||
switch 2023_10_20_18_52_28 Watchdog:1 expired; Wed 20 Oct 2023 06:52:28 PM UTC N/A N/A | ||
DPU3 2023_10_03_18_23_46 Watchdog: stage 1 expired; Mon 03 Oct 2023 06:23:46 PM UTC N/A N/A | ||
DPU2 2023_10_02_17_20_46 reboot Sun 02 Oct 2023 05:20:46 PM UTC admin User issued 'reboot' | ||
|
||
root@sonic:~#show reboot-cause history | ||
|
||
root@sonic:~#show reboot-cause history all | ||
|
||
Device Name Cause Time User Comment | ||
|
||
|
@@ -761,7 +764,12 @@ DPU1 SS-DPU1 2 Online up SN20 | |
SWITCH Chassis 0 Online N/A FLM27000ER | ||
``` | ||
#### 3.4.5 System health details | ||
* The system health summary on NPU should include the DPU health. Extend the existing infrastructure. | ||
#### Phase:1 | ||
* The system health summary on switch will display only the NPU health | ||
* The system health summary on DPU will display the DPU health | ||
|
||
#### Phase:2 | ||
* The system health summary on switch should include the NPU and DPU health. Extend the existing CLI infrastructure. | ||
|
||
show system-health summary \<module-name\> <font>**`Executed on the switch or DPU - module-name is ignored on the DPUs`**</font> | ||
``` | ||
|
@@ -798,7 +806,7 @@ Online : All states are up | |
Offline: dpu_midplane_link_state is down | ||
Partial Online: dpu_midplane_link_state is up and dpu_control_plane_state or dpu_data_plane_state is down | ||
|
||
There are two parts to the state detail. 1. The midplane state 2. the dpu states (booted, control plane state, data plane state). The midplane state has to be updated by the switch side pcied. The dpu states will be updated by the DPU (redis client update) on the switch ChassisStateDB. The get_state_info() API in the moduleBase class will fetch the contents from the DB. The show CLI reads the redis table and displays the data. | ||
There are two parts to the state detail. 1. The midplane state 2. the dpu states (control plane state, data plane state). The midplane state has to be updated by the switch side pcied. The dpu states will be updated by the DPU (redis client update) on the switch ChassisStateDB. The get_state_info() API in the moduleBase class will fetch the contents from the DB. The show CLI reads the redis table and displays the data. | ||
root@sonic:~#show system-health DPU all | ||
|
||
Name ID Oper-Status State-Detail State-Value Time Reason | ||
|
@@ -1022,4 +1030,4 @@ Note: | |
``` | ||
|
||
## 4. Test Plan | ||
In Progress | ||
[Test Plan](https://github.com/nissampa/sonic-mgmt_dpu_test/blob/dpu_test_plan_draft_pr/docs/testplan/Smartswitch-test-plan.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy In section 3.2 can you specify if the thermal management is in NPU or DPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor Updated. It runs on the NPU.