Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added schema for health_info, reboot_cause on chassisStateDB and added the link to pmon-test-plan #1709

Open
wants to merge 22 commits into
base: master
Choose a base branch
from
Open
Changes from 13 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
bd36cff
Added schema for reboot-cause and health-info chassisStateDB
rameshraghupathy Jun 6, 2024
98e787b
Fixed formatting
rameshraghupathy Jun 6, 2024
b8b02c1
Fixed formatting
rameshraghupathy Jun 6, 2024
7918edf
Added TestPlan link
rameshraghupathy Jun 6, 2024
1ae2222
Did some cleanup
rameshraghupathy Jun 10, 2024
a14fd88
Updated reboot-cause schema, system-health-info schema and removed the
rameshraghupathy Jun 12, 2024
f325667
updated the system-health info schema
rameshraghupathy Jun 12, 2024
0fdd737
Removed the get_health_info API and addressed a couple of minor comments
rameshraghupathy Jul 8, 2024
aa10015
Fixed reboot-cause reboot-cause history and the all option outputs
rameshraghupathy Jul 13, 2024
b7b341a
Updated rev 0.4
rameshraghupathy Jul 28, 2024
65f9f33
cleaned NPU to DPU data port mapping
rameshraghupathy Jul 29, 2024
064ce29
dpu_id now ranges from 0
rameshraghupathy Aug 4, 2024
7ae06d6
Changed the test-plan link
rameshraghupathy Aug 5, 2024
82f01e1
Added a section for DPU dark mode
rameshraghupathy Aug 6, 2024
0a47f51
Did a minor cleanup in 2.1.1
rameshraghupathy Aug 6, 2024
7d5e14e
replaced platform.json with hwsku.json file to represent the NPU-DPU
rameshraghupathy Aug 9, 2024
962a5e1
platform.json is the right file for NPU-DPU port association
rameshraghupathy Aug 10, 2024
5926a44
Removed get_module_dpu_data_port API. Showed hwo the role in interfaces
rameshraghupathy Aug 20, 2024
58130a7
Addressed review comments
rameshraghupathy Aug 31, 2024
c6b62cc
Added reboot hld link and updated the dpu shutdown sequence as per that
rameshraghupathy Sep 11, 2024
e734c2f
Addressed review comments
rameshraghupathy Oct 3, 2024
fa6af05
Addressed a couple of review comments
rameshraghupathy Oct 3, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
192 changes: 100 additions & 92 deletions doc/smart-switch/pmon/smartswitch-pmon.md
Copy link
Contributor

@prgeor prgeor Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy In section 3.2 can you specify if the thermal management is in NPU or DPU?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Updated. It runs on the NPU.

Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
| 0.1 | 12/02/2023 | Ramesh Raghupathy | Initial version|
| 0.2 | 01/08/2024 | Ramesh Raghupathy | Updated API, CPI sections and addressed review comments |
| 0.3 | 02/26/2024 | Ramesh Raghupathy | Addressed review comments |
| 0.4 | 06/06/2024 | Ramesh Raghupathy | Added schema for DPU health-info and called out phase:1 and phase:2 activities for DPU health-info. Added key suffix to module reboot-cause to avoid key conflicts |

## Definitions / Abbreviations

Expand Down Expand Up @@ -195,7 +196,9 @@ Key: "CHASSIS_MODULE|DPU0"
* Health
* SmartSwitch DPUs should store their health data locally and also provide it to the host for a consolidated view of the CLIs
* DPUs should support a CLI to display the health data “show system-health ...” (See CLIs section)
* The host pmon should use this data to support the host side CLIs. Though accessing this data from DPUs and storing them on the switch is implementation specific it is recommended to use redis call and store them on the switch chassisStateDB for faster access. use "UserDefinedChecker" class to provide this data to the CLIs.
* The host pmon should use this data to support the host side CLIs. Though accessing this data from DPUs and storing them on the switch is implementation specific it is recommended to use redis call and store them on the switch chassisStateDB for faster access.
* This is done in two phases. Please refer to section:3.1.5.1 for the HEALTH_INFO schema
* use "UserDefinedChecker" class to provide this data to the CLIs.
* Vendor specific data such as interrupt events can also be placed in user defined fields under this DB
* This table already exists in modular chassis design and the DPUs will use this just like a line card.
* Alarm and Syslog
Expand All @@ -219,7 +222,7 @@ SmartSwitch PMON block diagram

### 3.1. Platform monitoring and management
* SmartSwitch design Extends the existing chassis_base class and module_base class as described below.
* Extend MODULE_TYPE in ModuleBase class with MODULE_TYPE_DPU and MODULE_TYPE_SWITCH to support SmartSwitch
* Extend MODULE_TYPE in ModuleBase class with MODULE_TYPE_DPU to support SmartSwitch

#### 3.1.1 ChassisBase class API enhancements
is_modular_chassis(self):
Expand Down Expand Up @@ -280,7 +283,7 @@ get_dpu_id(self, name):
Retrieves the DPU ID for the given dpu-module name. Returns None for non-smartswitch chassis.

Returns:
An integer, indicating the DPU ID Ex: name:DPU0 return value 1, name:DPU1 return value 2, name:DPUX return value X+1
An integer, indicating the DPU ID Ex: name:DPU0 return value 0, name:DPU1 return value 1, name:DPUX return value X
```

is_smartswitch(self):
Expand All @@ -306,27 +309,19 @@ get_module_dpu_data_port(self, index):
#### 3.1.3 NPU to DPU data port mapping
platform.json of NPU/switch will show the NPU to DPU data port mapping. This will be used by services early in the system boot.
```
{
"DPUs" : [
{
"dpu0": {
"interface": {"Ethernet224": "Ethernet0"}
}
},
"DPUS": [
{
"dpu1": {
"interface": {"Ethernet232": "Ethernet0"}
"dpu0": {
"interface": {"Ethernet224": "Ethernet0"}
},
"dpu1": {
"interface": {"Ethernet232": "Ethernet0"}
},
"dpux": {
"interface": {"Ethernet2xx": "Ethernet0"}
},
},
.
.
{
"dpuX": {
"interface": {"EthernetX": "EthernetY"}
}
}
]
}
```
#### 3.1.4 ModuleBase class API enhancements
rameshraghupathy marked this conversation as resolved.
Show resolved Hide resolved
get_base_mac(self):
Expand Down Expand Up @@ -367,7 +362,7 @@ get_type(self):

Returns:
A string, the module-type from one of the predefined types:
MODULE_TYPE_SWITCH, MODULE_TYPE_DPU
MODULE_TYPE_DPU
```

get_oper_status(self):
Expand Down Expand Up @@ -436,34 +431,51 @@ is_midplane_reachable(self):
#### 3.1.5 ModuleBase class new APIs

##### 3.1.5.1 Need for consistent storage and access of DPU reboot cause, state and health
#### Reboot Cause
1. The smartswitch needs to know the reboot cause for DPUs. Please refer to the CLI section for the various options and their effects when executed on the switch and DPUs.

* Each DPU will update its reboot cause history in the Switch ChasissStateDB upon boot up. The recent reboot-cause can be derived from that list of reboot-causes.
* The get_reboot_cause will return the current reboot-cause of the module.
* For persistent storage of the DPU reboot-cause and reboot-caue-history files use the existing host storage path and mechanism.

#### Schema for REBOOT_CAUSE - switch stateDB
#### Schema for REBOOT_CAUSE of SWITCH on switch stateDB
```
Key: "REBOOT_CAUSE|2023_06_18_14_56_12"

"REBOOT_CAUSE|2023_06_18_14_56_12": {
"value": {
"cause": "REBOOT_CAUSE_HOST_RESET_DPU",
"cause": "Unknown",
"comment": "N/A",
"device": "DPU5",
"time": "2023_06_18_14_56_12",
"user": "N/A"
}
},
}

```
#### Schema for REBOOT_CAUSE of DPUs on switch ChassisStateDB
```
Key: "REBOOT_CAUSE|DPU0|2024_06_06_09_31_18"

"REBOOT_CAUSE|DPU0|2024_06_06_09_31_18": {
"value": {
"cause": "Software causes (Reboot)",
"comment": "User issued 'reboot' command [User: admin, Time: Thu Jun 6 09:46:43 AM UTC 2024]",
"device": "DPU0",
"time": "N/A",
"user": "N/A"
}
}

```
#### DPU State
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy can you please specify that this update is done by Chassisd inside PMON. We don't need DPU specific agent to fetch these ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy can you please specify that this update is done by Chassisd inside PMON. We don't need DPU specific agent to fetch these ?

Updated:
Store the state progression (dpu_midplane_link_state, dpu_control_plane_state, dpu_data_plane_state) on the host ChassisStateDB using the push model specified in section: 3.2.4 of SONiC Chassis Platform Management & Monitoring HLD
@prgeor

2. Though the get_oper_status(self) can get the operational status of the DPU Modules, the current implementation only has limited capabilities.
* Can only state MODULE_STATUS_FAULT and can't show exactly where in the state progression the DPU failed. This is critical in fault isolation, DPU switchover decision, resiliency and recovery
* Though this is platform implementation specific, in a multi vendor use case, there has to be a consistent way of storing and accessing the information.
* Store the state progression (dpu_midplane_link_state, dpu_control_plane_state, dpu_data_plane_state) on the host ChassisStateDB.
* get_state_info(self) will return an object with the ChassisStateDB data
* Potential consumers: HA, LB, Switch CLIs, Utils (install/repair images), Life Cycle Manager
* Use cases: HA, Debuggability, error recovery (reset, power cycle) and fault management, consolidated view of Switch and DPU state

#### DPU_STATE definition
dpu_midplane_link_state: up refers to the pcie link between the NPU and DPU is operational. This will be updated by the switch pcied.

Expand All @@ -490,13 +502,40 @@ dpu_data_plane_state: up refers to configuration downloaded, the pipeline stage
”dpu_data_plane_time": ”timestamp",
”dpu_data_plane_reason": ”Pipeline failure",
```

3. Each DPU has to store the health data in its local DB and should provide it to the switch.
* When the "show system-health ..." CLI is executed on the switch, the "UserDefinedChecker" class will collect the data and feed it to the CLI. It is up to the platform on how this is done. However, for faster access store it in the switch ChassisStateDB.
* The DPU is a complex hardware, to facilitate debug, a consistent way of storing and accessing the health record of the DPUs is critical in a multi vendor scenario even though it is a platform specific implementation.
* Both switch and the DPUs will follow to the [SONiC system health monitor HLD](https://github.com/sonic-net/SONiC/blob/ce313db92a694e007a6c5332ce3267ac158290f6/doc/system_health_monitoring/system-health-HLD.md)
#### DPU Health
3. This feature is implemented in two phases.
#### Phase:1
* Each DPU has to store the health info locally and should be available on the DPU when the "show system-health ..." CLI is executed on the DPU just like the switch.
#### Phase:2
* Each DPU besides storing the health info locally, should also store the DPU health info in the switch ChassisStateDB. The schema for each DPU health info is the same as the switch and also is shown below.
* When the "show system-health <all/DPUx/SWITCH>" CLI is executed on the switch a consolidated view of the entire system health will be provided.
* The DPU is a complex hardware. To facilitate debug, a consistent way of storing and accessing the health record of the DPUs is critical in a multi vendor scenario even though it is a platform specific implementation.
* Both switch and the DPUs will follow the [SONiC system health monitor HLD](https://github.com/sonic-net/SONiC/blob/ce313db92a694e007a6c5332ce3267ac158290f6/doc/system_health_monitoring/system-health-HLD.md)
* Refer to section 3.4.5 for "show system-health .." CLIs

#### Schema for HEALTH_INFO of DPUs on switch ChassisStateDB
rameshraghupathy marked this conversation as resolved.
Show resolved Hide resolved
```
; Defines information for a system health
key = SYSTEM_HEALTH_INFO|DPUx ; health information for DPUx
; field = value
summary = STRING ; summary status for the DPU
<item_name> = STRING ; an entry for a service or device

```
We store items to db only if it is abnormal. Here is an example:
```
admin@sonic:~$ redis-cli -n 13 hgetall SYSTEM_HEALTH_INFO
1) "lldp:lldpmgrd"
2) "Process 'lldpmgrd' in container 'lldp' is not running"
3) "summary"
4) "Not OK"
```
If the system status is good, the data in redis is like:
```
admin@sonic:~$ redis-cli -n 13 hgetall SYSTEM_HEALTH_INFO
1) "summary"
2) "OK"
```
##### 3.1.5.2 ModuleBase class new APIs
The DPU ID is used only for indexing purpose.

Expand All @@ -505,8 +544,7 @@ get_dpu_id(self):
Retrieves the DPU ID. Returns None for non-smartswitch chassis.

Returns:
An integer, indicating the DPU ID. DPU0 returns 1, DPUX returns X+1
Returns '0' on switch module
An integer, indicating the DPU ID. DPU0 returns 0, DPUX returns X
```
#### Get DPU reboot cause
def get_reboot_cause(self):
Expand All @@ -530,43 +568,7 @@ get_state_info(self):
```

#### DPU_HEALTH Use Case
* The major consumer of this data could be CLIs, fault management, debug, error recovery

get_health_info(self):
```
Retrieves the dpu health object having the detailed dpu health Fetched from the DPUs

Returns:
An object instance of the dpu health.
Returns None when the module is SWITCH

Example:
{
  "led_status": "green",

  "monitoredlists": {
    "Program": [
      {"Name": "routeCheck", "Status": "Not OK", "Type": "Program"},
      // Add more program items here
    ],
    "Service": [
      {"Name": "mgmt-framework", "Status": "Not OK", "Type": "Service"},
      // Add more service items here
    ],
    "Fan": [
      {"Name": "Fan", "Status": "Not OK", "Type": "Fan"}
    ],
    "UserDefined": [
      // Add user-defined items here
    ]
  },

  "ignore_list": [
    {"Name": "example1", "Status": "OK", "Type": "Type1"},
    // Add more items to ignore list
  ]
}
```
* The major consumer of this data could be CLIs, fault management, debug, error recovery. There is no platform API for this.

### 3.2 Thermal management
* Platform initializes all sensors
Expand Down Expand Up @@ -606,6 +608,8 @@ A typical modular chassis includes a midplane-interface to interconnect the Supe
* By default smartswitch midplane IP address assignment will be done using internal DHCP.
* Please refer to the [ip-address-assignment document](https://github.com/sonic-net/SONiC/blob/master/doc/smart-switch/ip-address-assigment/smart-switch-ip-address-assignment.md) for IP address assignment between the switch host and the DPUs.
* The second option is the static IP address assignment.
* A DPU state change handler will be implemented to monitor PCIe link state change events, DPU control-plane and data-plane state transitions mainly for HA.
* There will be a separate hld for the DPU state change handler.

### 3.4 Debug & RMA
CLI Extensions and Additions
Expand Down Expand Up @@ -676,26 +680,10 @@ fantray0 N/A fantray0.fan 55% intake Present OK 20230
fantray1 N/A fantray1.fan 56% intake Present OK 20230728 06:41:17
```

#### 3.4.1 Reboot Cause
#### 3.4.1 Reboot Cause CLIs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy Please specify which entity or service will update the chassisStateDB

Copy link
Contributor Author

@rameshraghupathy rameshraghupathy Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PMON on the DPU side will responsible to update the switch side chassisStateDB on DPU boot up, using the push model specified in section: 3.2.4 of SONiC Chassis Platform Management & Monitoring HLD
@prgeor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy understood its PMON. Which agent/daemon inside pmon?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor "Though how DPU pmon updates this is vendor dependent, it is recommended to use the sonic telemetry agent to align with the existing SONiC implementation."

* There are two CLIs "show reboot-cause" and "show reboot-cause history" which are applicable to both DPUs and the Switch. However, when executed on the Switch the CLIs provide a consolidated view of reboot cause as shown below.
* Each DPU will update its reboot cause history in the Switch ChasissStateDB upon boot up. The recent reboot-cause can be derived from that list of reboot-causes.
* The switch side PMON will copy this into the stateDB so that the existing workflow will not be affected.
* The get_reboot_cause API will return the current reboot-cause of the module.
* Each DPU will update its reboot cause history in the Switch ChasissStateDB upon boot up. The DPUs will limit the number of history entries to a maximum of ten. The recent reboot-cause can be derived from that list of reboot-causes. Platforms which are not capable of populating the ChasissStateDB can use the "get_reboot_cause" API to fetch the data from the DPUs. The trigger to activate the API will eventually come from the DPU state change handler.

#### REBOOT_CAUSE DB schema
```
Key: "REBOOT_CAUSE|2023_06_18_14_56_12"

"REBOOT_CAUSE|2023_06_18_14_56_12": {
"value": {
"cause": "REBOOT_CAUSE_HOST_RESET_DPU",
"comment": "N/A",
"device": "DPU5",
"time": "2023_06_18_14_56_12",
"user": "N/A"
}
}
```
#### 3.4.2 Reboot Cause CLIs on the DPUs <font>**`Executed on the DPU`**</font>
* The "show reboot-cause" shows the most recent reboot-cause
* The "show reboot-cause history" shows the reboot-cause history
Expand All @@ -718,17 +706,32 @@ Name Cause Time
* The "show reboot-cause history" CLI on the switch shows the history of the Switch and all DPUs
* The "show reboot-cause history module-name" CLI on the switch shows the history of the specified module

"show reboot-cause history" <font>**`Executed on the switch`**</font>
```
root@sonic:~#show reboot-cause

Name Cause Time User Comment

2023_10_20_18_52_28 Watchdog:1 expired; Wed 20 Oct 2023 06:52:28 PM UTC N/A N/A


root@sonic:~#show reboot-cause history

Name Cause Time User Comment

2023_10_20_18_52_28 Watchdog:1 expired; Wed 20 Oct 2023 06:52:28 PM UTC N/A N/A
2023_10_05_18_23_46 reboot Wed 05 Oct 2023 06:23:46 PM UTC user N/A


root@sonic:~#show reboot-cause all

Device Name Cause Time User Comment

switch 2023_10_20_18_52_28 Watchdog:1 expired; Wed 20 Oct 2023 06:52:28 PM UTC N/A N/A
DPU3 2023_10_03_18_23_46 Watchdog: stage 1 expired; Mon 03 Oct 2023 06:23:46 PM UTC N/A N/A
DPU2 2023_10_02_17_20_46 reboot Sun 02 Oct 2023 05:20:46 PM UTC admin User issued 'reboot'

root@sonic:~#show reboot-cause history

root@sonic:~#show reboot-cause history all

Device Name Cause Time User Comment

Expand Down Expand Up @@ -761,7 +764,12 @@ DPU1 SS-DPU1 2 Online up SN20
SWITCH Chassis 0 Online N/A FLM27000ER
```
#### 3.4.5 System health details
* The system health summary on NPU should include the DPU health. Extend the existing infrastructure.
#### Phase:1
* The system health summary on switch will display only the NPU health
* The system health summary on DPU will display the DPU health

#### Phase:2
* The system health summary on switch should include the NPU and DPU health. Extend the existing CLI infrastructure.

show system-health summary \<module-name\> <font>**`Executed on the switch or DPU - module-name is ignored on the DPUs`**</font>
```
Expand Down Expand Up @@ -798,7 +806,7 @@ Online : All states are up
Offline: dpu_midplane_link_state is down
Partial Online: dpu_midplane_link_state is up and dpu_control_plane_state or dpu_data_plane_state is down

There are two parts to the state detail. 1. The midplane state 2. the dpu states (booted, control plane state, data plane state). The midplane state has to be updated by the switch side pcied. The dpu states will be updated by the DPU (redis client update) on the switch ChassisStateDB. The get_state_info() API in the moduleBase class will fetch the contents from the DB. The show CLI reads the redis table and displays the data.
There are two parts to the state detail. 1. The midplane state 2. the dpu states (control plane state, data plane state). The midplane state has to be updated by the switch side pcied. The dpu states will be updated by the DPU (redis client update) on the switch ChassisStateDB. The get_state_info() API in the moduleBase class will fetch the contents from the DB. The show CLI reads the redis table and displays the data.
root@sonic:~#show system-health DPU all  

Name ID Oper-Status State-Detail State-Value Time Reason
Expand Down Expand Up @@ -1022,4 +1030,4 @@ Note:
```

## 4. Test Plan
In Progress
[Test Plan](https://github.com/nissampa/sonic-mgmt_dpu_test/blob/dpu_test_plan_draft_pr/docs/testplan/Smartswitch-test-plan.md)