Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added schema for health_info, reboot_cause on chassisStateDB and added the link to pmon-test-plan #1709

Open
wants to merge 22 commits into
base: master
Choose a base branch
from

Conversation

rameshraghupathy
Copy link
Contributor

Added schema for health_info, reboot_cause on chassisStateDB and added the link to pmon-test-plan

Copy link
Contributor

@prgeor prgeor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy Can you add a section for DPU dark mode support. In this case,
NPU's PMON should honor the user configuration to power OFF the DPU via platform API.

Copy link

@vvolam vvolam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor query, otherwise LGTM.

doc/smart-switch/pmon/smartswitch-pmon.md Outdated Show resolved Hide resolved
vvolam
vvolam previously approved these changes Aug 21, 2024
Copy link

@vvolam vvolam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Contributor

@prgeor prgeor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy Please update section 3.5 how the console utility be implemented.

@rameshraghupathy
Copy link
Contributor Author

@rameshraghupathy Can you add a section for DPU dark mode support. In this case, NPU's PMON should honor the user configuration to power OFF the DPU via platform API.

Added section "2.1.1 DPUs in dark mode"

### Configuring startup and shutdown
* The DPUs can be powered down by configuring the admin_status as shown.
* The corresponding switch configDB table is also shown
#### 2.1.1 DPUs in dark mode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy can you define what is DARK mode?
Also mention the default is DARM mode enabled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy can you define what is DARK mode? Also mention the default is DARM mode enabled

Done
@prgeor

* The user can use the “config chassis modules startup DPUx” to power ON a DPU Example: “config chassis modules startup DPU0”
* The “config chassis modules shutdown DPUx” is used to power OFF a DPU Example: “config chassis modules shutdown DPU0”
* The DPUs are powered down by configuring the admin_status as shown in the schema
* The config change event handler listens to the config change and sets the corresponding switch configDB table and also triggers the module set_admin_state() API
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy please specify where is this even handler running

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy please specify where is this even handler running

Done
@prgeor

@@ -128,9 +138,10 @@ Key: "CHASSIS_MODULE|DPU0"
#### DPU shutdown sequence
* There could be two possible sources for DPU shutdown. 1. A configuration change to DPU "admin_status: down" 2. The GNOI logic can trigger it.
* The GNOI server runs on the DPU even after the DPU is shutdown.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy if the DPU is shut how can GNOI server run?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy if the DPU is shut how can GNOI server run?

Meant to say pre-shutdown.
The GNOI server runs on the DPU even after the DPU is pre-shutdown and listens until the graceful shutdown finishes.
Fixed
@prgeor

}

```
#### DPU State
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy can you please specify that this update is done by Chassisd inside PMON. We don't need DPU specific agent to fetch these ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy can you please specify that this update is done by Chassisd inside PMON. We don't need DPU specific agent to fetch these ?

Updated:
Store the state progression (dpu_midplane_link_state, dpu_control_plane_state, dpu_data_plane_state) on the host ChassisStateDB using the push model specified in section: 3.2.4 of SONiC Chassis Platform Management & Monitoring HLD
@prgeor

@@ -676,26 +678,10 @@ fantray0 N/A fantray0.fan 55% intake Present OK 20230
fantray1 N/A fantray1.fan 56% intake Present OK 20230728 06:41:17
```

#### 3.4.1 Reboot Cause
#### 3.4.1 Reboot Cause CLIs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy Please specify which entity or service will update the chassisStateDB

Copy link
Contributor Author

@rameshraghupathy rameshraghupathy Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PMON on the DPU side will responsible to update the switch side chassisStateDB on DPU boot up, using the push model specified in section: 3.2.4 of SONiC Chassis Platform Management & Monitoring HLD
@prgeor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy understood its PMON. Which agent/daemon inside pmon?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor "Though how DPU pmon updates this is vendor dependent, it is recommended to use the sonic telemetry agent to align with the existing SONiC implementation."

@rameshraghupathy
Copy link
Contributor Author

@rameshraghupathy Please update section 3.5 how the console utility be implemented.

Done
@prgeor

Copy link
Contributor

@prgeor prgeor Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy In section 3.2 can you specify if the thermal management is in NPU or DPU?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Updated. It runs on the NPU.

#### REBOOT_CAUSE DB schema
```
Key: "REBOOT_CAUSE|2023_06_18_14_56_12"
* Each DPU will update its reboot cause history in the Switch ChassisStateDB upon boot up.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy How? Which daemon/service?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dpu_db_util/system_health service will update the ChassisStateDB table.

Comment on lines +688 to +689
* Though how DPU pmon updates this is vendor dependent, it is recommended to use the sonic telemetry agent to align with the existing SONiC implementation.
* The DPUs will limit the number of history entries to a maximum of ten.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy Why DPU pmon updates needs to be vendor dependent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no guarantee that the SONiC running on the DPUs will necessarily be running Telemetry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants