-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[chassis][midplane] Modify the chassisd to log expected/unexpected midplane connectivity messages #480
[chassis][midplane] Modify the chassisd to log expected/unexpected midplane connectivity messages #480
Conversation
@deepak-singhal0408 @judyjoseph This PR is for an issue of logging lost midplane connectivity log. Total 3 PRs. Please review them. Thanks |
Can you provide details (schema) on Chassis Module Reboot Info table which is introduced here. |
It is not clear why the Chassis module reboot info entry needs to be removed from platform specific code. Isn't this handled entirely in sonic common code. |
d38ebe6
to
386748a
Compare
On Nokia platform, one of the unpexpect reboot (missing heartbeart reboot) is calling the "sudo reboot". Since "sudo reboot" creates the expected CHASSIS_MODULE_REBOOT_INFO_TABLE entry, we need to remove it for this case. This is platform specified behaviors. |
The CHASSIS_MODULE_REBOOT_INFO_TABLE defined as below: Example: |
@mlok-nokia, could you please also add UT case? |
…dplane connectivity messages Signed-off-by: mlok <marty.lok@nokia.com>
386748a
to
918461f
Compare
Add mechanism to get the linecard_reboot_timeout value from platform_env.conf file. This provides capabilitiy to different platform can have a different timeout value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
UT has been added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, do we need to define this new table here : https://github.com/sonic-net/sonic-swss-common/blob/master/common/schema.h#L440
@kenneth-arista could you review as well |
MSFT ADO: 28164958 |
… for Nokia-IXR7250E platform (#18862) This PR add the platform specified linecard_reboot_timeout value to the platform_evn.conf. It works PR sonic-net/sonic-platform-daemons#480 and sonic-net/sonic-utilities#3292 to address issue #18540 Signed-off-by: mlok <marty.lok@nokia.com>
… for Nokia-IXR7250E platform (sonic-net#18862) This PR add the platform specified linecard_reboot_timeout value to the platform_evn.conf. It works PR sonic-net/sonic-platform-daemons#480 and sonic-net/sonic-utilities#3292 to address issue sonic-net#18540 Signed-off-by: mlok <marty.lok@nokia.com>
Modified the SUP chassisd check_midplane_reachability() function to use the CHASSIS_MODULE_REBOOT_INFO_TABLE data (which is set by Linecard "sudo reboot" command) log expected or unexpected module lost midplane connectivity. This address issue sonic-net/sonic-buildimage#18540
Description
Add a new method is_module_reboot_expected() to check if CHASSIS_MODULE_REBOOT_INFO_TABLE|LINECARD# entry exists in CHASSIS_STATE_DB when a linecard is not reachable from SUP. If entry exists, it is expected reboot. check_midplane_reachability() will log "pmon#chassisd: Expected: Module LINE-CARD1 lost midplane connectivity". If entry doesn't exist, it will log "pmon#chassisd: Unexpected: Module LINE-CARD1 lost midplane connectivity". The CHASSIS_MODULE_REBOOT_INFO_TABLE|LINECARD# entry created and insert by linecard "sudo reboot" command by PR. It means that Users issue a linecard reboot, "lost midplane connectivity" is expected. Otherwise, such a linecard crash or missing heartbeat reboot, etc is unexpected.
Add new method module_reboot_set_time() and is_module_reboot_system_up_expired() to check if an expected reboot of linecard is not able to be up and detected by SUP in 3 minutes, check_midplane_reachabikity() will log "pmon#chassisd: Unexpected: Module LINE-CARD1 lost midplane connectivity". This provides the log message to the monitoring tool to take any further action.
This PR is required and associated with the following PRs
PR sonic-net/sonic-buildimage#18805
sonic-net/sonic-utilities#3292
#480
sonic-net/sonic-buildimage#18862
Motivation and Context
This provides a proper log message whether a module "lost midplane connectivity" is expected or not. This provides an efficient information log to the monitoring tool to take any further action. Fixes sonic-net/sonic-buildimage#18540
How Has This Been Tested?
This PR requires PRhttps://github.com/sonic-net/sonic-utilities/pull/3292 and to work with
Additional Information (Optional)
This PR needs to be back ported to branchs:
[x] 202205