Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[warm-reboot] remove ISSU bank check #2958

Merged
merged 1 commit into from
Oct 2, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 3 additions & 39 deletions scripts/fast-reboot
Original file line number Diff line number Diff line change
Expand Up @@ -193,7 +193,7 @@ function request_pre_shutdown()
{
if [ -x ${DEVPATH}/${PLATFORM}/${PLATFORM_REBOOT_PRE_CHECK} ]; then
debug "Requesting platform reboot pre-check ..."
${DEVPATH}/${PLATFORM}/${PLATFORM_REBOOT_PRE_CHECK} ${REBOOT_TYPE}
${DEVPATH}/${PLATFORM}/${PLATFORM_REBOOT_PRE_CHECK} ${REBOOT_TYPE}
fi
debug "Requesting pre-shutdown ..."
STATE=$(timeout 5s docker exec syncd /usr/bin/syncd_request_shutdown --pre &> /dev/null; if [[ $? == 124 ]]; then echo "timed out"; fi)
Expand All @@ -202,34 +202,6 @@ function request_pre_shutdown()
fi
}

function recover_issu_bank_file()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that the paths in the current check are not relevant as the issu bank file format has changed.

However, I am not clear on why we do not change file names to use new format, rather than remove the check altogether.
From our discussion - this issue is not seen so far on newer branches, but is that a guaranee that the issue will never be seen again? If yes, then how is that guaranteed?

What is the downside of keeping this change w/ new file format?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are concerned about the auto-fix ultimately breaking something, then do you rather want to detect a missing/broken bank file, and abort warmboot (do the check before set +e)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I am not clear on why we do not change file names to use new format, rather than remove the check altogether. From our discussion - this issue is not seen so far on newer branches, but is that a guaranee that the issue will never be seen again? If yes, then how is that guaranteed?

As far as I remember the issu_bank.txt got corrupted due to unknown reasons. Even with this check I don't see a guarantee issu_bank.txt won't get corrupted after the check and recovery.

See the original PR - #1466, the way to reproduce it is modify warm-reboot script to artificially corrupt it.

What is the downside of keeping this change w/ new file format?

We, in SONiC, have to track changes SDK makes to it's files. And there is not just a single file, there are more. Not all of them can be restored from SONiC. It could also happen that SDK changes the format of the file and SONiC overwrites it to an old format leading to a failed warm boot.

If you are concerned about the auto-fix ultimately breaking something, then do you rather want to detect a missing/broken bank file, and abort warmboot (do the check before set +e)?

I would then ask SDK to do the check. But then, the logic they have to implement is - write the content and read it to make sure it is written. Seems redundant. Again, does not guarantee files corruption afterwards.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would then ask SDK to do the check. But then, the logic they have to implement is - write the content and read it to make sure it is written. Seems redundant. Again, does not guarantee files corruption afterwards.

I did not follow this part completely. IMO, this is not redundant if it works in some situations and detect corrupted files to avoid warmboot. This has happened before in older images, and can potentially happen now too. Aborting warm reboot will prevent any risk of dataplane downtime.

{
debug "Recovering the (${ISSU_BANK_FILE}) file"
docker exec -i syncd sx_api_dbg_generate_dump.py
issu_bank_value=`docker exec -i syncd cat /tmp/sdkdump | grep 'ISSU Bank' | grep -o -E '[0-9]+'`
printf $issu_bank_value > /host/warmboot/issu_bank.txt
}

function check_issu_bank_file()
{
ISSU_BANK_FILE=/host/warmboot/issu_bank.txt

if [[ ! -s "$ISSU_BANK_FILE" ]]; then
error "(${ISSU_BANK_FILE}) does NOT exist or empty ..."
recover_issu_bank_file
return
fi

issu_file_chars_count=`stat -c %s ${ISSU_BANK_FILE}`;
issu_file_content=`awk '{print $0}' ${ISSU_BANK_FILE}`

if [[ $issu_file_chars_count != 1 ]] ||
[[ "$issu_file_content" != "0" && "$issu_file_content" != "1" ]]; then
error "(${ISSU_BANK_FILE}) is broken ..."
recover_issu_bank_file
fi
}

function wait_for_pre_shutdown_complete_or_fail()
{
debug "Waiting for pre-shutdown ..."
Expand Down Expand Up @@ -464,7 +436,7 @@ function invoke_kexec() {

function load_kernel() {
# Load kernel into the memory
invoke_kexec -a
invoke_kexec -a
}

function load_kernel_secure() {
Expand Down Expand Up @@ -630,7 +602,7 @@ fi
if is_secureboot && grep -q aboot_machine= /host/machine.conf; then
load_aboot_secureboot_kernel
else
# check if secure boot is enable in UEFI
# check if secure boot is enable in UEFI
CHECK_SECURE_UPGRADE_ENABLED=0
SECURE_UPGRADE_ENABLED=$(bootctl status 2>/dev/null | grep -c "Secure Boot: enabled") || CHECK_SECURE_UPGRADE_ENABLED=$?
if [[ CHECK_SECURE_UPGRADE_ENABLED -ne 0 ]]; then
Expand Down Expand Up @@ -773,17 +745,9 @@ for service in ${SERVICES_TO_STOP}; do
# Pre-shutdown syncd
initialize_pre_shutdown

if [[ "x$sonic_asic_type" == x"mellanox" ]]; then
check_issu_bank_file
fi

request_pre_shutdown

wait_for_pre_shutdown_complete_or_fail

if [[ "x$sonic_asic_type" == x"mellanox" ]]; then
check_issu_bank_file
fi
fi

if [[ "$REBOOT_TYPE" = "fastfast-reboot" || "$REBOOT_TYPE" = "fast-reboot" ]]; then
Expand Down
Loading