Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[warm-reboot] remove ISSU bank check #2958

Merged
merged 1 commit into from
Oct 2, 2023

Conversation

stepanblyschak
Copy link
Contributor

@stepanblyschak stepanblyschak commented Aug 29, 2023

What I did

I removed issu bank check:

  1. The file in which SDK holds ISSU bank information has changed and is no longer issu_bank.txt
  2. We shouldn't check the SDK dump content from SONiC as that information and its format is constantly changing. Overwriting it might be risky when SONiC and SDK are out of sync regarding the format of the file
  3. ISSU bank information is written by pre_shutdown request anyway. Verifying whether the information was indeed written right after that seems redundant
  4. The check is made after set +e, so at the moment we "detect" a problem it is already too late

How I did it

Removed the relevant code.

How to verify it

Run warm-reboot on device.

Previous command output (if the output of a command-line utility has changed)

New command output (if the output of a command-line utility has changed)

1. The file in which SDK holds ISSU bank information has changed and is
no longer issu_bank.txt
2. We shouldn't check the SDK dump content from SONiC as that
information and its format is constantly changing
3. ISSU bank information is written by pre_shutdown request anyway.
Verifying whether the information was indeed written right after that
seems redundant.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
@@ -202,34 +202,6 @@ function request_pre_shutdown()
fi
}

function recover_issu_bank_file()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that the paths in the current check are not relevant as the issu bank file format has changed.

However, I am not clear on why we do not change file names to use new format, rather than remove the check altogether.
From our discussion - this issue is not seen so far on newer branches, but is that a guaranee that the issue will never be seen again? If yes, then how is that guaranteed?

What is the downside of keeping this change w/ new file format?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are concerned about the auto-fix ultimately breaking something, then do you rather want to detect a missing/broken bank file, and abort warmboot (do the check before set +e)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I am not clear on why we do not change file names to use new format, rather than remove the check altogether. From our discussion - this issue is not seen so far on newer branches, but is that a guaranee that the issue will never be seen again? If yes, then how is that guaranteed?

As far as I remember the issu_bank.txt got corrupted due to unknown reasons. Even with this check I don't see a guarantee issu_bank.txt won't get corrupted after the check and recovery.

See the original PR - #1466, the way to reproduce it is modify warm-reboot script to artificially corrupt it.

What is the downside of keeping this change w/ new file format?

We, in SONiC, have to track changes SDK makes to it's files. And there is not just a single file, there are more. Not all of them can be restored from SONiC. It could also happen that SDK changes the format of the file and SONiC overwrites it to an old format leading to a failed warm boot.

If you are concerned about the auto-fix ultimately breaking something, then do you rather want to detect a missing/broken bank file, and abort warmboot (do the check before set +e)?

I would then ask SDK to do the check. But then, the logic they have to implement is - write the content and read it to make sure it is written. Seems redundant. Again, does not guarantee files corruption afterwards.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would then ask SDK to do the check. But then, the logic they have to implement is - write the content and read it to make sure it is written. Seems redundant. Again, does not guarantee files corruption afterwards.

I did not follow this part completely. IMO, this is not redundant if it works in some situations and detect corrupted files to avoid warmboot. This has happened before in older images, and can potentially happen now too. Aborting warm reboot will prevent any risk of dataplane downtime.

Copy link
Contributor

@vaibhavhd vaibhavhd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR to be kept on master, and not to be backported to older branches.

@qiluo-msft qiluo-msft merged commit ba179fb into sonic-net:master Oct 2, 2023
5 checks passed
JunhongMao pushed a commit to JunhongMao/sonic-utilities that referenced this pull request Oct 4, 2023
### What I did

I removed issu bank check:

1. The file in which SDK holds ISSU bank information has changed and is no longer issu_bank.txt
2. We shouldn't check the SDK dump content from SONiC as that information and its format is constantly changing. Overwriting it might be risky when SONiC and SDK are out of sync regarding the format of the file
3. ISSU bank information is written by pre_shutdown request anyway. Verifying whether the information was indeed written right after that seems redundant
4. The check is made after ```set +e```, so at the moment we "detect" a problem it is already too late

#### How I did it

Removed the relevant code.

#### How to verify it

Run warm-reboot on device.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants