Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Express Reboot_HLD #1570

Open
wants to merge 13 commits into
base: master
Choose a base branch
from
Open

Express Reboot_HLD #1570

wants to merge 13 commits into from

Conversation

jhli-cisco
Copy link

No description provided.

@zhangyanzhao
Copy link
Collaborator

Is this proposal Cisco 8000 specific? Or can apply to other platforms?

@jhli-cisco
Copy link
Author

jhli-cisco commented Jan 17, 2024 via email

@jhli-cisco jhli-cisco changed the title Cisco 8000 Express Reboot_HLD Express Reboot_HLD Jan 29, 2024
@paulmenzel
Copy link

(Please remove the citation part, if you reply via email.)

Have you already implemented this as a prototype, and can you share the timings?

@zhangyanzhao
Copy link
Collaborator

Community reviewed on 1/30/2024.

3. During express boot pre-shutdown, syncd sets SAI_SWITCH_ATTR_FAST_API_ENABLE = true in setPreShutdownOnAllSwitches() to notify SDK for proper cleanup.

```
sai_status_t yncd::setPreShutdownOnAllSwitches()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo. Syncd

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!. will correct it.

@zhangyanzhao
Copy link
Collaborator

We should define a clear reboot time threshold for express reboot, also please keep in mind that this feature may impact other platforms and the threshold should apply to other platforms.

* A special type of boot used by Cisco platforms to start in 'express'
* boot mode
*/
SAI_START_TYPE_EXPRESS_BOOT = 4,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We need to describe what the SAI implementation needs to do for Express Boot?
  2. How is Express Boot different from FastFast boot?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update SAI to add express boot type. The difference is that express boot needs its own pre-shutdown notification to differentiate from warm boot, but fastfast boot uses the same pre-shutdown notification as warm boot.

* A special type of boot used by Cisco platforms to start in 'express'
* boot mode
*/
SAI_START_TYPE_EXPRESS_BOOT = 4,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We need to describe what the SAI implementation needs to do for Express Boot?
  2. How is Express Boot different from FastFast boot?

*/
SAI_START_TYPE_EXPRESS_BOOT = 4,

/**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. Thanks!

*/
SAI_START_TYPE_EXPRESS_BOOT = 4,

/**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

```
sai_status_t yncd::setPreShutdownOnAllSwitches()
{
if (shutdownType == SYNCD_RESTART_TYPE_EXPRESS)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this step is the only difference between Express boot vs FastFast boot.

Why is this step needed? During Express shut, the SAI implementation must not reset the ASIC.

This can be achieved by setting SAI_SWITCH_ATTR_UNINIT_DATA_PLANE_ON_REMOVAL to false.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs a pre-shutdown notification so that platform hardware knows it and does necessary things pre- and during shutdown. SAI_SWITCH_ATTR_FAST_API_ENABLE is used after boot upon receving SAI_REDIS_NOTIFY_SYNCD_APPLY_VIEW. SO we used the same attr for pre-shutdown notification as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are overloading SAI_SWITCH_ATTR_FAST_API_ENABLE attribute. We should clearly define what the underlying SAI implementation must do for Express boot/shut.

SAI_SWITCH_ATTR_FAST_API_ENABLEExpress Boot/shutMeaning
TrueBoot

Valid only for FastFast and Express??

Signals the SAI adapter to begin the bulk programming

 

FalseBootShould not be used?
TrueShut??
FalseShut??

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, this is how the SAI_SWITCH_ATTR_FAST_API_ENABLE being used for express/fastfast boot. Will add this info to the doc.

SAI_SWITCH_ATTR_FAST_API_ENABLE Express boot fastfast boot Meaning
TRUE Shut Not used Pre-shutdown notification
FALSE Boot Boot Signals the SAI adapter to begin the bulk programming post reboot

```
sai_status_t yncd::setPreShutdownOnAllSwitches()
{
if (shutdownType == SYNCD_RESTART_TYPE_EXPRESS)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this step is the only difference between Express boot vs FastFast boot.

Why is this step needed? During Express shut, the SAI implementation must not reset the ASIC.

This can be achieved by setting SAI_SWITCH_ATTR_UNINIT_DATA_PLANE_ON_REMOVAL to false.

@zhangyanzhao
Copy link
Collaborator

@jhli-cisco
Copy link
Author

(Please remove the citation part, if you reply via email.)

Have you already implemented this as a prototype, and can you share the timings?

It is sub-second traffic interruption.

@eddieruan-alibaba
Copy link
Contributor

Can you explain how you could get "SDK needs a notification once all configurations are restored after express boot."? Would it be configuration only, or it would include BGP routes redownloading?


## Overview

The goal of Sonic express reboot is to be able to restart and upgrade SONiC software with sub-second data plane interruption. This is achieved by not initializing external phys, and only initializing NPU after receiving all configuration updates. This document covers updates needed in Sonic and SAI interface.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you create a section to describe the timelines on control plane and data plane version changes during this process, how the control traffic will be handled especially when control plane and data plane are not in the same version of images?

"all configuration updates", do you include operation data such as BGP routes learnt from remote side to be updated as well?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eddieruan-alibaba add a drawing that illustrates the interactions between v1/v2 sonic/sdk/npu during image upgrade. Please let me know if it is what you were looking for.

The existing fast-reboot script ((https://github.com/sonic-net/sonic-utilities/blob/master/scripts/fast-reboot) that is used to trigger fast-reboot/warm-boot will be enhanced to support express-reboot. Major changes are:

1. Enforce express-reboot is only applicable to Cisco 8000 platforms at the moment.
2. During express-reboot shutdown path, syncd_request_shutdown will be called with a new option “-pxe” to inform syncd pre-shutdown for express boot mode. The actual calls will be “docker exec syncd /usr/bin/syncd_request_shutdown –pxe”.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does -pxe stand for? Why not use -exp as it is the prefix of express reboot?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HI Ying, it stands for pre-expressboot. This is the notification before reboot.

@jhli-cisco
Copy link
Author

Can you explain how you could get "SDK needs a notification once all configurations are restored after express boot."? Would it be configuration only, or it would include BGP routes redownloading?

Hi Eddie, this is the same as in warm-reboot case, not only configuration, I believe including BGP routes.

@jhli-cisco
Copy link
Author

@yxieca @eddieruan-alibaba @rck-innovium Please let me know if you have any more comments.

@eddieruan-alibaba
Copy link
Contributor

When do you enable the punt path from NPU to SONiC? between t1 and t2, or between t3 and t4?
How does routes and neighbor / ARP be reinstalled in V2's kernel?

@jhli-cisco
Copy link
Author

When do you enable the punt path from NPU to SONiC? between t1 and t2, or between t3 and t4? How does routes and neighbor / ARP be reinstalled in V2's kernel?

@eddieruan-alibaba the punt path is enabled for both periods. Between t1 and t2 the npu is v1 version and between t3 and t4 is in v2 version. Routes/ARP are replayed by v2 swss between t1 and t2 from saved db.

@eddieruan-alibaba
Copy link
Contributor

When do you enable the punt path from NPU to SONiC? between t1 and t2, or between t3 and t4? How does routes and neighbor / ARP be reinstalled in V2's kernel?

@eddieruan-alibaba the punt path is enabled for both periods. Between t1 and t2 the npu is v1 version and between t3 and t4 is in v2 version. Routes/ARP are replayed by v2 swss between t1 and t2 from saved db.

Can you add some detail information about this punt path handling in your HLD? Between t1 and t2, your v1 hardware will send traffic to v2 software. Since you want to be able upgrade from any version to any version, this version mismatch punt would be a very strong requirement on punt header backward compatibility.

Also, can you add some detail information on how to replay Routes / ARP between t1 and t2, and what benefits you want to achieve with replay. From when BGP would relearn the routes, and how BGP learnt routes get resync with your replay routes. For ARP entries, how do you tell kernel to install your replay entry and how kernel should handle it with learnt ARP packets between t1 and t2. I don't have a clear picture in this area.

@jhli-cisco
Copy link
Author

When do you enable the punt path from NPU to SONiC? between t1 and t2, or between t3 and t4? How does routes and neighbor / ARP be reinstalled in V2's kernel?

@eddieruan-alibaba the punt path is enabled for both periods. Between t1 and t2 the npu is v1 version and between t3 and t4 is in v2 version. Routes/ARP are replayed by v2 swss between t1 and t2 from saved db.

Can you add some detail information about this punt path handling in your HLD? Between t1 and t2, your v1 hardware will send traffic to v2 software. Since you want to be able upgrade from any version to any version, this version mismatch punt would be a very strong requirement on punt header backward compatibility.

Also, can you add some detail information on how to replay Routes / ARP between t1 and t2, and what benefits you want to achieve with replay. From when BGP would relearn the routes, and how BGP learnt routes get resync with your replay routes. For ARP entries, how do you tell kernel to install your replay entry and how kernel should handle it with learnt ARP packets between t1 and t2. I don't have a clear picture in this area.

Hi @eddieruan-alibaba, there isn't special handling in punt path. Express boot isn't targeted for upgrade from any version to any version, and in reality, it can't. The upgrade in control plane/sonic still follows warm boot requirements.

Route/ARP replay after reboot also follows warm-reboot logic which is documented in https://github.com/sonic-net/SONiC/blob/master/doc/warm-reboot/swss_warm_restart.md. The only difference from warm-reboot is in syncd the reconciliation is skiped for express boot.

Hope this clarrify things a bit.

@eddieruan-alibaba
Copy link
Contributor

Express boot isn't targeted for upgrade from any version to any version, and in reality, it can't. The upgrade in control plane/sonic still follows warm boot requirements.

Can you create a table to compare your approach with existing approach from boot requirements to implementations. Since you don't hold "any version to any version upgrade", I don't have a clear picture on what your approach is different from existing one. It would be great if you could articulate it clearly.

@jhli-cisco
Copy link
Author

Hi @eddieruan-alibaba, added a table to compare major steps in warm and express boot. Please let me know if you need more details. I can schedule a meeting with you if needed.

@eddieruan-alibaba
Copy link
Contributor

Hi @eddieruan-alibaba, added a table to compare major steps in warm and express boot. Please let me know if you need more details. I can schedule a meeting with you if needed.

Some comments on your table

  1. Use case, "with change in data plane" This statement is very confusing. You have a limitation that there is no PUNT/Inject header changes and you don't support any version to any version update. Can you put these two constrain in this table. For current one, it only allows N to N+1 without data plane change completely. But in your case, you allow normal data plane change, but no punt path change.

  2. SONiC action. In your case, why do you need SWSS replay? You will relearn all routes and arp/mac between t1 and t2. What's the need for SWSS replay? I assume you only need configuration replay only, not routes/ARP etc in SWSS.

  3. SDK action. You will reset NPU between t2 and t3. Can you mention it clearly or describe the behavior with different time window?

  4. "table configuration " is a little bit confusing term here. I would rather call them as "NPU data". You will DMA "NPU data" to NPU.

Sure. We could have a chat.

@jhli-cisco
Copy link
Author

Hi @eddieruan-alibaba , what time is good for you? Webex or zoom?

@zhangyanzhao
Copy link
Collaborator

code PR is not ready, move to backlog for future release

@jhli-cisco
Copy link
Author

@eddieruan-alibaba , updated to include punt/inject handling.

@eddieruan-alibaba
Copy link
Contributor

Thanks. The punt/inject picture looks good.

Can you comment on that in your local testing, what is the gap between t1 and t2 and under what kind of scale.

This window would grow as the scale goes high. The window between t1 and t2 would be your black out window. We need to make sure the neighbor devices would not time out.

@jhli-cisco
Copy link
Author

Thanks. The punt/inject picture looks good.

Can you comment on that in your local testing, what is the gap between t1 and t2 and under what kind of scale.

This window would grow as the scale goes high. The window between t1 and t2 would be your black out window. We need to make sure the neighbor devices would not time out.

@eddieruan-alibaba do you mean t2 to t3 which is the traffic disruption window?

@eddieruan-alibaba
Copy link
Contributor

Thanks. The punt/inject picture looks good.
Can you comment on that in your local testing, what is the gap between t1 and t2 and under what kind of scale.
This window would grow as the scale goes high. The window between t1 and t2 would be your black out window. We need to make sure the neighbor devices would not time out.

@eddieruan-alibaba do you mean t2 to t3 which is the traffic disruption window?

No, I mean the window between t1 and t2. The window between t2 and t3 is your DMA window, which would be very small, in msec level. But since you don't punt traffic between t1 and t2, I want to understand this window's length in your testing with your scale.

@jhli-cisco
Copy link
Author

jhli-cisco commented May 9, 2024

Thanks. The punt/inject picture looks good.
Can you comment on that in your local testing, what is the gap between t1 and t2 and under what kind of scale.
This window would grow as the scale goes high. The window between t1 and t2 would be your black out window. We need to make sure the neighbor devices would not time out.

@eddieruan-alibaba do you mean t2 to t3 which is the traffic disruption window?

No, I mean the window between t1 and t2. The window between t2 and t3 is your DMA window, which would be very small, in msec level. But since you don't punt traffic between t1 and t2, I want to understand this window's length in your testing with your scale.

@eddieruan-alibaba Ok, I dont have the exact number. Using standard T0 sonic-management topology, we make sure there is no bgp neighbor and port-channel flapping as part of the sonic-management suite that we add, same as warm-reboot testing. I can get measurement later, but hope this is not an issue for this document approval.

@jhli-cisco
Copy link
Author

@eddieruan-alibaba , updated the punt and inject change handling after further internal discussion. Please take a look.

@jhli-cisco
Copy link
Author

Hi @yxieca @rck-innovium @eddieruan-alibaba, updated document per review comments. Please review again and let me know if you have any further comments.

@yxieca yxieca requested a review from vaibhavhd July 15, 2024 17:00
@eddieruan-alibaba
Copy link
Contributor

"It can be seen from Figure 1, it is possible that punt-header-v1 reachs SONIC-v2 or inject-header-v2 reachs NPU-v1 during t1 to t2 window. The punt and inject header changes are rare and not commom. Currently punt and inject header data structure differences between V1 and V2 are handled case by case basis in S1 SDK internally."

Should it be a requirement to ask punt/inject headers should be backward compatible instead of assuming the change is rare? I would have concerns on this statement in routing domain. As feature set grows, punt/inject headers are changing release by release. :) For example, IPM may break your assumption unless you restrictedly follow backward compatible.

I would think this could be more easier to get supported if

  1. In DCN only
  2. In bugfix patch release

By the way, NV folks presented a warm reboot approach similar to this one in routing WG, except they cut hardware in half and write V2 in hardware and swap hardware from V1 to V2 with some flag, a.k.a similar to NCS 6K's approach but without control plane redundancy. Your V2 context is cached in software and run DMA to swap from V1 to V2. The meeting minutes could be found at https://lists.sonicfoundation.dev/g/sonic-wg-routing/wiki/37114.

@jhli-cisco
Copy link
Author

@eddieruan-alibaba We don’t want to explicitly put a restriction on punt/inject headers should be backward compatible, though it may be in most cases. But we would like to give a choice to vendor implementation whether they want to
provide backward compatibility for punt headers or drop the version mismatched punt packets depending on hardware capability.

@eddieruan-alibaba
Copy link
Contributor

@eddieruan-alibaba We don’t want to explicitly put a restriction on punt/inject headers should be backward compatible, though it may be in most cases. But we would like to give a choice to vendor implementation whether they want to provide backward compatibility for punt headers or drop the version mismatched punt packets depending on hardware capability.

The success of your approach relies on the assumption that the punt/inject header will remain unchanged or be backward compatible. Also, the current SONiC warm reboot does not rely on this assumption. I recommend explicitly stating this constraint to avoid any confusion or overestimation of this approach.

How this is achieved will depend on the choices of vendors and operators. Some may restrict their deployment to data center networks (DCN), while others may limit the approach to minor releases only.

call out punt-inject header compatibility
@jhli-cisco
Copy link
Author

@eddieruan-alibaba , updated it to call out backward compatibility explicitly. Please review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: MovedToBacklog
Development

Successfully merging this pull request may close these issues.

7 participants