diff --git a/.markdownlint.json b/.markdownlint.json new file mode 100644 index 00000000000..7068a1dccc9 --- /dev/null +++ b/.markdownlint.json @@ -0,0 +1,5 @@ +{ + "line-length": false, + "no-inline-html": false, + "fenced-code-language": false +} \ No newline at end of file diff --git a/MoM.html b/MoM.html index 778032dc38c..4709dd57af7 100755 --- a/MoM.html +++ b/MoM.html @@ -104,6 +104,56 @@

SONiC community meeting minutes

Links To Meeting Agenda Links To Minutes Of The meeting + +   Jan 09 2024    + SONiC long reset button press + MoM + + +   Jan 02 2024    + No Meeting + MoM + + +   Dec 26 2023    + No Meeting + MoM + + +   Dec 19 2023    + No Meeting + MoM + + +   Dec 12 2023    + Handle ASIC/SDK Health event + MoM + + +   Dec 05 2023    + 202405 Release planning + MoM + + +   Nov 28 2023    + No Meeting + MoM + + +   Nov 21 2023    + No Meeting + MoM + + +   Nov 14 2023    + No Meeting + MoM + + +   Nov 07 2023    + Wake-on-VLAN HLD + MoM +   Oct 31 2023    DHCPv4 - Specify Gateway explicit diff --git a/Supported-Devices-and-Platforms.html b/Supported-Devices-and-Platforms.html index 2c6893adf5d..1b0a8cfa0d6 100644 --- a/Supported-Devices-and-Platforms.html +++ b/Supported-Devices-and-Platforms.html @@ -1100,6 +1100,14 @@

Nvidia +Spectrum 3 +48x100 + 8x400 + + + +Nvidia SN4600C Nvidia Spectrum 3 @@ -1108,6 +1116,14 @@

Nvidia +Spectrum 3 +64x200G + + + +Nvidia SN4700 Nvidia Spectrum 3 @@ -1115,6 +1131,14 @@

Nvidia +Spectrum 4 +64x800G + + + Pegatron Porsche Nephos @@ -1203,6 +1227,14 @@

Broadcom +Trident 3 +48x25G+8x100G + + + Tencent TCS8400-24CC8CD Broadcom diff --git a/doc/BGP/BGP-supress-fib-pending.md b/doc/BGP/BGP-supress-fib-pending.md index b4ddf3dd236..007ceb724e9 100644 --- a/doc/BGP/BGP-supress-fib-pending.md +++ b/doc/BGP/BGP-supress-fib-pending.md @@ -111,7 +111,7 @@ High level requirements: Restrictions/Limitations: - MPLS, VNET routes are out of scope of this document -- Directly connected and static routes are announced by BGP regardless of their offload status +- Directly connected and kernel routes are announced by BGP regardless of their offload status ### 5. Architecture Design @@ -722,7 +722,7 @@ Due to additional response publishing in orchagent there might be a slight delay ### 10. Restrictions/Limitations - MPLS, VNET routes are out of scope of this document -- Directly connected and static routes are announced by BGP regardless of their offload status +- Directly connected and kernel routes are announced by BGP regardless of their offload status ### 11. Testing Requirements/Design diff --git a/doc/Container Hardening/SONiC_container_hardening_HLD.md b/doc/Container Hardening/SONiC_container_hardening_HLD.md new file mode 100644 index 00000000000..72568528860 --- /dev/null +++ b/doc/Container Hardening/SONiC_container_hardening_HLD.md @@ -0,0 +1,378 @@ +# SONiC Container Hardening # + +## Table of Content +- [SONiC Container Hardening](#sonic-container-hardening) + - [Table of Content](#table-of-content) + - [List of Tables](#list-of-tables) + - [Revision](#revision) + - [Scope](#scope) + - [Definitions/Abbreviations](#definitionsabbreviations) + - [1. Overview](#1-overview) + - [2. Requirements](#2-requirements) + - [3. Architecture Design](#3-architecture-design) + - [3.1 Root privileges](#31-root-privileges) + - [3.2 net=host](#32-nethost) + - [4. High-Level Design](#4-high-level-design) + - [4.1 Root privileges removal](#41-root-privileges-removal) + - [Docker privileges](#docker-privileges) + - [4.2 net=host optimization](#42-nethost-optimization) + - [How to check?](#how-to-check) + - [5. SAI API](#5-sai-api) + - [6. Configuration and management](#6-configuration-and-management) + - [6.1. Manifest (if the feature is an Application Extension)](#61-manifest-if-the-feature-is-an-application-extension) + - [6.2. CLI/YANG model Enhancements](#62-cliyang-model-enhancements) + - [6.3. Config DB Enhancements](#63-config-db-enhancements) + - [7. Warmboot and Fastboot Design Impact](#7-warmboot-and-fastboot-design-impact) + - [8. Restrictions/Limitations](#8-restrictionslimitations) + - [9. Testing Requirements/Design](#9-testing-requirementsdesign) + - [9.1 Unit Test cases](#91-unit-test-cases) + - [9.2 System Test cases](#92-system-test-cases) + - [10. Open/Action items - if any](#10-openaction-items---if-any) + - [Appendix A: Further reading](#appendix-a-further-reading) + - [Appendix B: Linux Capabilities](#appendix-b-linux-capabilities) + - [Appendix C: Container List](#appendix-c-container-list) + +## List of Tables +* [Table 1: Revision](#table-1-revision) +* [Table 2: Abbreviations](#table-2-abbreviations) +* [Table 3: Default Linux capabilities](#table-3-default-linux-capabilities) +* [Table 4: Extended Linux capabilities](#table-4-extended-linux-capabilities) + +## Revision +###### Table 1: Revision +| Rev | Date | Author | Change Description | +|:---:|:-----------:|:------------------:|-----------------------------------| +| 0.1 | | | Initial version | + +## Scope + +This section describes the requirements, goals, and recommendations of the container hardening item for SONiC. + +## Definitions/Abbreviations +###### Table 2: Abbreviations +| Definitions/Abbreviation | Description | +|--------------------------|--------------------------------------------| +| OS | Operating System | +| API | Application Programmable Interface | +| SAI | Swich Abstraction Interface | + +## 1. Overview + +Containers is a method of creating virtualization and abstraction of an OS for a subset of processes/service on top of a single host with the purpose of giving it an environment to run and execute its tasks without effect of nearby containers/processes. + +In SONiC, we are deploying containers with full visibility and capabilities as the host Linux. + +This poses a security risk and vulnerability as a single breached container means that the whole system is breached. + +Addressing this issue – we have composed this doc for container hardening, describing the security hardening requirements and definitions for all containers on top of SONiC. + +## 2. Requirements + +What are we trying to achieve here? + +We would like to increase the security in SONiC so that an attack on a specific container will not compromise the whole system. + +To do so, we'll tackle the following areas: +1. Privileges +2. Network +3. Capabilities +4. Mount namespace +5. Cgroups +6. Etc + +For now, we will focus on #1 & #2 + +Further guidelines and requirements will be brought upon in the future on-demand. + +## 3. Architecture Design + +### 3.1 Root privileges + +When removing the root privileges from a specific container - we are required to remove the `--privileged` flag and add the required missing Linux capabilities to the docker, or alternatively adjust the container so that it does not require root privileges to perform any action. + +### 3.2 net=host + +Removing the `net=host` is required to prevent the container from accessing the full network scope of the host and system. +When doing this removal - we will start getting failures from devices that require external access and packet transfers between the container and the host to the interfaces. +In order to overcome this obstacle - we have a few options here: +- using `--net=bridge` and port forwarding + +## 4. High-Level Design + +### 4.1 Root privileges removal +Removing the `--privileged` flag is done by editing the docker_image_ctl.j2 file: + +docker_image_ctl.j2 file + + docker create {{docker_image_run_opt}} \ # *Need to modify this parameter "docker_image_run_opt" to not contain the --privileged flag* + {%- if docker_container_name != "database" %} + --net=$NET \ + --uts=host \{# W/A: this should be set per-docker, for those dockers which really need host's UTS namespace #} + {%- endif %} + {%- if docker_container_name == "database" %} + -p 6379:6379 \ + {%- endif %} + -e RUNTIME_OWNER=local \ + {%- if install_debug_image == "y" %} + -v /src:/src:ro -v /debug:/debug:rw \ + {%- endif %} + {%- if '--log-driver=json-file' in docker_image_run_opt or '--log-driver' not in docker_image_run_opt %} + --log-opt max-size=2M --log-opt max-file=5 \ + {%- endif %} + +This will cause the docker file to be altered in the following manner: + +**database.sh file** + + docker create --privileged -t -v /etc/sonic:/etc/sonic:ro \ # *Need to remove the --privileged flag* + -p 6379:6379 \ + -e RUNTIME_OWNER=local \ + --log-opt max-size=2M --log-opt max-file=5 \ + --tmpfs /tmp \ + $DB_OPT \ + $REDIS_MNT \ + -v /usr/share/sonic/device/$PLATFORM:/usr/share/sonic/platform:ro \ + --tmpfs /var/tmp \ + --env "NAMESPACE_ID"="$DEV" \ + --env "NAMESPACE_PREFIX"="$NAMESPACE_PREFIX" \ + --env "NAMESPACE_COUNT"=$NUM_ASIC \ + --name=$DOCKERNAME \ + docker-database:latest \ + || { + echo "Failed to docker run" >&1 + exit 4 + } + +#### Docker privileges +Removing the root privileges from the docker container - will remove some Linux capabilities that are inherited from the root level permissions. + +Running the capabilities list command on a privileged container, this includes all capabilities captured in both [Table 3: Default Linux capabilities](#table-3-default-linux-capabilities) and [Table 4: Extended Linux capabilities](#table-4-extended-linux-capabilities) + + root@ce2c36a0b20c:/# capsh --print + Current: = cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read+eip + +Running the capabilities list command on an un-privileged container, this includes all capabilities captured in [Table 3: Default Linux capabilities](#table-3-default-linux-capabilities): + + root@ce2c36a0b20c:/# capsh --print + Current: cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap=eip + +If, for some reason, a docker must retain a specific capablity functionality on top of the container (which is removed after removing the `--privileged` flag), we can do that with the following: + +In the docker-database.mk file adjust this line: + + $(DOCKER_DATABASE)_RUN_OPT += -t –-cap-add NET_ADMIN # Changed by removing the --privileged flag and adding --cap-add flag + +### 4.2 net=host optimization + +Here we will provide a detailed example of how to switch from the `--net=host` configuration (host network) to the `--net=bridge` configuration paired with port forwarding in a specific container. We are using the database container as an example for this item. + +The original docker creation should be like in the example below: + +docker with host sharing: + + docker create --privileged -t -v /etc/sonic:/etc/sonic:ro \ + --net=$NET \ + -e RUNTIME_OWNER=local \ + --uts=host \ + --log-opt max-size=2M --log-opt max-file=5 \ + --tmpfs /tmp \ + $DB_OPT \ + $REDIS_MNT \ + -v /usr/share/sonic/device/$PLATFORM:/usr/share/sonic/platform:ro \ + --tmpfs /var/tmp \ + --env "NAMESPACE_ID"="$DEV" \ + --env "NAMESPACE_PREFIX"="$NAMESPACE_PREFIX" \ + --env "NAMESPACE_COUNT"=$NUM_ASIC \ + --name=database_no_net \ + --cap-drop=NET_ADMIN \ + docker-database:latest + +To disable the sharing of the networking stack between the host and a container we need to remove the flag: `--net=host`. Because we have not specified any `--network` flag, the containers connect to the default bridge network `--net=bridge`. +To support port forwarding we are required to add the flag:  `-p :` + +The "new" docker creation file database.sh can be seen in the code block below: + +Docker with port forwarding and default bridge network + + docker create --privileged -t -v /etc/sonic:/etc/sonic:ro \ + **-p 6379:6379** \ + -e RUNTIME_OWNER=local \ + --uts=host \ + --log-opt max-size=2M --log-opt max-file=5 \ + --tmpfs /tmp \ + $DB_OPT \ + $REDIS_MNT \ + -v /usr/share/sonic/device/$PLATFORM:/usr/share/sonic/platform:ro \ + --tmpfs /var/tmp \ + --env "NAMESPACE_ID"="$DEV" \ + --env "NAMESPACE_PREFIX"="$NAMESPACE_PREFIX" \ + --env "NAMESPACE_COUNT"=$NUM_ASIC \ + --name=$DOCKERNAME \ + docker-database:latest \ + +**How we did it?** + +To create a docker with the flags above it is required to set the "new" flag in the file docker_image_ctl.js. Follow the call `docker create {{docker_image_run_opt}} \`:  +and replace the `–--net=$NET`. +docker flag generation + + {%- if docker_container_name != "database" %} + --net=$NET \ + {%- endif %} + {%- if docker_container_name == "database" %} + -p 6379:6379 \ + {%- endif %} + +#### How to check? + +Go into the docker - `docker exec -it docker bash` +Run `ifconfig`. + +On a docker with host network - you'll be able to view all physical interfaces. +On a docker without host network - we'll see only eth0 and lo. + +Note - we are not committing to user defined bridges at this stage. +Once we manage to stabalize the system without host network and without root privileges on top of the containers we can move to the next step of user defined bridges. +This will either be an expansion of this HLD or an HLD of its own. + +## 5. SAI API + +N/A + +## 6. Configuration and management + +N/A - no configuration management/changes are required. + +### 6.1. Manifest (if the feature is an Application Extension) + +N/A + +### 6.2. CLI/YANG model Enhancements + +N/A +We are not adding CLI commands or management capabilities to the system with this item. + +### 6.3. Config DB Enhancements + +N/A - DB should remain the same + +## 7. Warmboot and Fastboot Design Impact + +No impact on all boot sequences, as this item should be seemlessly integrated into the system and achieve the same functionality level as before. + +## 8. Restrictions/Limitations + +## 9. Testing Requirements/Design + +To define this item completed - we are required to run the full CI and check that nothing has been broken from the changes proposed in this HLD. +In addition - we should test that the mitigations are applicable for the relevant containers. + +### 9.1 Unit Test cases + +N/A, this feature will be checked on a system level. + +### 9.2 System Test cases + +For general fucntionality flows- running the same test cases that we currently have on top of our system and verifying that nothing broke. + +For adidtional security test cases, we should check that priviliges and network capabilities have been removed. +Net=$HOST removal test: +1. Login to container with removed network capabilities +2. Run ls /dev/ +3. Check that we do not have visibility to all network devices (no tty9/8 no sda, etc') + +Privilege removal test: +1. Login to container without --privileged flag +2. Check that you cannot access /etc/shadow +3. Check that you cannot perform vim for /boot folder or any file in it + +## 10. Open/Action items - if any + +Currently, Nvidia and MSFT have scoped commitment for specific containers. +Redis and SNMP already have these adjustments. +What remains is to perform this container hardening for all other containers in the system so that the whole echo-system will comply to these security hardening requirements. + +## Appendix A: Further reading + +[Linux Capabilities 101](https://linux-audit.com/linux-capabilities-101/) + +[Understanding Linux Capabilities](https://tbhaxor.com/understanding-linux-capabilities/) + +[Linux Namespaces Wiki](https://en.wikipedia.org/wiki/Linux_namespaces) + +## Appendix B: Linux Capabilities + +The following table lists the Linux capability options which are allowed by default and can be dropped. +###### Table 3: Default Linux capabilities +| Capability Key | Capability Description | +| ----------- | ----------- | +| AUDIT_WRITE | Write records to kernel auditing log | +| CHOWN | Make arbitrary changes to file UIDs and GIDs (see chown(2)). | +| DAC_OVERRIDE | Bypass file read, write, and execute permission checks. | +| FOWNER | Bypass permission checks on operations that normally require the file system UID of the process to match the UID of the file. | +| FSETID | Don’t clear set-user-ID and set-group-ID permission bits when a file is modified. | +| KILL | Bypass permission checks for sending signals | +| MKNOD | Create special files using mknod(2). | +| NET_BIND_SERVICE | Bind a socket to internet domain privileged ports (port numbers less than 1024). | +| NET_RAW | Use RAW and PACKET sockets | +| SETFCAP | Set file capabilities | +| SETGID | Make arbitrary manipulations of process GIDs and supplementary GID list. | +| SETPCAP | Modify process capabilities | +| SETUID | Make arbitrary manipulations of process UIDs. | +| SYS_CHROOT | Use chroot(2), change root directory. | + +The next table shows the capabilities which are not granted by default and may be added. +###### Table 4: Extended Linux capabilities +| Capability Key | Capability Description | +| ----------- | ----------- | +| AUDIT_CONTROL | Enable and disable kernel auditing; change auditing filter rules; retrieve auditing status and filtering rules. | +| AUDIT_READ | Allow reading the audit log via multicast netlink socket | +| BLOCK_SUSPEND | Allow preventing system suspends. | +| BPF | Allow creating BPF maps, loading BPF Type Format (BTF) data, retrieve JITed code of BPF programs, and more. | +| CHECKPOINT_RESTORE | Allow checkpoint/restore related operations. Introduced in kernel 5.9. | +| DAC_READ_SEARCH | Bypass file read permission checks and directory read and execute permission checks. | +| IPC_LOCK | Lock memory (mlock(2), mlockall(2), mmap(2), shmctl(2)). | +| IPC_OWNER | Bypass permission checks for operations on System V IPC objects. | +| LEASE | Establish leases on arbitrary files (see fcntl(2)). | +| LINUX_IMMUTABLE | Set the FS_APPEND_FL and FS_IMMUTABLE_FL i-node flags. | +| MAC_ADMIN | Allow MAC configuration or state changes. Implemented for the Smack LSM. | +| MAC_OVERRIDE | Override Mandatory Access Control (MAC). Implemented for the Smack Linux Security Module (LSM). | +| NET_ADMIN | Perform various network-related operations. | +| NET_BROADCAST | Make socket broadcasts, and listen to multicasts. | +| PERFMON | Allow system performance and observability privileged operations using perf_events, i915_perf and other kernel subsystems | +| SYS_ADMIN | Perform a range of system administration operations. | +| SYS_BOOT | Use reboot(2) and kexec_load(2), reboot and load a new kernel for later execution. | +| SYS_MODULE | Load and unload kernel modules. | +| SYS_NICE | Raise process nice value (nice(2), setpriority(2)) and change the nice value for arbitrary processes. | +| SYS_PACCT | Use acct(2), switch process accounting on or off. | +| SYS_PTRACE | Trace arbitrary processes using ptrace(2). | +| SYS_RAWIO | Perform I/O port operations (iopl(2) and ioperm(2)). | +| SYS_RESOURCE | Override resource Limits | +| SYS_TIME | Set system clock (settimeofday(2), stime(2), adjtimex(2)); set real-time (hardware) clock. | +| SYS_TTY_CONFIG | Use vhangup(2); employ various privileged ioctl(2) operations on virtual terminals. | +| SYSLOG | Perform privileged syslog(2) operations. | +| WAKE_ALARM | Trigger something that will wake up the system | + +## Appendix C: Container List +| Container | Host Network Recommendation | Privilege Recommendation | Comments | +| ----------- | ----------- |----------- |-----------| +| Database | Remove host network |Remove container root privilege| Port forward| +| SNMP | Remove host network |Remove container root privilege| Port forward| +| Teamd | Remove host network |Remove container root privilege| Retain net_cap_admin| +| FRR | Retain |Remove container root privilege| Retain net_cap_admin| +| LLDP | Retain |Remove container root privilege| Retain net_cap_admin| +| DHCPrelay | Remove host network |Remove container root privilege| Retain net_cap_admin| +| Mux | Remove host network |Remove container root privilege| Retain net_cap_admin| +| Telemetry | Remove host network |Remove container root privilege| Port forward for gnmi | +| Radv | Remove host network |Remove container root privilege| Might need additional capabilities for L2 data| +| RestAPI | Remove host network |Remove container root privilege| Planned for deprecation | +| Eventd | Remove host network |Remove container root privilege| | +| iccpd | Remove host network |Remove container root privilege| | +| macsec | Remove host network |Remove container root privilege| | +| NAT | Remove host network |Remove container root privilege| Retain net_cap_admin | +| SWSS | Retain |Retain root privilege| | +| syncd | Retain |Retain root privilege| | +| PMON | Remove host network |Remove container root privilege| Check file descriptor privileges | +| sFlow | Remove host network |Remove container root privilege| | +| Management Framework | TBD |TBD| | +| P4rt | TBD |TBD| | diff --git a/doc/SONiC_202205_Release_Notes.md b/doc/SONiC_202205_Release_Notes.md index f631abec473..5e173089954 100644 --- a/doc/SONiC_202205_Release_Notes.md +++ b/doc/SONiC_202205_Release_Notes.md @@ -10,6 +10,7 @@ This document captures the new features added and enhancements done on existing * [Dependency Version](#dependency-version) * [Security Updates](#security-updates) * [Feature List](#feature-list) + * [Known Issues](#Known-Issues) * [SAI APIs](#sai-apis) * [Contributors](#contributors) @@ -328,6 +329,11 @@ Refer below mentioned PR's for more details.
**Pull Requests** : [10047](https://github.com/sonic-net/sonic-buildimage/pull/10047) +# Known Issues +On the 202205 release image, a difference of 0.2 - 0.3 sec is observed (for slower CPU's) when running show cli's. This is reflected in most of the show cli's since many of them import device_info which is still using swsssdk in 202205 release. This is a known observation of this 202205 image. + +This known issue, has been fixed in 202211 release through the [PR#10099](https://github.com/sonic-net/sonic-buildimage/pull/10099). As mentioned in the other [PR#16595](https://github.com/sonic-net/sonic-buildimage/issues/16595), the fix is not backported to 202205 branch and hence the issue will continue to exit in 202205 image. + # SAI APIs Please find the list of API's classified along the newly added SAI features. For further details on SAI API please refer [SAI_1.10.2 Release Notes](https://github.com/opencomputeproject/SAI/blob/master/doc/SAI_1.10.2_ReleaseNotes.md) diff --git a/doc/SONiC_202311_Release_Notes.md b/doc/SONiC_202311_Release_Notes.md new file mode 100644 index 00000000000..f1d5717bffb --- /dev/null +++ b/doc/SONiC_202311_Release_Notes.md @@ -0,0 +1,104 @@ +# SONiC 202311 Release Notes + +This document captures the new features added and enhancements done on existing features/sub-features for the SONiC [202311](https://github.com/orgs/sonic-net/projects/14/views/1) release. + + + +# Table of Contents + + * [Branch and Image Location](#branch-and-image-location) + * [Dependency Version](#dependency-version) + * [Security Updates](#security-updates) + * [Feature List](#feature-list) + * [SAI APIs](#sai-apis) + * [Contributors](#contributors) + + +# Branch and Image Location + +Branch : https://github.com/Azure/sonic-buildimage/tree/202311
+Image : https://sonic-build.azurewebsites.net/ui/sonic/pipelines (Example - Image for Broadcom based platforms is [here](https://sonic-build.azurewebsites.net/ui/sonic/pipelines/138/builds/51255/artifacts/98637?branchName=master&artifactName=sonic-buildimage.broadcom)) + +# Dependency Version + +|Feature | Version | +| ------------------------- | --------------- | +| Linux kernel version | linux_5.10.0-23-2-$(5.10.179) | +| SAI version | SAI v1.13.3 | +| FRR | 8.5.1 | +| LLDPD | 1.0.16-1+deb12u1 | +| TeamD | 1.30-1 | +| SNMPD | 5.9+dfsg-4+deb11u1 | +| Python | 3.9.2-1 | +| syncd | 1.0.0 | +| swss | 1.0.0 | +| radvd | 2.18-3 | +| isc-dhcp | 4.4.1-2.3+deb11u2 | +| sonic-telemetry | 1.1 | +| redis-server/ redis-tools | 5.0.3-3~bpo9+2 | +| Debian version | Continuous to use Bullseye (Debian version 11) | + +Note : The kernel version is migrated to the version that is mentioned in the first row in the above 'Dependency Version' table. + + +# Security Updates + +1. Kernel upgraded from 5.10.103-1 to 5.10.136-1 for SONiC release.
+ Change log: https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.10.136 + +2. Docker upgraded from 24.0.2-debian-stretch to 24.0.7-debian-stretch
+ Change log: https://docs.docker.com/engine/release-notes/24.0/#2407 + + +# Feature List + +| Feature| Feature Description | HLD PR / PR tracking | Quality | +| ------ | ------- | -----|-----| +| ***[DASH] ACL tags HLD*** | In a DASH SONiC, a service tag represents a group of IP address prefixes from a given service. The controller manages the address prefixes encompassed by the service tag and automatically updates the service tag as addresses change, minimizing the complexity of frequent updates to network security rules. Mapping a prefix to a tag can reduce the repetition of prefixes across different ACL rules and optimize memory usage. | [1427](https://github.com/sonic-net/SONiC/pull/1427) | [Alpha](https://github.com/sonic-net/SONiC/blob/master/doc/SONiC%20feature%20quality%20definition.md) | +| ***AMD-Pensando ELBA SOC support*** | This patchset adds support for AMD-Pensando ELBA SOC. Elba provides a secure, controlled portal to network services, storage, and the data center control plane. This SOC is used in AMD-Pensando PCI Distributed Services Card (DSC).| [322](https://github.com/sonic-net/sonic-linux-kernel/pull/322) | [Alpha](https://github.com/sonic-net/SONiC/blob/master/doc/SONiC%20feature%20quality%20definition.md) | +| ***Auto FEC*** | This feature delivers a deterministic approach when FEC and autoneg are configured together which is currently left to vendor implementation. | [1416](https://github.com/sonic-net/SONiC/pull/1416) | [Alpha](https://github.com/sonic-net/SONiC/blob/master/doc/SONiC%20feature%20quality%20definition.md) | +| ***Banner HLD*** | This feature covers the definition, design and implementation of SONiC Banner feature and Banner CLI. |[1361](https://github.com/sonic-net/SONiC/pull/1361)| [Alpha](https://github.com/sonic-net/SONiC/blob/master/doc/SONiC%20feature%20quality%20definition.md) | +| ***FRR version 8.5.1 Upgrade*** | This feature is achieved with the implementation of new FRR 8.5.1 integration | [15965](https://github.com/sonic-net/sonic-buildimage/pull/15965) | [Alpha](https://github.com/sonic-net/SONiC/blob/master/doc/SONiC%20feature%20quality%20definition.md) | +| ***Build improvements changes*** | This feature adds optimization for the SONiC image build by splitting the final build step into two stages. It allows running the first stage in parallel, improving build time. | [1413](https://github.com/sonic-net/SONiC/issues/1413) & [15924](https://github.com/sonic-net/sonic-buildimage/pull/15924) | [Alpha](https://github.com/sonic-net/SONiC/blob/master/doc/SONiC%20feature%20quality%20definition.md) | +| ***CMIS host management - Port signal integrity per speed*** | This feature provides general information about configuring port signal integrity per speed in SONiC. | [1376](https://github.com/sonic-net/SONiC/issues/1376) & [1455](https://github.com/sonic-net/SONiC/pull/1455) | [Alpha](https://github.com/sonic-net/SONiC/blob/master/doc/SONiC%20feature%20quality%20definition.md) | +| ***CMIS Module Management Enhancement HLD *** | This feature is to enhance host_tx_ready set process to State DB, to have full synchronization between asic and module configuration. | [1453](https://github.com/sonic-net/SONiC/pull/1453) | [Alpha](https://github.com/sonic-net/SONiC/blob/master/doc/SONiC%20feature%20quality%20definition.md) | +| ***Container Hardening*** | This feature implements the container hardening, containing the security hardening requirements and definitions for all containers on top of SONiC | [1364](https://github.com/sonic-net/SONiC/pull/1364) | [Alpha](https://github.com/sonic-net/SONiC/blob/master/doc/SONiC%20feature%20quality%20definition.md) | +| ***Create CMIS-custom-SI-settings.md*** | This feature is to apply host defined SI parameters to CMIS supported modules. | [1334](https://github.com/sonic-net/SONiC/pull/1334) | [Alpha](https://github.com/sonic-net/SONiC/blob/master/doc/SONiC%20feature%20quality%20definition.md) | +| ***Egress Sflow Enhancement.*** | This feature updates the existing sFlow HLD for egress Sflow support. | [1268](https://github.com/sonic-net/SONiC/pull/1268) | [Alpha](https://github.com/sonic-net/SONiC/blob/master/doc/SONiC%20feature%20quality%20definition.md) | +| ***Factory reset*** | This feature implements the support for reset factory feature in Sonic OS. | [1231](https://github.com/sonic-net/SONiC/pull/1231) | [Alpha](https://github.com/sonic-net/SONiC/blob/master/doc/SONiC%20feature%20quality%20definition.md) | +| ***Fix containers deployments dependencies on boot/config_reload affecting user experience*** | Currently hostcfgd controls the services based on the feature table. The feature table has a specific field 'has_timer' for the non essential services which needs to be delayed during the reboot flow. This field will be now replaced by new field called "delayed". These services will controlled by hostcfgd. | [1203](https://github.com/sonic-net/SONiC/pull/1203) & [1379](https://github.com/sonic-net/SONiC/issues/1379) | [Alpha](https://github.com/sonic-net/SONiC/blob/master/doc/SONiC%20feature%20quality%20definition.md) | +| ***gNMI Master Arbitration*** | For high availability, a system may run multiple replicas of a gNMI client. Among the replicas, only one client should be elected as master and do gNMI operations that mutate any state on the target. However, in the event of a network partition, there can be two or more replicas thinking themselves as master. But if they both call the `Set` RPC, the target may be incorrectly configured by the stale master. Therefore, "Master Arbitration" is needed when multiple clients exist. | [1285](https://github.com/sonic-net/SONiC/pull/1285) & [1240](https://github.com/sonic-net/SONiC/issues/1240) | [Alpha](https://github.com/sonic-net/SONiC/blob/master/doc/SONiC%20feature%20quality%20definition.md) | +| ***High-level design for Wake-on-LAN feature in SONiC*** | This feature implements the Wake-on-LAN feature design in SONiC. Wake-on-LAN (WoL or WOL) is an Ethernet or Token Ring computer networking standard that allows a computer to be turned on or awakened from sleep mode by a network message. | [1508](https://github.com/sonic-net/SONiC/pull/1508) | [Alpha](https://github.com/sonic-net/SONiC/blob/master/doc/SONiC%20feature%20quality%20definition.md) | +| ***Libvs Port Counter Support*** | In sonic-vs 'show interface counters' is not supported (port counters set to zero). The counter support would be useful for debugging and automation. As part of this feature the basic port counters are fetched from corresponding host interface net stat. | [1398](https://github.com/sonic-net/SONiC/issues/1398) & [1275](https://github.com/sonic-net/sonic-sairedis/pull/1275) | [Alpha](https://github.com/sonic-net/SONiC/blob/master/doc/SONiC%20feature%20quality%20definition.md) | +| ***NAT Bookworm Upgrade*** | This feature updates the fullcone NAT patch in sonic-linux-kernel needs to be updated for Linux 6.1. | [1519](https://github.com/sonic-net/SONiC/issues/1519), [16867](https://github.com/sonic-net/sonic-buildimage/issues/16867) & [357](https://github.com/sonic-net/sonic-linux-kernel/pull/357) | [Alpha](https://github.com/sonic-net/SONiC/blob/master/doc/SONiC%20feature%20quality%20definition.md) | +| ***NTP: Additional NTP configuration knobs + NTP server provisioning*** | This SONiC Network Time Protocol feature covers Configuring NTP global parameters, Adding/removing new NTP servers, Change the configuration for NTP servers, Show NTP status & Show NTP configuration | [1296](https://github.com/sonic-net/SONiC/pull/1296) & [1254](https://github.com/sonic-net/SONiC/issues/1254) | [Alpha](https://github.com/sonic-net/SONiC/blob/master/doc/SONiC%20feature%20quality%20definition.md) | +| ***PDDF System Fan Enhancement*** | Current PDDF design supports only 12 individual fans (if 2 fans per tray then total of 6 fantrays). However, some platform have more fans. To support those platforms via PDDF, we added support for more fans in common fan PDDF drivers. | [15956](https://github.com/sonic-net/sonic-buildimage/pull/15956) & [1440](https://github.com/sonic-net/SONiC/issues/1440) | [Alpha](https://github.com/sonic-net/SONiC/blob/master/doc/SONiC%20feature%20quality%20definition.md) | +| ***PDDF support for Ufispace platforms and GPIO extension*** | This feature adds the PDDF support on Ufispace platforms with Broadcom ASIC for S9110-32X, S8901-54XC, S7801-54XS, S6301-56ST | [16017](https://github.com/sonic-net/sonic-buildimage/pull/16017) & [1441](https://github.com/sonic-net/SONiC/issues/1441)| [Alpha](https://github.com/sonic-net/SONiC/blob/master/doc/SONiC%20feature%20quality%20definition.md) | +| ***Persistent DNS address across reboots*** | With the current implementation dynamic DNS configuration can be received from the DHCP server or static configuration can be set manually by the user. However, SONiC doesn't provide any protection for the static configuration. The configuration that is set by the user can be overwritten with the dynamic configuration at any time. The proposed solution is to add support for static DNS configuration into Config DB. To be able to choose between dynamic and static DNS configurations resolvconf package. | [1380](https://github.com/sonic-net/SONiC/issues/1380), [13834](https://github.com/sonic-net/sonic-buildimage/pull/13834), [14549](https://github.com/sonic-net/sonic-buildimage/pull/14549), [2737](https://github.com/sonic-net/sonic-utilities/pull/2737), [49](https://github.com/sonic-net/sonic-host-services/pull/49), [1322](https://github.com/sonic-net/SONiC/pull/1322), [8436](https://github.com/sonic-net/sonic-mgmt/pull/8436) & [8712](https://github.com/sonic-net/sonic-mgmt/pull/8712) | [Alpha](https://github.com/sonic-net/SONiC/blob/master/doc/SONiC%20feature%20quality%20definition.md) | +| ***RADIUS NSS Vulnerability*** | The nss library uses popen to execute useradd and usermod commands. Popen executes using a shell (/bin/sh) which is passed the command string with "-c". This means that if untrusted user input is supplied, unexpected shell escapes can occur. To overcome this, we have suggested to use execle instead of popen to avoid shell escape exploits. | [1399](https://github.com/sonic-net/SONiC/issues/1399) | [Alpha](https://github.com/sonic-net/SONiC/blob/master/doc/SONiC%20feature%20quality%20definition.md) | +| ***[SNMP]: SONiC SNMP Changes to support IPv6*** | The feature captures the changes required to support SNMP over IPv6 for single asic platforms. | [1457](https://github.com/sonic-net/SONiC/pull/1457) | [Alpha](https://github.com/sonic-net/SONiC/blob/master/doc/SONiC%20feature%20quality%20definition.md) | +| ***SSH global config*** | This feature introduces a procedure to configure ssh server global settings. This feature will include 3 configurations in the first phase, but can be extended easily to include additional configurations. | [1169](https://github.com/sonic-net/SONiC/issues/1169), [1075](https://github.com/sonic-net/SONiC/pull/1075) | [Alpha](https://github.com/sonic-net/SONiC/blob/master/doc/SONiC%20feature%20quality%20definition.md) | +| ***Sflow 800G Support*** | This feature enhances the current sFlow in sonic, with additional speed due to new ASICs support for 800G. | [1383](https://github.com/sonic-net/SONiC/issues/1383), [2799](https://github.com/sonic-net/sonic-swss/pull/2799) & [2805](https://github.com/sonic-net/sonic-swss/pull/2805) | [Alpha](https://github.com/sonic-net/SONiC/blob/master/doc/SONiC%20feature%20quality%20definition.md) | +| ***TACACS NSS Vulnerability*** | The nss library uses popen to execute useradd and usermod commands. Popen executes using a shell (/bin/sh) which is passed the command string with "-c". This means that if untrusted user input is supplied, unexpected shell escapes can occur. To overcome this, we have suggested to use execle instead of popen to avoid shell escape exploits. | [1464](https://github.com/sonic-net/SONiC/issues/1464) | [Alpha](https://github.com/sonic-net/SONiC/blob/master/doc/SONiC%20feature%20quality%20definition.md) | +| ***UMF: Additional Optimizations for Transformer Infrastructure*** | This feature offers additional optimizational enhancements & bug-fixes for transformer infrastructure. | [1463](https://github.com/sonic-net/SONiC/issues/1463) | [Alpha](https://github.com/sonic-net/SONiC/blob/master/doc/SONiC%20feature%20quality%20definition.md) | +| ***UMF Infra Enhancement for SONIC-YANG*** | This implements the option to import specific sonic yangs from buildimage sonic-yang-models directory into UMF & CVL enhancement to handle handle singleton tables modeled as a container instead of the usual _LIST syntax | [1397](https://github.com/sonic-net/SONiC/issues/1397) | [Alpha](https://github.com/sonic-net/SONiC/blob/master/doc/SONiC%20feature%20quality%20definition.md) | +| ***UMF Subscription Infra Phase 2*** | This feature implements the SONiC Telemetry service and Translib infrastructure changes to support gNMI subscriptions and wildcard paths for YANG defined paths. | [1287](https://github.com/sonic-net/SONiC/pull/1287) | [Alpha](https://github.com/sonic-net/SONiC/blob/master/doc/SONiC%20feature%20quality%20definition.md) | +| ***Upgrade hsflowd and remove dropmon build flags*** | TBD | [1378](https://github.com/sonic-net/SONiC/issues/1378) | TBD | +| ***Virtual SONiC Network Helper*** | This feature implements vsnet tool to create network of virtual sonic instances | [8459](https://github.com/sonic-net/sonic-mgmt/pull/8459) | [Alpha](https://github.com/sonic-net/SONiC/blob/master/doc/SONiC%20feature%20quality%20definition.md) | + + +Note : The HLD PR's have been updated in ""HLD PR / PR tracking"" coloumn. The code PR's part of the features are mentioned within the HLD PRs. The code PRs not mentioned in HLD PRs are updated in "HLD PR / PR tracking" coloumn along with HLD PRs. + +# SAI APIs + +Please find the list of API's classified along the newly added SAI features. For further details on SAI API please refer [SAI_1.13.3 Release Notes](https://github.com/opencomputeproject/SAI/blob/master/doc/SAI_1.13.3_ReleaseNotes.md) + + +# Contributors + +SONiC community would like to thank all the contributors from various companies and the individuals who has contributed for the release. Special thanks to the major contributors - AMD, Aviz Networks, Broadcom, Capgemini, Centec, Cisco, Dell, eBay, Edge core, Google, InMon, Inspur, Marvell, Micas Networks, Microsoft, NTT, Nvidia, Orange, Ufispace, xFlow Research Inc. + +
+ + + diff --git a/doc/TWAMP/SONiC-TWAMP-Ligth-HLD.md b/doc/TWAMP/SONiC-TWAMP-Ligth-HLD.md new file mode 100644 index 00000000000..ba27e81aa47 --- /dev/null +++ b/doc/TWAMP/SONiC-TWAMP-Ligth-HLD.md @@ -0,0 +1,1021 @@ +# TWAMP Light HLD # + +## Table of Contents + + - [Definitions/Abbreviation](#definitionsabbreviation) + - [Revision](#Revision) + - [Scope](#Scope) + - [Definition/Abbreviation](#Definition_Abbreviation) + - [1. Overview](#1-Overview) + - [2. Requirements](#2-Requirements) + - [2.1 Functional requirements](#2_1-Functional_requirements) + - [2.2 Configuration and Management Requirements](#2_2-Configuration_and_Management_Requirements) + - [2.3 Scalability and Default Values](#2_3-Scalability_and_Default_Values) + - [2.3.1 Scalability](#2_3_1-Scalability) + - [2.3.2 Default Values](#2_3_2-Default_Values) + - [2.4 Warm Restart requirements](#2_4-Warm_Restart_requirements) + - [3. Architecture Design](#3-Architecture_Design) + - [4. High-Level Design](#4-High_Level_Design) + - [4.1 DB Changes](#4_1-DB_Changes) + - [4.1.1 CONFIG DB](#4_2_1-CONFIG_DB) + - [4.1.1.1 CFG_TWAMP_SESSION_TABLE](#4_1_1_1-CFG_TWAMP_SESSION_TABLE) + - [4.1.2 STATE DB](#4_2_2-STATE_DB) + - [4.1.2.1 STATE_TWAMP_SESSION_TABLE](#4_1_2_1-STATE_TWAMP_SESSION_TABLE) + - [4.1.2.2 STATE_SWITCH_CAPABILITY_TABLE](#4_1_2_2-STATE_SWITCH_CAPABILITY_TABLE) + - [4.1.3 COUNTER DB](#4_2_3-COUNTER_DB) + - [4.1.3.1 TWAMP Light counter table](#4_1_3_1-TWAMP_Light_counter_table) + - [4.2 Flow Diagrams](#4_2-Flow_Diagrams) + - [4.2.1 Query ASIC capability](#4_2_1-Query_ASIC_capability) + - [4.2.2 Create session](#4_2_2-Create_session) + - [4.2.3 Remove session](#4_2_3-Remove_session) + - [4.2.4 Restart/Stop session](#4_2_4-Restart_Stop_session) + - [4.2.5 Notify Session-Sender event](#4_2_5-Notify_Session_Sender_event) + - [5. SAI API](#5-SAI_API) + - [6. Configuration and management](#6-Configuration_and_management) + - [6.1 Manifest](#6_1-Manifest) + - [6.2 CLI](#6_2-CLI) + - [6.2.1 Configuration commands](#6_2_1-Configuration_commands) + - [6.2.1.1 Session-Sender with packet-count mode](#6_2_1_1-Session_Sender_with_packet_count_mode) + - [6.2.1.2 Session-Sender with continuous mode](#6_2_1_2-Session_Sender_with_continuous_mode) + - [6.2.1.3 Start TWAMP Light Session-Sender](#6_2_1_3-Start_TWAMP_Light_Session_Sender) + - [6.2.1.4 Stop TWAMP Light Session-Sender](#6_2_1_4-Stop_TWAMP_Light_Session_Sender) + - [6.2.1.5 Session-Reflector](#6_2_1_5-Session_Reflector) + - [6.2.1.6 Remove_TWAMP_Light_session](#6_2_1_6-Remove_TWAMP_Light_session) + - [6.2.2 Show commands](#6_2_1-Show_commands) + - [6.2.2.1 Show TWAMP Light session status](#6_2_2_1-Show_TWAMP_Light_session_status) + - [6.2.2.2 Show TWAMP Ligth latency and jitter](#6_2_2_2-Show_TWAMP_Ligth_latency_and_jitter) + - [6.2.2.3 Show TWAMP Ligth packet loss](#6_2_2_3-Show_TWAMP_Light_packet_loss) + - [6.3 YANG](#6_2-YANG) + - [7. Warmboot and Fastboot Design Impact](#7-Warmboot_and_Fastboot_Design_Impact) + - [8. Restrictions_Limitations](#8-Restrictions_Limitations) + - [9. Testing Requirements/Design](#9-Testing_Requirements_Design) + - [9.1 Unit Test cases](#9_1-Unit_Test_cases) + - [9.2.1 CLI](#9_2_1-CLI) + - [9.2.2 Functionality](#9_2_2-Functionality) + - [9.2.3 Scaling](#9_2_3-Scaling) + - [9.2.4 SAI](#9_2_4-SAI) + - [References](#-References) + +## Revision + +| Rev | Date | Author | Change Description | +| :--: | :--------: | :-----------------------------------: | ------------------ | +| 0.1 | 06/12/2023 | Xiaodong Hu, Shuhua You, Xianghong Gu | Initial version | + +## Scope + +This document describes the requirements, architecture, and configuration details of the TWAMP Light feature in SONiC. + +## Definition/Abbreviation + +| **Definitions/Abbreviation** | Description | +| ---------------------------- | ----------------------------------------- | +| TWAMP | Two-way Active Measurement Protocol | +| TWAMP Light | A light version of TWAMP | +| AR | Augmented Reality | +| IPPM | IP Performance Measurement | +| OAM | Operation, Administration and Maintenance | +| SAI | Switch Abstraction Interface | +| RPM | Real-Time Performance Monitor | +| RTT | Round-Trip Time | +| VRF | Virtual Routing and Forwarding | + +## 1 Overview + +With the rapid development of network application technologies, latency and packet loss-sensitive such as audio, video and AR services impose higher requirements on network performance. In this requirement scenario, service providers need a mechanism that provides real-time information about the quality and latency of network paths for helping them to monitor and maintain the quality of their networks. + +Some vendors use proprietary features such as RPM to measure IP network performance, so it is not universally supported by devices. In this interoperability environment, service providers need an effective, simple, and universal OAM performance measurement mechanism for performance measurement of the network. + +There are some standard measurement tools such as Ping and Traceroute. Ping is a tool used to test connectivity and measure RTT, and Traceroute is a tool used to identify the path that packets take between two hosts, but TWAMP Light serves different purposes from them. TWAMP Light measures the quality and latency of the network path in both directions, providing a more accurate view of the network performance. + +TWAMP Light defined by the IP Performance Measurement (IPPM) working group, is a standard performance measurement protocol applied to IP networks as described in RFC5357. It provides a unified measurement model and unified test packet format for interoperability among devices of different vendors. TWAMP Light uses the client-server model. It generates and maintains the performance measurement data only on the client. As an IP link detection technology, It can monitor network quality, including latency, jitter, and packet loss. So it is easy to deploy and use. + +![TWAMP Light role](./images/TWAMP_Light_role.png) + +TWAMP Light does not require a control connection to set up and manage the measurement sessions. Instead, it uses a simple message exchange process between two endpoints to initiate and terminate the measurements. In hence, the roles of Control-Client, Server and Session-Sender are implemented in one host referred to as the controller, and the role of Session-Reflector is implemented in another host referred to as the responder as described in RFC5357. TWAMP Light is a Two-way measurement witch is common in IP networks, primarily because synchronization between local and remote clocks is unnecessary for round-trip delay, and measurement support at the remote end may be limited to a simple echo function. + +To calculate the latency, jitter, and packet loss rate of network, the following steps are typically followed: + +- Session-Sender as the active endpoint of the TWAMP Light sends a Test-request packet with a tx timestamp marked as t0. +- Session-Reflector receives the Test-request packet and captures a rx timestamp as t1. +- Prior to the transmission of the Test-response packet, Session-Reflector captures a tx timestamp as t2. +- Session-Reflector encodes t1 and t2 into the Test-response packet, then sends it out. +- Session-Sender receives the Test-response packet and captures a rx timestamp as t3. + +The algorithm is: + +- Latency = (t3-t0) – (t2-t1) +- Jitter = | Latency1 – Latency0 | +- Packet loss rate = (txPkt – rxPkt) / txPkt + +## 2 Requirements + +### 2.1 Functional requirements + +At a high level the following should be supported: + +Phase #1 + +- RFC5357 TWAMP Light should be followed +- Should be able to perform the role of TWAMP Light Session-Sender: + + - Support on-demand measurement based on specific number of packets + - Support proactive measurement + - Support to start and stop sending TWAMP-Test packets +- Should be able to perform the role of TWAMP Light Control-Client: + + - Maintain the performance measurement data including latency, jitter and packet loss +- Should be able to perform the role of TWAMP Light Session-Reflector: + + - Support to reflect TWAMP-Test packets to sender +- Should be able to configure TWAMP-Test packets fileds: Source IP, L4 Source Port, Destination IP, L4 Destination Port, DSCP, TTL, timestamp format, padding + +Later phases: + + - Support the authenticated and encrypted modes + - Support overlay network measurement + - Support an approach of measurement based on pure software + +### 2.2 Configuration and Management Requirements + +- Support CLI configuration commands +- Support show commands + +### 2.3 Scalability and Default Values + +#### 2.3.1 Scalability + +The maximum number of sessions is due to hardware capability in hardware mode. + +The maximum number of sessions is not limited in software mode. + +#### 2.3.2 Default Values + +Below table shows the default values for TWAMP Test-packets. + +| Attribute | Value | +| -------------------------------------- | ----- | +| TWAMP_SESSION_DEFAULT_DSCP | 0 | +| TWAMP_SESSION_DEFAULT_TTL | 255 | +| TWAMP_SESSION_DEFAULT_TIMESTAMP_FORMAT | NTP | + +### 2.4 Warm Restart requirements + +At this time warm restart capability is no factored into the design. This shall be revisited as a later phase enhancement. + +## 3 Architecture Design + +It is delighted to introduce two approaches of measurement which implement TWAMP Light. One is based on ASIC and the other is based on pure software. In orchagent initialization phase, it can query the TWAMP Light capability from ASIC to choose one approach to deploy TWAMP Light. + +![TWAMP Light architecture](./images/TWAMP_Light_architecture.png) + +Architecture considers both software and hardware solutions, with the main difference is the generation, transmission, and calculation of test packets based on hardware or software. In case of the chip supports TWAMP Light feature, the hardware solution provides more precise measurement results. If not, the software solution can be chosen for TWAMP Light. + +* TWAMP Light architecture using the hardware solution. A process twamporch is newly added in the hardware solution. Twamporch subscribes to TWAMP_LIGHT_TABLE and creates a session that includes properties such as IP, UDP PORT and offloads to ASIC. The ASIC generates the test packets, independently calculates the measurement data, and reports it to the up layer system. + +* TWAMP Light architecture using the software solution. A TWAMP container is newly added. The container includes twampd which subscribes to TWAMP_LIGHT_TABLE and creates a session that includes properties such as IP, UDP PORT. Twampd runs two threads, one generates TWAMP-test packets and sends them out by Linux socket, the other receives TWAMP-test packets from Linux socket. For Session-Sender, twampd is required to calculate the latency, jitter and packet loss rate based on the measurement data such as timestamp, transmitting and receiving packets count and so on. For Session-Reflector, twampd reflects the TWAMP-test packet with timestamp. For TWAMP-Test packet, ACL entry with packet fileds such as ip, udp_port will be installed to ASIC when creating a TWAMP Light session. Aslo a new trap of host interface will be added, and this trap is configurable in COPP rules. + +## 4 High-Level Design + +### 4.1 DB Changes + +#### 4.1.1 CONFIG DB + +##### 4.1.1.1 CFG_TWAMP_SESSION_TABLE + +Producer: Configuration +Consumer: TWAMP Light Orch Agent +Description: New table to store TWAMP Light session configuration. Applications can use it to get TWAMP Light session configuration +Schema: + +``` +;Stores TWAMP Light sender or reflector session configuration +;Status: work in progress +key = TWAMP_SESSION|session_name ; session_name is + ; unique session identifier + +;field = value +mode = "LIGHT" ; TWAMP Light mode +role = SENDER/REFLECTOR ; TWAMP Light role is sender or reflector +vrf = ; vrf name +src_ip = ; sender ip address +dst_ip = ; reflector ip address +udp_src_port = ; sender udp port (862, 863, 1025-65535) +udp_dst_port = ; reflector udp port (862, 863, 1025-65535) +packet_count = ; TEST-Request packet count (100 to 30000, DEF:100) +monitor_time = ; continuous in secs (0 indicates forever) +tx_interval = ; transmit TEST-Request in msecs ([10,100,1000], DEF:100) +timeout = ; timeout in secs (1 to 10, DEF:5) +statistics_interval = ; calculate in msecs (2000 to 3600000) +test_session_enable = ENABLE/DISABLE ; session is enabled or disabled +dscp = ; TEST-Request packet DSCP (0 to 63, DEF:0) +ttl = ; TEST-Request packet TTL (DEF:255) +timestamp_format = NTP/PTP ; TEST-Request packet timestamp format +packet_padding = ; TEST-Request packet padding, e.g., 00,55,aa,ff +packet_padding_size = ; TEST-Request packet padding length (32 to 1454, DEF:108) +``` + +#### 4.1.2 STATE DB + +##### 4.1.2.1 STATE_TWAMP_SESSION_TABLE + +Producer: syncd +Consumer: Applications over TWAMP Light +Description: New table to provide TWAMP Light session states to applications running over TWAMP Light +Schema: + +``` +;Stores TWAMP Light state table +;Status: work in progress +key = TWAMP_SESSION_TABLE|session_name ; session_name is + ; unique session identifier + +;field = value +status = ACTIVE/INACTIVE ; session test status +``` + +##### 4.1.2.2 STATE_SWITCH_CAPABILITY_TABLE + +Producer: TwampOrch +Consumer: TWAMP Light Orch Agent +Description: New attribute to store TWAMP Light ASIC capability +Schema: + +``` +;Add one new attribute to existing SWITCH_CAPABILITY_TABLE to control whether hardware or software solution should be chosen. + +;field = value +MAX_TWAMP_SESSION_COUNT = +``` + +#### 4.1.3 COUNTER DB + +##### 4.1.3.1 TWAMP Light counter table + +Producer: syncd +Consumer: Applications over TWAMP Light +Description: New table to store TWAMP Light performance measurement data per session +Schema: + +``` +COUNTERS_TWAMP_SESSION_NAME_MAP + session_name : + +COUNTERS:oid:session_name_oid:index + SAI_TWAMP_SESSION_STAT_RX_PACKETS : + SAI_TWAMP_SESSION_STAT_RX_BYTE : + SAI_TWAMP_SESSION_STAT_TX_PACKETS : + SAI_TWAMP_SESSION_STAT_TX_BYTE : + SAI_TWAMP_SESSION_STAT_DROP_PACKETS : + SAI_TWAMP_SESSION_STAT_MAX_LATENCY : + SAI_TWAMP_SESSION_STAT_MIN_LATENCY : + SAI_TWAMP_SESSION_STAT_AVG_LATENCY : + SAI_TWAMP_SESSION_STAT_MAX_JITTER : + SAI_TWAMP_SESSION_STAT_MIN_JITTER : + SAI_TWAMP_SESSION_STAT_AVG_JITTER : + SAI_TWAMP_SESSION_STAT_FIRST_TS : + SAI_TWAMP_SESSION_STAT_LAST_TS : + SAI_TWAMP_SESSION_STAT_DURATION_TS : +``` + +### 4.2 Flow Diagrams + +Below diagram shows the flow for hardware solution. + +#### 4.2.1 Query ASIC capability + +Below diagram shows the flow for twamporch queries ASIC capability. + +![query hw capability flow diagram](./images/TWAMP_Light_query_hw_capability.png) + +#### 4.2.2 Create session + +Below diagram shows the flow of creating Session-Sender or Session-Reflector. + +![create session flow diagram](./images/TWAMP_Light_create_session.png) + +#### 4.2.3 Remove session + +Below diagram shows the flow of removing Session-Sender or Session-Reflector. + +![remove session flow diagram](./images/TWAMP_Light_remove_session.png) + +#### 4.2.4 Start/Stop session + +Below diagram shows the flow of starting or stopping Session-Sender. + +![set session state flow diagram](./images/TWAMP_Light_set_session_state.png) + +#### 4.2.5 Notify Session-Sender event + +Below diagram shows the flow of notifying the session running state and measurement data. + +![nofiy session event flow diagram](./images/TWAMP_Light_nofiy_session_event.png) + +## 5 SAI API + +Following SAI API changes are proposed to program the modes to get the ASIC capability: + +**File: saitswitch.h** + + /** + * @brief Attribute Id in sai_set_switch_attribute() and + * sai_get_switch_attribute() calls. + */ + typedef enum _sai_switch_attr_t + { + . + . + /** + * @brief Set Switch TWAMP session state change event notification callback function passed to the adapter. + * + * Use sai_twamp_session_state_change_notification_fn as notification function. + * + * @type sai_pointer_t sai_twamp_session_state_change_notification_fn + * @flags CREATE_AND_SET + * @default NULL + */ + SAI_SWITCH_ATTR_TWAMP_SESSION_STATE_CHANGE_NOTIFY, + + /** + * @brief Max number of Two-Way Active Measurement Protocol session supports + * + * @type sai_uint32_t + * @flags READ_ONLY + */ + SAI_SWITCH_ATTR_MAX_TWAMP_SESSION, + . + . + } sai_switch_attr_t; +Max number of TWAMP Light sessions(SAI_SWITCH_ATTR_MAX_TWAMP_SESSION) are defined in below SAI spec - + +https://github.com/opencomputeproject/SAI/pull/1786 + +**File: saitwamp.h** + + /** + * @brief SAI attributes for Two-Way Active Measurement Protocol session + */ + typedef enum _sai_twamp_session_attr_t + { + . + . + /** + * @brief Two-Way Active Measurement Protocol session role of sender or receiver. + * + * @type sai_twamp_session_role_t + * @flags MANDATORY_ON_CREATE | CREATE_ONLY + */ + SAI_TWAMP_SESSION_ATTR_SESSION_ROLE, + + /** + * @brief UDP Source port + * + * @type sai_uint32_t + * @flags MANDATORY_ON_CREATE | CREATE_ONLY + */ + SAI_TWAMP_SESSION_ATTR_UDP_SRC_PORT, + + /** + * @brief UDP Destination port + * + * @type sai_uint32_t + * @flags MANDATORY_ON_CREATE | CREATE_ONLY + */ + SAI_TWAMP_SESSION_ATTR_UDP_DST_PORT, + + /** + * @brief Local source IP address + * + * @type sai_ip_address_t + * @flags MANDATORY_ON_CREATE | CREATE_ONLY + */ + SAI_TWAMP_SESSION_ATTR_SRC_IP, + + /** + * @brief Remote Destination IP address + * + * @type sai_ip_address_t + * @flags MANDATORY_ON_CREATE | CREATE_ONLY + */ + SAI_TWAMP_SESSION_ATTR_DST_IP, + + /** + * @brief DSCP of Traffic Class + * + * @type sai_uint8_t + * @flags CREATE_ONLY + * @default 0 + */ + SAI_TWAMP_SESSION_ATTR_DSCP, + + /** + * @brief IP header TTL + * + * @type sai_uint8_t + * @flags CREATE_ONLY + * @default 255 + * @validonly SAI_TWAMP_SESSION_ATTR_SESSION_ROLE == SAI_TWAMP_SESSION_ROLE_SENDER + */ + SAI_TWAMP_SESSION_ATTR_TTL, + + /** + * @brief Virtual Private Network virtual router + * + * @type sai_object_id_t + * @flags CREATE_ONLY + * @objects SAI_OBJECT_TYPE_VIRTUAL_ROUTER + * @allownull true + * @default SAI_NULL_OBJECT_ID + * @validonly SAI_TWAMP_SESSION_ATTR_HW_LOOKUP_VALID == true + */ + SAI_TWAMP_SESSION_ATTR_VPN_VIRTUAL_ROUTER, + + /** + * @brief Encapsulation type + * + * @type sai_twamp_encapsulation_type_t + * @flags CREATE_ONLY + */ + SAI_TWAMP_SESSION_ATTR_TWAMP_ENCAPSULATION_TYPE, + + /** + * @brief To enable Two-Way Active Measurement Protocol session transmit packet + * + * @type bool + * @flags CREATE_AND_SET + * @default false + * @validonly SAI_TWAMP_SESSION_ATTR_SESSION_ROLE == SAI_TWAMP_SESSION_ROLE_SENDER + */ + SAI_TWAMP_SESSION_ATTR_SESSION_ENABLE_TRANSMIT, + + /** + * @brief Hardware lookup valid + * + * @type bool + * @flags CREATE_ONLY + * @default true + */ + SAI_TWAMP_SESSION_ATTR_HW_LOOKUP_VALID, + + /** + * @brief Two-Way Active Measurement Protocol test packet tx interval + * + * @type sai_uint32_t + * @flags CREATE_ONLY + * @condition SAI_TWAMP_SESSION_ATTR_SESSION_ROLE == SAI_TWAMP_SESSION_ROLE_SENDER + */ + SAI_TWAMP_SESSION_ATTR_TX_INTERVAL, + + /** + * @brief Two-Way Active Measurement Protocol packet tx mode: CONTINUOUS, PACKET_COUNT + * + * Valid when SAI_TWAMP_SESSION_ATTR_SESSION_ROLE == SAI_TWAMP_SESSION_ROLE_SENDER + * + * @type sai_twamp_pkt_tx_mode_t + * @flags MANDATORY_ON_CREATE | CREATE_ONLY + */ + SAI_TWAMP_SESSION_ATTR_TWAMP_PKT_TX_MODE, + + /** + * @brief Two-Way Active Measurement Protocol test packet tx count, configuring by Two-Way Active Measurement Protocol send packet count of Tx + * + * @type sai_uint32_t + * @flags MANDATORY_ON_CREATE | CREATE_ONLY + * @condition SAI_TWAMP_SESSION_ATTR_SESSION_ROLE == SAI_TWAMP_SESSION_ROLE_SENDER and SAI_TWAMP_SESSION_ATTR_TWAMP_PKT_TX_MODE == SAI_TWAMP_PKT_TX_MODE_PACKET_NUM + */ + SAI_TWAMP_SESSION_ATTR_TX_PKT_CNT, + + /** + * @brief Two-Way Active Measurement Protocol test packet tx period + * if tx period equal 0, sender will continue to generate packet and send them. + * + * @type sai_uint32_t + * @flags MANDATORY_ON_CREATE | CREATE_ONLY + * @condition SAI_TWAMP_SESSION_ATTR_SESSION_ROLE == SAI_TWAMP_SESSION_ROLE_SENDER and SAI_TWAMP_SESSION_ATTR_TWAMP_PKT_TX_MODE == SAI_TWAMP_PKT_TX_MODE_PERIOD + */ + SAI_TWAMP_SESSION_ATTR_TX_PKT_PERIOD, + + /** + * @brief Two-Way Active Measurement Protocol mode: light mode and full mode + * + * @type sai_twamp_mode_t + * @flags MANDATORY_ON_CREATE | CREATE_ONLY + */ + SAI_TWAMP_SESSION_ATTR_TWAMP_MODE, + + /** + * @brief The format of timestamp in test packet. + * + * @type sai_twamp_timestamp_format_t + * @flags CREATE_ONLY + * @default sai_twamp_timestamp_format_t + */ + SAI_TWAMP_SESSION_ATTR_TIMESTAMP_FORMAT, + . + . + } sai_twamp_session_attr_t; + + /** + * @brief Two-Way Active Measurement Protocol Session counter IDs in sai_get_twamp_session_stats() call + */ + typedef enum _sai_twamp_session_stats_t + { + /** Rx packet stat count */ + SAI_TWAMP_SESSION_STATS_RX_PACKETS, + + /** Rx byte stat count */ + SAI_TWAMP_SESSION_STATS_RX_BYTE, + + /** Tx packet stat count */ + SAI_TWAMP_SESSION_STATS_TX_PACKETS, + + /** Tx byte stat count */ + SAI_TWAMP_SESSION_STATS_TX_BYTE, + + /** Packet Drop stat count */ + SAI_TWAMP_SESSION_STATS_DROP_PACKETS, + + /** Packet max latency */ + SAI_TWAMP_SESSION_STATS_MAX_LATENCY, + + /** Packet min latency */ + SAI_TWAMP_SESSION_STATS_MIN_LATENCY, + + /** Packet avg latency */ + SAI_TWAMP_SESSION_STATS_AVG_LATENCY, + + /** Packet max value */ + SAI_TWAMP_SESSION_STATS_MAX_JITTER, + + /** Packet min value */ + SAI_TWAMP_SESSION_STATS_MIN_JITTER, + + /** Packet avg value */ + SAI_TWAMP_SESSION_STATS_AVG_JITTER, + + /** Session first timestamp */ + SAI_TWAMP_SESSION_STATS_FIRST_TS, + + /** Session last timestamp */ + SAI_TWAMP_SESSION_STATS_LAST_TS, + } sai_twamp_session_stats_t; + + /** + * @brief Two-Way Active Measurement Protocol method table retrieved with sai_api_query() + */ + typedef struct _sai_twamp_api_t + { + sai_create_twamp_session_fn create_twamp_session; + sai_remove_twamp_session_fn remove_twamp_session; + sai_set_twamp_session_attribute_fn set_twamp_session_attribute; + sai_get_twamp_session_attribute_fn get_twamp_session_attribute; + sai_get_twamp_session_stats_fn get_twamp_session_stats; + sai_get_twamp_session_stats_ext_fn get_twamp_session_stats_ext; + sai_clear_twamp_session_stats_fn clear_twamp_session_stats; + + } sai_twamp_api_t; + + +TWAMP Light SAI interface APIs are already defined and is available at below location - + +https://github.com/opencomputeproject/SAI/pull/1786 + +## 6 Configuration and management + +Following diagram introduces the CLI parameters for the TWAMP Light Session-Sender configuration. + +![Packet-Count](./images/TWAMP_Light_packet_count.png) + +![Continuous with monitor](./images/TWAMP_Light_continuous_monitor.png) + +![Continuous](./images/TWAMP_Light_continuous.png) + +### 6.1 Manifest (if the feature is an Application Extension) + +N/A + +### 6.2 CLI + +#### 6.2.1 Configuration commands + +New sets of configuration commands are introduced to configure TWAMP Light. + +##### 6.2.1.1 Session-Sender with packet-count mode + +This command is used to create Session-Sender with packet-count mode. + +``` +Format: + config twamp-light session-sender add packet-count [vrf ] + +Arguments: + session_name: session sender name. e.g: s1, test_ip1_ip2 + vrf_name: session vrf name. e.g: vrf1 + sender_ip_port: sender ip and udp port. e.g: 10.1.1.2:20000 + reflector_ip_port: reflector ip and udp port. e.g: 10.1.1.2:20001 + packet_count: sender transmits Test-request packet count, e.g: 100 + tx_interval: sender transmits Test-request packet interval in msecs. e.g: 10 + timeout: sender receives Test-response packet timeout in secs. e.g. 5 + statistics_interval: sender calculates measurement statistics in msecs, e.g: 6000 + +Example: + config twamp-light session-sender add packet-count s1 10.1.1.2:20000 20.2.2.2:20001 100 10 5 6000 +``` + +##### 6.2.1.2 Session-Sender with continuous mode + +This command is used to create Session-Sender with continuous mode. + +``` +Format: + config twamp-light session-sender add continuous [vrf ] + +Arguments: + session_name: sender session name. e.g: s1, test_ip1_ip2 + vrf_name: session vrf name. e.g: vrf1 + sender_ip_port: sender ip and udp port. e.g: 10.1.1.2:20000 + reflector_ip_port: reflector ip and udp port. e.g: 10.1.1.2:20001 + monitor_time: sender monitor Test-request packet in secs. e.g: 10 + tx_interval: sender transmits Test-request packet interval in msecs. e.g: 100 + timeout: sender receives Test-response packet timeout in secs. e.g. 5 + statistics_interval: sender calculates measurement statistics in msecs, e.g: 15000 + +Example: + config twamp-light session-sender add continuous s1 10.1.1.2:2000 192.168.3.2:20001 10 100 5 15000 +``` + +##### 6.2.1.3 Start TWAMP Light Session-Sender + +This command is used to start the Session-Sender. + +``` +Format: + config twamp-light session-sender start + +Arguments: + session_name: session name. e.g: s1 + all: all sessions + +Example: + config twamp-light session-sender start s1 + config twamp-light session-sender start all +``` + +##### 6.2.1.4 Stop TWAMP Light Session-Sender + +This command is used to stop the Session-Sender + +``` +Format: + config twamp-light session-sender stop + +Arguments: + session_name: session name. e.g: s1 + all: all sessions + +Example: + config twamp-light session-sender stop s1 + config twamp-light session-sender stop all +``` + +##### 6.2.1.5 Session-Reflector + +This command is used to create the Session-Reflector. + +``` +Format: + config twamp-light session-reflector add [vrf ] + +Arguments: + session_name: sender session name. e.g: r1, test_ip1_ip2 + vrf_name: session vrf name. e.g: vrf1 + sender_ip_port: sender ip and udp port. e.g: 10.1.1.2:20000 + reflector_ip_port: reflector ip and udp port. e.g: 10.1.1.2:20001 + +Example: + config twamp-light reflector add r1 10.1.1.2:20000 20.2.2.2:20001 +``` + +##### 6.2.1.6 Remove TWAMP Light session + +This command is used to remove the session. + +``` +Format: + config twamp-light remove + +Arguments: + session_name: session name. e.g: s1 + all: all sessions + +Example: + config twamp-light remove s1 + config twamp-light remove all +``` + +#### 6.2.2 Show commands + +New sets of show commands are introduced to display the result of measurement. + +##### 6.2.2.1 Show TWAMP Light session status + +This command is used to display the Session-Sender and Session-Reflector status. + +``` +show twamp-light session +TWAMP Light Sender Sessions +Name Status Sender IP:PORT Reflector IP:PORT Packet Count Monitor Time Tx Interval Stats Interval Timeout Last Start Time Last Stop Time +------ -------- ---------------- ------------------- -------------- -------------- ------------- ---------------- --------- ----------------- ---------------- +sdp34 inactive 30.3.3.2:20000 40.4.4.2:20001 100 - 100 6000 6 2023-02-23 15:23:09 2023-02-23 15:23:10 + +TWAMP-Light Reflector Sessions +Name Status Sender IP:PORT Reflector IP:PORT +------ -------- ---------------- ------------------- + +``` + +##### 6.2.2.2 Show TWAMP Ligth latency and jitter + +This command is used to display the latency and jitter. + +``` +show twamp-light statistics twoway-latency +Latest two-way latency statistics(nsec): +Name Index Latency(AVG) Jitter(AVG) +------ ------- -------------- ------------- +sdp34 1 20906 134217 + +Total two-way latency statistics(nsec): +Name Latency(AVG) Latency(MIN) Latency(MAX) Jitter(AVG) Jitter(MIN) Jitter(MAX) +------ -------------- -------------- -------------- ------------- ------------- ------------- +sdp34 20906 5 6 134217 3489660 3489660 +``` + +##### 6.2.2.3 Show TWAMP Ligth packet loss + +This command is used to display the packet loss + +``` +show twamp-light statistics twoway-loss +Latest two-way loss statistics: + Index Loss Count Loss Ratio + +------- ------------ ------------ + + 1 0 0 + +Total two-way loss statistics: +Name Loss Count(AVG) Loss Count(MIN) Loss Count(MAX) Loss Ratio(AVG) Loss Ratio(MIN) Loss Ratio(MAX) + +------ ----------------- ----------------- ----------------- ----------------- ----------------- ----------------- + +sdp34 0 0 0 0 0 0 +``` + +#### 6.3 YANG + + module sonic-twamp { + + yang-version 1.1; + + namespace "http://github.com/Azure/sonic-twamp"; + prefix stwamp; + + import ietf-inet-types { + prefix inet; + } + + import sonic-vrf { + prefix vrf; + } + + description + "SONiC twamp yang model"; + + revision 2023-06-11 { + description + "Initial revision."; + } + + typedef timestamp_format { + type enumeration { + enum ntp { + description "NTP 64 bit format of a timestamp"; + } + enum ptp { + description "PTPv2 truncated format of a timestamp"; + } + } + description "timestamp format used by Session-Sender or Session-Reflector."; + } + + feature session_sender { + description "This feature relates to the device functions as the TWAMP Session-Sender"; + } + + feature session_reflector { + description "This feature relates to the device functions as the TWAMP Session-Reflector"; + } + + grouping session_parameters { + description "TWAMP session parameters"; + leaf sender_ip { + type inet:ip-address; + mandatory true; + description "Sender IP address"; + } + leaf sender_udp_port { + type inet:port-number { + range "862 | 863 | 1025..65535"; + } + default 862; + mandatory true; + description "Sender UDP port number"; + } + leaf reflector_ip { + type inet:ip-address; + mandatory true; + description "Reflector IP address"; + } + leaf reflector_udp_port { + type inet:port-number { + range "862 | 863 | 1025..65535"; + } + default 863; + description "Reflector UDP port number"; + } + leaf vrf_name { + type union { + type string { + pattern "default"; + } + type leafref { + path "/vrf:sonic-vrf/vrf:VRF/vrf:VRF_LIST/vrf:name"; + } + } + description "VRF name"; + } + } + + container sonic-twamp { + description "Top level container for TWAMP configuration"; + + container TWAMP_SESSION_SENDER { + if-feature session_sender; + description "TWAMP Session-Sender container"; + + list test_session { + key "name"; + unique "sender_ip sender_udp_port reflector_ip reflector_udp_port vrf_name"; + description + "This structure is a container of test session managed objects"; + leaf name { + type string; + description "A unique name for this TWAMP-Test session to be used + for identifying this test session by the + Session-Sender logical entity."; + } + uses session_parameters; + leaf test_session_enable { + type boolean; + default "true"; + description "Whether this TWAMP Test session is enabled"; + } + leaf dscp { + type inet:dscp; + default 0; + description + "DSCP value to be set in the test packet."; + } + leaf ttl { + type inet:ttl; + default 0; + description + "TTL value to be set in the test packet."; + } + leaf packet_timestamp_format { + type timestamp_format; + default ntp; + description "Sender Timestamp format"; + } + leaf packet_padding_size { + type uint16; + default 108; + description + "Size of the Packet Padding. Suggested to run Path MTU + Discovery to avoid packet fragmentation in IPv4 and packet + blackholing in IPv6"; + } + leaf packet_count { + type uint32; + default 100; + description + "This value determines if the TWAMP-Test session is + bound by number of test packets or not."; + } + leaf tx_interval { + type union { + type uint32 { + range "10 | 100 | 1000"; + } + } + units milliseconds; + default 100; + } + leaf statistics_interval { + type uint32; + units milliseconds; + description + "Interval to calculate performance measurement data"; + } + leaf monitor_time { + type uint32; + units seconds; + default 0; + description + "The value 0 indicates that the test session SHALL run *forever*"; + } + leaf timeout { + type uint32 { + range "1..10"; + }; + units "seconds"; + default 5; + description + "The timeout value for the Session-Sender to + collect outstanding reflected packets."; + } + } + } + + container TWAMP_SESSION_REFLECTOR { + if-feature session_reflector; + description "TWAMP Session-Reflector container"; + list test_session { + key "name"; + unique "sender_ip sender_udp_port reflector_ip reflector_udp_port vrf_name"; + description + "This structure is a container of test session + managed objects"; + leaf name { + type string; + description "A unique name for this TWAMP-Test session to be used + for identifying this test session by the + Session-Reflector logical entity."; + } + uses session_parameters; + } + } + } + } +## 7 Warmboot and Fastboot Design Impact + +TBD + +## 8 Restrictions/Limitations + +N/A + +## 9 Testing Requirements/Design + +### 9.1 Unit Test cases + +#### 9.1.1 CLI + +1) Verify CLI to create TWAMP Light Session-Sender for packet-count measurement +3) Verify CLI to create TWAMP Light Session-Sender for continuous measurement +4) Verify CLI to create TWAMP Light Session-Reflector +5) Verify CLI to start TWAMP Light Session-Sender +6) Verify CLI to stop TWAMP Light Session-Sender +7) Verify CLI to show TWAMP Light session status +8) Verify CLI to show TWAMP Light Session-Sender statistics + +#### 9.1.2 Functionality + +1) Verify TWAMP-test format of the packets are correct +2) Verify TWAMP Light sender performs packet-count measurement +3) Verify TWAMP Light sender performs continuous measurement +4) Verify TWAMP Light sender collects and saves measurement statistics +5) Verify TWAMP Light reflector receives and replies test packets to sender + +#### 9.1.3 Scaling + +1) Verify running MAX num TWAMP Light sessions + +#### 9.1.4 SAI + +1) Verify creating TWAMP Light session in SAI +2) Verify setting TWAMP Light session state in SAI +3) Verify removing TWAMP Light session in SAI +4) Verify getting TWAMP Light session in SAI +5) Verify getting TWAMP Light max num session in SAI + +## References + +Reference for proposed algorithm: + + [RFC5357](https://www.rfc-editor.org/info/rfc5357) + + + diff --git a/doc/TWAMP/images/TWAMP_Light_architecture.png b/doc/TWAMP/images/TWAMP_Light_architecture.png new file mode 100644 index 00000000000..d472a50db75 Binary files /dev/null and b/doc/TWAMP/images/TWAMP_Light_architecture.png differ diff --git a/doc/TWAMP/images/TWAMP_Light_continuous.png b/doc/TWAMP/images/TWAMP_Light_continuous.png new file mode 100644 index 00000000000..5dfda0833d5 Binary files /dev/null and b/doc/TWAMP/images/TWAMP_Light_continuous.png differ diff --git a/doc/TWAMP/images/TWAMP_Light_continuous_monitor.png b/doc/TWAMP/images/TWAMP_Light_continuous_monitor.png new file mode 100644 index 00000000000..37e2e146e91 Binary files /dev/null and b/doc/TWAMP/images/TWAMP_Light_continuous_monitor.png differ diff --git a/doc/TWAMP/images/TWAMP_Light_create_session.png b/doc/TWAMP/images/TWAMP_Light_create_session.png new file mode 100644 index 00000000000..1f53720d4d8 Binary files /dev/null and b/doc/TWAMP/images/TWAMP_Light_create_session.png differ diff --git a/doc/TWAMP/images/TWAMP_Light_nofiy_session_event.png b/doc/TWAMP/images/TWAMP_Light_nofiy_session_event.png new file mode 100644 index 00000000000..a1b351a15a0 Binary files /dev/null and b/doc/TWAMP/images/TWAMP_Light_nofiy_session_event.png differ diff --git a/doc/TWAMP/images/TWAMP_Light_packet_count.png b/doc/TWAMP/images/TWAMP_Light_packet_count.png new file mode 100644 index 00000000000..8f0ff725937 Binary files /dev/null and b/doc/TWAMP/images/TWAMP_Light_packet_count.png differ diff --git a/doc/TWAMP/images/TWAMP_Light_query_hw_capability.png b/doc/TWAMP/images/TWAMP_Light_query_hw_capability.png new file mode 100644 index 00000000000..03fde3a1d4e Binary files /dev/null and b/doc/TWAMP/images/TWAMP_Light_query_hw_capability.png differ diff --git a/doc/TWAMP/images/TWAMP_Light_remove_session.png b/doc/TWAMP/images/TWAMP_Light_remove_session.png new file mode 100644 index 00000000000..52dba2759e7 Binary files /dev/null and b/doc/TWAMP/images/TWAMP_Light_remove_session.png differ diff --git a/doc/TWAMP/images/TWAMP_Light_role.png b/doc/TWAMP/images/TWAMP_Light_role.png new file mode 100644 index 00000000000..8a0f9e729e7 Binary files /dev/null and b/doc/TWAMP/images/TWAMP_Light_role.png differ diff --git a/doc/TWAMP/images/TWAMP_Light_set_session_state.png b/doc/TWAMP/images/TWAMP_Light_set_session_state.png new file mode 100644 index 00000000000..c16d427f40d Binary files /dev/null and b/doc/TWAMP/images/TWAMP_Light_set_session_state.png differ diff --git a/doc/acl/ACL-Table-Type-HLD.md b/doc/acl/ACL-Table-Type-HLD.md index 09f74509504..a4153a4cb64 100644 --- a/doc/acl/ACL-Table-Type-HLD.md +++ b/doc/acl/ACL-Table-Type-HLD.md @@ -99,7 +99,7 @@ match = 1*64VCHAR match-list = [1-max-matches]*match action = 1*64VCHAR action-list = [1-max-actions]*action -bind-point = port/lag +bind-point = port/portchannel bind-points-list = [1-max-bind-points]*bind-point ``` @@ -119,7 +119,7 @@ Example: ], "BIND_POINTS": [ "PORT", - "LAG" + "PORTCHANNEL" ] } }, @@ -168,7 +168,7 @@ container ACL_TABLE_TYPE { mandatory true; type enumeration { enum PORT; - enum LAG; + enum PORTCHANNEL; } } } diff --git a/doc/banner/banner_hld.md b/doc/banner/banner_hld.md new file mode 100644 index 00000000000..82163d948ef --- /dev/null +++ b/doc/banner/banner_hld.md @@ -0,0 +1,278 @@ +# Banner messages HLD # + +## Table of contents +- [Revision](#revision) +- [About this manual](#about-this-manual) +- [Scope](#scope) +- [1 Introduction](#1-introduction) + - [1.1 Feature overview](#11-feature-overview) + - [1.2 Requirements](#12-requirements) +- [2 Design](#2-design) + - [2.1 Overview](#21-overview) + - [2.2 Flows](#25-flows) + - [2.3 CLI](#26-cli) + - [2.3.1 Config command group](#231-config-command-group) + - [2.3.2 Show command](#232-show-command) + - [2.4 YANG model](#24-yang-model) +- [3 Test plan](#3-test-plan) + - [3.1 Unit tests via VS](#31-unit-tests-via-vs) + + +# Revision +| Rev | Date | Author | Description | +|:---:|:----------:|:-------------------:|:----------------| +| 0.1 | 01/02/2023 | Sviatoslav Boichuk | Initial version | + + +# About this manual +This document provides a high-level information for the SONiC Banner feature and Banner CLI. It describes high-level behavior, internal definition, design of the commands, syntax, and output definition. + +# Scope +The scope of this document is to cover definition, design and implementation of SONiC Banner feature and Banner CLI. +The document covers the next CLI: +1. Commands to configure banner messages +2. Command to display banner settings + + +## Abbreviations +| Term | Meaning | +|:------|:------------------------------------------| +| SONiC | Software for Open Networking in the Cloud | +| MOTD | Message of the day | +| DB | Database | +| CLI | Сommand-line Interface | +| YANG | Yet Another Next Generation | + + +## List of figures +- [Figure 1: Banner system chart diagram](#figure-1-banner-system-chart-diagram) +- [Figure 2: Banner init flow](#figure-2-banner-init-flow) +- [Figure 3: Banner config flow](#figure-3-banner-config-flow) +- [Figure 4: Banner show configuration](#figure-4-banner-show-configuration) + +# 1 Introduction + +## 1.1 Feature overview +The SONiC maintains several messages for communication with users. These messages are associated with login and logout processes. + +There are few banners message types: +| Type | Description | +| :-----------: | :-----------------------------------------------------------------------------: | +| login | Display a banner to the users connected locally or remotely before login prompt | +| motd | Display a banner to the users after login prompt | +| logout | Display a logout banner to the users connected locally or remotely | + + +## 1.2 Requirements + +**This feature will support the following functionality:** +1. Show Banner configuration +2. Configure Banner + 1. Feature state + 2. Login message + 3. Logout message + 4. MOTD + +# 2 Design + +## 2.1 Overview +Here is the representation of SONiC platform using Banner feature: + +![Banner system chart diagram](images/banner_system_chart_diagram.png "Figure 1: Banner system chart diagram") + +###### Figure 1: Banner system chart diagram + +This feature require access to SONiC DB. All messages (MOTD, login and logout) saved into the SONiC config database. Hostcfgd will listen for the configuration changes in corresponding tables and restart banner-config service. Banner config service - it is simple SystemD service which runs before we get SSH connection. It reads configured messages from database and apply it to Linux. + +**The Linux files will be used:** +1. /etc/issue.net and /etc/issue - Login message +2. /etc/motd - Message of the day +3. /etc/logout_message - Logout message + +## 2.2 Flows + +### 2.2.1 Banner init flow +![Banner init flow](images/banner_init_diagram.png "Figure 2: Banner init flow") + +###### Figure 2: Banner init flow + + +### 2.5.2 Banner config flow +![Banner config flow](images/banner_config_diagram.png "Figure 3: Banner config flow") + +###### Figure 3: Banner config flow + + +### 2.5.2 Banner show configuration +![Banner show configuration](images/banner_show_diagram.png "Figure 4: Banner show configuration") + +###### Figure 4: Banner show configuration + +The default Banner feature state is disabled. It means - the current (default) SONiC OS banner messages won't be changed. +With disabled feature state - user can use provided CLI to configre banner messages. The changes will be applied to Config DB table. +Only with enabled feature state, configured banner messages from Config DB will be applied to Linux. + +## 2.3 CLI + + +### 2.3.1 Command structure + +**User interface**: +``` +config +\-- banner + |-- state + |-- login + |-- logout + |-- motd + +show +\-- banner +``` + +**Options:** + +General: +1. `` - Message to be configured: `string` + +## Multiline string support + Banner feature support multiline string message. + Example: config banner login "Hello!\nWellcome to SONiC CLI!" + Banner output: " + Hello! + Wellcome to SONiC CLI! + " + + +#### 2.3.1 Config command group +**The following command set banner feature state:** +```bash +config banner state +``` + +**The following command set loin banner message:** +```bash +config banner login +``` + +**The following command set logout banner message:** +```bash +config banner logout +``` + +**The following command set mesage of the day (MOTD):** +```bash +config banner motd +``` + +#### 2.3.2 Show command +**The following command display banner configuration:** +```bash +root@sonic:/home/admin# show banner +state login motd logout +------- ------- ------------------------------------------------ -------- +enabled Login You are on + Message ____ ___ _ _ _ ____ + / ___| / _ \| \ | (_)/ ___| + \___ \| | | | \| | | | + ___) | |_| | |\ | | |___ + |____/ \___/|_| \_|_|\____| + + -- Software for Open Networking in the Cloud -- + + Unauthorized access and/or use are prohibited. + All access and/or use are subject to monitoring. + + Help: https://sonic-net.github.io/SONiC/ +``` + +## 2.4 YANG model +New YANG model `sonic-banner.yang` will be added to provide support for configuring Banner messages. + +**Skeleton code:** +``` + module sonic-banner { + + yang-version 1.1; + + namespace "http://github.com/sonic-net/sonic-banner"; + prefix banner_message; + + description "BANNER_MESSAGE YANG Module for SONiC-based OS"; + + revision 2023-05-18 { + description "First Revision"; + } + + container sonic-banner { + + container BANNER_MESSAGE { + + description "BANNER_MESSAGE part of config_db.json"; + + container MESSAGE { + leaf state { + type string { + pattern "enabled|disabled"; + } + description "Banner feature state"; + default disabled; + } + + leaf login { + type string; + description "Banner message displayed to user before login prompt"; + default "Debian GNU/Linux 11"; + } + + leaf motd { + type string; + description "Banner message displayed to user after login prompt"; + default "You are on + ____ ___ _ _ _ ____ + / ___| / _ \| \ | (_)/ ___| + \___ \| | | | \| | | | + ___) | |_| | |\ | | |___ + |____/ \___/|_| \_|_|\____| + + -- Software for Open Networking in the Cloud -- + + Unauthorized access and/or use are prohibited. + All access and/or use are subject to monitoring. + + Help: https://sonic-net.github.io/SONiC/ + "; + } + + leaf logout { + type string; + description "Banner message dispalyed to the users on logout"; + default ""; + } + } /* end of container MESSAGE */ + } + /* end of container BANNER_MESSAGE */ + } + /* end of top level container */ +} +/* end of module sonic-banner */ +``` + +# 3 Test plan + +## 3.1 Unit tests + +1. Configure login banner message + 1. Logout form system. Login again - expected to see configured message + before login prompt. + 2. Do not save configuration and reboot device, login to the system - expected to see default message before login prompt. + 3. Save configuration and reboot device - expected to see configured message before login prompt. +2. Configure message of the day + 1. Logout form system. Login again - expected to see configured message + after login prompt. + 2. Do not save configuration and reboot device,login to the system - expected to see default message after login prompt. + 3. Save configuration and reboot device - expected to see configured message after login prompt. +3. Configure logout banner message + 1. Logout form system - expected to see configured logout message. + 2. Do not save configuration and reboot device. Logout from system after reboot - expected to see default message before login prompt. + 3. Save configuration and reboot device. Logout from system after reboot expected to see configured message before login prompt. diff --git a/doc/banner/images/banner_config_diagram.png b/doc/banner/images/banner_config_diagram.png new file mode 100644 index 00000000000..2b877e3d776 Binary files /dev/null and b/doc/banner/images/banner_config_diagram.png differ diff --git a/doc/banner/images/banner_init_diagram.png b/doc/banner/images/banner_init_diagram.png new file mode 100644 index 00000000000..9ad79640eb0 Binary files /dev/null and b/doc/banner/images/banner_init_diagram.png differ diff --git a/doc/banner/images/banner_show_diagram.png b/doc/banner/images/banner_show_diagram.png new file mode 100644 index 00000000000..98232d6de0a Binary files /dev/null and b/doc/banner/images/banner_show_diagram.png differ diff --git a/doc/banner/images/banner_system_chart_diagram.png b/doc/banner/images/banner_system_chart_diagram.png new file mode 100644 index 00000000000..bef903846ea Binary files /dev/null and b/doc/banner/images/banner_system_chart_diagram.png differ diff --git a/doc/cmis-module-enhancement/cmis-module-enhancement.md b/doc/cmis-module-enhancement/cmis-module-enhancement.md new file mode 100644 index 00000000000..f7005513b09 --- /dev/null +++ b/doc/cmis-module-enhancement/cmis-module-enhancement.md @@ -0,0 +1,224 @@ +# Enhancement of CMIS module management + +## Table of Content + +## 1. Revision + + + | Rev | Date | Author | Change Description | + |:---:|:-----------:|:------------------:|-----------------------------------| + | 0.1 | July 2023 | Noa Or | Initial version | + +## 2. Scope + +This section describes an enhancment of the synchronization between ASIC port and module configuration. + +## 3. Definitions/Abbreviations + +N/A + +## 4. Overview + +Configuration of the ASIC side (port SERDES) is handled by the SWSS docker (Ports Orch Agent) interacting with the vendor SAI via SAI calls, whereas the modules supporting the CMIS protocol are configured by PMON docker (Xcvrd daemon). + +These initialization processes should be synchronized and the configuration of CMIS module should start only after the ASIC is initialized and per CMIS 5.2 it started sending the high-speed signal toward a module. + +Currently, SONIC uses the "host_tx_ready" flag in the PORT table in STATE DB for synchronization. This flag is set by Ports OA right after the SAI API for setting the Admin status to UP returns with OK/Success status. PMON registers for changes of this flag in Redis DB and starts the CMIS initialization for a particular module when this flag is set. + +Current design has some gaps with synchronization between ASIC and module configuration. +This document purpose is to introduce an enhacement that will address the gaps and find a backward compatible solution. + + +## 5. Requirements + +* SONiC shall have backward compatibility for platforms that don't support the proposed enhancement. + +* Vendor SDK/FW shall start sending high-speed signal to module, only when admin status is up, and it has an indication that module was plugged in. + +* Vendor SDK/FW shall support asynchronous notification of start/stop of sending high-speed signal from ASIC to module. + +* PortsOrch shall set host_tx_ready in state DB, only when it recieved notification that the high-speed signal is sent. + +## 6. High-Level Design + +Current Flow: +

+Figure 1. Current Flow +

+ +### 6.1. Problem Statement +With move to SW-based management of CMIS modules - when the ASIC side is configured by vendor SDK, and module side by SONIC, some problems have been identified with the current approach. + +#### 6.1.2. Host TX Ready signal +As mentioned earlier, currently SONIC assumes that as soon as the Admin Status is set to UP and the corresponding SAI call returns with OK/SUCCESS status, the "host_tx_ready" flag can be set in the STATE DB to trigger the CMIS State Machine for specific port. +But it is not always a truth. ASIC port initialization process takes some time and this time can increase with move to new transceiver technologies (e.g. BiDi). +So, in some cases the module initialization can be triggered too early, before the high-speed signal started to be transmitted by ASIC to a module + +#### 6.1.3. Unterminated Transmission +With move to the SW-based module management, the module presence is handled by SONIC, and FW might be unaware of the module presence status. +In this case, when the Admin status of a port is set to UP, FW can start transmitting the high-speed signal even without a plugged-in module. Such unterminated transmission can cause cross-talks to adjacent ports, high EMI, high BER and eventually shorten the transceiver lifetime so it is recommended that ASIC will not start sending the high-speed signal before a module is plugged. + +### 6.2. New Approach + +To provide the response to the described problem statements, SONIC should do the following: + +1. Control transmitting of the high-speed signal based on module presence (allow this signal only when a module is plugged-in). +2. Trigger the module initialization (using the CMIS state machine) only when the high-speed signal was already transmitted by ASIC towards a module. + +SWSS shall allow transmitting of the high-speed signal on receiving the INSERTION indication from PMON. +Then, on setting the Admin Status to UP, the vendor SDK/FW shall start transmitting this signal to a module and shall report about that to SONIC. + +ASIC FW shall start transmitting the high-speed TX signal only when **both** conditions are met: + +1. It is allowed (by SWSS) to transmit this signal. +2. Port mapped to the module is set to Admin UP. + +High Level Flow: + +

+Figure 2. High Level Flow +

+ +#### 6.2.1. host_tx_signal +This flow shall be used only on supporting platforms. +Hence, as part of PortsOrch initialization, SONiC will query SAI capabilities regarding the support of allowence flag for sending high-speed signal to module - It will be done by checking if SAI_PORT_ATTR_HOST_TX_SIGNAL_ENABLE is supported. +In case SAI supports it, PortsOrch will start listening to TRANSCEIVER_INFO in State DB to know on any module plug event. + +Module's INSERTION/REMOVAL events shall trigger the calling of SAI API on a Port object with SAI_PORT_ATTR_HOST_TX_SIGNAL_ENABLE to enable or disable data signal from ASIC to module. + +Host Tx Signal Enable Flow: +

+Figure 3. Host Tx Signal Enable Flow +

+ +NOTE: +Setting SAI_PORT_ATTR_HOST_TX_SIGNAL_ENABLE to TRUE (when a module is plugged) is not sufficient to start the transmission of the high-speed signal towards a module. +Vendor SDK/FW should wait until the Admin status of the port mapped to this module is set to UP to start transmitting the high-speed signal. + +Once a module plug event occurs, Xcvrd in Pmon will update TRANSCEIVER_INFO table in STATE DB. + +In order to know on module plug events, Ports OA will listen to changes in TRANSCEIVER_INFO table. +According to information in TRANSCEIVER_INFO table, Ports OA will send SAI_PORT_ATTR_HOST_TX_SIGNAL_ENABLE enable/disable to SAI, indicating if sending high-speed signal to module is allowed or not. + +#### 6.2.2. host_tx_ready +When the ASIC starts transmitting the high-speed signal toward a plugged module the vendor SAI should notify the SONIC (SWSS) of that via a new notification - SAI_SWITCH_ATTR_PORT_HOST_TX_READY_NOTIFY. + +The SWSS shall use this notification to set "host_tx_ready" flag in STATE DB which will trigger the CMIS initialization of the module. +It will ensure that the module initialization doesn't start before the high-speed signal is transmitted by ASIC to a module. + +The notification shall be expected and consumed only on platforms supporting it. +Hence, The platform capabilities for supporting of this feature should be queried on Port OA init. + +On platforms not supporting this functionality the “host_tx_ready” flag shall be set in STATE DB upon return of Port Admin status UP with SUCCESS return code (backward-compatible behavior) + +

+Figure 3. Host Tx Signal Enable Flow +

+ +For supporting platforms, admin status UP is not sufficient for configuring the module. +Ports OA will set host_tx_ready in state DB only after it knows a module was plugged, and received notification from SAI of the high-speed signal. +When the high-speed signal is sent, a notificatoin from SAI named "SAI_SWITCH_ATTR_PORT_HOST_TX_READY_NOTIFY" will arrive to swss. +PortsOrch will update State DB with host_tx_ready, after the notification has arrived. + +Cmis Manager task behavior will stay the same as today - start configuring the module only after admin state is UP, and host_tx_ready=true. + +### 6.3. Implementation Flow + +#### 6.3.1. Initialization Flow + +Proposed enhancement will be supported on some vendor platforms, so there is a need to learn whether the specific platform supports them on not. +To do that, Ports OA shall use the query capability for the following SAI attributes: + +1. SAI_PORT_ATTR_HOST_TX_SIGNAL_ENABLE – to know whether the platform supports the control of enabling/disabling the high-speed signal from ASIC to module from SONIC. +2. SAI_SWITCH_ATTR_PORT_HOST_TX_READY_NOTIFY – to know whether the platform supports the asynchronous notification from SDK/SAI to SWSS about start/stop of transmission of high-speed signal data from ASIC to module. + +Ports OA shall use the status of support of these capabilities for its flow. +For example, the notification consumer for a new SAI_SWITCH_ATTR_PORT_HOST_TX_READY_NOTIFY notification will be initialized only if the platform supports the asynchronous notification from SAI/SDK to SWSS. + +#### 6.3.2. Enabling/disabling “host_tx_signal” +As described earlier, for platforms supporting this capability, high-speed signal shall be explicitly enabled/disabled by Ports OA on getting the Insertion/Removal event for a specific module. + +When a module is plugged in/out the PMON adds/deletes a per-port entry in the TRANSCIEVER_TABLE in STATE DB. +This event shall be used by Ports OA as a trigger for enabling or disabling the host_tx_signal (using setting true/false to SAI_PORT_ATTR_HOST_TX_SIGNAL_ENABLE on the Port Object). + +To detect this insertion/removal event the Ports Orch Agent shall register for changes in the TRANSCIEVER_TABLE in STATE DB and will do the following: +* INSERTION event – set SAI_PORT_ATTR_HOST_TX_SIGNAL_ENABLE to TRUE +* REMOVAL event - set SAI_PORT_ATTR_HOST_TX_SIGNAL_ENABLE to FALSE + +Please note that on platforms supporting this functionality, SDK/FW shall start the transmission of high-speed signal from ASIC to a module only if both criteria below are met: +* TX signal is enabled (via setting SAI_PORT_ATTR_HOST_TX_SIGNAL_ENABLE to TRUE) +* Admin Status is set to UP + +Whenever one of them is set to false (e.g. TX signal is disabled due to removal event of Admin station is set to DOWN), the SDK/FW stops transmission of this signal. + +This approach allows to avoid the issue of unterminated transmission when ASIC starts the transmission of the high-speed signal even when no module is plugged-in causing cross-talks to adjacent ports, high EMI/BER and shortening the transceiver lifespan + +On port creation for platforms supporting proposed enhancements, the SAI_PORT_ATTR_HOST_TX_SIGNAL_ENABLE shall be explicitly set by Ports OA to FALSE since its default SAI value is defined as TRUE for backward-compatibility reasons. + +#### 6.3.3. Handling of HW-based “host_tx_ready” event +Per description above, on the platform supporting new enhancements, the SDK/FW shall send an asynchronous notification of “host_tx_ready” to SWSS when the ASIC starts/stops sending high-speed signal to a module. + +The vendor SDK shall internally support this indication from ASIC (via the trap mechanism) and shall call the notification callback registered by Ports OA on initialization . +The backward-compatibility is supported by a new design. Therefore, the handling shall be as follows: + +| Platform Type | Ports OA Behavior | | +|------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------|---| +| Platform supporting new enhancements | On return from SAI call of setting port's Admin status, nothing is done, unless the notification event "host_tx_ready" is received and consumed by Ports OA. | | +| Platform not supporting new enhancements | Ports OA will set "host_tx_ready" in STATE DB right after Admin status SAI call returns.
(UP -> "true", DOWN -> "false") | | + + +##### 6.3.3.1. Warm-boot handling +After warm-boot, SONiC will not receive a new “host_tx_ready” notification from SAI (because nothing has changed in ASIC/module). +To have the update status in STATE DB, one needs to refresh the status of “host_tx_ready” like it is done today for port’s operational status and speed. +To do that a new logic related to host_tx_ready will be added in PortsOrch::initializePort(). +For platforms supporting proposed enhancements, SAI_PORT_ATTR_HOST_TX_READY_STATUS SAI attribute shall be queried and then the “host_tx_ready” flag in the STATE DB shall be updated accordingly. + +``` +bool PortsOrch::initializePort(Port &port) +{ + ... + ... + if (saiTxReadyNotifySupported && saiHwTxSignalEnabled) + { + uint32_t host_tx_ready_status; + string status; + attr.id = SAI_PORT_ATTR_HOST_TX_READY_STATUS; + status = sai_port_api->get_port_attribute(port.m_port_id, 1, &attr); + if (status == SAI_STATUS_SUCCESS) + { + if (attr.value.32) + { + status = "true"; + } + else + { + status = "false"; + } + vector tuples; + FieldValueTuple tuple("host_tx_ready", status); + tuples.push_back(tuple); + m_portTable->set(port.m_alias, tuples); + } + } + ... + ... +} +``` + +#### 6.3.4. End-to-end Flow +The flow below illustrates all enhancements proposed by the current document: + +

+Figure 4. E2E Flow +

+ + +### 6.4. Unit Test cases +1. sonic-sairedis/unittest/lib/TestSwitch.cpp will be ajdusted to the new notification. +2. sonic-sairedis/unittest/meta/TestNotificationFactory.cpp will be adjusted to the new notification. +3. A new test names sonic-sairedis/unittest/meta/TestNotificationHostTxReadyEvent.cpp will be added to cover the new notification handler. +4. New functional tests will be added in sonic-swss/tests/mock_tests/portsorch_ut.cpp to cover all new changes in sonic-swss. +5. sonic-swss/tests/test_warm_reboot.py will be ajdusted to check the host_tx_ready for each port. + +### 6.5. Open/Action items - if any +N/A \ No newline at end of file diff --git a/doc/cmis-module-enhancement/img/E2E-Flow.svg b/doc/cmis-module-enhancement/img/E2E-Flow.svg new file mode 100644 index 00000000000..f981f786fdc --- /dev/null +++ b/doc/cmis-module-enhancement/img/E2E-Flow.svg @@ -0,0 +1 @@ +title%20New%20Flow%0A%0A%0Aparticipant%20%22Xcvrd%20(Cmis%20thread)%22%20as%20cmis%20%23PowderBlue%0Aparticipant%20%22Xcvrd%20(Sfp%20thread)%22%20as%20sfp%20%23PowderBlue%0Adatabase%20%22State%20DB%22%20as%20statedb%20%23PowderBlue%0Aparticipant%20%22PortsOrch%22%20as%20portsorch%20%23PowderBlue%0Aparticipant%20%22SAI%22%20as%20sai%20%23PowderBlue%0Aparticipant%20%22SDK%22%20as%20sdk%20%23PowderBlue%0Aparticipant%20%22FW%22%20as%20fw%20%23PowderBlue%0Aparticipant%20%22Module%22%20as%20module%20%23PowderBlue%0A%0A%0A%23%20initialization%0A%0A%0Aportsorch-%3Esai%3Asai_query_attribute_capabilities%5Cn(SAI_PORT_ATTR_HOST_TX_SIGNAL_ENABLE)%0Asai-%3Eportsorch%3A%20enable%2Fdisable%0A%0Anote%20over%20portsorch%3Asave%20to%20**saiHwTxSignalEnabled**%0A%0Aportsorch-%3Esai%3Asai_query_attribute_capabilities%5Cn(SAI_SWITCH_ATTR_PORT_HOST_TX_READY_NOTIFY)%0Asai-%3Eportsorch%3A%20enable%2Fdisable%0A%0Anote%20over%20portsorch%3A%20save%20to%20**saiTxReadyNotifySupported**%0A%0Aabox%20over%20portsorch%3A%20if%20saiHwTxSignalEnabled%3A%0Arbox%20right%20of%20portsorch%3A%20Register%20to%20TRANSCEIVER_INFO_TABLE%20changes%0A%0Aabox%20over%20portsorch%3A%20if%20saiTxReadyNotifySupported%3A%0Arbox%20right%20of%20portsorch%3A%20Initialize%20SAI%20notification%20consumer%0A%0A%3D%3D%3D%3D%0A%0A%23%20admin%20up%0Anote%20over%20portsorch%3A%20Set%20port%20admin%20status%20-%3E%20up%0Aportsorch-%3Esai%3Astatus%20%3D%20sai_port_api-%3E%5Cnset_port_attribute(port.m_port_id%2C%20%26attr)%3B%0Asai-%3Esdk%3A%0Asdk-%3Efw%3A%0A%0A%3D%3D%3D%3D%0A%0A%23%20plug%20event%20%26%20signal%0Anote%20over%20sfp%3A%20Module%20plug%20event%20occur%0Asfp-%3Estatedb%3A%20TRANSCIEVER_INFO%20update%20with%20module%0Astatedb-%3Eportsorch%3A%20change%20in%20TRANSCIEVER_INFO%20table%0Aportsorch-%3Esai%3A%20SAI_PORT_ATTR_HOST_TX_SIGNAL_ENABLE%5Cnenable%2Fdisable%20(for%20specific%20port)%0Asai-%3Esdk%3A%0Asdk-%3Efw%3A%0Arbox%20over%20fw%3A%20admin%20up%20%26%20high-speed%20signal%20allowed%0Afw-%3Emodule%3A%20send%20signal%0Arbox%20over%20module%3A%20signal%20OK%0Amodule--%3Efw%3A%0Afw--%3Esdk%3A%0Asdk--%3Esai%3A%0Asai--%3Eportsorch%3A%20%20SAI_SWITCH_ATTR_PORT_HOST_TX_READY_NOTIFY%0Aportsorch-%3Estatedb%3A%20set%20host_tx_ready%20%3D%20true%2Ffalse%0Astatedb-%3Ecmis%3Ahost_tx_ready%20changed%0Arbox%20over%20cmis%3A%20if%20true%20-%3E%20configure%20the%20module%0A%0ANew FlowXcvrd (Cmis thread)Xcvrd (Sfp thread)State DBPortsOrchSAISDKFWModulesai_query_attribute_capabilities(SAI_PORT_ATTR_HOST_TX_SIGNAL_ENABLE)enable/disablesave to saiHwTxSignalEnabledsai_query_attribute_capabilities(SAI_SWITCH_ATTR_PORT_HOST_TX_READY_NOTIFY)enable/disablesave to saiTxReadyNotifySupportedif saiHwTxSignalEnabled:Register to TRANSCEIVER_INFO_TABLE changesif saiTxReadyNotifySupported:Initialize SAI notification consumerSet port admin status -> upstatus = sai_port_api->set_port_attribute(port.m_port_id, &attr);Module plug event occurTRANSCIEVER_INFO update with modulechange in TRANSCIEVER_INFO tableSAI_PORT_ATTR_HOST_TX_SIGNAL_ENABLEenable/disable (for specific port)admin up & high-speed signal allowedsend signalsignal OK SAI_SWITCH_ATTR_PORT_HOST_TX_READY_NOTIFYset host_tx_ready = true/falsehost_tx_ready changedif true -> configure the module \ No newline at end of file diff --git a/doc/cmis-module-enhancement/img/Host-Tx-Ready-Flow.png b/doc/cmis-module-enhancement/img/Host-Tx-Ready-Flow.png new file mode 100644 index 00000000000..2beea01c126 Binary files /dev/null and b/doc/cmis-module-enhancement/img/Host-Tx-Ready-Flow.png differ diff --git a/doc/cmis-module-enhancement/img/Host-Tx-Signal-Enable-Flow.png b/doc/cmis-module-enhancement/img/Host-Tx-Signal-Enable-Flow.png new file mode 100644 index 00000000000..04d2fbc1566 Binary files /dev/null and b/doc/cmis-module-enhancement/img/Host-Tx-Signal-Enable-Flow.png differ diff --git a/doc/cmis-module-enhancement/img/The-Current-Flow.png b/doc/cmis-module-enhancement/img/The-Current-Flow.png new file mode 100644 index 00000000000..2939aed67ff Binary files /dev/null and b/doc/cmis-module-enhancement/img/The-Current-Flow.png differ diff --git a/doc/cmis-module-enhancement/img/The-High-Level-Flow.png b/doc/cmis-module-enhancement/img/The-High-Level-Flow.png new file mode 100644 index 00000000000..2421bcdf8a3 Binary files /dev/null and b/doc/cmis-module-enhancement/img/The-High-Level-Flow.png differ diff --git a/doc/dash/dash-sonic-hld.md b/doc/dash/dash-sonic-hld.md index 6eb6ef34eae..c7e10ff0fb8 100644 --- a/doc/dash/dash-sonic-hld.md +++ b/doc/dash/dash-sonic-hld.md @@ -1242,7 +1242,7 @@ For the example configuration above, the following is a brief explanation of loo For the inbound direction, after Route/ACL lookup, pipeline shall use the "underlay_ip" as specified in the ENI table to VXLAN encapsulate the packet and VNI shall be the ```vm_vni``` specified in the APPLIANCE table - 5. Inbound packet destined to 10.1.2.5 with source PA 101.1.2.3 and VNI 45654 + 5. Inbound packet destined to 10.1.1.1 with source PA 101.1.2.3 and VNI 45654 a. After setting direction to inbound, the Route Rule table is looked up based on priority b. First Inbound rule gets hit as PR prefix and VNI key match c. PA validation is set to true and Vnet is given as Vnet1. diff --git a/doc/debian_upgrade/SONiC_Debian_Upgrade_Cadence.md b/doc/debian_upgrade/SONiC_Debian_Upgrade_Cadence.md new file mode 100644 index 00000000000..45c39c72025 --- /dev/null +++ b/doc/debian_upgrade/SONiC_Debian_Upgrade_Cadence.md @@ -0,0 +1,220 @@ +# SONiC Debian Upgrade Cadence +## Table of Contents + - [Revision history](#revision-history) + - [Scope](#scope) + - [SONiC Base image Debian Upgrade](#sonic-base-image-debian-upgrade) + - [Debian Release Schedule](#debian-release-schedule) + - [Timeline](#timeline1) + - [Procedure](#procedure1) + - [SONiC Container Debian Upgrade](#sonic-container-debian-upgrade) + - [Guidelines](#guidelines2) + - [Procedure](#procedure2) + - [Timeline](#timeline2) + - [SONiC Debian support](#sonic-debian-support) + - [Guidelines](#guidelines3) + - [Procedure](#procedure3) + - [Timeline](#timeline3) + - [SONiC Debian Deprecation](#sonic-debian-deprecation) + +## Revision history + +| Rev | Date | Authors | Change Description | +|------------|------------|-----------------------------------|----------------------- +| 0.1 | 01/01/2024 | Pavan Naregundi, Saikrishna Arcot | Initial version | + +## Scope +SONiC is a free and open source network operating system based on Debian based Linux. In order to keep SONiC updated with new features, bug fixes, and security updates, we want to make sure that both the SONiC base image and all of the containers are based on the most recent version of Debian. Scope of this document is to describe process and cadence for following, + +* SONiC Base image Debian Upgrade +* SONiC Container Debian Upgrade +* SONiC Debian Support +* SONiC Debian Deprecation + +## SONiC Base image Debian Upgrade +This section describes SONiC Base image Debian upgrade process and cadence. + +### Debian Release Schedule +Debian does not have any fixed release schedule officially. Debian Community will release a new version when it’s ready. Unofficially, based on the [releases](https://www.debian.org/releases/) since Debian Stretch (in 2017), new Debian versions have come out every 2 years, around June-August, with Bookworm released in June 2023. Based on this schedule, it’s reasonable to assume that Debian Trixie may get released around June-August 2025. + + +### Timeline +Given Debian release info, and the fact that the current SONiC release trend is to have a release in May and November, the goal should be to target the base image upgrade to the new Debian version for the November release. If, however, the release schedule changes such that there are less than 3 months between the Debian release and the SONiC branch cutoff, it would be recommend to push back the base image upgrade to the next SONiC release. + +If needed, some work (such as creating the slave container, new kernel migration) can be done at the full freeze of Debian release (which will likely be about 2-3 weeks prior to the release). This gives a little bit of extra room in the schedule. However, do note that there is a chance of issues in the new version being present during this time, so please expect potential changes to packages during this time. + + +### Procedure +To accomplish this, there are some changes that should be present in sonic-net/sonic-buildimage, either in the master branch or in a development branch. Specifically, this is the slave container for the new Debian version. This is required for sonic-net/sonic-linux-kernel to be able to build the new kernel version. +Hence, the following tasks need to be done first, and merged into the master branch or in a development branch in sonic-net/sonic-buildimage: + +1. Create the slave container for the new Debian version. + a. This doesn’t need to be the final/official slave container that will be used. For now, it needs to be able to build the new kernel, which should be fairly easy to accomplish. +2. Make changes to Makefile and Makefile.work to be able to build the new slave container. + +After that is done, a new Azure pipeline will need to be created to build the slave container and publish it to the container registry, so that the sonic-net/sonic-linux-kernel build can use it. + +Then, work on both upgrading the kernel to the new version (and disabling patches/configs as necessary to get the build to succeed) as well as building a VS/KVM image, but with the kernel build disabled (if needed), can begin. This part can proceed in parallel, since for the userspace applications in the VS build, there shouldn’t be any hard dependency on the specific kernel that we build. Note that getting the kernel build done will likely take less time than getting the userspace build done, depending on the changes that are in the new version of Debian. + +On the userspace side, depending on the number of changes needed and depending on what is being built, it may be easier to disable the build of that application/package for now to get an image built. In some cases, a version upgrade may end up being all that is needed. In other cases, patches or actual code may need to be updated. + +Note also that there may be changes needed in submodules. In most cases, changes done in submodules must not break the build on the current Debian version. This is because they may be installed and used in both the base image and in some container. There are two submodules (that I’m currently aware of) that only get installed on the base image. In these cases, it may be reasonable to have a separate development branch in those submodules to make any needed changes (including breaking changes, if needed) and use that until the final merge into master branch. These submodules are: + +* src/sonic-linux-kernel +* platform/broadcom/saibcm-modules-dnx + +The src/sonic-host-services submodule is used in the docker-sonic-vs container, meaning it needs to (largely) stay compatible with the current Debian version and the new Debian version. Because of this, there may need to be breaking changes in this repo, and this repo may need a new debian specific branch as well. + +For all other submodules, any changes needed there should be done in a way that doesn’t break anything on the current Debian version. They should be merged into the master branch of that submodule, which will eventually get picked up in a submodule update to the master branch of sonic-net/sonic-buildimage. + +Once the VS image is built, and it boots up, at this point, it should be possible to build images for the individual platforms. Note that kernel modules that are built as part of that platform would need to be updated or disabled. In addition, applications/packages that were disabled earlier can now be fixed up and built into the image. During this time, regular code syncs from the master branch should be happening, so as to find any breaking changes in master branch and fix them sooner rather than later. I recommend doing a git rebase of the development branch on top of the master branch, so as to keep the git history for the development branch cleaner. + +An estimate of how much time is needed for each task is given below: + +| Task | Time Estimate | +|----------------------------------|---------------| +| Create slave container | 1 week | +| Update Makefile and Makefile.work to be able to build the new slave container | 2 days | +| Merge into master (or dev) branch of sonic-net/sonic-buildimage | 2 days | +| Define new Azure pipeline to build the new slave container | 2 days | +| Update kernel to build the new version | 1.5 weeks | +| Update slave.mk to build images for the new Debian version | 1 day | +| Make changes in sonic-buildimage to build VS image (with modules disabled as needed) | 2.5 week | +| Update platform modules and python scripts for different platforms for the new Debian release | 6 week | +| Fix up issues/TODOs added when building the VS image | In parallel with previous task. Should be within 2-3 weeks, depending on the scope of issues. | +| Ensure that the kernel is stable, and that all images are building and functional | 1 week | + +This gives a total estimate of about 10.5-11 weeks (assuming that tasks that can be done in parallel are done in parallel). + +## SONiC Container Debian Upgrade +This section describes SONiC container Debian upgrade process and cadence. + +As explained in previous section base SONiC will be first upgraded to latest Debian release and first SONiC release of Debian upgrade cycle will only target this. + +Guidelines and procedure of container Debian upgrade as follows, + + +### Guidelines + +* Container upgrade will be targeted from the next release after the base SONiC Debian upgrade. Let us call this release as ‘second’ release of Debian upgrade cycle. + * Ex: Bookworm base Debian upgrade in 202311, Container upgrade will be targeted from release 202405 +* Following are the list if container which needs upgrade. Further, list is broken into Phase1 and 2. All Phase 1 containers are enabled by default in ‘rules/config’ or built by default. Also, some of the Phase 2 containers are used for specific use cases. + #### Phase 1 + * database + * swss and orchagent + * teamd + * pmon + * lldp + * snmp + * syncd/saiserver /syncd-rpc + * frr + * radvd + * nat + * eventd + * dhcp-relay + * telemetry + * macsec + * sflow + * mux + + #### Phase 2 + * p4rt + * gbsyncd + * iccpd + * restapi + * dhcp-server + * sonic-sdk + * ptf + * mgmt-framework + * PDE + +* Phase 1 list container upgrades should be covered in ‘second’ release. It is also recommended to upgrade Phase 2 containers in 'second' release, but will be best effort only. + * Ex: Phase 1 list container should be targeted for upgrade to bookworm in 202405 release. Container from phase 2 list can be included in 202405 releases if PR is raised within the time. + + +### Procedure + +* Create docker-base-\ and docker-config-engine-\. +* Create docker-swss-layer-\. + * orchagent, teamd, frr, nat, sflow are using this swss-layer. +* Upgrade Dockerfile.j2 of each container to point to latest Debian. + * Sometimes this may need packaging updates in some containers. + + +### Timeline + +| Task | Time estimate | Description | +| --------------------------------------------| ---------------|-----------------------------------| +| Create base docker and config-engine docker | 1 week | | +| Create swss-layer docker | 1 week | | +| Dockerfile migration for each container | - | Task for respective owners to upgrade the containers.| + +## SONiC Debian support +This section describes SONiC Debian support process to update SONiC with latest fixes/CVEs from Debian community. + +Debian community has three active stable releases named stable, oldstable and oldoldstable([Debian Releases](https://www.debian.org/releases/)). SONiC releases mapping to these Debian releases needs active support. + +This document is targeted at packages which needs manual update in sonic-buildimage. Below is the guidelines and procedure, + + +### Guidelines + +* Target Debian source which needs active support is currently limited to list below. Minor version of these software package will be updated to the latest available from Debian community. + * Linux Kernel +* Target timelines for Linux Kernel minor version upgrades on different SONiC branches. + * Branches based on Debian 'Stable' - 6 months. + * Branches based on Debian 'OldStable' - 6 months. + * Branches Based on Debian 'OldOldStable' - Based on requirement. + * Ex: As per the current status below is the mapping of SONiC branches to Debian releases. + * OldOldStable (Buster) - 202006, 202012, 202106 + * OldStable (Bullseye) - 202111, 202205, 202211, 202305, 202311 + * Stable (Bookworm) - master, 202405(planned) +* Other third-party list below will be updated based on requirement. + * bash + * isc-dhcp-relay + * iproute2 + * iptables + * kdump-tools + * ntp + * protobuf + * snmpd + * socat + * lm-sensors + * redis + * FRR + * ifupdown2 + * libnl3 + * libteam + * monit + * openssh + * ptf + * libyang + * ldpd + * thrift + * dockerd + + +### Procedure + +* Linux Kernel minor version update. + * Update the minor version in sonic-linux-kernel. + * Vendors may need to update or remove the patches. + * Update sonic-buildimage + * Update installer file with latest minor version. + * Update makefile related to kernel module from different vendors. + * Vendors may need to update drivers. +* Change needs to be pushed to all target branches. + + +### Timeline + +| Task | Time estimate | +| ------------------------------------| -------------------| +| sonic-linux-kernel changes | - | +| sonic-buildimage changes | - | +| Backport changes to target branches | - | + +## SONiC Debian Deprecation + +Deprecation defines when to stop Debian based support for SONiC release/branch. + +Deprecation of the older SONiC branch for Debian support will happen after EOL of LTS Debian version. After deprecation of SONiC branch Debian source list will point to use last stable LTS snapshot archive for continued build. diff --git a/doc/event-alarm-framework/event-alarm-framework.md b/doc/event-alarm-framework/event-alarm-framework.md index 00764de6637..9238b32544e 100644 --- a/doc/event-alarm-framework/event-alarm-framework.md +++ b/doc/event-alarm-framework/event-alarm-framework.md @@ -311,16 +311,16 @@ The following additional parameters to be given with this api: For e.g call for port down event. current call: - event_params_t params = {{"ifname",port.m_alias},{"status",isUp ? "up" : "down"}}; + event_params_t params = /{/{"ifname",port.m_alias/},/{"status",isUp ? "up" : "down"/}/}; event_publish(g_events_handle, "if-state", ¶ms); new call: - event_params_t params = {{"ifname",port.m_alias},{"status",isUp ? "up" : "down"}, {"resource", port.m_alias}, {"event-id", "INTERFACE_OPER_STATUS_CHANGE"}, {"text", isUp? "status:UP" : "status:DOWN"}}; + event_params_t params = /{/{"ifname",port.m_alias},{"status",isUp ? "up" : "down"}, {"resource", port.m_alias}, {"event-id", "INTERFACE_OPER_STATUS_CHANGE"}, {"text", isUp? "status:UP" : "status:DOWN"/}/}; event_publish(g_events_handle, "if-state", ¶ms); e.g., Sensor temperature critical high - event_params_t params = {{"event-id", "SENSOR_TEMP_CRTICAL_HIGH"}, {"text", "Current temperature {}C, critical high threshold {}C", {"action":"RAISE_ALARM"}, {"resource":"sensor_name"}}} ; + event_params_t params = /{/{"event-id", "SENSOR_TEMP_CRTICAL_HIGH"}, {"text", "Current temperature {}C, critical high threshold {}C", {"action":"RAISE_ALARM"}, {"resource":"sensor_name"/}/}/} ; event_publish(g_events_handle, "sensor_temp_critical_high", ¶ms); ### 3.1.2 Event Consumer The event consumer is a class in EventDB service that processes the incoming events. diff --git a/doc/handle-ASIC-SDK-health-event/handle-ASIC-SDK-health-event-images/archchart.png b/doc/handle-ASIC-SDK-health-event/handle-ASIC-SDK-health-event-images/archchart.png new file mode 100644 index 00000000000..3af8406c8c0 Binary files /dev/null and b/doc/handle-ASIC-SDK-health-event/handle-ASIC-SDK-health-event-images/archchart.png differ diff --git a/doc/handle-ASIC-SDK-health-event/handle-ASIC-SDK-health-event-images/initflow.png b/doc/handle-ASIC-SDK-health-event/handle-ASIC-SDK-health-event-images/initflow.png new file mode 100644 index 00000000000..6726361961d Binary files /dev/null and b/doc/handle-ASIC-SDK-health-event/handle-ASIC-SDK-health-event-images/initflow.png differ diff --git a/doc/handle-ASIC-SDK-health-event/handle-ASIC-SDK-health-event-images/normalflow.png b/doc/handle-ASIC-SDK-health-event/handle-ASIC-SDK-health-event-images/normalflow.png new file mode 100644 index 00000000000..0d27210bea1 Binary files /dev/null and b/doc/handle-ASIC-SDK-health-event/handle-ASIC-SDK-health-event-images/normalflow.png differ diff --git a/doc/handle-ASIC-SDK-health-event/handle-ASIC-SDK-health-event-images/suppressflow.png b/doc/handle-ASIC-SDK-health-event/handle-ASIC-SDK-health-event-images/suppressflow.png new file mode 100644 index 00000000000..f11a24a08c2 Binary files /dev/null and b/doc/handle-ASIC-SDK-health-event/handle-ASIC-SDK-health-event-images/suppressflow.png differ diff --git a/doc/handle-ASIC-SDK-health-event/handle-ASIC-SDK-health-event.md b/doc/handle-ASIC-SDK-health-event/handle-ASIC-SDK-health-event.md new file mode 100644 index 00000000000..663b21317e5 --- /dev/null +++ b/doc/handle-ASIC-SDK-health-event/handle-ASIC-SDK-health-event.md @@ -0,0 +1,714 @@ +# Handle ASIC/SDK health event # + +## Table of Content + +### Revision + +| Rev | Date | Author | Change Description | +|:---:|:-----------:|:------------------:|-----------------------------------| +| 0.1 | Oct 23, 2023 | Stephen Sun | Initial version | +| 0.2 | Nov 17, 2023 | Stephen Sun | Fix internal review comments | +| 0.3 | Dec 11, 2023 | Stephen Sun | Adjust for multi ASIC platform according to the common pratice in the community | +| 0.4 | Jan 05, 2023 | Stephen Sun | Address community review comments | +| 0.5 | Jan 11, 2023 | Stephen Sun | Minor adjustments in CLI | + +### Scope + +This document describes the high level design of handle ASIC/SDK health event framework in SONiC. + +### Definitions/Abbreviations + +| Name | Meaning | +|:----:|:-------:| +| ASIC/SDK health event | Health event is a way for SAI to inform NOS about HW/SW health issues. Usually they are not directly caused by a SAI API call. | +|| An ASIC/SDK health event is described using `severity`, `category`, `timestamp`, `description`. | +|| For multi ASIC system it also includes `asic name`. | +| severity of an ASIC/SDK health event | one of `fatal`, `warning`, and `notice`, which represents how severe the event is | +| category of an ASIC/SDK health event | one of `software`, `firmware`, `cpu_hw`, `asic_hw`, which usually represents the component from which the event is detected | + +### Overview + +A way for syncd to notify orchagent an ASIC/SDK health event before asking orchagent to shutdown is introduced in this document. + +For most of ethernet switches, the switch ASIC is the core component in the system. It is very important to identify a switch ASIC is in a failure state and report such event to NOS. + +Currently, such failure is detected by SDK/FW on most of platforms. A vendor SAI notifies orchagent to shutdown using `switch_shutdown_request` notification when it detects an ASIC/SDK internal error. Usually, the vendor SAI prints log message before calling shutdown API. + +Orchagent can abort itself if a SAI API call fails, usually due to a bad arguments, and can not be recovered. From a customer's perspective of view, this can be distinguished from the ASIC/SDK health event only by analyzing the log message. + +The current implementation has the following limitations: + +- It is difficult for a customer to understand what occured on SAI and below or distinguish an SDK/FW internal error from a SAI API call. Even a customer can analyze the issue using the log message, it is not intuitive. +- It is unable to notify an ASIC/FW/SDK event if the event is less serious to ask for shutdown. +- It is unable for telementry agent to collect such information. + +In this design, we will introduce a new way to address the limitations. + +### Requirements + +This section list out all the requirements for the HLD coverage and exemptions (not supported) if any for this design. + +1. Capabilities + 1. A vendor SAI should expose the corresponding SAI switch attributes if it supports ASIC/SDK health event so that orchagent can fetch them using `sai_query_attribute_capability` + 2. Orchagent shall not set any SAI switch attributes that is not supported by the vendor SAI. +2. For any vendor SAI who supports the feature, it shall notify a `switch_asic_sdk_health_event` when it detects a HW/SW health issues. + 1. If the issue is serious enough to shutdown the switch, the vendor SAI shall notify `switch_asic_sdk_health_event` before `switch_shutdown_request` + 2. Otherwise, the vendor SAI will not notify `switch_shutdown_request`. +3. On receiving an ASIC/SDK health event, the orchagent shall + 1. Extract data from the event (severity, timestamp, description) and push data to the STATE_DB table using timestamp and date as a key + 2. Report the event to gNMI server using the event collect mechanism +4. CLI commands shall be provided to display or clear all the ASIC/SDK health events in the STATE_DB +5. A CLI command shall be provided for a customer + 1. to suppress a certain type of ASIC/SDK health event on a certain severity. + 2. to eliminate old ASIC/SDK health events in the database in order to avoid consuming too much resource. +6. ASIC/SDK health events should be collected in `show techsupport` as an independent file in `dump`. + +### Architecture Design + +The current architecture is not changed in this design. + +### High-Level Design + +![Flow](handle-ASIC-SDK-health-event-images/archchart.png "Figure: A overall block chart - handle ASIC/SDK health event") + +A mechanism to handle an SDK/FW internal events is enhanced in the following way in this design. + +- Orchagent registers a notification handler of `switch_asic_sdk_health_event` to SAI during system initialization for all severities. + - Capabilities will be fetched ahead of registering the event and exposed to `STATE_DB`. + - A user can suppress the events that he/she is not interested in by severity and category using configuration. +- A vendor SAI notifies orchagent an ASIC/SDK event using `switch_asic_sdk_health_event` notification with corresponding arguments when it detects an HW/SW issue. + - The orchagent stores the information of the ASIC/SDK health event into database and pushes it to gNMI server using event collector mechanism. +- The vendor SAI notifies orchagent to shutdown using `switch_shutdown_request` notification if the event is seriously enough. + Orchagent will abort on receiving the notification. + - Otherwise, `switch_shutdown_request` will not be sent and system continues to run. +- The ASIC/SDK health events stored in `STATE_DB.ASIC_SDK_HEALTH_EVENT_TABLE` can be displayed or cleared using CLI commands. + +The `timestamp`, `severity`, and `category` are represented in various components. + +The `timestamp` is converted to format "%Y-%m-%d %H:%M:%S" which is a walltime based on the timezone in `swss` docker container. + +The `severity` is mapped between each other according to the next table: + +| represention in SONiC | Enumerate in SAI headers | SAI attribute to register corresponding eventa | +|:---:|:---:|:---:| +| fatal | SAI_SWITCH_ASIC_SDK_HEALTH_SEVERITY_FATAL | SAI_SWITCH_ATTR_REG_FATAL_SWITCH_ASIC_SDK_HEALTH_CATEGORY | +| warning | SAI_SWITCH_ASIC_SDK_HEALTH_SEVERITY_WARNING | SAI_SWITCH_ATTR_REG_WARNING_SWITCH_ASIC_SDK_HEALTH_CATEGORY | +| notice | SAI_SWITCH_ASIC_SDK_HEALTH_SEVERITY_NOTICE | SAI_SWITCH_ATTR_REG_NOTICE_SWITCH_ASIC_SDK_HEALTH_CATEGORY | + +The `category` is mapped between each other according to the next table: + +| represention in SONiC | Enumerate in SAI headers | +|:---:|:---:| +| software | SAI_SWITCH_ASIC_SDK_HEALTH_CATEGORY_SW | +| firmware | SAI_SWITCH_ASIC_SDK_HEALTH_CATEGORY_FW | +| cpu_hw | SAI_SWITCH_ASIC_SDK_HEALTH_CATEGORY_CPU_HW | +| asic_hw | SAI_SWITCH_ASIC_SDK_HEALTH_CATEGORY_ASIC_HW | + +This is a built-in SONiC feature implemented in the following sub-modules + +- sonic-swss, which handles SAI notification, storing it into database and pushing it into gNMI server +- sonic-sairedis, which transmits the ASIC/SDK health events reported by vendor SAI to orchagent +- sonic-utilities, in which the CLI to display and clear ASIC/SDK health events and configure suppress ASIC/SDK health events are implemented +- sonic-buildimage, in which the new yang models for the new events are defined + +#### DB changes + +##### STATE_DB change + +###### Table ASIC_SDK_HEALTH_EVENT_TABLE + +Table `ASIC_SDK_HEALTH_EVENT_TABLE` contains the ASIC/SDK health events information. + +```text +key = ASIC_SDK_HEALTH_EVENT_TABLE:timestamp_string ; "%Y-%m-%d %H:%M:%S", full-date and partial-time separated by white space. + ; Example: 2022-09-12 09:39:19 +severity = "fatal" | "warning" | "notice" +category = "software" | "firmware" | "cpu_hw" | "asic_hw" +description = 1*255VCHAR ; ASIC/SDK health event's description text +``` + +###### Table SWITCH_CAPABILITY + +Table `SWITCH_CAPABILITY` is not a new table. It has been designed to represent various switch object capabilities supported on the platform. + +The following fields will be introduced in this design. + +```text +ASIC_SDK_HEALTH_EVENT = "true" | "false" ; whether SAI attribute SAI_SWITCH_ATTR_SWITCH_ASIC_SDK_HEALTH_EVENT_NOTIFY is supported +REG_FATAL_ASIC_SDK_HEALTH_CATEGORY = "true" | "false" ; whether SAI attribute SAI_SWITCH_ATTR_REG_FATAL_SWITCH_ASIC_SDK_HEALTH_CATEGORY is supported +REG_WARNING_ASIC_SDK_HEALTH_CATEGORY = "true" | "false" ; whether SAI attribute SAI_SWITCH_ATTR_REG_WARNING_SWITCH_ASIC_SDK_HEALTH_CATEGORY is supported +REG_NOTICE_ASIC_SDK_HEALTH_CATEGORY = "true" | "false" ; whether SAI attribute SAI_SWITCH_ATTR_REG_NOTICE_SWITCH_ASIC_SDK_HEALTH_CATEGORY is supported +``` + +### SAI API + +There is no new SAI API introduced nor changed. + +The following SAI attributes of switch object defined in `SAI/inc/saiswitch.h` are used in this document. + +```C + /** + * @brief Health notification callback function passed to the adapter. + * + * Use sai_switch_asic_sdk_health_event_notification_fn as notification function. + * + * @type sai_pointer_t sai_switch_asic_sdk_health_event_notification_fn + * @flags CREATE_AND_SET + * @default NULL + */ + SAI_SWITCH_ATTR_SWITCH_ASIC_SDK_HEALTH_EVENT_NOTIFY, + + /** + * @brief Registration for health fatal categories. + * + * For specifying categories of causes for severity fatal events + * + * @type sai_s32_list_t sai_switch_asic_sdk_health_category_t + * @flags CREATE_AND_SET + * @default empty + */ + SAI_SWITCH_ATTR_REG_FATAL_SWITCH_ASIC_SDK_HEALTH_CATEGORY, + + /** + * @brief Registration for health warning categories. + * + * For specifying categories of causes for severity warning events + * + * @type sai_s32_list_t sai_switch_asic_sdk_health_category_t + * @flags CREATE_AND_SET + * @default empty + */ + SAI_SWITCH_ATTR_REG_WARNING_SWITCH_ASIC_SDK_HEALTH_CATEGORY, + + /** + * @brief Registration for health notice categories. + * + * For specifying categories of causes for severity notice events + * + * @type sai_s32_list_t sai_switch_asic_sdk_health_category_t + * @flags CREATE_AND_SET + * @default empty + */ + SAI_SWITCH_ATTR_REG_NOTICE_SWITCH_ASIC_SDK_HEALTH_CATEGORY, +``` + +The following type definitions for the SAI attributes defined in `SAI/inc/saiswitch.h` are used in this document. + +``` C +/** + * @brief Switch health event severity + */ +typedef enum _sai_switch_asic_sdk_health_severity_t +{ + /** Switch event severity fatal */ + SAI_SWITCH_ASIC_SDK_HEALTH_SEVERITY_FATAL, + + /** Switch event severity warning */ + SAI_SWITCH_ASIC_SDK_HEALTH_SEVERITY_WARNING, + + /** Switch event severity notice */ + SAI_SWITCH_ASIC_SDK_HEALTH_SEVERITY_NOTICE + +} sai_switch_asic_sdk_health_severity_t; + +/** + * @brief Switch health categories + */ +typedef enum _sai_switch_asic_sdk_health_category_t +{ + /** Switch health software category */ + SAI_SWITCH_ASIC_SDK_HEALTH_CATEGORY_SW, + + /** Switch health firmware category */ + SAI_SWITCH_ASIC_SDK_HEALTH_CATEGORY_FW, + + /** Switch health cpu hardware category */ + SAI_SWITCH_ASIC_SDK_HEALTH_CATEGORY_CPU_HW, + + /** Switch health ASIC hardware category */ + SAI_SWITCH_ASIC_SDK_HEALTH_CATEGORY_ASIC_HW + +} sai_switch_asic_sdk_health_category_t; + +/** + * @brief Switch health event callback + * + * @objects switch_id SAI_OBJECT_TYPE_SWITCH + * + * @param[in] switch_id Switch Id + * @param[in] severity Health event severity + * @param[in] timestamp Time and date of receiving the SDK Health event + * @param[in] category Category of cause + * @param[in] data Data of switch health + * @param[in] description JSON-encoded description string with information delivered from SDK event/trap + * Example of a possible description: + * { + * "switch_id": "0x00000000000000AB", + * "severity": "2", + * "timestamp": { + * "tv_sec": "22429", + * "tv_nsec": "3428724" + * }, + * "category": "3", + * "data": { + * data_type: "0" + * }, + * "additional_data": "Some additional information" + * } + */ + +typedef void (*sai_switch_asic_sdk_health_event_notification_fn)( + _In_ sai_object_id_t switch_id, + _In_ sai_switch_asic_sdk_health_severity_t severity, + _In_ sai_timespec_t timestamp, + _In_ sai_switch_asic_sdk_health_category_t category, + _In_ sai_switch_health_data_t data, + _In_ const sai_u8_list_t description); + +``` + +The following type definitions for the SAI attributes defined in `SAI/inc/saitypes.h` are used in this document. + +```C +typedef enum _sai_health_data_type_t +{ + /** General health data type */ + SAI_HEALTH_DATA_TYPE_GENERAL +} sai_health_data_type_t; + +typedef struct _sai_switch_health_data_t +{ + /** Type of switch health data */ + sai_health_data_type_t data_type; + +} sai_switch_health_data_t; +``` + +### Configuration and management + +#### Manifest (if the feature is an Application Extension) + +N/A. + +#### CLI Enhancements + +##### Configure suppress ASIC/SDK health events by severity and category + +Command `config asic-sdk-health-event suppress [<--category-list> ||] [<--max-events> ] [<--namespace|-n> ]` is introduced for a customer to configure: + +- the categories that he/she wants to suppress for a certain severity. +- the maximum number of ASIC/SDK health events to be stored in `STATE_DB.ASIC_SDK_HEALTH_EVENT_TABLE`. + +The severity can be one of `fatal`, `warning`, and `notice`. + +The category-list is a list whose elements are one of `software`, `firmware`, `cpu_hw`, `asic_hw` separated by a comma. The order does not matter. + +- If the category-list is `none`, none category is suppressed and all the categories will be notified for `severity` and the field `categories` will be removed. +- If the category-list is `all`, all the categories are suppressed and none category will be notified for `severity` and the field `catetories` is a list of all categories. + +The max-events is a number, which represents the maximum number of events the customer wants to store in the database. + +- If the max-events is `0`, all events of that severity will be stored in the database and the field `max_events` will be removed. + +If neither `category-list` nor `max-events` exists, the entry will be removed from `CONFIG_DB.SUPPRESS_ASIC_SDK_HEALTH_EVENT`. + +The namespace is an option for multi ASIC platforms only. + +If a non-zero `max-events` is configured, the system will remove the oldest events of each severity every 1 hour. + +If a `category-list` is configured, the ASIC/SDK health events whose `category` is in `category-list` with the `severity` will not be reported by the vendor SAI once the corresponding SAI attributes are set. +But the events that were reported after the command is executed but before the SAI attributes are set will be handled by orchagent and pushed into `STATE_DB.ASIC_SDK_HEALTH_EVENT_TABLE` as usual. + +Eg 1. the following command will suppress `notice` events with category `asic_hw` and `cpu_hw`: + +`config asic-sdk-health-event suppress notice --category-list asic_hw,cpu_hw` + +After that, the ASIC/SDK health events whose `category` is one of `asic_hw` and `cpu_hw` and `severity` is `notice` will not be reported. + +Eg 2. the following command will configure maxinum number of events of `notice` to '10240`: + +`config asic-sdk-health-event suppress notice --max-events 10240` + +After that, only the most-recently-received 10240 ASIC/SDK health events will be stored in the `STATE_DB.ASIC_SDK_HEALTH_EVENT_TABLE`. All the older entries will be removed. + +The following error message will be shown if a customer configures it on a platform that does not support it. + +`ASIC/SDK health event is not supported on the platform` + +The following error message will be shown if a customer suppresses a severity which is not supported on the platform. + +`Suppressing ASIC/SDK health {severity} event is not supported on the platform` + +##### Display the ASIC/SDK health events + +Command `show asic-sdk-health-event received [<--namespace|-n> ]` is introduced to display the ASIC/SDK health events as a table. +An example of the output is as below: + +The namespace is an option for multi ASIC platforms only. + +``` +admin@sonic:~$ show asic-sdk-health-event received +Time Severity Category Description +------------------- ----------- --------- ----------------- +2023-10-20 05:07:34 fatal firmware Command timeout +2023-10-20 03:06:25 fatal software SDK daemon keep alive failed +2023-10-20 05:07:34 fatal asic_hw Uncorrectable ECC error +2023-10-20 01:58:43 notice asic_hw Correctable ECC error +``` + +An example of the output on a multi ASIC system: + +``` +admin@sonic:~$ show asic-sdk-health-event received +asic0: +Time Severity Category Description +------------------- ----------- --------- ----------------- +2023-10-20 05:07:34 fatal firmware Command timeout +2023-10-20 03:06:25 fatal software SDK daemon keep alive failed +asic1: +Time Severity Category Description +------------------- ----------- --------- ----------------- +2023-10-20 05:07:34 fatal asic_hw Uncorrectable ECC error +2023-10-20 01:58:43 notice asic_hw Correctable ECC error +``` + +The following error message will be shown if a customer executes the command on a platform that does not support it. + +`ASIC/SDK health event is not supported on the platform` + +##### Display the ASIC/SDK health suppress configuration + +Command `show asic-sdk-health-event suppress-configuration [<--namespace|-n> ]` is introduced to display the suppressed categories of each severity of ASIC/SDK health events or the maximum events to store in the database. + +Only severities that have been configured will be displayed. + +- if only category-list is configured, the maximum events will be displayed as `unlimited` +- if only maximum events is configured, the category-list will be displayed as `none` +- if neither of above is configured, the severity will not be displayed + +An example of the output is as below: + +The namespace is an option for multi ASIC platforms only. + +``` +admin@sonic:~$ show asic-sdk-health-event suppressed-category-list +Severity Suppressed category-list Max events +---------- -------------------------- ------------ +fatal software unlimited +notice none 1024 +warning firmware,asic_hw 10240 +``` + +An example of the output on a multi ASIC system: +``` +admin@sonic:~$ show asic-sdk-health-event suppressed-category-list +asic0: +Severity Suppressed category-list Max events +---------- -------------------------- ------------ +warning firmware,asic_hw 10240 +asic1: +Severity Suppressed category-list Max events +---------- -------------------------- ------------ +notice none 1024 +``` + +The following error message will be shown if a customer executes the command on a platform that does not support it. + +`ASIC/SDK health event is not supported on the platform` + +##### Clear the ASIC/SDK health events + +Command `sonic-clear asic-sdk-health-events [<--namespace|-n> ]` is introduced to clear the ASIC/SDK health events stored in `STATE_DB.ASIC_SDK_HEALTH_EVENT_TABLE`. + +The namespace is an option for multi ASIC platforms only. + +After the command is executed, all items in `STATE_DB.ASIC_SDK_HEALTH_EVENT_TABLE` will be cleared. + +#### YANG model Enhancements + +##### YANG model of the suppress ASIC/SDK health event configuration + +The following YANG model is introduced for the suppress ASIC/SDK health event + +```text + container sonic-suppress-asic-sdk-health-event { + container SUPPRESS_ASIC_SDK_HEALTH_EVENT { + list SUPPRESS_ASIC_SDK_HEALTH_EVENT_LIST { + key "severity"; + + leaf severity { + type enumeration { + enum fatal; + enum warning; + enum notice; + } + description "Severity of the ASIC/SDK health event"; + } + + leaf max_events { + type uint32; + } + + leaf-list categories { + mandatory true; + type enumeration { + enum software; + enum firmware; + enum cpu_hw; + enum asic_hw; + } + description "Category of the ASIC/SDK health event"; + } + } + } + } + +``` + +##### YANG model of the ASIC/SDK health event + +The following YANG model is introduced for ASIC/SDK health event. + +A `sai_timestamp` is provided on top of `timestamp` which is provided by the event collect mechanism since they differ. + +```text + container sonic-events-swss { + container asic-sdk-health-event { + evtcmn:ALARM_SEVERITY_MAJOR; + description "Declares an event for ASIC/SDK health event."; + leaf sai_timestamp { + type string { + pattern '[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}'; + } + } + leaf asic_name { + type string { + pattern 'asic[0-9]{1,2}'; + } + } + leaf severity { + type enumeration { + enum fatal; + enum warning; + enum notice; + } + } + leaf category { + type enumeration { + enum software; + enum firmware; + enum cpu_hw; + enum asic_hw; + } + } + leaf description { + type string; + } + } + } +``` + +#### Config DB Enhancements + +Table `SUPPRESS_ASIC_SDK_HEALTH_EVENT` contains + +1. the list of categories of ASIC/SDK health events that a user wants to suppress for a certain severity. +2. the number of events of each severity a user wants to keep + +```text +key = SUPPRESS_ASIC_SDK_HEALTH_EVENT: ; severity can be one of fatal, warning or notice +categories = {,} + ; a list whose element can be one of software, firmware, cpu_hw, asic_hw separated by a comma +max_events = 1*10DIGIT ; the number of events for a severity a user wants to keep. + ; If there are more events than max_events in the database, the older ones will be removed. +``` + +#### Flows + +##### Register ASIC/SDK health event handler during system initialization + +We leverage the existing framework to register the ASIC/SDK health event handler. + +Various events can occur when a switch system is running, which requires orchagent, upper layer application or protocol to handle. Currently, this has been done by using event handlers. There is a dedicated event handler defined as an attribute of switch object for each event that needs to be handled. + +Currently, there are following event handlers defined. + +| Attribute name | Event | +|:---:|:---:| +| SAI_SWITCH_ATTR_SWITCH_STATE_CHANGE_NOTIFY | Switch state change | +| SAI_SWITCH_ATTR_SHUTDOWN_REQUEST_NOTIFY | Shutdown a switch | +| SAI_SWITCH_ATTR_FDB_EVENT_NOTIFY | FDB event | +| SAI_SWITCH_ATTR_NAT_EVENT_NOTIFY | NAT entry event | +| SAI_SWITCH_ATTR_PORT_STATE_CHANGE_NOTIFY | Port state change | +| SAI_SWITCH_ATTR_QUEUE_PFC_DEADLOCK_NOTIFY | PFC watchdog | +| SAI_SWITCH_ATTR_BFD_SESSION_STATE_CHANGE_NOTIFY | BFD session state change | + +These events can be handled in different ways. Eg. `Shutdown a switch` event is handled directly in the event handler. For other events, the event handler is empty and the real logic is handled in orchagent main thread using `NotificationConsumer`. + +To handle ASIC/SDK health event, a new event handler should be implemented and registered as below. +The ASIC/SDK health event will be handled in the event handler, which is the same way as `Shutdown a switch`. This is because we need to guarantee that `ASIC/SDK health` will always be handled before `shutdown a switch`, otherwise the information can be lost. + +| Attribute name | Event | Callback prototype | +|:---:|:---:|:---:| +| SAI_SWITCH_ATTR_SWITCH_ASIC_SDK_HEALTH_EVENT_NOTIFY | ASIC/SDK health event handler | sai_switch_asic_sdk_health_event_notification_fn | + +The following SAI attributes of switch object should also be specified, indicating ASIC/SDK health event of all categories and severities will be notified. + +| Attribute name | Meaning | Value | +|:---:|:---:|:---:| +| SAI_SWITCH_ATTR_REG_FATAL_SWITCH_ASIC_SDK_HEALTH_CATEGORY | The categories of fatal severity | firmware, software, cpu_hw, asic_hw | +| SAI_SWITCH_ATTR_REG_WARNING_SWITCH_ASIC_SDK_HEALTH_CATEGORY | The categories of warning severity | firmware, software, cpu_hw, asic_hw | +| SAI_SWITCH_ATTR_REG_NOTICE_SWITCH_ASIC_SDK_HEALTH_CATEGORY | The categories of notice severity | firmware, software, cpu_hw, asic_hw | + +The initialize flow is: + +1. Fetch capability of `SAI_SWITCH_ATTR_SWITCH_ASIC_SDK_HEALTH_EVENT_NOTIFY` using `sai_query_attribute_capability` +2. If it is supported, set the attribute using `sai_switch_api->set_switch_attribute` with corresponding callback. +3. If it is not supported or failed to set, expose the following fields to `STATE_DB.SWITCH_CAPABILITY` table as `false` and flow finishes. + - ASIC_SDK_HEALTH_EVENT + - REG_FATAL_ASIC_SDK_HEALTH_CATEGORY + - REG_WARNING_ASIC_SDK_HEALTH_CATEGORY + - REG_NOTICE_ASIC_SDK_HEALTH_CATEGORY + +4. For each severity in {FATAL, WARNING, NOTICE}, fetch the capability of the SAI switch attribute `REG_{severity}_ASIC_SDK_HEALTH_CATEGORY` using `sai_query_attribute_capability` + 1. If it is supported, set the attribute using `sai_switch_api->set_switch_attribute` with all categories (firmware, software, cpu_hw, asic_hw). + 2. If it is supported and succeeded to set, expose corresponding field `REG_{severity}_ASIC_SDK_HEALTH_CATEGORY` as `true`. Otherwise, expose it as `false` + +![Flow](handle-ASIC-SDK-health-event-images/initflow.png "Figure: Initialize flow") + +##### Handle ASIC/SDK health event + +The flow to handle ASIC/SDK health event is as below. The steps 1~3 are introduced in this HLD and the rest steps already existed. + +1. A vendor SAI calls the stored callback function `sai_switch_asic_sdk_health_event_notification_fn` with arguments `timestamp`, `severity`, `category`, and `description` when it detects a HW/SW health issue. +2. Sai redis handles the ASIC/SDK health event, exacts the information, serializes it and then notifies orchagent using `switch_asic_sdk_health_event`. +3. Orchagent handles the sai redis notification + 1. arguments `timestamp`, `severity`, and `category` are translated to corresponding representations in SONiC. + 2. pushes the information to `STATE_DB.ASIC_SDK_HEALTH_EVENT_TABLE` with timestamp being the key of the table + 3. publishes the information to gNMI using event collect mechanism + 4. prints a syslog message: `[] ASIC/SDK health event occurred at , [asic , ]category : ` + - the severity of the message is `NOTICE` + - ``, ``, `` and `` are translated from the event + - `asic ` is printed only for multi ASIC system. The `asic name` is `CONFIG_DB|DEVICE_METADATA|localhost.asic_name`. +4. The flow finishes if the vendor SAI decides not to ask orchagent to shutdown. + + Usually, vendor SAI does not need to ask orchagent to shutdown switch on receiving an ASIC/SDK health event with `NOTICE` severity. +5. The vendor SAI calls stored callback function `sai_switch_shutdown_request_notification_fn` +6. Sai redis notifies orchagent using `switch_shutdown_request` +7. Orchagent calls `abort` on receiving `switch_shutdown_request` +8. The core dump of orchagent is generated on receiving SIGABRT. + + The tech support dump is collected automatically as a result of coredump if auto techsupport is enabled both globally and for swss. +9. The swss and syncd service stopped and then restarted as the result of orchagent aborted. + +![Flow](handle-ASIC-SDK-health-event-images/normalflow.png "Figure: handle ASIC/SDK health event") + +##### Handle suppress ASIC/SDK health event configuration + +A user can suppress the ASIC/SDK health events by severity match certain using configuration. + +Once user configures the category, and severity to suppress, orchagent will deregister them from SAI using corresponding SAI attribute. +The events that have been notified by SAI before the SAI attributes are updated will be handled by orchagent and pushed into `STATE_DB.ASIC_SDK_HEALTH_EVENT_TABLE` as usual. + +The flow to handle suppress ASIC/SDK health event configuration is as below: + +1. CLI parses, validate user input +2. If the corresponding attribute is not supported according to `STATE_DB.ASIC_SDK_HEALTH_EVENT_TABLE` table, print an error and flow finishes. +3. Push the new value into `CONFIG_DB.SUPPRESS_ASIC_SDK_HEALTH_EVENT` +4. Switch orchagent receives notification of table update, and then translates severity and category list into corresponding SAI attribute and enumurations + - severity mapping: + - fatal: SAI_SWITCH_ATTR_REG_FATAL_SWITCH_ASIC_SDK_HEALTH_CATEGORY + - warning: SAI_SWITCH_ATTR_REG_WARNING_SWITCH_ASIC_SDK_HEALTH_CATEGORY + - notice: SAI_SWITCH_ATTR_REG_NOTICE_SWITCH_ASIC_SDK_HEALTH_CATEGORY + - category mapping: + - software: SAI_SWITCH_ASIC_SDK_HEALTH_CATEGORY_SW + - firmware: SAI_SWITCH_ASIC_SDK_HEALTH_CATEGORY_FW + - cpu_hw: SAI_SWITCH_ASIC_SDK_HEALTH_CATEGORY_CPU_HW + - asic_hw: SAI_SWITCH_ASIC_SDK_HEALTH_CATEGORY_ASIC_HW +5. The categories to register ASIC/SDK health event for the `severity` is the completionary set of categories to be suppressed with the universal map containing all defined categories. +6. Switch orchagent calls SAI API `sai_switch_api->set_attribute` with corresponding `severity` and `categories to register event` as arguments. +7. SAI redis receives the call, validates the arguments, and then call vendor SAI's API. + +![Flow](handle-ASIC-SDK-health-event-images/suppressflow.png "Figure: handle suppress ASIC/SDK health event configuration") + +##### Eliminate oldest events from the database + +A user can configure the maximum number of events of each severity. The system will check `STATE_DB.ASIC_SDK_HEALTH_EVENT_TABLE` every 1 hour, and remove the oldest entries of a severity if it exceeds the threshold. + +As it requires frequent communitcate with redis server, a Lua plugin will be introduced to do this job. +The Lua plugin will be loaded during system intialization, and check the number of entries in `STATE_DB.ASIC_SDK_HEALTH_EVENT_TABLE`, and remove the old entries every 1 hour. + +The flow is as below: + +1. Check whether `max_events` is configured and exit the flow if it is not configured for any severity. +2. Check the events in `STATE_DB.ASIC_SDK_HEALTH_EVENT_TABLE` and exit the flow if the number of events does not exceed the threshold. +3. Sort the events by time and remove the oldest events. + +##### Add ASIC/SDK health event to `show asic-sdk-health-event received` to show techsupport + +The command `show asic-sdk-health-event received` will be invoked during collecting techsupport dump. + +A file `asic-sdk-health-event` will contain all the ASIC/SDK health events and be saved in `dump` folder of the techsupport dump. + +### Warmboot and Fastboot Design Impact + +It does not impact warm boot nor fast boot. + +### Memory Consumption +This sub-section covers the memory consumption analysis for the new feature: no memory consumption is expected when the feature is disabled via compilation and no growing memory consumption while feature is disabled by configuration. +### Restrictions/Limitations + +### Testing Requirements/Design +Explain what kind of unit testing, system testing, regression testing, warmboot/fastboot testing, etc., +Ensure that the existing warmboot/fastboot requirements are met. For example, if the current warmboot feature expects maximum of 1 second or zero second data disruption, the same should be met even after the new feature/enhancement is implemented. Explain the same here. +Example sub-sections for unit test cases and system test cases are given below. + +#### Unit Test cases + +##### sonic-swss + +1. Configure suppress all categories for a severity, and then check whether empty list has been set on the SAI attribute. +2. Configure suppress none categories for a severity, and then check whether all categories have been set on the SAI attribute. +3. Configure suppress part of the categories (eg. software, cpu_hw), and then check whether corresponding categories have been set on the SAI attribute. +4. Check whether the capabilities have been exposed to `STATE_DB.SWITCH_CAPABILITY|switch` correctly. +5. Check whether mocked event has been correctly handled. + +##### sonic-sairedis + +1. Check whether ASIC/SDK health event handler is correctly registered. +2. Check whether an instance of ASIC/SDK health event notification handler class is correctly created based on the notification string. +3. Check whether an ASIC/SDK health event is correctly serialized and then deserialized. + +##### sonic-utilities + +1. Check whether `CONFIG_DB.SUPPRESS_ASIC_SDK_HEALTH_EVENT` table is correctly updated based on CLI input. +2. Check whether `show asic-sdk-health-event received` correctly displays the information based on the `STATE_DB.ASIC_SDK_HEALTH_EVENT_TABLE`. +3. Check whether `show asic-sdk-health-event suppressed-category-list` correctly displays the configuration based on the `CONFIG_DB.SUPPRESS_ASIC_SDK_HEALTH_EVENT`. +4. Check whether the information in `STATE_DB.ASIC_SDK_HEALTH_EVENT_TABLE` has been cleared by executing `sonic-clear asic-sdk-health-event`. + +#### System Test cases + +TBD + +### Open/Action items - if any + +### Appendix + +#### Why ASIC/SDK health events are handled in notifications + +The `main thread` in orchagent handles table updates. +There is a `dedicated thread` handling NOTIFICATIONS in orchagent daemon. The entrypoint is `RedisChannel::notificationThreadFunction`. All the notification callbacks defined in https://github.com/sonic-net/sonic-swss/blob/master/orchagent/notifications.cpp are called from that thread. + +Nowadays almost all callbacks in that file are NOP except on_switch_shutdown_request which calls `exit` terminating the daemon. +The rest callbacks are handled using `NotificationConsumer` in the `main thread` in orchagent. + +If the callback `on_switch_asic_sdk_health_event` is called from the `main thread` using NotificationConsumer, it is handled from a different thread than `on_switch_shutdown_request`. +In this case, even vendor SAI always sends ASIC/SDK health event ahead of shutdown request, it’s possible that the `dedicated thread` is scheduled to run ahead of the `main thread`. As a result, the `on_switch_shutdown_request` can be called before `on_switch_asic_sdk_health_event` is called and OA will shutdown without ASIC/SDK health event handled and saved. + +If we handle shut down request in the main thread, it can result in the same situation. Eg. if something is wrong in ASIC/SDK which makes it unable to handle any SAI API calls: + +1. It notifies ASIC/SDK health event and then shutdown request +2. At the same time, there are a large number of table updates, let’s say routing entry update, to be programmed to SAI. +3. Both shutdown request and routing entry update are handled in the main thread. +4. If the routing entry update is handled first, SAI can return fail because of ASIC issue. OA will abort immediately without handling ASIC/SDK health event and shutdown request. +5. The ASIC/SDK health event is lost. diff --git a/doc/hash/hash-design.md b/doc/hash/hash-design.md index 0b3ac0803c9..b644ca131a7 100644 --- a/doc/hash/hash-design.md +++ b/doc/hash/hash-design.md @@ -28,7 +28,6 @@ - [2.4.2.1 Switch hash capabilities](#2421-switch-hash-capabilities) - [2.4.3 Data sample](#243-data-sample) - [2.4.4 Configuration sample](#244-configuration-sample) - - [2.4.5 Initial configuration](#245-initial-configuration) - [2.5 Flows](#25-flows) - [2.5.1 Config section](#251-config-section) - [2.5.1.1 GH update](#2511-gh-update) @@ -48,10 +47,11 @@ ## Revision -| Rev | Date | Author | Description | -|:---:|:----------:|:--------------:|:------------------------------------------------| -| 0.1 | 12/09/2022 | Nazarii Hnydyn | Initial version | -| 0.2 | 05/12/2022 | Nazarii Hnydyn | Capabilities validation | +| Rev | Date | Author | Description | +|:---:|:----------:|:--------------:|:--------------------------------| +| 0.1 | 12/09/2022 | Nazarii Hnydyn | Initial version | +| 0.2 | 05/12/2022 | Nazarii Hnydyn | Capabilities validation | +| 0.3 | 25/09/2023 | Nazarii Hnydyn | Hashing algorithm configuration | ## About this manual @@ -63,10 +63,11 @@ This document describes the high level design of GH feature in SONiC **In scope:** 1. ECMP/LAG switch hash configuration +2. ECMP/LAG switch hash algorithm configuration **Out of scope:** 1. ECMP/LAG switch hash seed configuration -2. ECMP/LAG switch hash algorithm configuration +2. ECMP/LAG switch hash offset configuration ## Abbreviations @@ -112,7 +113,7 @@ For ECMP, the hashing algorithm determines how incoming traffic is forwarded to For LAG, the hashing algorithm determines how traffic is placed onto the LAG member links to manage bandwidth by evenly load-balancing traffic across the outgoing links. -GH is a feature which allows user to configure which hash fields suppose to be used by hashing algorithm. +GH is a feature which allows user to configure various aspects of hashing algorithm. GH provides global switch hash configuration for ECMP and LAG. ## 1.2 Requirements @@ -189,51 +190,69 @@ GH provides global switch hash configuration for ECMP and LAG. ###### Figure 1: GH design -GH will use SAI Hash API to configure user-defined list of hash fields to ASIC. +GH will use SAI Hash API to configure various aspects of hashing algorithm to ASIC. Hashing policy can be set independently for ECMP and LAG. **GH important notes:** 1. According to the SAI Behavioral Model, the hash is calculated on ingress to pipeline -2. SAI configuration of hash fields is applicable to an original packet before any DECAP/ENCAP, +2. SAI configuration of hash fields is applicable to original packet before any DECAP/ENCAP, i.e. configuration is tunnel-agnostic -3. If some configured field is not present in an incoming packet, then zero is assumed for hash calculation +3. If some configured hash field is not present in an incoming packet, then zero is assumed for hash calculation ## 2.2 SAI API **SAI native hash fields which shall be used for GH:** -| Field | Comment | -|:----------------------------------------|:----------------------------------------| -| SAI_NATIVE_HASH_FIELD_IN_PORT | SWITCH_HASH\|GLOBAL\|ecmp_hash/lag_hash | -| SAI_NATIVE_HASH_FIELD_DST_MAC | | -| SAI_NATIVE_HASH_FIELD_SRC_MAC | | -| SAI_NATIVE_HASH_FIELD_ETHERTYPE | | -| SAI_NATIVE_HASH_FIELD_VLAN_ID | | -| SAI_NATIVE_HASH_FIELD_IP_PROTOCOL | | -| SAI_NATIVE_HASH_FIELD_DST_IP | | -| SAI_NATIVE_HASH_FIELD_SRC_IP | | -| SAI_NATIVE_HASH_FIELD_L4_DST_PORT | | -| SAI_NATIVE_HASH_FIELD_L4_SRC_PORT | | -| SAI_NATIVE_HASH_FIELD_INNER_DST_MAC | | -| SAI_NATIVE_HASH_FIELD_INNER_SRC_MAC | | -| SAI_NATIVE_HASH_FIELD_INNER_ETHERTYPE | | -| SAI_NATIVE_HASH_FIELD_INNER_IP_PROTOCOL | | -| SAI_NATIVE_HASH_FIELD_INNER_DST_IP | | -| SAI_NATIVE_HASH_FIELD_INNER_SRC_IP | | -| SAI_NATIVE_HASH_FIELD_INNER_L4_DST_PORT | | -| SAI_NATIVE_HASH_FIELD_INNER_L4_SRC_PORT | | +| Field | Comment | +|:----------------------------------------|:-------------------------------| +| SAI_NATIVE_HASH_FIELD_IN_PORT | SWITCH_HASH\|GLOBAL\|ecmp_hash | +| SAI_NATIVE_HASH_FIELD_DST_MAC | SWITCH_HASH\|GLOBAL\|lag_hash | +| SAI_NATIVE_HASH_FIELD_SRC_MAC | | +| SAI_NATIVE_HASH_FIELD_ETHERTYPE | | +| SAI_NATIVE_HASH_FIELD_VLAN_ID | | +| SAI_NATIVE_HASH_FIELD_IP_PROTOCOL | | +| SAI_NATIVE_HASH_FIELD_DST_IP | | +| SAI_NATIVE_HASH_FIELD_SRC_IP | | +| SAI_NATIVE_HASH_FIELD_L4_DST_PORT | | +| SAI_NATIVE_HASH_FIELD_L4_SRC_PORT | | +| SAI_NATIVE_HASH_FIELD_INNER_DST_MAC | | +| SAI_NATIVE_HASH_FIELD_INNER_SRC_MAC | | +| SAI_NATIVE_HASH_FIELD_INNER_ETHERTYPE | | +| SAI_NATIVE_HASH_FIELD_INNER_IP_PROTOCOL | | +| SAI_NATIVE_HASH_FIELD_INNER_DST_IP | | +| SAI_NATIVE_HASH_FIELD_INNER_SRC_IP | | +| SAI_NATIVE_HASH_FIELD_INNER_L4_DST_PORT | | +| SAI_NATIVE_HASH_FIELD_INNER_L4_SRC_PORT | | + +**SAI hash algorithms which shall be used for GH:** + +| Algorithm | Comment | +|:-----------------------------|:-----------------------------------------| +| SAI_HASH_ALGORITHM_CRC | SWITCH_HASH\|GLOBAL\|ecmp_hash_algorithm | +| SAI_HASH_ALGORITHM_XOR | SWITCH_HASH\|GLOBAL\|lag_hash_algorithm | +| SAI_HASH_ALGORITHM_RANDOM | | +| SAI_HASH_ALGORITHM_CRC_32LO | | +| SAI_HASH_ALGORITHM_CRC_32HI | | +| SAI_HASH_ALGORITHM_CRC_CCITT | | +| SAI_HASH_ALGORITHM_CRC_XOR | | **SAI attributes which shall be used for GH:** -| API | Function | Attribute | -|:-------|:-------------------------------------------|:-------------------------------------| -| OBJECT | sai_query_attribute_capability | SAI_SWITCH_ATTR_ECMP_HASH | -| | | SAI_SWITCH_ATTR_LAG_HASH | -| | | SAI_HASH_ATTR_NATIVE_HASH_FIELD_LIST | -| | sai_query_attribute_enum_values_capability | SAI_HASH_ATTR_NATIVE_HASH_FIELD_LIST | -| SWITCH | get_switch_attribute | SAI_SWITCH_ATTR_ECMP_HASH | -| | | SAI_SWITCH_ATTR_LAG_HASH | -| HASH | set_hash_attribute | SAI_HASH_ATTR_NATIVE_HASH_FIELD_LIST | +| API | Function | Attribute | +|:-------|:-------------------------------------------|:--------------------------------------------| +| OBJECT | sai_query_attribute_capability | SAI_SWITCH_ATTR_ECMP_HASH | +| | | SAI_SWITCH_ATTR_LAG_HASH | +| | | SAI_HASH_ATTR_NATIVE_HASH_FIELD_LIST | +| | | SAI_SWITCH_ATTR_ECMP_DEFAULT_HASH_ALGORITHM | +| | | SAI_SWITCH_ATTR_LAG_DEFAULT_HASH_ALGORITHM | +| | sai_query_attribute_enum_values_capability | SAI_HASH_ATTR_NATIVE_HASH_FIELD_LIST | +| | | SAI_SWITCH_ATTR_ECMP_DEFAULT_HASH_ALGORITHM | +| | | SAI_SWITCH_ATTR_LAG_DEFAULT_HASH_ALGORITHM | +| SWITCH | get_switch_attribute | SAI_SWITCH_ATTR_ECMP_HASH | +| | | SAI_SWITCH_ATTR_LAG_HASH | +| | set_switch_attribute | SAI_SWITCH_ATTR_ECMP_DEFAULT_HASH_ALGORITHM | +| | | SAI_SWITCH_ATTR_LAG_DEFAULT_HASH_ALGORITHM | +| HASH | set_hash_attribute | SAI_HASH_ATTR_NATIVE_HASH_FIELD_LIST | ## 2.3 Orchestration agent @@ -316,9 +335,11 @@ private: ; defines schema for switch hash configuration attributes key = SWITCH_HASH|GLOBAL ; switch hash global. Must be unique -; field = value -ecmp_hash = hash-field-list ; hash fields for hashing packets going through ECMP -lag_hash = hash-field-list ; hash fields for hashing packets going through LAG +; field = value +ecmp_hash = hash-field-list ; hash fields for hashing packets going through ECMP +lag_hash = hash-field-list ; hash fields for hashing packets going through LAG +ecmp_hash_algorithm = hash-algorithm ; hash algorithm for hashing packets going through ECMP +lag_hash_algorithm = hash-algorithm ; hash algorithm for hashing packets going through LAG ; value annotations hash-field = "IN_PORT" @@ -340,6 +361,13 @@ hash-field = "IN_PORT" / "INNER_L4_DST_PORT" / "INNER_L4_SRC_PORT" hash-field-list = hash-field [ 1*( "," hash-field ) ] +hash-algorithm = "CRC" + / "XOR" + / "RANDOM" + / "CRC_32LO" + / "CRC_32HI" + / "CRC_CCITT" + / "CRC_XOR" ``` ### 2.4.2 State DB @@ -353,10 +381,15 @@ key = SWITCH_CAPABILITY|switch ; must be unique ECMP_HASH_CAPABLE = capability-knob ; specifies whether switch is ECMP hash capable LAG_HASH_CAPABLE = capability-knob ; specifies whether switch is LAG hash capable HASH|NATIVE_HASH_FIELD_LIST = hash-field-list ; hash field capabilities for hashing packets going through switch +ECMP_HASH_ALGORITHM_CAPABLE = capability-knob ; specifies whether switch is ECMP hash algorithm capable +LAG_HASH_ALGORITHM_CAPABLE = capability-knob ; specifies whether switch is LAG hash algorithm capable +ECMP_HASH_ALGORITHM = hash-algorithm ; hash algorithm capabilities for hashing packets going through ECMP +LAG_HASH_ALGORITHM = hash-algorithm ; hash algorithm capabilities for hashing packets going through LAG ; value annotations capability-knob = "true" / "false" hash-field = "" + / "N/A" / "IN_PORT" / "DST_MAC" / "SRC_MAC" @@ -376,6 +409,15 @@ hash-field = "" / "INNER_L4_DST_PORT" / "INNER_L4_SRC_PORT" hash-field-list = hash-field [ 1*( "," hash-field ) ] +hash-algorithm = "" + / "N/A" + / "CRC" + / "XOR" + / "RANDOM" + / "CRC_32LO" + / "CRC_32HI" + / "CRC_CCITT" + / "CRC_XOR" ``` ### 2.4.3 Data sample @@ -389,6 +431,10 @@ INNER_DST_MAC,INNER_SRC_MAC,INNER_ETHERTYPE,INNER_IP_PROTOCOL,INNER_DST_IP,INNER 3) "lag_hash@" 4) "DST_MAC,SRC_MAC,ETHERTYPE,IP_PROTOCOL,DST_IP,SRC_IP,L4_DST_PORT,L4_SRC_PORT, \ INNER_DST_MAC,INNER_SRC_MAC,INNER_ETHERTYPE,INNER_IP_PROTOCOL,INNER_DST_IP,INNER_SRC_IP,INNER_L4_DST_PORT,INNER_L4_SRC_PORT" +5) "ecmp_hash_algorithm" +6) "CRC" +7) "lag_hash_algorithm" +8) "CRC" ``` **State DB:** @@ -402,6 +448,14 @@ redis-cli -n 6 HGETALL 'SWITCH_CAPABILITY|switch' 6) "IN_PORT,DST_MAC,SRC_MAC,ETHERTYPE,VLAN_ID,IP_PROTOCOL,DST_IP,SRC_IP,L4_DST_PORT,L4_SRC_PORT, \ INNER_DST_MAC,INNER_SRC_MAC,INNER_ETHERTYPE,INNER_IP_PROTOCOL,INNER_DST_IP,INNER_SRC_IP, \ INNER_L4_DST_PORT,INNER_L4_SRC_PORT" + 7) "ECMP_HASH_ALGORITHM_CAPABLE" + 8) "true" + 9) "LAG_HASH_ALGORITHM_CAPABLE" +10) "true" +11) "ECMP_HASH_ALGORITHM" +12) "CRC,XOR,RANDOM,CRC_32LO,CRC_32HI,CRC_CCITT,CRC_XOR" +13) "LAG_HASH_ALGORITHM" +14) "CRC,XOR,RANDOM,CRC_32LO,CRC_32HI,CRC_CCITT,CRC_XOR" ``` ### 2.4.4 Configuration sample @@ -446,48 +500,11 @@ INNER_L4_DST_PORT,INNER_L4_SRC_PORT" "INNER_SRC_IP", "INNER_L4_DST_PORT", "INNER_L4_SRC_PORT" - ] - } - } -} -``` - -### 2.4.5 Initial configuration - -GH initial configuration will be updated at `sonic-buildimage/files/build_templates/init_cfg.json.j2` -in order to match vendor specific requirements. - -**Skeleton code:** -```jinja -{ - ... - -{%- if sonic_asic_platform == "mellanox" %} - "SWITCH_HASH": { - "GLOBAL": { - "ecmp_hash": [ - "DST_IP", - "SRC_IP", - "IP_PROTOCOL", - "L4_DST_PORT", - "L4_SRC_PORT", - "INNER_DST_IP", - "INNER_SRC_IP" ], - "lag_hash": [ - "DST_IP", - "SRC_IP", - "IP_PROTOCOL", - "L4_DST_PORT", - "L4_SRC_PORT", - "INNER_DST_IP", - "INNER_SRC_IP" - ] + "ecmp_hash_algorithm": "CRC", + "lag_hash_algorithm": "CRC" } - }, -{%- endif %} - - ... + } } ``` @@ -532,6 +549,8 @@ config |--- global |--- ecmp-hash ARGS |--- lag-hash ARGS + |--- ecmp-hash-algorithm ARG + |--- lag-hash-algorithm ARG show |--- switch-hash @@ -581,54 +600,116 @@ config switch-hash global lag-hash \ 'INNER_L4_SRC_PORT' ``` +**The following command updates switch hash algorithm global:** +```bash +config switch-hash global ecmp-hash-algorithm 'CRC' +config switch-hash global lag-hash-algorithm 'CRC' +``` + #### 2.6.2.2 Show command group **The following command shows switch hash global configuration:** ```bash root@sonic:/home/admin# show switch-hash global -ECMP HASH LAG HASH ------------------ ----------------- -DST_MAC DST_MAC -SRC_MAC SRC_MAC -ETHERTYPE ETHERTYPE -IP_PROTOCOL IP_PROTOCOL -DST_IP DST_IP -SRC_IP SRC_IP -L4_DST_PORT L4_DST_PORT -L4_SRC_PORT L4_SRC_PORT -INNER_DST_MAC INNER_DST_MAC -INNER_SRC_MAC INNER_SRC_MAC -INNER_ETHERTYPE INNER_ETHERTYPE -INNER_IP_PROTOCOL INNER_IP_PROTOCOL -INNER_DST_IP INNER_DST_IP -INNER_SRC_IP INNER_SRC_IP -INNER_L4_DST_PORT INNER_L4_DST_PORT -INNER_L4_SRC_PORT INNER_L4_SRC_PORT ++--------+-------------------------------------+ +| Hash | Configuration | ++========+=====================================+ +| ECMP | +-------------------+-------------+ | +| | | Hash Field | Algorithm | | +| | |-------------------+-------------| | +| | | DST_MAC | CRC | | +| | | SRC_MAC | | | +| | | ETHERTYPE | | | +| | | IP_PROTOCOL | | | +| | | DST_IP | | | +| | | SRC_IP | | | +| | | L4_DST_PORT | | | +| | | L4_SRC_PORT | | | +| | | INNER_DST_MAC | | | +| | | INNER_SRC_MAC | | | +| | | INNER_ETHERTYPE | | | +| | | INNER_IP_PROTOCOL | | | +| | | INNER_DST_IP | | | +| | | INNER_SRC_IP | | | +| | | INNER_L4_DST_PORT | | | +| | | INNER_L4_SRC_PORT | | | +| | +-------------------+-------------+ | ++--------+-------------------------------------+ +| LAG | +-------------------+-------------+ | +| | | Hash Field | Algorithm | | +| | |-------------------+-------------| | +| | | DST_MAC | CRC | | +| | | SRC_MAC | | | +| | | ETHERTYPE | | | +| | | IP_PROTOCOL | | | +| | | DST_IP | | | +| | | SRC_IP | | | +| | | L4_DST_PORT | | | +| | | L4_SRC_PORT | | | +| | | INNER_DST_MAC | | | +| | | INNER_SRC_MAC | | | +| | | INNER_ETHERTYPE | | | +| | | INNER_IP_PROTOCOL | | | +| | | INNER_DST_IP | | | +| | | INNER_SRC_IP | | | +| | | INNER_L4_DST_PORT | | | +| | | INNER_L4_SRC_PORT | | | +| | +-------------------+-------------+ | ++--------+-------------------------------------+ ``` **The following command shows switch hash capabilities:** ```bash root@sonic:/home/admin# show switch-hash capabilities -ECMP HASH LAG HASH ------------------ ----------------- -IN_PORT IN_PORT -DST_MAC DST_MAC -SRC_MAC SRC_MAC -ETHERTYPE ETHERTYPE -VLAN_ID VLAN_ID -IP_PROTOCOL IP_PROTOCOL -DST_IP DST_IP -SRC_IP SRC_IP -L4_DST_PORT L4_DST_PORT -L4_SRC_PORT L4_SRC_PORT -INNER_DST_MAC INNER_DST_MAC -INNER_SRC_MAC INNER_SRC_MAC -INNER_ETHERTYPE INNER_ETHERTYPE -INNER_IP_PROTOCOL INNER_IP_PROTOCOL -INNER_DST_IP INNER_DST_IP -INNER_SRC_IP INNER_SRC_IP -INNER_L4_DST_PORT INNER_L4_DST_PORT -INNER_L4_SRC_PORT INNER_L4_SRC_PORT ++--------+-------------------------------------+ +| Hash | Capabilities | ++========+=====================================+ +| ECMP | +-------------------+-------------+ | +| | | Hash Field | Algorithm | | +| | |-------------------+-------------| | +| | | IN_PORT | CRC | | +| | | DST_MAC | XOR | | +| | | SRC_MAC | RANDOM | | +| | | ETHERTYPE | CRC_32LO | | +| | | VLAN_ID | CRC_32HI | | +| | | IP_PROTOCOL | CRC_CCITT | | +| | | DST_IP | CRC_XOR | | +| | | SRC_IP | | | +| | | L4_DST_PORT | | | +| | | L4_SRC_PORT | | | +| | | INNER_DST_MAC | | | +| | | INNER_SRC_MAC | | | +| | | INNER_ETHERTYPE | | | +| | | INNER_IP_PROTOCOL | | | +| | | INNER_DST_IP | | | +| | | INNER_SRC_IP | | | +| | | INNER_L4_DST_PORT | | | +| | | INNER_L4_SRC_PORT | | | +| | +-------------------+-------------+ | ++--------+-------------------------------------+ +| LAG | +-------------------+-------------+ | +| | | Hash Field | Algorithm | | +| | |-------------------+-------------| | +| | | IN_PORT | CRC | | +| | | DST_MAC | XOR | | +| | | SRC_MAC | RANDOM | | +| | | ETHERTYPE | CRC_32LO | | +| | | VLAN_ID | CRC_32HI | | +| | | IP_PROTOCOL | CRC_CCITT | | +| | | DST_IP | CRC_XOR | | +| | | SRC_IP | | | +| | | L4_DST_PORT | | | +| | | L4_SRC_PORT | | | +| | | INNER_DST_MAC | | | +| | | INNER_SRC_MAC | | | +| | | INNER_ETHERTYPE | | | +| | | INNER_IP_PROTOCOL | | | +| | | INNER_DST_IP | | | +| | | INNER_SRC_IP | | | +| | | INNER_L4_DST_PORT | | | +| | | INNER_L4_SRC_PORT | | | +| | +-------------------+-------------+ | ++--------+-------------------------------------+ ``` ## 2.7 YANG model @@ -665,6 +746,19 @@ will be extended with a new common type. enum INNER_L4_SRC_PORT; } } + + typedef hash-algorithm { + description "Represents hash algorithm"; + type enumeration { + enum CRC; + enum XOR; + enum RANDOM; + enum CRC_32LO; + enum CRC_32HI; + enum CRC_CCITT; + enum CRC_XOR; + } + } ``` New YANG model `sonic-hash.yang` will be added to `sonic-buildimage/src/sonic-yang-models/yang-models` @@ -676,7 +770,7 @@ module sonic-hash { yang-version 1.1; - namespace "http://github.com/Azure/sonic-hash"; + namespace "http://github.com/sonic-net/sonic-hash"; prefix hash; import sonic-types { @@ -689,6 +783,30 @@ module sonic-hash { description "First Revision"; } + typedef hash-field { + description "Represents native hash field"; + type stypes:hash-field { + enum IN_PORT; + enum DST_MAC; + enum SRC_MAC; + enum ETHERTYPE; + enum VLAN_ID; + enum IP_PROTOCOL; + enum DST_IP; + enum SRC_IP; + enum L4_DST_PORT; + enum L4_SRC_PORT; + enum INNER_DST_MAC; + enum INNER_SRC_MAC; + enum INNER_ETHERTYPE; + enum INNER_IP_PROTOCOL; + enum INNER_DST_IP; + enum INNER_SRC_IP; + enum INNER_L4_DST_PORT; + enum INNER_L4_SRC_PORT; + } + } + container sonic-hash { container SWITCH_HASH { @@ -699,12 +817,22 @@ module sonic-hash { leaf-list ecmp_hash { description "Hash fields for hashing packets going through ECMP"; - type stypes:hash-field; + type hash:hash-field; } leaf-list lag_hash { description "Hash fields for hashing packets going through LAG"; - type stypes:hash-field; + type hash:hash-field; + } + + leaf ecmp_hash_algorithm { + description "Hash algorithm for hashing packets going through ECMP"; + type stypes:hash-algorithm; + } + + leaf lag_hash_algorithm { + description "Hash algorithm for hashing packets going through LAG"; + type stypes:hash-algorithm; } } @@ -728,7 +856,9 @@ No special handling is required GH basic configuration test: 1. Verify ASIC DB object state after switch ECMP hash update 2. Verify ASIC DB object state after switch LAG hash update +3. Verify ASIC DB object state after switch ECMP hash algorithm update +4. Verify ASIC DB object state after switch LAG hash algorithm update ## 3.2 Data plane tests via PTF -TBD +1. [Generic Hash Test Plan](https://github.com/sonic-net/sonic-mgmt/pull/7524 "Test Plan") diff --git a/doc/hash/images/gh_update_flow.svg b/doc/hash/images/gh_update_flow.svg index b5eae557e0c..750dbf212e2 100644 --- a/doc/hash/images/gh_update_flow.svg +++ b/doc/hash/images/gh_update_flow.svg @@ -1,8 +1,8 @@ - + gh update flow @@ -17,9 +17,9 @@ .st7 {fill:#ffffff;stroke:none;stroke-linecap:butt;stroke-width:7.2} .st8 {fill:#803a00;font-family:Segoe UI;font-size:1.00001em;font-weight:bold} .st9 {fill:#70ad47;stroke:#70ad47;stroke-linecap:round;stroke-linejoin:round;stroke-width:0.5} - .st10 {marker-end:url(#mrkr3-88);stroke:#a54d00;stroke-dasharray:7,5;stroke-linecap:round;stroke-linejoin:round;stroke-width:1} - .st11 {marker-end:url(#mrkr3-88);stroke:#a54d00;stroke-linecap:round;stroke-linejoin:round;stroke-width:1} - .st12 {fill:#ffffff;stroke:none;stroke-linecap:butt} + .st10 {fill:#ffffff;stroke:none;stroke-linecap:butt} + .st11 {marker-end:url(#mrkr3-88);stroke:#a54d00;stroke-dasharray:7,5;stroke-linecap:round;stroke-linejoin:round;stroke-width:1} + .st12 {marker-end:url(#mrkr3-88);stroke:#a54d00;stroke-linecap:round;stroke-linejoin:round;stroke-width:1} .st13 {fill:none;stroke:#b43500;stroke-linecap:round;stroke-linejoin:round;stroke-width:1} .st14 {fill:#b43500;font-family:Segoe UI;font-size:1.00001em;font-weight:bold} .st15 {fill:#ac770d;font-family:Segoe UI;font-size:1.00001em;font-weight:bold} @@ -44,355 +44,387 @@ - Page-12 - + Page-1 + Actor lifeline.1198 ConfigDB Sheet.1001 - + Sheet.1002 - + - + Sheet.1003 - + Sheet.1004 - + - - ConfigDB + + ConfigDB - + Object lifeline.1203 SwitchOrch Sheet.1006 - + Sheet.1007 - + - + Sheet.1008 - + Sheet.1009 - + - - SwitchOrch + + SwitchOrch - + Object lifeline.1208 SAI Sheet.1011 - + Sheet.1012 - + - + Sheet.1013 - + Sheet.1014 - + - - SAI + + SAI - + Actor lifeline.1213 CLI Sheet.1016 - + Sheet.1017 - - + Sheet.1018 - + Sheet.1019 - + - - CLI + + CLI - + Activation.1098 - + - + Self Message.1099 process data - - - process data - + + + process data + Activation.1102 - + - + Self Message.1103 set switch hash - - - set switch hash - + + + set switch hash + Activation.1104 - + - + Message.1105 HMSET SWITCH_HASH|GLOBAL - - - HMSET SWITCH_HASH|GLOBAL - + + + HMSET SWITCH_HASH|GLOBAL + Activation.1106 - + - + Return Message.1107 - + - + Activation.1188 - + - + Self Message.1189 set switch ecmp/lag hash field list - - - set switch ecmp/lag hash field list - + + + set switch ecmp/lag hash field list + Activation.1190 - + - + Asynchronous Message.1240 SWITCH_HASH|GLOBAL - - - SWITCH_HASH|GLOBAL - + + + SWITCH_HASH|GLOBAL + Activation.1241 - + - + Self Message.1242 process data - - - process data - + + + process data + Message.1244 set_hash_attribute - - - set_hash_attribute - + + + set_hash_attribute + Activation.1245 - + - + Return Message.1246 return <status> - - - return <status> - + + + return <status> + Activation.1247 - + - + Self Message.1248 get switch ecmp/lag hash oid - - - get switch ecmp/lag hash oid - + + + get switch ecmp/lag hash oid + Activation.1249 - + - + Message.1250 get_switch_attribute - - - get_switch_attribute - + + + get_switch_attribute + Activation.1251 - + - + Return Message.1252 return <status> - - - return <status> - + + + return <status> + Activation.1044 - + - + Self Message.1045 get switch ecmp/lag hash capabilities - - - get switch ecmp/lag hash capabilities - + + + get switch ecmp/lag hash capabilities + Activation.1046 - + - + Message.1047 sai_query_attribute_capability - - - sai_query_attribute_capability - + + + sai_query_attribute_capability + Activation.1048 - + - + Return Message.1049 return <status> - - - return <status> - + + + return <status> + Message.1050 sai_query_attribute_enum_values_capability - - - sai_query_attribute_enum_values_capability - + + + sai_query_attribute_enum_values_capability + Activation.1051 - + - + Return Message.1052 return <status> - - - return <status> - + + + return <status> + Optional fragment.1059 - - + + - - Sheet.1053 - opt - - opt - + Sheet.1054 + opt + + opt + + Sheet.1055 capability is validated - capability is validated + capability is validated - + Loop fragment.1063 - - + + - - Sheet.1056 + + Sheet.1057 query - - query + + query - + Loop fragment.1066 - - + + - - Sheet.1059 + + Sheet.1060 query - - query + + query - + Actor lifeline.1069 StateDB - - Sheet.1062 - - - + Sheet.1063 - + - + Sheet.1064 + - + Sheet.1065 - - - - StateDB + + Sheet.1066 + + + + + StateDB - - Activation.1066 - + + Activation.1074 + - + Self Message.1075 push switch ecmp/lag hash capabilities to DB - - - push switch ecmp/lag hash capabilities to DB - + + + push switch ecmp/lag hash capabilities to DB + Activation.1076 - + - + Message.1077 HMSET SWITCH_CAPABILITY|switch - - - HMSET SWITCH_CAPABILITY|switch - + + + HMSET SWITCH_CAPABILITY|switch + Activation.1078 - + - + Return Message.1079 - + - - Sheet.1074 + + Sheet.1073 SAI_SWITCH_ATTR_ECMP_HASH SAI_SWITCH_ATTR_LAG_HASH SAI_HASH_A... - - SAI_SWITCH_ATTR_ECMP_HASH SAI_SWITCH_ATTR_LAG_HASH SAI_HASH_ATTR_NATIVE_HASH_FIELD_LIST - - Sheet.1076 - SAI_HASH_ATTR_NATIVE_HASH_FIELD_LIST - - SAI_HASH_ATTR_NATIVE_HASH_FIELD_LIST + + SAI_SWITCH_ATTR_ECMP_HASH SAI_SWITCH_ATTR_LAG_HASH SAI_HASH_ATTR_NATIVE_HASH_FIELD_LIST SAI_SWITCH_ATTR_ECMP_DEFAULT_HASH_ALGORITHM SAI_SWITCH_ATTR_LAG_DEFAULT_HASH_ALGORITHM + + Sheet.1074 + SAI_HASH_ATTR_NATIVE_HASH_FIELD_LIST SAI_SWITCH_ATTR_ECMP_DEF... + + SAI_HASH_ATTR_NATIVE_HASH_FIELD_LIST SAI_SWITCH_ATTR_ECMP_DEFAULT_HASH_ALGORITHM SAI_SWITCH_ATTR_LAG_DEFAULT_HASH_ALGORITHM + + Activation.1084 + + + + Self Message.1085 + set switch ecmp/lag hash algorithm + + + set switch ecmp/lag hash algorithm + + Activation.1086 + + + + Message.1087 + set_switch_attribute + + + set_switch_attribute + + Activation.1088 + + + + Return Message.1089 + return <status> + + + return <status> diff --git a/doc/mgmt/SONiC_YANG_Model_Guidelines.md b/doc/mgmt/SONiC_YANG_Model_Guidelines.md index a2d7127dad6..3f6bca366fc 100644 --- a/doc/mgmt/SONiC_YANG_Model_Guidelines.md +++ b/doc/mgmt/SONiC_YANG_Model_Guidelines.md @@ -8,6 +8,7 @@ |:---:|:-----------:|:------------------:|-----------------------------------| | 1.0 | 22 Aug 2019 | Praveen Chaudhary | Initial version | | 1.0 | 11 Sep 2019 | Partha Dutta | Adding additional steps for SONiC YANG | + | 1.1 | 15 Dec 2023 | Jingwen Xie | Added rules for List Keys | ## References | References | Date/Version | Link | @@ -44,7 +45,7 @@ sonic-vlan.yang Example : #### YANG -``` +```yang module sonic-acl { container sonic-acl { ..... @@ -57,7 +58,7 @@ module sonic-acl { Example : #### YANG -``` +```yang module sonic-acl { namespace "http://github.com/Azure/sonic-acl"; ..... @@ -70,7 +71,7 @@ module sonic-acl { Example : #### YANG -``` +```yang module sonic-acl { revision 2019-09-02 { description @@ -92,7 +93,7 @@ Example: Table VLAN will translate to container VLAN. #### ABNF -``` +```yang "VLAN": { "Vlan100": { "vlanid": "100" @@ -103,7 +104,7 @@ will translate to: #### YANG -- -``` +```yang container VLAN { //"VLAN" mapped to a container list VLAN_LIST { key name; @@ -132,7 +133,7 @@ Leaf names are same PACKET_ACTION, IP_TYPE and PRIORITY, which are defined in AB ``` #### YANG -``` +```yang leaf PACKET_ACTION { ..... } @@ -148,7 +149,7 @@ Leaf names are same PACKET_ACTION, IP_TYPE and PRIORITY, which are defined in AB Example: #### YANG -``` +```yang leaf SRC_IP { type inet:ipv4-prefix; <<<< } @@ -182,7 +183,7 @@ For Example: ``` #### YANG In YANG, "Family" of VLAN_INTERFACE and "IP_TYPE" of ACL_RULE is at same level. -``` +```yang container VLAN_INTERFACE { description "VLAN_INTERFACE part of config_db.json"; list VLAN_INTERFACE_LIST { @@ -222,7 +223,7 @@ Example: VLAN_MEMBER dictionary in ABNF.json has both vlan-id and ifname part of key = VLAN_MEMBER_TABLE:"Vlan"vlanid:ifname ; #### YANG -``` +```yang container VLAN_MEMBER { description "VLAN_MEMBER part of config_db.json"; list ..... { @@ -242,7 +243,7 @@ key: ACL_RULE_TABLE:table_name:rule_name ..... ``` #### YANG -``` +```yang ..... container ACL_TABLE { list ACL_TABLE_LIST { @@ -289,7 +290,7 @@ queue = 1*DIGIT; queue index ``` #### YANG -``` +```yang container TC_TO_QUEUE_MAP { list TC_TO_QUEUE_MAP_LIST { key "name"; @@ -336,7 +337,7 @@ wred_profile = ref_hash_key_reference; reference to wred profile key ``` #### YANG -``` +```yang container sonic-queue { container QUEUE { list QUEUE_LIST { @@ -361,7 +362,7 @@ container sonic-queue { Example: #### YANG -``` +```yang must "(/sonic-ext:operation/sonic-ext:operation != 'DELETE') or " + "count(../../ACL_TABLE[aclname=current()]/ports) = 0" { error-message "Ports are already bound to this rule."; @@ -373,7 +374,7 @@ Example: Example: #### YANG -``` +```yang module sonic-vlan { .... .... @@ -422,7 +423,7 @@ Example of When Statement: Orchagent of SONiC will have unknown behavior if belo ``` #### YANG: -``` +```yang choice ip_prefix { case ip4_prefix { when "boolean(IP_TYPE[.='ANY' or .='IP' or .='IPV4' or .='IPV4ANY' or .='ARP'])"; @@ -456,7 +457,7 @@ leaf L4_DST_PORT_RANGE { Example: #### YANG -``` +```yang leaf family { /* family leaf needed for backward compatibility Both ip4 and ip6 address are string in IETF RFC 6020, @@ -485,7 +486,7 @@ For Example: Below entries in PORTCHANNEL_INTERFACE Table must be part of List O ``` #### YANG -``` +```yang container PORTCHANNEL_INTERFACE { description "PORTCHANNEL_INTERFACE part of config_db.json"; @@ -497,9 +498,9 @@ container PORTCHANNEL_INTERFACE { } ``` -### 18. In some cases it may be required to split an ABNF table into multiple YANG lists based on the data stored in the ABNF table. +### 18. In some cases it may be required to split an ABNF table into multiple YANG lists based on the data stored in the ABNF table. In this case it is crucial to ensure that the List keys are non-overlapping, unique, and unambiguous. -Example : "INTERFACE" table stores VRF names to which an interface belongs, also it stores IP address of each interface. Hence it is needed to split them into two different YANG lists. +**Strategies for Ensuring Unique and Unambiguous Keys**: Utilize composite keys that have a different number of key elements to distinguish lists. Need to mention that different key names do not count as unambiguous model. #### ABNF ``` @@ -511,51 +512,206 @@ Example : "INTERFACE" table stores VRF names to which an interface belongs, also } } ``` -#### YANG -``` +#### Example 1: Key with different number of elements(composite keys - Allowed case) + +`INTERFACE` table stores VRF names to which an interface belongs, also it stores IP address of each interface. Hence it is needed to split them into two different YANG lists. + +```yang ...... -container sonic-interface { - container INTERFACE { - list INTERFACE_LIST { // 1st list - key ifname; - - leaf ifname { - type leafref { - ...... - } +container INTERFACE { + list INTERFACE_LIST { // 1st list + key ifname; + + leaf ifname { + type leafref { + ...... } - leaf vrf-name { - type leafref { - ...... - } + } + leaf vrf-name { + type leafref { + ...... } - ...... } + ...... + } - list INTERFACE_IPADDR_LIST { //2nd list - key ifname, ip_addr; - - leaf ifname { - type leafref { - ...... - } - } - leaf ip_addr { - type inet:ipv4-prefix; - } + list INTERFACE_IPADDR_LIST { //2nd list + key "ifname ip_addr" + + leaf ifname { + type leafref { ...... + } + } + leaf ip_addr { + type inet:ipv4-prefix; + } + ...... + } +} +...... +``` +In the example above if the config DB contains an INTERFACE table with single key element then it will be associted with the INTERFACE_LIST and if contains 2 key elements then it will be associated with INTERFACE_IPADDR_LIST + +#### Example 2: Keys with same number of elements of same type (NOT Allowed case 1) + +```yang +...... +container NOT_SUPPORTED_INTERFACE { + list NOT_SUPPORTED_INTERFACE_LIST { // 1st list + key ifname; + leaf ifname { + type string; + } + // ... + } + + list NOT_SUPPORTED_INTERFACE_ANOTHER_LIST { // Negative case + key ifname; + leaf ifname { + type string; } - } + // ... + } } ...... ``` +In the example above if the config DB contains an NOT_SUPPORTED_INTERFACE table with key Ethernet1 then it would match with both the list, this is an overlapping scenario + +#### Example 3: Keys with same number of elements of same type (NOT Allowed case 2) + +```yang +...... +container NOT_SUPPORTED_TELEMETRY_CLIENT { + list NOT_SUPPORTED_TELEMETRY_CLIENT_DS_LIST { // 1st list + key "prefix name"; + + leaf prefix { + type string { + pattern "DestinationGroup_" + ".*"; + } + } + + leaf name { + type string; + } + + leaf dst_addr { + type ipv4-port; + } + } + + list NOT_SUPPORTED_TELEMETRY_CLIENT_SUB_LIST { // Negative case + key "prefix name"; + + leaf prefix { + type string { + pattern "Subscription_" + ".*"; + } + } + + leaf name { + type string; + } + + leaf dst_group { + must "(contains(../../TELEMETRY_CLIENT_DS_LIST/prefix, current()))"; + type string; + } + } +} +...... +``` +In the example above if the config DB contains an NOT_SUPPORTED_TELEMETRY_CLIENT table with key "DestinationGroup|HS", then it would correspond to the NOT_SUPPORTED_TELEMETRY_CLIENT_DS_LIST and NOT_SUPPORTED_TELEMETRY_CLIENT_SUB_LIST, this is an overlapping scenario + +#### Example 4: keys with same number of elements and different type(NOT Allowed case 3) + +In the given example, if the configuration database has an NOT_SUPPORTED_TELEMETRY_CLIENT table with the key "1234", it would correspond to the NOT_SUPPORTED_TELEMETRY_CLIENT_DS_LIST and NOT_SUPPORTED_TELEMETRY_CLIENT_SUB_LIST, this is an overlapping scenario + +```yang +...... +container NOT_SUPPORTED_TELEMETRY_CLIENT { + list NOT_SUPPORTED_TELEMETRY_CLIENT_DS_LIST { // 1st list + key "prefix"; + + leaf prefix { + type string { + pattern ".*"; + } + } + + leaf dst_addr { + type ipv4-port; + } + } + + list NOT_SUPPORTED_TELEMETRY_CLIENT_SUB_LIST { // Negative case + key "id"; + + leaf id { + type int32; + } + + leaf dst_group { + must "(contains(../../TELEMETRY_CLIENT_DS_LIST/prefix, current()))"; + type int32; + } + } +} +...... +``` + +#### Example 5: keys with same number of elements and different type(NOT Allowed case 4) + +In the given example, if the configuration database has an NOT_SUPPORTED_TELEMETRY_CLIENT table with the key "1234|1234", it would correspond to the NOT_SUPPORTED_TELEMETRY_CLIENT_DS_LIST and NOT_SUPPORTED_TELEMETRY_CLIENT_SUB_LIST, this is an overlapping scenario + +```yang +...... +container NOT_SUPPORTED_TELEMETRY_CLIENT { + list NOT_SUPPORTED_TELEMETRY_CLIENT_DS_LIST { // 1st list + key "prefix name"; + + leaf prefix { + type string; + } + + leaf name { + type string; + } + + leaf dst_addr { + type ipv4-port; + } + } + + list NOT_SUPPORTED_TELEMETRY_CLIENT_SUB_LIST { // Negative case + key "id name"; + + leaf id { + type int32; + } + + leaf name { + type int32; + } + + leaf dst_group { + must "(contains(../../TELEMETRY_CLIENT_DS_LIST/prefix, current()))"; + type int32; + } + } +} +...... +``` + + ### 19. Add read-only nodes for state data using 'config false' statement. Define a separate top level container for state data. If state data is defined in other DB than CONFIG_DB, use extension 'sonic-ext:db-name' for defining the table present in other Redis DB. The default separator used in table key is "|", if it is different, use 'sonic-ext:key-delim {separator};' YANG extension. This step applies when SONiC YANG is used as Northbound YANG. Example: #### YANG -``` +```yang container ACL_RULE { list ACL_RULE_LIST { .... @@ -584,7 +740,7 @@ container ACL_RULE { Example: #### YANG -``` +```yang container sonic-acl { .... .... @@ -607,7 +763,7 @@ container sonic-acl { Example: #### YANG -``` +```yang module sonic-port { .... .... @@ -621,17 +777,11 @@ module sonic-port { } ``` - - - - - - ## APPENDIX ### Sample SONiC ACL YANG -``` +```yang module sonic-acl { namespace "http://github.com/Azure/sonic-acl"; prefix sacl; diff --git a/doc/pac/Port Access Control.md b/doc/pac/Port Access Control.md new file mode 100644 index 00000000000..27614ddb66c --- /dev/null +++ b/doc/pac/Port Access Control.md @@ -0,0 +1,852 @@ + +# Port Access Control in SONiC + +# Table of Contents +- **[List of Tables](#list-of-tables)** +- **[Revision](#revision)** +- **[About this Manual](#about-this-manual)** +- **[Definitions and Abbreviations](#definitions-and-abbreviations)** +- **[1 Feature Overview](#1-feature-overview)** + - [1.1 Port Access Control](#11-port-access-control) + - [1.2 Requirements](#12-requirements) + - [1.2.1 Functional Requirements](#121-functional-requirements) + - [1.2.2 Configuration and Management Requirements](#122-configuration-and-management-requirements) + - [1.2.3 Scalability Requirements](#123-scalability-requirements) + - [1.2.4 Warm Boot Requirements](#124-warm-boot-requirements) + - [1.3 Design Overview](#13-design-overview) + - [1.3.1 Container](#131-container) + - [1.3.2 SAI Support](#132-sai-support) +- **[2 Functionality](#2-functionality)** + - [2.1 Target Deployment Use Cases](#21-target-deployment-use-cases) + - [2.2 Functional Description](#22-functional-description) + - [2.2.1 802.1x](#221-802.1x) + - [2.2.2 MAC Authentication Bypass](#222-mac-authentication-bypass) + - [2.2.3 RADIUS](#223-radius) + - [2.2.4 PAC Interface Host Modes](#224-pac-interface-host-modes) + - [2.2.5 VLAN](#225-vlan) + - [2.2.6 MAC move](#226-mac-move) + - [2.2.7 Warmboot](#227-warmboot) +- **[3 Design](#3-design)** + - [3.1 Overview](#31-overview) + - [3.1.1 Configuration flow](#311-configuration-flow) + - [3.1.2 EAPoL Receive flow](#312-eapol-receive-flow) + - [3.1.3 MAB Packet receive flow](#313-mab-packet-receive-flow) + - [3.1.4 RADIUS](#314-radius) + - [3.2 DB Changes](#32-db-changes) + - [3.2.1 Config DB](#321-config-db) + - [3.2.2 App DB](#322-app-db) + - [3.2.3 ASIC DB](#323-asic-db) + - [3.2.4 Counter DB](#324-counter-db) + - [3.2.5 State DB](#325-state-db) + - [3.3 Switch State Service Design](#33-switch-state-service-design) + - [3.3.1 Orchestration Agent](#331-orchestration-agent) + - [3.4 PAC Modules](#34-pac-modules) + - [3.4.1 Authentication Manager](#341-authentication-manager) + - [3.4.2 mabd](#342-mabd) + - [3.4.3 hostapd](#343-hostapd) + - [3.4.4 hostapdmgrd](#344-hostapdmgrd) + - [3.4.5 Interaction between modules](#345-interaction-between-modules) + - [3.5 SyncD](#35-syncd) + - [3.6 SAI](#36-sai) + - [3.6.1 Host Interface Traps](#361-host-interface-traps) + - [3.6.2 Bridge port learning modes](#362-bridge-port-learning-modes) + - [3.6.3 FDB](#363-fdb) + - [3.6.4 VLAN](#363-vlan) + - [3.7 Manageability](#37-manageability) + - [3.7.1 Yang Model](#371-yang-model) + - [3.7.2 Configuration Commands](#372-configuration-commands) + - [3.7.3 Show Commands](#373-show-commands) + - [3.7.4 Clear Commands](#373-clear-commands) +- **[4 Scalability](#4-scalability)** +- **[5 Appendix: Sample configuration](#5-appendix-sample-configuration)** +- **[6 Future Enhancements](#6-future-enhancements)** + +# List of Tables +[Table 1 Abbreviations](#table-1-abbreviations) + +# Revision +| Rev | Date | Author | Change Description | +| ---- | ---------- | ---------------------------------------- | ------------------ | +| 0.1 | 04/05/2023 | Amitabha Sen, Vijaya Abbaraju, Shirisha Dasari, Anil Kumar Pandey | Initial version | + + +# About this Manual + +This document describes the design details of the Port Access Control (PAC) feature in SONiC. + + +# Definitions and Abbreviations + +| **Term** | **Meaning** | +| ------------- | ---------------------------------------- | +| Authenticator | An entity that enforces authentication on a port before allowing access to services available on that port | +| CoPP | Control Plane Policing | +| 802.1x | IEEE 802.1x standard | +| EAPoL | Extensible Authentication Protocol over LAN | +| MAB | MAC Authentication Bypass | +| PAC | Port Access Control | +| PAE | Port Access Entity | +| RADIUS | Remote Authentication Dial In User service | +| Supplicant | A client that attempts to access services offered by the Authenticator | +| AAA | Authentication, Authorization, Accounting | + +# 1 Feature Overview + +## 1.1 Port Access Control +Port Access Control (PAC) provides a means of preventing unauthorized access by users to the services offered by a Network. + +An entity (Port Access Entity) can adopt one of two distinct roles within an access control interaction: + +1. Authenticator: An entity that enforces authentication on a port before allowing access to services available on that port. +2. Supplicant: A client that attempts to access services offered by the Authenticator. + +Additionally, there exists a third role: + +3. Authentication Server: Performs the authentication function necessary to check the credentials of the Supplicant on behalf of the Authenticator. + +Port access control is achieved by enforcing authentication of Supplicants that are attached to an Authenticator's controlled Ports. The result of the authentication process determines whether the Supplicant is authorized to access services on that controlled port. + +All three roles are required in order to complete an authentication exchange. A Switch needs to support the Authenticator role, as is supported by PAC. The Authenticator PAE is responsible for communicating with the Supplicant, submitting the information received from the Supplicant to the Authentication Server in order for the credentials to be checked. The Authenticator PAE controls the authorized/unauthorized state of the clients on the controlled port depending on the outcome of the authentication process. + + + +## 1.2 Requirements + +### 1.2.1 Functional Requirements + +***PAC*** +The following are the requirements for Port Access Control feature: +1. PAC should be supported on physical interfaces only. + +2. PAC should enforce access control for clients on switch ports using the following authentication mechanisms: + - 802.1x + - MAB (MAC Authentication Bypass). + +3. It should be possible to enable both 802.1x and MAB on a port together. Their relative order and priority should be configurable. + +4. The following Host modes should be supported + + - Multiple Hosts mode: only one client can be authenticated on a port and after that access is granted to all clients connected to the port + - Single-Host mode: one client can be authenticated on a port and is granted access to the port at a given time. + - Multiple Authentication mode: multiple clients can be authenticated on a port and these clients are then granted access. All clients are authorized on the same VLAN. + +5. The following PAC port modes should be supported: + - Auto : Authentication is enforced on the port. Traffic is only allowed for authenticated clients + - Force Authorized : All traffic is allowed. + - Force Unauthorized : All traffic is blocked. + +6. Reauthentication of clients is supported. + + + +***802.1x*** + +PAC should support 802.1x Authenticator functionality. + +***MAB*** + +PAC should support MAB for authentication, primarily to support clients that do not support 802.1x. + +***RADIUS*** +1. PAC should support RADIUS client functionality to be able to authenticate clients using RADIUS. +2. PAC 802.1x should support multiple EAP authentication methods like EAP-MD5, EAP-PEAP, EAP-TLS, etc. +3. PAC MAB should support the EAP authentication methods EAP-MD5, EAP-PAP and EAP-CHAP. +4. The following Authorization attributes from RADIUS should be supported: + - VLAN + - Session-Timeout + - Session-Termination-Action +5. RADIUS authentication should be tested/qualified with the following RADIUS Servers: + - FreeRADIUS + - ClearPass + - Cisco ISE. + + +### 1.2.2 Configuration and Management Requirements +PAC should support configuration using CLI and JSON based input. + +List of configuration shall include the following: +- configuring the port control mode of an interface. +- configuring the host mode of an interface. +- configuring the PAE role of an interface. +- enabling the 802.1x authentication support on the switch. +- enabling MAC Authentication Bypass (MAB) on an interface. +- enabling the authentication method of MAC Authentication Bypass (MAB) on an interface. +- configuring the maximum number of clients supported on an interface when multi-authentication host mode is enabled on the port. +- enabling periodic reauthentication of the supplicant on an interface. +- enabling periodic reauthentication timer configuration of the supplicant on an interface. +- configuring the order of authentication methods used on a port. +- configuring the priority for the authentication methods used on a port. + +### 1.2.3 Scalability Requirements +16 authenticated clients per port with a maximum of 128 authenticated clients per switch should be supported. + +## 1.3 Design Overview + +### 1.3.1 Container +The existing "macsec" docker holds all the port security applications. Code changes are also made to the SWSS docker. + +### 1.3.2 SAI Support +No changes to SAI spec for supporting PAC. + +# 2 Functionality + +## 2.1 Target Deployment Use Cases + +The following figure illustrates how clients like PCs and printers are authenticated and authorized for accessing the network. + +![pac-deployment](images/PAC_Deployment.JPG) + + +**Figure 1 : PAC target deployment use cases** + +## 2.2 Functional Description + +PAC uses authentication methods 802.1x and MAB for client authentication. These methods in turn use RADIUS for client credential verification and receive the authorization attributes like VLANs, for the authenticated clients. + +### 2.2.1 802.1x + +PAC leverages the IEEE 802.1X-2004 for 802.1x standard as available in the "hostapd" implementation in the "macsec" docker. It is an IEEE Standard for Port Access Control that provides an authentication mechanism to devices wishing to attach to a LAN. The standard defines Extensible Authentication Protocol over LAN (EAPoL), which is an encapsulation technique to carry EAP packets between the Supplicant and the Authenticator. The standard describes an architectural framework within which authentication and consequent actions take place. It also establishes the requirements for a protocol between the Authenticator and the Supplicant, as well as between the Authenticator and the Authentication server. + +### 2.2.2 MAC Authentication Bypass + +PAC makes use of MAC Authentication Bypass (MAB) feature to authenticate devices like cameras or printers which do not support 802.1x. MAB makes use of the device MAC address to authenticate the client. + +### 2.2.3 RADIUS +***Authentication*** + +PAC (Authenticator) uses an external RADIUS server for client authentication. It determines the authorization status of the clients based on RADIUS Access-Accept or Access-Reject frames as per the RADIUS RFC 2865. + +PAC as a PAE Authenticator for 802.1x is essentially a passthrough for client Authentication exchange messages. Hence different EAP authentication methods like EAP-MD5, EAP-PEAP, EAP-TLS, etc. are supported. These are essentially the 802.1x Supplicant and RADIUS server functionalities. + +PAC as a PAE Authenticator for MAB mimics the Supplicant role for MAB clients. Authentication methods EAP-MD5, EAP-PAP and EAP-CHAP are supported. + +***Authorization*** + +Once a client is authenticated, authorization parameters from RADIUS can be sent for the client. The Authenticator switch processes these RADIUS attributes to apply to the client session. Following attributes are supported. + +- *VLAN Id*: This is the VLAN ID sent by a RADIUS server for the authenticated client. This VLAN should be a pre-created VLAN on the switch. +- *Session Timeout*: This is the timeout attribute of the authenticated client session. +- *Session Termination Action*: Upon session timeout, the Session Termination Action determines the action on the client session. The following actions are defined: + - *Default*: The client session is torn down and authentication needs to be restarted for the client. + - *RADIUS*: Re-authentication is initiated for the client. + +### 2.2.4 PAC Interface Host Modes + +PAC works with port learning modes and FDB entries to block or allow traffic for authenticated clients as needed. + +- **Multiple Host mode**: A single client can be authenticated on the port. With no client authenticated on the port, the learning mode is set to DROP or CPU_TRAP (if MAB is enabled on the port). Once a client is authenticated on a port, the learning mode is set to HW. All clients connected to the port are allowed access and FDB entries are populated dynamically. + +- **Single Host and Multiple Authentication Modes**: All clients on the port need to authenticate. The learning mode of the port is always set to CPU_TRAP. Once a client starts the authentication process, the client is no longer unknown to PAC. PAC installs a static FDB entry to mark the client known so that the incoming traffic does not flood the CPU. The entry is installed with discard bits set to prevent client traffic from being forwarded. In effect, the packets are not flooded to the CPU nor forwarded to other ports during the authentication process. When the client is authenticated, the discard bits of the installed FDB entry are reset to allow client traffic. + + ​ + +### 2.2.5 VLAN +1. PAC associates authenticated clients to a VLAN on the port. +2. If RADIUS assigns a VLAN to a client, the port's configured untagged VLAN membership is reverted and the RADIUS assigned VLAN is used to authorize the client. The RADIUS assigned VLAN is operationally configured as the untagged VLAN of the port. All incoming untagged client traffic is assigned to this VLAN. Any incoming client's tagged traffic will be allowed or dropped based on if it matches the port's configured untagged VLAN or not. +3. If RADIUS does not assign a VLAN to a client, the port's configured untagged VLAN is used to authorize the client. The port's untagged VLAN configuration is retained and all incoming untagged client traffic is assigned to this VLAN. Any incoming client's tagged traffic will be allowed or dropped based on if it matches the port's configured untagged VLAN or not. +4. All clients on a port are always associated with a single VLAN. +5. The RADIUS assigned VLAN configured is reverted to the port's configured untagged VLAN once the last authenticated client on the port logs off. +6. When PAC is disabled on the port, the operationally added untagged VLAN, if present, is removed from the port and the user configured untagged VLAN is assigned back to the port. +7. If clients are authorized on the port's configured untagged VLAN and the VLAN configuration is modified, all the authenticated clients on the port are removed. +8. If clients are authorized on RADIUS assigned VLAN, any updates on the port's configured untagged VLAN does not affect the clients. The configuration is updated in the CONFIG_DB but not propagated to the port. + + +### 2.2.6 MAC move + +If a client that is authorized on one port moves to another port controlled by PAC, the existing client session is torn down and the authentication is attempted again on the new port. + +### 2.2.7 Warmboot + +After a Warm Boot, the authenticated client sessions are torn down and they need to authenticate again. + +# 3 Design + +## 3.1 Overview + +[Figure 2](#configuration-flow) shows the high level design overview of PAC services in SONiC. The existing "macsec" docker is leveraged. + +PAC is composed of multiple sub-modules. + +1. pacd: PAC daemon is the main module that controls client authentication. It is the central repository of PAC clients. It makes use of hostapd and mabd daemons to authenticate clients via 802.1x and MAB respectively. + +2. hostapd: This 802.1x module is an opensource Linux application that is available in the SONiC "macsec" docker. It uses hostapd.conf as its config file. + +3. mabd: This is the MAB authentication module. + +4. hostapdmgrd: This is the hostapd manager module. It listens to 802.1x specific configurations from CONFIG_DB and translates them to respective hostapd.conf file config entries and commands to hostapd. + + +### 3.1.1 Configuration flow + +![pac-config-flow](images/PAC_Config_Flow.JPG) + +**Figure 2: PAC service daemon and configuration flow** + +1. Mgmt interfaces like CLI write the user provided configuration to CONFIG_DB. +2. The pacd, mabd and hostapdmgrd gets notified about their respective configurations. +3. hostapd being a standard Linux application gets its configuration from a hostapd.conf file. hostapdmgrd generates the hostapd.conf file based on the relevant CONFIG_DB tables. hostapdmgrd informs hostapd about the list of ports it needs to run on. This port list is dynamic as it depends of port link/admin state, port configuration etc. hostapdmgrd keeps hostapd updated about these changes. +4. These modules communicate amongst themselves via socket messages. +5. hostapd listens to EAPoL PDUs on the provided interface list. When it receives a PDU, it consults pacd and proceeds to authenticate the client. pacd also listens to "unknown src MAC" and triggers MAB, if configured on the port, to authenticate the client. + + + +### 3.1.2 EAPoL receive flow + +![pac-eapol-rx-flow](images/PAC_EAPoL_Rx_Flow.JPG) + + +**Figure 3: EAPoL receive flow** + +1. EAPoL packet is received by hardware on a front panel interface and trapped to the CPU by COPP rules for EAP. The packet gets delivered to the hostapd socket listening on EtherType 0x888E. +2. In a multi-step process, hostapd runs the 802.1x state machine to Authenticate the client via RADIUS. +3. On successful authentication of a client, hostapd sends a "Client Authenticated" message to pacd with all the authorization parameters like VLAN, Session-Timeout, etc. +4. pacd proceeds to authorize the client. RADIUS authorization parameters like client VLAN membership, is communicated to relevant modules (VLAN, FDB) by writing on their tables on STATE_DB. Authenticated clients are updated in PAC_AUTHENTICATED_CLIENT_OPER table in STATE_DB. +5. VLAN, FDB further process these STATE_DB updates from PAC and write into their STATE_DB and APPL_DB tables. +6. Orchagent in SWSS docker gets notified about changes in APPL_DB and responds by translating the APPL_DB changes to respective sairedis calls. +7. Sairedis APIs write into ASIC_DB. +8. Syncd gets notified of changes to ASIC_DB and in turn calls respective SAI calls. The SAI calls translate to respective SDK calls to program hardware. +9. EAP Success message (EAPoL PDU) is sent to the client. + + +### 3.1.3 MAB packet receive flow + +![pac-mab-rx-flow](images/PAC_MAB_Rx_Flow.JPG) + + +**Figure 4: MAB PDU receive flow** + +1. Unknown source MAC packets are received by hardware on a front panel interface and trapped to CPU. The packets gets delivered to a pacd socket. +2. pacd sends a "Client Authenticate" message along with the received packet MAC to mabd. +3. mabd interacts with RADIUS server to authenticate the given client based on the MAC. +4. On successful authentication of a client, mabd sends an "Client Authenticated" message to pacd with all the authorization parameters like VLAN, Session-Timeout, etc. +5. pacd proceeds to authorize the client. RADIUS authorization parameters like client VLAN membership, is communicated to relevant modules (VLAN, FDB) by writing on their tables on STATE_DB. Authenticated clients are updated in PAC_AUTHENTICATED_CLIENT_OPER table in STATE_DB. +6. VLAN, FDB further process these STATE_DB updates from PAC and write into their STATE_DB and APPL_DB tables. +7. Orchagent in SWSS docker gets notified about changes in APPL_DB and responds by translating the APPL_DB changes to respective sairedis calls. +8. Sairedis APIs write into ASIC_DB. +9. Syncd gets notified of changes to ASIC_DB and in turn calls respective SAI calls. The SAI calls translate to respective SDK calls to program hardware. +10. EAP success message (EAPoL PDU) is sent to the client. + + +### 3.1.4 RADIUS + +PAC uses the RADIUS client from hostapd. + +PAC supports only 1 RADIUS server. The highest priority server will be picked up for authentication. + + +## 3.2 DB Changes + +### 3.2.1 Config DB + +**PAC_PORT_CONFIG** +``` +"PAC_PORT_CONFIG": { + "Ethernet1": { + "method_list": [ + "dot1x", + "mab" + ], + "priority_list": [ + "dot1x", + "mab" + ], + "port_pae_role": "authenticator", + "port_control_mode": "auto", + "host_control_mode": "multi_auth", + "reauth_period": 60, + "reauth_enable": "true", + "max_users_per_port": 16, + } +} + + +key = PAC_PORT_CONFIG:port ;Physical port + +;field = value + +method_list = "dot1x"/"mab" ;List of methods to be used for authentication + +priority_list = "dot1x"/"mab" ;Relative priority of methods to be used for authentication + +port_pae_role = "none"/"authenticator" ;"none": PAC is disabled on the port + "authenticator": PAC is enabled on the port + +port_control_mode = "auto"/"force_authorized"/ ;"auto": authentication enforced on port + "force_unauthorized" ; "force_authorized": authentication not enforced on port + "force_unauthorized": authentication not enforced on port but port is blocked for all traffic + +host_control_mode = "multi-host"/ ;"multi-host": One data client can be authenticated on the port. Rest of the + "multi-auth"/"single-auth" clients tailgate once the first client is authenticated. + "multi-auth": Multiple data client and one voice client can be authenticated on the port. + "single-auth": One data client or one voice client can be authenticated on the port. + +reauth_period = 1*10DIGIT ;The initial value of the timer that defines the period after which the will + reauthenticate the Supplicant. Range is 1 - 65535 seconds. + +reauth_enable = "true"/"false" ;Indicates whether Reauthentication is enabled on the port. + +max_users_per_port = 1*2DIGIT ;Maximum number of clients that can be authenticated on the port. This is applicable + only for "multi-auth" host mode. Range is 1 - 16 clients. + +``` + +**HOSTAPD_GLOBAL_CONFIG** +``` +"HOSTAPD_GLOBAL_CONFIG": { + "global": { + "dot1x_system_auth_control": "enable" + } +} + + +;field = value +dot1x_system_auth_control "true"/"false" ; Indicates whether 802.1x is enabled in the system. +``` + +**MAB_PORT_CONFIG** +``` +"PAC_PORT_CONFIG": { + "Ethernet1": { + "mab": "enable", + "mab_auth_type": "eap-md5", + } +} + + +key = PAC_PORT_CONFIG:port ;Physical port + +;field = value + +mab = "enable"/"disable" ;Indicates whether MAB is enabled on the port. + +mab_auth_type = "eap-md5"/"pap"/"chap' ;MAB authentication type + + +``` + +### 3.2.2 App DB + +``` +"VLAN_MEMBER_TABLE: { + "Vlan10:Ethernet1": { + "dynamic": "yes", + "tagging_mode": "untagged" + } +} + +key = VLAN_MEMBER_TABLE:Vlan:Port ;Vlan and Physical port + +;field = value + +dynamic = "yes"/"no" ;"yes" = configured, "no" = assigned by RADIUS +tagging_mode = "untagged"/"tagged" ;Vlan tagging mode + +``` + +``` +"PORT_TABLE: { + "Ethernet1": { + "learn_mode": "drop", + "pvid": "10" + } +}, + +``` +### 3.2.3 ASIC DB + +None + +### 3.2.4 Counter DB + +None + + +### 3.2.5 State DB + +**PAC_PORT_OPER** + +``` +"PAC_PORT_OPER": { + "Ethernet1": { + "enabled_method_list": [ + "dot1x", + "mab" + ], + "enabled_priority_list": [ + "dot1x", + "mab" + ] + } +} + + +key = PAC_PORT_OPER:port ;Physical port + +;field = value + +enabled_method_list = "dot1x"/"mab" ;List of methods to be used for authentication +enabled_priority_list = "dot1x"/"mab" ;Relative priority of methods to be used for authentication + +``` + + +**PAC_AUTHENTICATED_CLIENT_OPER** +``` + +"PAC_AUTHENTICATED_CLIENT_OPER": { + "Ethernet1": [ + { + "00:00:00:11:02:33": { + "authenticated_method": "dot1x", + "session_timeout": 60, + "user_name": "sonic_user", + "termination_action": 0, + "vlan_id": 194, + "session_time": 511, + } + }, + { + "00:00:00:21:00:30": { + "authenticated_method": "dot1x", + "session_timeout": 60, + "user_name": "sonic_user1", + "termination_action": 0, + "vlan_id": 194, + "session_time": 51, + } + } + ] +} + + +key = PAC_AUTHENTICATED_CLIENTS_OPER: mac ; Client MAC address +;field = value ; +authenticated_method = "dot1x"/'mab" ; Method used to authenticate the client +session_timeout = 1*10DIGIT ; Client session timeout +user_name = 1*255VCHARS ; Client user name +termination_action = 1DIGIT ; Client action on session timeout: + ;0: Terminate the client + ;1: Reauthenticate the client +vlan_id = 1*4DIGIT ; VLAN associated with the authorized client +session_time = 1*10DIGIT ; Client session time. + +``` + + + +***PAC_GLOBAL_OPER*** +``` +"PAC_GLOBAL_OPER": { + "global": { + "num_clients_authenticated": 10 + } +} +;field = value + +num_clients_auth = 1*10DIGIT ;number of clients authenticated +``` + +***STATE_OPER_PORT*** +``` +"STATE_OPER_PORT": { + { + "Ethernet0": { + "learn_mode": "cpu_trap", + "acquired": "true", + } + } +} + +;field = value +learn_mode = 1*255VCHARS ; learn mode +acquired = 1*255VCHARS ; whether the port is acquired by PAC + +``` + +***STATE_OPER_VLAN*** +``` +"STATE_OPER_VLAN_MEMBER": { + "Vlan10": [ + { + "Ethernet0": { + "tagging_mode": "untagged", + } + } + ] +} + +;field = value +tagging_mode = 1*255VCHARS ; tagging mode + +``` + +***STATE_OPER_FDB*** +``` +"STATE_OPER_FDB": { + "Vlan10": [ + { + "00:00:00:00:00:01": { + "port": "Ethernet0", + "type": "static", + } + } + ] +} + +;field = value +port = 1*255VCHARS ; port +type = 1*255VCHARS ; FDB entry type + +``` + + +## 3.3 Switch State Service Design + +### 3.3.1 Vlan Manager + +VLAN Manager processes updates from "pacd" through STATE DB updates and propagates the Port learning mode, Port PVID and VLAN member updates to APP DB for further processing by OA. + +### 3.3.2 Orchestration Agent + +OA processes updates from APP DB for setting the Port learning mode, VLAN membership and PVID and passes down the same to SAI Redis library for updating the ASIC DB. + + + +## 3.4 PAC Modules +### **3.4.1 Authentication Manager** + +Authentication Manager is the central component of the pacd process. + +Authentication Manager enables configuring various Port Modes, Authentication Host Modes. These modes determine the number of clients and the type of clients that can be authenticated and authorized on the ports. + +Authentication Manager also enables configuring the authentication methods to be used for authenticating clients on a port. By default the configured authentication methods are tried in order for that port. The below authentication methods can be configured for each port. +- 802.1X +- MAB + +In the event that a port is configured for 802.1X and MAB in this sequence, the port will first attempt to authenticate the user through 802.1X. If 802.1X authentication times out, the switch will attempt MAB. The automatic sequencing of authentication methods allows the network administrator to apply the same configuration to every access port without having to know in advance what kind of device (employee or guest, printer or PC, IEEE 802.1X capable or not, etc.) will be attached to it. + +Authentication Manager allows configuring priority for each authentication method on the port. If the client is already authenticated using MAB and 802.1X happens to have higher priority than MAB, if a 802.1X frame is received, then the existing authenticated client will be authenticated again with 802.1x. However if 802.1X is configured at a lower priority than the authenticated method, then the 802.1X frames will be ignored. + +After successful authentication, the authentication method returns the Authorization parameters for the client. Authentication Manager uses these parameters for configuring the switch for allowing traffic for authenticated clients. If Authentication Manager cannot apply any of the authorization attributes for a client, client authentication will fail. + +Client reauthentication is also managed by this module. + +If RADIUS sends a Session timeout attribute with Termination action RADIUS (reauthenticate) or Default (clear client session), this module manages the client session timers for reauthentication or client cleanup. + + +### 3.4.2 mabd +mabd provides the MAC Authentication Bypass (MAB) functionality. MAB is intended to provide 802.1x unaware clients controlled access to the network using the devices’ MAC address as an identifier. This requires that the known and allowable MAC address and corresponding access rights be pre-populated in the authentication server. + +PAC supported authentication methods for MAB are as given below: +- CHAP +- EAP-MD5 +- PAP + +### 3.4.3 hostapd +Hostapd is an open source implementation of 802.1x standard and the Linux application is supplied with wpa_suplicant package. The wired driver module of hostapd is adapted to communicate with pacd via socket interface + +. hostapd gets its configuration from the hostapd.conf file generated by hostapdmgrd. + +### 3.4.4 hostapdmgrd +hostapdmgr reads hostapd specific configuration from SONiC DBs and populates the hostapd.conf. It further notifies the hostapd to re-read the configuration file. + +### 3.4.5 Interaction between modules + +*hostapd(802.1X)* + +hostapd comes to know of an 802.1x client attempting authentication via an EAP exchange. It informs pacd of this client by conveying the client MAC. If the authentication method selected by pacd is 802.1X, pacd sends an event to hostapd for authenticating the user. The client is however authenticated via MAB If the authentication method selected here is MAB. + +hostapd informs pacd about the result of the authentication. hostapd also passes all the authorization parameters it receives from the RADIUS Server to the pacd. These are used for configuring the switch to allow authenticated client traffic. + +*mabd(MAB)* + +When user or client tries to authenticate and the method selected is MAB, the pacd sends an event to mabd for authenticating the user. The client’s MAC address is sent to mabd for the same. + +pacd learns client’s MAC address through an hardware rule to Trap-to-CPU the packets from unknown source MAC addresses. + +mabd informs pacd about the result of the authentication. mabd also passes all the authorization parameters it receives from the RADIUS Server to the pacd. These are used for configuring the NAS to allow authenticated client traffic. + + +## 3.5 SyncD + +No specific changes are needed in syncd for PAC. + +## 3.6 SAI + +Existing SAI attributes are used for this implementation and there is no new SAI attribute requirement. + +### 3.6.1 Host interface traps + +Leveraged **SAI_HOSTIF_TRAP_TYPE_EAPOL** to trap EAP packets (Ethertype - 0x888E) to the CPU. + + +### 3.6.2 Bridge port learning modes +PAC uses the following bridge port learning modes to drop/trap all unknown source MAC packets. +- SAI_BRIDGE_PORT_FDB_LEARNING_MODE_DROP +- SAI_BRIDGE_PORT_FDB_LEARNING_MODE_HW +- SAI_BRIDGE_PORT_FDB_LEARNING_MODE_CPU_TRAP + + + +### 3.6.3 FDB +PAC uses **SAI_FDB_ENTRY_ATTR_PACKET_ACTION** with **SAI_PACKET_ACTION_DROP** to put the static FDB entry in discard state. +**SAI_PACKET_ACTION_FORWARD** is used to put the static FDB entry into forwarding state post successful client authentication. + + + + +## 3.7 Manageability + +### 3.7.1 Yang Model +Yang Models are available for managing PAC, hostapd and MAB modules. SONiC YANG is added as part of this contribution. + +### 3.7.2 Configuration Commands + +The following commands are used to configure PAC. + +| CLI Command | Description | +| :--------------------------------------- | :--------------------------------------- | +| config authentication port-control interface | This command configures the authentication mode to use on the specified interface. Default is force-authorized. | +| config dot1x pae interface | This command sets the PAC role on the port. Default is none. Role authenticator enables PAC on the port. | +| config authentication host-mode interface | This command configures the host mode on the specified interface. Default is multi-host. | +| config dot1x system-auth-control | This command configures 802.1x globally. Default is disabled. | +| config authentication max-users interface | This command configures max users on the specified interface. The count is applicable only in the multiple authentication host mode. Default is 16. | +| config mab interface \[ auth-type \] | This command configures MAB on the specified interface with the specified MAB authentication type. MAB is disabled by default. Default auth-type is eap-md5. | +| config authentication periodic interface | This command enables periodic reauthentication of the supplicants on the specified interface. Default is disabled. | +| config authentication timer reauthenticate interface | This command configures the reauthentication period of supplicants on the specified interface. The 'server' option is used to fetch this period from the RADIUS server. The 'seconds' option is used to configure the period locally. Default is 'server'. | +| config authentication order interface | This command is used to set the order of authentication methods used on a port. Default order is 802.1x,mab. | +| config authentication priority interface | This command is used to set the priority of authentication methods used on a port. Default priority is 802.1x,mab. | + + + +### 3.7.3 Show Commands + +**show authentication interface** **** + +This command displays the authentication manager information for the interface + +| Field | Description | +| -------------------------- | ---------------------------------------- | +| Interface | The interface for which authentication configuration information is being displayed. | +| Port Control Mode | The configured control mode for this port. Possible values are force-unauthorized | +| Host Mode | The authentication host mode configured on the interface. | +| Configured method order | The order of authentication methods used on the interface. | +| Enabled method order | The order of authentication methods used on the interface. | +| Configured method priority | The priority for the authentication methods used on the interface. | +| Enabled method priority | The priority for the authentication methods used on the interface. | +| Reauthentication Period | The period after which all clients on the interface will be reauthenticated. | +| Reauthentication Enabled | Indicates whether reauthentication is enabled on the interface. | +| Maximum Users | The maximum number of clients that can be authenticated on the interface if the interface is configured as multi-auth host mode. | +| PAE role | Indicates the configured PAE role as authenticator or none. | + + + +**show authentication** + +This command displays the number of authenticated clients. + +| Field | Description | +| ------------------------------- | ---------------------------------------- | +| Number of Authenticated clients | The total number of clients authenticated on the switch | + + + +**show authentication clients \>** + +This command displays the details authenticated clients. + + +| Field | Description | +| ---------------------------------------- | ---------------------------------------- | +| Interface | The interface for which authentication configuration information is being displayed. | +| Mac Address | The MAC address of the client. | +| User Name | The user name associated with the client. | +| VLAN | The VLAN associated with the client. | +| Host Mode | The authentication host mode configured on the interface. The possible values are multi-auth, multi-host and single-host. | +| Method | The method used to authenticate the client on the interface. The possible values are dot1x or MAB. | +| Session Time | The amount of time the client session has been active. | +| Session Timeout | This value indicates the time for which the given session is valid. The time period in seconds is returned by the RADIUS server on authentication of the port. | +| Time left for Session Termination Action | This value indicates the time left for the session termination action to occur. This field is valid only when the “authentication periodic” is configured. | +| Session Termination Action | This value indicates the action to be taken once the session timeout expires. Possible values are Default and Radius-Request. If the value is Default, the session is terminated and client details are cleared. If the value is Radius-Request, then a reauthentication of the client is performed. | + + + +**show mab ** + +This command is used to show a summary of the global mab configuration and summary information of the mab configuration for all ports. This command also provides the detailed mab configuration for a specified port + +| Field | Description | +| ------------- | ---------------------------------------- | +| Interface | Given interface | +| Admin Mode | MAB admin mode on the given interface | +| MAB auth type | MAB authentication type (EAP_MD5, PAP, CHAP) | + + + +**show dot1x** + +This command is used to show a summary of the global 802.1x configuration. + +| Field | Description | +| ------------------- | ---------------------------------------- | +| Administrative Mode | Indicates whether 802.1x is enabled or disabled. | + + + + +show dot1x detail + +This command is used to show details of 802.1x configuration on an interface. + +| Field | Description | +| ---------------- | ---------------------------------------- | +| Interface | Given Interface | +| PAE Capabilities | The Port Access entity (PAE) functionality of this port. Possible values are Authenticator or None | + + + +### 3.7.4 Clear Commands + +**sonic-clear authentication sessions \>** + +This command clears information for all Auth Manager sessions. All the authenticated clients are re-initialized and forced to authenticate again. + + + +# 4 Scalability + +The following scale is supported: + +| Configuration / Resource | Scale | +| ---------------------------------------- | ----- | +| Total number of authenticated clients on a port configured in Multiple Authentication host mode | 16 | +| Total number of authenticated clients supported by the switch | 128 | + + + +# 5 Appendix: Sample configuration + +``` +config vlan add 100 +config authentication port contol interface auto Ethernet10 +config authentication dot1x pae interface authenticator Ethernet10 +config authentication host-mode interface multi-auth Ethernet10 +config authentication interface max-users 10 Ethernet10 +config mab interface enable pap +config dot1x system-auth-control enable +config authentication periodic interface Ethernet10 +config authentication timer reauthenticate interface 600 Ethernet10 +``` + + + +# 6 Future Enhancements + +1. Add configurability support for 802.1x and MAB timers. +2. Add support for fallback VLANs like Guest, Unauth etc. VLAN. These are used to authorize clients if they fail authentication under various circumstances. +3. Add support for RADIUS Authorization attributes like ACLs. +4. Add support for multiple RADIUS servers. + +``` + +``` \ No newline at end of file diff --git a/doc/pac/images/PAC_Config_Flow.JPG b/doc/pac/images/PAC_Config_Flow.JPG new file mode 100644 index 00000000000..87657c79949 Binary files /dev/null and b/doc/pac/images/PAC_Config_Flow.JPG differ diff --git a/doc/pac/images/PAC_Deployment.JPG b/doc/pac/images/PAC_Deployment.JPG new file mode 100644 index 00000000000..b81bb9c40c6 Binary files /dev/null and b/doc/pac/images/PAC_Deployment.JPG differ diff --git a/doc/pac/images/PAC_EAPoL_Rx_Flow.JPG b/doc/pac/images/PAC_EAPoL_Rx_Flow.JPG new file mode 100644 index 00000000000..43c05018369 Binary files /dev/null and b/doc/pac/images/PAC_EAPoL_Rx_Flow.JPG differ diff --git a/doc/pac/images/PAC_MAB_Rx_Flow.JPG b/doc/pac/images/PAC_MAB_Rx_Flow.JPG new file mode 100644 index 00000000000..f10e694314a Binary files /dev/null and b/doc/pac/images/PAC_MAB_Rx_Flow.JPG differ diff --git a/doc/pic/bgp_pic_arch_doc.md b/doc/pic/bgp_pic_arch_doc.md new file mode 100644 index 00000000000..0d1c3a421d8 --- /dev/null +++ b/doc/pic/bgp_pic_arch_doc.md @@ -0,0 +1,329 @@ +# BGP Prefix Independent Convergence Architecture Document + +#### Rev 1.0 + +# Table of Contents + +- [Revision](#revision) +- [1 Background](#1-background) +- [2 PIC Core](#2-pic-core) +- [3 PIC Edge](#3-pic-edge) +- [4 PIC Local aka Fast Reroute](#4-pic-local-aka-fast-reroute) +- [5 PIC Requirements](#5-pic-requirements) +- [6 Concepts](#6-concept) +- [7 SONIC PIC Core Call Flow](#7-sonic-pic-core-call-flow) + - [7.1 Core Local Failure](#71-core-local-failure) + - [7.2 Core Remote Failure](#72-core-remote-failure) +- [8 SONIC PIC Edge Call Flow](#8-sonic-pic-edge-call-flow) +- [9 References](#9-references) + +# Revision + +| Rev | Date | Author | Change Description | +|:--:|:--------:|:-----------------:|:------------------------------------------------------------:| +| 1.0 | Jan 2024 | Patrice Brissette (Cisco), Eddie Ruan (Alibaba), Donald Sharp (Nvidia), Syed Hasan Raza Naqvi (Broadcom) | Initial version + +# 1 Background + +Link and/or node failure in a network causes packet loss. Loss may cause black-holes and/or micro-loops until all relevant nodes in the path have converged. Overlay (BGP) and underlay (iGP) are quite different for convergence. They have their own challenges. Underlay deals with hundreds of routes. Overlay is designed to carry millions of routes. Upon network failure, the amount of packet loss may be astronomic if each single affected route is reprogrammed along the affected path. + +Prefix Independent Convergence (PIC) is a solution allowing network elements to address a network failure independently of the number of affected objects. Usually, PIC makes use of multi-paths capability i.e., ECMP or primary / backup reachability paths. + +PIC solution is well described and cover at IETF by . + +The picture illustrates a typical network. Three PEs are connected via two P routers. CE devices are directly connected: CE1 is multi homed to PE1 and PE2 and CE2 is single homed to PE3. PE3 is connected to P1 via interface A and connected to P2 via interface B. PE1 is connected to P1 via interface C and connected to P2 via interface D. CEs are either device, hosts, applications, etc. They are represented by r1, r2, … rn. PE3 has reachability to PE1 via interfaces A and B. PE3 has reachability to PE2 via B only. + +![PIC Topology](images/background.png) + +__Figure 1: PIC Topology__ + +In this document, explanations are based on traffic flow from North to South direction. + +# 2 PIC Core + +PIC Core is about prefix independent convergence when underlay / IGP is affected. + +The overlay service runs between PEs routers where it gets terminated on PE1, PE2 and PE3. PIC core matters when failure happens within the core (e.g., P1 is impacted). + +Initially at steady state, PE3 has a full routing knowledge about CE1 in its table. Upon failure and detection, the underlay / IGP reconverge. The interface A is removed for the routing table without having the need to update each single route (r1, r2, …rn). Basically, PIC functionality allows the PE3 to avoid looping across all prefixes for hardware reprogramming. + +![PIC Core](images/pic_core.png) + +__Figure 2: PIC Core__ + + + + + + + + + + + + + + + + + + + + + + +
@PE3
Steady State(r1, r2, … rn) -> PE1 (sC1) + PE2 (sC2)
+ PE1 -> Intf-A (uC1) + Intf-B (uC2)
+ PE2 -> Intf-B (uC4)
FailureP1 failure:

+ (r1, r2, … rn) -> PE1 (sC1) + PE2 (sC2)
+ PE1 -> Intf-A (uC1) + Intf-B (uC2)
+ PE2 -> Intf-B (uC4)
ResultOnly PE3 affected underlay interface has been removed from forwarding. The impact is limited to underlay routes.
+ +Where PEx = IP address of specific remote nexthop, sCx = Overlay Service Context (VRF, label, VNI, etc.), Intf-x = Interface, uCx = Underlay Context (label, nexthop, etc.). + +# 3 PIC Edge + +PIC Edge is about prefix independent convergence when service overlay is affected … as well as the underlay / IGP. However, different devices on the network perform different PIC. As seen before, PIC Core applies to P1 and P2. However, PE3 triggers PIC Edge. In the following example, the overlay service runs between PEs. + +![PIC Edge](images/pic_edge.png) + +__Figure 3: PIC Edge__ + + + + + + + + + + + + + + + + + + + + + + +
@PE3
Steady State(r1, r2, … rn) -> PE1 (sC1) + PE2 (sC2)
+ PE1 -> Intf-A (uC1) + Intf-B (uC2)
+ PE2 -> Intf-B (uC4)
FailureP1 failure:

+ (r1, r2, … rn) -> PE1 (sC1) + PE2 (sC2)
+ PE1 -> Intf-A (uC1) + Intf-B (uC2)
+ PE2 -> Intf-B (uC4)
ResultPE1 is removed from underlay / IGP reachability. PE1 nexthop is completely removed from PE3 routing table.
+ +Where PEx = IP address of specific remote nexthop, sCx = Overlay Service Context (VRF, label, VNI, etc.), Intf-x = Interface, uCx = Underlay Context (label, nexthop, etc.). + +In datacenters today, many providers build them with BGP as overlay and as underlay protocol. BGP peers between devices at different levels for overlay and underlay. BGP sessions are between loopback addresses for overlay service while different BGP sessions are between physical interface for underlay. + +![BGP Peering](images/bgp_peering.png) + +__Figure 4: BGP Peering__ + +In the case of the PE1 node failure as shown above for PIC edge, the following sequence of event happens: + +1. P1 detects PE1 down; link is down towards PE1 +2. P1 withdraws PE1 reachable route/info from BGP towards PE3 (underlay) +3. PE3 adjusts locally the underlay path list accordingly upon receiving withdraw message. +4. PE3 adjusts locally the overlay path list based on underlay adjustment + +# 4 PIC Local aka FAST Reroute + +PIC local aka Fast Reroute is yet another tool to perform prefix independent convergence but done for locally connected interfaces. It is performed mainly for a short period of time; enough time to allow the network to re-converge (helping PIC edge). PIC-Edge provides quick convergence at Ingress Side; FRR provides supporting quick convergence at Egress side until Ingress gets notified/signaled about egress loss. PE3 performs PIC Edge and Core behaviors while P1 and P2 perform PIC core behavior. FRR is a local behavior using the concept of primary / backup path. FRR may be performed on either P or PE routers. However, when people refer to FRR, it is mainly of edge devices. In the core, terms like LFA, etc. are used. + +The idea behind FRR is to reroute a specific flow to peering PE where another path is available to reach a specific destination. In this case, PE2CE protocol runs between CE and PE devices. + +![PIC Local](images/pic_local.png) + +__Figure 5: PIC Local__ + + + + + + + + + + + + + + + + + + + + + + +
@PE1
Steady State(r1, r2, … rn) -> P: Local intf, B: PE2 (sC2)
+ PE2 -> Intf-C (uC3) + Intf-D (uC4)
FailureLink failure between PE1 and CE1:
+ (r1, r2, … rn) -> P: Local intf, B: PE2 (sC2)
+ PE2 -> Intf-C (uC3), Intf-D (uC4)
ResultThe local interface is removed from the forwarding table. PE1 promotes the backup path (via PE2) as primary path to reach CE1.
+ +Where PEx = IP address of specific remote nexthop, sCx = Overlay Service Context (VRF, label, VNI, etc.), Intf-x = Interface, uCx = Underlay Context (label, nexthop, etc.). + +# 5 PIC Requirements + +Efficient architecture and design of PIC is very important to achieve high-performance convergence. Many variables may affect the overall result. + +- Although its use was popularized using BGP, PIC is a protocol independent routing and forwarding feature +- Work independently from overlay e.g., overlay protocols such BGP, controllers, etc. +- Work independently from underlay (IGP) +- PIC Edge requires multi-path (ECMP, UCMP, pre-computed backup) +- PIC Core may make use of multi-path or primary / backup to pre-compute alternative paths +- Level of indirection must be shareable and design in a hierarchical manner +- Level of indirection must be shared across many prefixes. Shared path-list count is targeted to be much smaller than per prefix count. +- Forwarding plane must be capable of accessing multiple level of indirection +- Drivers in software (SAI / SDK) must be capable to program that level of indirection in specific ASIC accordingly + +Requirements are slightly different between PIC core and PIC edge but they share common goals. That is due to how the router detect (or get to know) about a failure. Generally speaking, the detections happen as such: + +- PIC core: local interface failure on any nodes of the network or BFD equivalent +- PIC edge: Nexthop tracking. Failure provided by IGP/BGP. +- PIC local: local interface failure on any nodes of the network or BFD equivalent +Regardless of the detection mechanism, requirements are: + +- Provide hierarchical forwarding chains which can be shared across multiple routes/objects +- Hardware must be preprogrammed using these hierarchical forwarding chains. + - E.g., objects -> level of indirection -> [multi-path list]. The list can be for ECMP or active/backup. + - In FRR/SONIC context, the level of indirection is equal to the NHG object (nexthop group) +- For PIC core and PIC local, upon local failure, low level software or hardware must do: + - Pruned the multi-path list for the affected path. Translate that into a single hardware update. The content of the NHG is updated with “less” path. + - Tell upper layer of software about the issue. + - Control plane updates, at its own pace, the hardware accordingly once protocol converge. +- For PIC edge, upon remote failure, control plane must update only the level of indirection first (NHG). It may re-apply new reachability after once the network is fully converge. +- Any transition programmed by control plane must be hitless from datapath point of view (zero packet loss) + - Applicable when going from single path to multi-path (or vice versa) + - Going from a level of indirection to a new level of indirection (e.g., from NHG-A to NHG-B) + - Going from pure single path info to a level of indirection and vice versa. +- ECMP hardware resource is very limited. For scalability purpose, the usage of NHG (level of indirection) may be limited to only where it is applicable. For instance, NHG may NOT be used for single path only. + +# 6 Concept + +Conceptually speaking, PIC works by using level of indirections where shareable entries are used by many routes / objects. + +The following picture shows the various levels of indirection required to perform efficient PIC functionality. + +![Concept](images/nhg_concept.png) + +__Figure 6: Concept__ + +The logic is as follow: + +- Prefixes / routes point to a set of remote nexthop where each element represents the IP address of the remote overlay service nexthop. (PEx loopback address) +- Each element points to a list of outgoing underlay nexthop where each element represents an interface, NH, etc., to reach the next network device along the path to get to destination. + +In FRR / SONIC, the level of indirection is represented by the nexthop group object e.g., NHG. A NHG is made of a path list. For example, NHG-A = [nexthop 1, nexthop 2] where multiple routes points that NHG. + +The following diagram shows the high level concept of the interaction between FRR (bgpd and zebra) and SONIC fpmsyncd. The first column shows the initial state. There are 2 routes in BGP pointing to the same remote nexthop (NH1, NH2). Zebra creates a NHG object with that list of paths where both IP routes (IP1, IP2) point to. The NHG and routes are provided to fpmsyncd. + +The second column shows the addition of a new remote nexthop for the IP1 entry. Zebra creates a 2nd NHG which is now used by IP1 route. Fpmsyncd receives the new update. From SONIC side, the hardware must be programmed first with that new NHG object. Then, IP1 route can be reprogrammed using it. This is to avoid packet loss during the transition from NHG-A to NH-B. + +The third column shows the remote nexthop removal. Zebra simply remove the affected path from the NHG object path list. The IP route does not get updated. Later on, Zebra may decide to delete NHG-B and move IP1 to NHG-A. That transition should be hitless. When the ECMP path has 2 members, upon removal on one, the end result is a single path. Zebra may decide to delete that NHG object using a single NH and move affected routes to use directly a nexthop ID; without having a nexthop group. Again, the transition from the usage of NHG object to plain nexthop ID should be hitless. + +![High Level Flows](images/nhg_flow.png) + +__Figure 7: High Level Flows__ + +It is worth to note that BGP PIC works with the concept of FAST DOWNLOAD and SLOW DOWNLOAD updates. FAST DOWNLOAD update must be done "as fast as possible” to achieve proper BGP PIC convergence results. It consists mainly to trigger a single update in hardware e.g. NHG object update. The SLOW DOWNLOAD is mainly about applying proper control plane convergence results. It consists of updating, validating the FAST DOWNLOAD hardware update at its own pace. + +# 7 SONIC PIC CORE Call flow + +## 7.1 Core Local Failure + +The link between PE3 and P1 goes down. That translates to interface A down on PE3. + +![PIC Core Local Failure](images/pic_core.png) + +__Figure 8 PIC Core Local Failure__ + +The following diagram describes the interaction happening in FRR and SONIC for such failure: + +![Core Local Failure Call Flow](images/core_failure_local_flow.png) + +__Figure 9: Core Local Failure Call Flow__ + + FAST DOWNLOAD update: + +0. LOSS detection done by the ASIC; a notification is sent to syncd +1. syncd sends port-down event to ASIC DB +2. orchagent collects the new state +3. orchagent (nhgorch) updates the NHG from ASIC_DB by removing affected path from the list +4. syncd receives the update thru ASIC_DB and invoke SAI +5. single NHG object update done in hardware via SDK + Kernel update + +SLOW DOWNLOAD update: + +6. syncd updates (via SAI/SDK) update the kernel with interface oper down status +7. a netlink message is received in portsyncd (silently discarded) +8. a netlink message is received in zebra with interface oper down status for the affected interface +9. BGP and/or IGP sends new path and zebra recomputes the NHG + - if the reachability do not impact BGP, it is skip and solution rely on IGP only +10. zebra redownloads the updated NHG object by sending netlink message +11. Kernel and fpmsyncd get updated +12. fpmsyncd process netlink message and pushes the update to APP_DB +13. orchagent (nhgorch) gets notified, process the received information and updates the ASIC_DB +14. syncd gets notified and update SAI/SDK accordingly. +15. Hardware is finally updated. + +Alternatively, to align design between PIC Edge and core, local events may go straight to FRR and let zebra performs the FAST DOWNLOAD. This solution is not desirable since it does not provide full performance on the convergence. + +## 7.2 Core Remote Failure + +The link between PE2 and P2 goes down. That translates to an iGP update on PE3. + +![PIC Core Remote Failure](images/core_failure_remote.png) + +__Figure 10: PIC Core Remote Failure__ + +The following diagram describes the interaction happening in FRR and SONIC for such failure: + +![Core Remote Failure Call Flow](images/core_failure_remote_flow.png) + +__Figure 11: Core Remote Failure Call Flow__ + +In this scenario, only FAST DOWNLOAD is considered: + +0. zebra finds about actual dependent NHGs and trigger PIC NHG update first before any other updates. +1. zebra redownloads the updated NHG object by sending netlink message. It may redownload some affected routes when NHG id change or fallback on NH-ID only. +2. Kernel and fpmsyncd get updated +3. fpmsyncd process netlink message and pushes the update to APP_DB +4. orchagent gets notified, process the received information and updates the ASIC_DB +5. syncd gets notified and update SAI/SDK accordingly. +6. Hardware is finally updated. + +## 8 SONIC PIC Edge Call flow + +PE1 router goes down. That translates to an iGP and BGP updates to PE3. + +![PIC Core Remote Failure](images/pic_edge.png) + +__Figure 12: PIC Core Remote Failure__ + +The following diagram describes the interaction happening in FRR and SONIC for such failure: + +![Core Remote Failure Call Flow](images/core_failure_remote_flow.png) + +__Figure 13: Core Remote Failure Call Flow__ + +In this scenario, only FAST DOWNLOAD is considered: + +0. zebra finds about actual dependent NHGs and trigger PIC NHG update first before any other updates. +1. zebra redownloads the updated NHG object by sending netlink message. It may redownload some affected routes when NHG id change or fallback on NH-ID only. +2. Kernel and fpmsyncd get updated +3. fpmsyncd process netlink message and pushes the update to APP_DB +4. orchagent gets notified, process the received information and updates the ASIC_DB +5. syncd gets notified and update SAI/SDK accordingly. +6. Hardware is finally updated. + +# 9 References + +- [IETF BGP PIC draft](https://datatracker.ietf.org/doc/draft-ietf-rtgwg-bgp-pic/) diff --git a/doc/pic/hld_fpmsyncd.md b/doc/pic/hld_fpmsyncd.md new file mode 100644 index 00000000000..2e28dfca711 --- /dev/null +++ b/doc/pic/hld_fpmsyncd.md @@ -0,0 +1,689 @@ +# `fpmsyncd` NextHop Group Enhancement High Level Design Document + + +## Table of Content +- [Revision](#revision) +- [Scope](#scope) +- [Overview](#overview) +- [Requirements](#requirements) +- [Architecture Design](#architecture-design) +- [High-Level Design](#high-level-design) + - [Current fpmsyncd processing flow (for reference)](#current-fpmsyncd-processing-flow-for-reference) + - [Proposed fpmsyncd processing flow using NextHop Group](#proposed-fpmsyncd-processing-flow-using-nexthop-group) + - [Value SET/DEL to APPL\_DB](#value-setdel-to-appl_db) + - [Example of entries in APPL\_DB](#example-of-entries-in-appl_db) + - [Example of entries in ASIC\_DB](#example-of-entries-in-asic_db) +- [SAI API](#sai-api) +- [Configuration and management](#configuration-and-management) + - [Configuration data flow](#configuration-data-flow) + - [CLI/YANG model Enhancements](#cliyang-model-enhancements) + - [Config DB Enhancements](#config-db-enhancements) +- [Warmboot and Fastboot Design Impact](#warmboot-and-fastboot-design-impact) +- [Testing Requirements/Design](#testing-requirementsdesign) + - [Unit Test cases](#unit-test-cases) + - [Config test cases (feature enable/disable)](#config-test-cases-feature-enabledisable) + - [System Test cases](#system-test-cases) +- [Open/Action items - if any](#openaction-items---if-any) + - [libnl compatibility with upstream](#libnl-compatibility-with-upstream) + - [Further performance improvements](#further-performance-improvements) + - [Backward compatibility with current NHG creation logic (Fine-grain NHG, Ordered NHG/ECMP)](#backward-compatibility-with-current-nhg-creation-logic-fine-grain-nhg-ordered-nhgecmp) + - [nexthop\_compat\_mode Kernel option](#nexthop_compat_mode-kernel-option) + - [Warmboot/Fastboot support](#warmbootfastboot-support) + - [No support for setting config enable/disable on runtime](#no-support-for-setting-config-enabledisable-on-runtime) + - [Source of APPL\_DB entry related to NHG](#source-of-appl_db-entry-related-to-nhg) + +### Revision + +| Rev | Date | Author | Change Description | +| :---: | :----------: | :------------------------------------------------: | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| 0.1 | Jul 14, 2023 | Kanji Nakano, Kentaro Ebisawa, Hitoshi Irino (NTT) | Initial version | +| 0.2 | Jul 30, 2023 | Kentaro Ebisawa (NTT) | Remove description about VRF which is not nessesary for NHG. Add High Level Architecture diagram. Add note related to libnl, Routing WG. Fix typo and improve explanations. | +| 0.3 | Sep 18, 2023 | Kentaro Ebisawa (NTT) | Update based on discussion at Routing WG on Sep 14th (Scope, Warmboot/Fastboot, CONFIG_DB) | +| 0.4 | Sep 24, 2023 | Kentaro Ebisawa (NTT) | Add feature enable/disable design and CLI. Update test plan. | +| 0.5 | Nov 10, 2023 | Kanji Nakano (NTT) | Update feature enable/disable design and CLI. Update test plan. | + +### Scope + +This document details the design and implementation of the "fpmsyncd extension" related to NextHop Group behavior in SONiC. +The goal of this "fpmsyncd extension" is to integrate NextHop Group (NHG) functionality into SONiC by writing NextHop Group entry from `fpmsyncd` to `APPL_DB` for NextHop Group operation in SONiC. + +- Scope of this change is to extend `fpmsyncd` to handle `RTM_NEWNEXTHOP` and `RTM_DELNEXTHOP` messages from FPM. +- There will be no change to SWSS/Orchagent. +- This change is backward compatible. Upgrade from a SONiC version that does not support this feature does not change the user's expected behavior as this feature is disabled by default. + +### Overview + +SONIC system has support for programming routes using the NextHop Group feature through the NextHop Group table in `APPL_DB` database. +The idea is to have a more efficient system that would involve managing the NextHop Group in use by the route table separately, and simply have the route table specify a reference to which NextHop Group to use. +Since at scale many routes will use the same NextHop Groups, this requires much smaller occupancy per route, and so more efficient building, transmission and parsing of per-route information. + +The current version of `fpmsyncd` has no support to handle the NextHop Group netlink messages sent by zebra process via `FPM` using the `dplane_fpm_nl` module. +This implementation modifies the `fpmsyncd` code to handle `RTM_NEWNEXTHOP` and `RTM_DELNEXTHOP` events and write it to the database. +Also, the `fpmsyncd` was modified to use the NextHop Group ID (`nexthop_group`) when programming the route to the `ROUTE_TABLE` if `RTA_NH_ID` was included in the `RTM_NEWROUTE` message from zebra via `FPM`. + +NHG ID and members are managed by `FRR`. +`fpmsyncd` will use NHG ID provided in FPM message from `FRR(zebra)`. +Thus, logic of either if updating NHG members or create NHG with new ID during topology change is managed by `FRR`. + +Use case example of this feature would be BGP PIC, and recursive routes handling. +BGP PIC has started in design discussion in the SONiC Routing WG. +Recursive routes support would be discussed after. +See [09072023 Routing WG Meeting minutes](https://lists.sonicfoundation.dev/g/sonic-wg-routing/wiki/34786) for further information about BGP PIC discussion. + +### Requirements + +`Fpmsyncd extension` requires: +- `fpmsyncd` to handle `RTM_NEWNEXTHOP` and `RTM_DELNEXTHOP` events from zebra via `dplane_fpm_nl` +- `fpmsyncd` to SET/DEL routes to `APPL_DB: ROUTE_TABLE` using `nexthop_group` +- `fpmsyncd` to SET/DEL NextHop Group entry to `APPL_DB: NEXTHOP_GROUP_TABLE` + +This feature must be disabled by default. +- When this feature is disabled, behavior will be the same as before introducing this feature. + - i.e. `NEXTHOP_GROUP_TABLE` entry will not be created and `nexthop_group` will not be used in `ROUTE_TABLE` entry in `APPL_DB`. +- See section [Configuration and management](#configuration-and-management) for details on how this feature is disabled/enabled. + +### Architecture Design + + +This design modifies `fpmsyncd` to use the new `APPL_DB` tables. + +The current `fpmsyncd` handle just the `RTM_NEWROUTE` and `RTM_DELROUTE` writing all route information for each route prefix to `ROUTE_TABLE` on Redis DB (`redis-server`). +When zebra process is initialized using the old fpm module, the `RTM_NEWROUTE` is sent with at least destination address, gateway, and interface id attributes. +For multipath route, the `RTM_NEWROUTE` is sent with a list of gateway and interface id. + +This `Fpmsyncd extension` will modify `fpmsyncd` to handle `RTM_NEWNEXTHOP` and `RTM_DELNEXTHOP` as below. + + +##### Figure: Fpmsyncd NHG High Level Architecture +![fig: fpmsyncd nhg architecture](images_fpmsyncd/fpmsyncd-nhg-architecture.png) + +- FRR configuration + - (1) config zebra to use `dplane_fpm_nl` instead of `fpm` module (this is default since 202305 release) + - (2) set `fpm use-nexthop-groups` option (this is disabled by default and enabled via `CONFIG_DB`) +- fpmsyncd enhancement + - (3) Handle `RTM_NEWNEXTHOP` fpm message from zebra + - (4) and create `NEXTHOP_GROUP_TABLE` entry + + +### High-Level Design + + +#### Current fpmsyncd processing flow (for reference) + +For example, if one configure following routes: + +``` +B>*10.1.1.4/32 [20/0] via 10.0.0.1, Ethernet0, 00:00:08 + * via 10.0.0.3, Ethernet4, 00:00:08 +``` + +it will generate the following `APPL_DB` entries: + +``` +admin@sonic:~$ sonic-db-cli APPL_DB hgetall "ROUTE_TABLE:10.1.1.4/32" +{'nexthop': '10.0.0.1,10.0.0.3', 'ifname': 'Ethernet0,Ethernet4', 'weight': '1,1'} + +``` + +The flow below shows how `zebra`, `fpmsyncd` and `redis-server` interacts when using `fpm plugin` without NextHop Group: + + +##### Figure: Flow diagram without NextHop Group +![fig1](images_fpmsyncd/fig1.svg) + +#### Proposed fpmsyncd processing flow using NextHop Group + +To support the nexthop group, `fpmsyncd` was modified to handle the new events `RTM_NEWNEXTHOP` and `RTM_DELNEXTHOP`. +`fpmsyncd` now has a new logic to associate routes to NextHop Groups. + +The flow for the new NextHop Group feature is shown below: + + +##### Figure: Flow diagram new nexthop group feature +![fig2](images_fpmsyncd/fig2.svg) + + +#### Value SET/DEL to APPL_DB + +After enabling `use-next-hop-groups` in `dplane_fpm_nl` plugin, zebra will send `RTM_NEWNEXTHOP` to `fpmsyncd` when a new route is added. + +`RTM_NEWNEXTHOP` is sent with 2 different attribute groups as shown in the table below: + + + + + + + + +
EventAttributesDescription
RTM_NEWNEXTHOPNHA_IDNextHop Group ID
NHA_GATEWAYgateway address
NHA_OIFThe interface ID
RTM_NEWNEXTHOPNHA_IDNextHop Group ID
NHA_GROUPA list of nexthop groups IDs with its respective weights.
+ +After sending the `RTM_NEWNEXTHOP` events, zebra sends the `RTM_NEWROUTE` to `fpmsyncd` with NextHop Group ID as shown in the table below: + + + + + +
EventAttributesDescription
RTM_NEWROUTERTA_DSTroute prefix address
RTA_NH_IDNextHop Group ID
+ +#### Example of entries in APPL_DB + +For example. following route configuration will generate events show in the table below: + +``` +admin@sonic:~$ show ip route +B>*10.1.1.4/32 [20/0] via 10.0.0.1, Ethernet0, 00:00:08 + * via 10.0.0.3, Ethernet4, 00:00:08 +``` + +``` +admin@sonic:~$ sonic-db-cli APPL_DB keys \* | grep NEXT +NEXTHOP_GROUP_TABLE:ID127 + +admin@sonic:~$ sonic-db-cli APPL_DB HGETALL NEXTHOP_GROUP_TABLE:ID127 +{'nexthop': '10.0.0.1,10.0.0.3', 'ifname': 'Ethernet0,Ethernet4', 'weight': '1,1'} + +admin@sonic:~$ sonic-db-cli APPL_DB keys \* | grep ROUTE +ROUTE_TABLE:10.1.1.4 + +admin@sonic:~$ sonic-db-cli APPL_DB HGETALL ROUTE_TABLE:10.1.1.4 +{'nexthop_group': 'ID127', 'protocol': 'bgp'} +``` + + + + + + + + + + + + + +
SeqEventAttributesValue
1RTM_NEWNEXTHOPNHA_IDID125
NHA_GATEWAY10.0.0.1
NHA_OIF22
2RTM_NEWNEXTHOPNHA_IDID126
NHA_GATEWAY10.0.0.3
NHA_OIF23
3RTM_NEWNEXTHOPNHA_IDID127
NHA_GROUP[{125,1},{126,1}]
4RTM_NEWROUTERTA_DST10.1.1.4
RTA_NH_IDID127
+ +A short description of `fpmsyncd` logic flow: + +- When receiving `RTM_NEWNEXTHOP` events on sequence 1, 2 and 3, `fpmsyncd` will save the information in an internal list to be used when necessary. +- When `fpmsyncd` receive the `RTM_NEWROUTE` on sequence 4, the process will write the NextHop Group with ID 118 to the `NEXTHOP_GROUP_TABLE` using the information of gateway and interface from the NextHop Group events with IDs 116 and 117. +- Then `fpmsyncd` will create a new route entry to `ROUTE_TABLE` with a `nexthop_group` field with value `ID118`. +- When `fpmsyncd` receives the last `RTM_NEWROUTE` on sequence 5, the process will create a new route entry (but no NextHop Group entry) in `ROUTE_TABLE` with `nexthop_group` field with value `ID118`. (Note: This NextHop Group entry was created when the `fpmsyncd` received the event sequence 4.) + +#### Example of entries in ASIC_DB + +The `ASIC_DB` entry is not changed by this enhancement. +Therefore, even after this enhancement, table entries will be created for `ROUTE_ENTRY`, `NEXT_HOP_GROUP`, `NEXT_HOP_GROUP_MEMBER`, and `NEXT_HOP` respectively, as shown in the example below + + +##### Figure: Example of ASIC_DB entry +![fig3](images_fpmsyncd/fig3.svg) + + +### SAI API + +No changes are being made in SAI. +The end result of what gets programmed via SAI will be the same as current implementation when manually adding `NEXTHOP_GROUP_TABLE` entries to `APPL_DB`. + + +### Configuration and management + +This NextHop Group feature is enabled/disabled by config option of zebra (BGP container): `[no] fpm use-next-hop-groups` + +- To disable this feature (default): configure `no fpm use-next-hop-groups` +- To enable this feature: configure `fpm use-next-hop-groups` + +On FRR, one can configure this zebra option via vtysh (zebra CLI) or `zebra.conf` (zebra startup config). + +In SONiC, we will use CONFIG_DB data to enable/disable this option to be consistent with other SONiC features. +We will also use `config_db.json` to preserve config among system reboot. + +Users (SONiC admin) are expected to use only SONiC CLI or edit `config_db.json` file to enable/disable this feature, and should not edit `zebra.conf` directly. + + +This configuration is backward compatible. Upgrade from a SONiC version that does not support this feature does not change the user's expected behavior as this flag is set to be disabled by default. (i.e. It's disabled if `FEATURE|nexthop_group` entry does not exist in CONFIG_DB) + +This setting can NOT be enabled or disabled at runtime. +System reboot is required after enabling/disabling this feature to make sure route entry using and not using this NHG feature would not co-exisit in the `APPL_DB`. + +#### Configuration data flow + +Diagram shows how `zebra.conf` is genereated from CONFIG_DB data. + + +##### Figure: Configuration data flow +![fig4](images_fpmsyncd/fig4-config.svg) + +- CONFIG_DB entry is created via CLI or data stored in `config_db.json` file +- `sonic-cfggen` will generate `zebra.conf` based on template file named `zebra.conf.j2` +- FRR will use `zebra.conf` during startup to apply config stored in the file + +This flow is existing framework and not specific to this feature. + +Modification made for this feature is in `zebra.conf.j2` to generate config with `[no] fpm use-next-hop-groups` based on `DEVICE_METADATA|localhost` entry in CONFIG_DB. + +As shown in below diff code, the template will generate config following below logic. + +- If `DEVICE_METADATA|localhost` is present in CONFIG_DB but there is no "nexthop_group" attribute => disabled +- If `DEVICE_METADATA|localhost` is present in CONFIG_DB and "nexthop_group" attribute is "enabled" => enabled + +``` +> zebra.conf.j2 + + {/% endblock banner /%} + ! + {/% block fpm /%} ++{/% if ( ('localhost' in DEVICE_METADATA) and ('nexthop_group' in DEVICE_METADATA['localhost']) and ++ (DEVICE_METADATA['localhost']['nexthop_group'] == 'enabled') ) /%} ++fpm use-next-hop-groups ++{/% else /%} + ! Uses the old known FPM behavior of including next hop information in the route (e.g. RTM_NEWROUTE) messages + no fpm use-next-hop-groups ++{/% endif /%} + ! + fpm address 127.0.0.1 + {/% endblock fpm /%} +``` + +#### CLI/YANG model Enhancements + + + +The output of 'show ip route' and 'show ipv6 route' will remain unchanged - the CLI code will resolve the NextHop Group ID referenced in the `ROUTE_TABLE` to display the next hops for the routes. + +To enable/disable this feature, two new CLI (Klish) would be introduced. + +- Enable: `feature next-hop-group enable` +- Disable: `no feature next-hop-group` + +CONFIG_DB entry will be created (enable) or removed (disable) by entering above CLI command. + +This setting is read at boot time during FRR startup so it requires a reboot once it’s changed and saved to startup configuration. +So after config is changed by CLI (KLISH via RESTCONF), user must run `sudo config save -y` in order for the configuration to be saved in `config_db.json` and take effect after system restart. + +Below is example when using this CLI command to enable/disable the feature. + +Enable + +```sh +admin@sonic:~$ redis-cli -n 4 hget "DEVICE_METADATA|localhost" nexthop_group +(nil) + +admin@sonic:~$ sonic-cli + +sonic# configure terminal + +sonic(config)# + end Exit to EXEC mode + exit Exit from current mode + feature Configure additional feature + interface Select an interface + ip Global IP configuration subcommands + mclag domain + no To delete / disable commands in config mode + +sonic(config)# feature + next-hop-group Next-hop Groups feature + +sonic(config)# feature next-hop-group + enable Enable Next-hop Groups feature + +sonic(config)# feature next-hop-group enable + +admin@sonic:~$ redis-cli -n 4 hget "DEVICE_METADATA|localhost" nexthop_group +"enabled" +``` + +Disable + +```sh +sonic(config)# no + feature Disable additional feature + ip Global IP configuration subcommands + mclag domain + +sonic(config)# no feature + next-hop-group Disable Next-hop Groups feature + +sonic(config)# no feature next-hop-group + +admin@sonic:~$ redis-cli -n 4 hget "DEVICE_METADATA|localhost" nexthop_group +(nil) +``` + +Implementation: + +- New CLI actioner `sonic-cli-feature.py` will be added for this CLI command. +- The CLI command will be defined in a new cli-xml file: `/CLI/clitree/cli-xml/sonic-feature.xml` + +When actioner `sonic-cli-feature.py` is called from the Klish framework, it will call RESTCONF to create / remove the CONFIG_DB entry. + +- enable: `$SONIC_CLI_ROOT/sonic-cli-feature.py configure_sonic_nexthop_groups 1` +- disable: `$SONIC_CLI_ROOT/sonic-cli-feature.py configure_sonic_nexthop_groups 0` +- RESTCONF URI called from `sonic-cli-feature.py`: `/restconf/data/sonic-feature:sonic-feature` + +The model is not newly introduced but using pre-existing `sonic-device_metadata.yang` model present in the source code at https://github.com/sonic-net/sonicbuildimage/blob/master/src/sonic-yang-models/yang-models/sonic-device_metadata.yang + +``` +module: sonic-device_metadata + +--rw sonic-device_metadata + +--rw DEVICE_METADATA + +--rw localhost + +--rw hwsku? stypes:hwsku + +--rw default_bgp_status? enumeration + +--rw docker_routing_config_mode? string + +--rw hostname? stypes:hostname + +--rw platform? string + +--rw mac? yang:mac-address + +--rw default_pfcwd_status? enumeration + +--rw bgp_asn? inet:as-number + +--rw deployment_id? uint32 + +--rw type? string + +--rw buffer_model? string + +--rw frr_mgmt_framework_config? boolean + +--rw synchronous_mode? enumeration + +--rw yang_config_validation? stypes:mode-status + +--rw cloudtype? string + +--rw region? string + +--rw sub_role? string + +--rw downstream_subrole? string + +--rw resource_type? string + +--rw cluster? string + +--rw subtype? string + +--rw peer_switch? stypes:hostname + +--rw storage_device? boolean + +--rw asic_name? string + +--rw switch_id? uint16 + +--rw switch_type? string + +--rw max_cores? uint8 + +--rw dhcp_server? stypes:admin_mode + +--rw bgp_adv_lo_prefix_as_128? boolean + +--rw suppress-fib-pending? enumeration + +--rw rack_mgmt_map? string + +--rw timezone? stypes:timezone-name-type + +--rw create_only_config_db_buffers? boolean + +--rw nexthop_group? enumeration +``` + +#### Config DB Enhancements + + + +This feature should be disabled/enabled using the existing CONFIG_DB `DEVICE_METADATA` Table. +The key name will be `DEVICE_METADATA|localhost` with `nexthop_group` attribute. + +Configuration schema in ABNF format: + +``` +; DEVICE_METADATA table +key = DEVICE_METADATA|localhost` ; DEVICE_METADATA configuration table +nexthop_group = "enabled" or "disabled" ; Globally enable/disable next-hop group feature, + ; by default this flag is disabled +``` + +Sample of CONFIG DB snippet given below: + +``` + "DEVICE_METADATA": { + "localhost": { + "bgp_asn": "65100", + "buffer_model": "traditional", + "default_bgp_status": "up", + "default_pfcwd_status": "disable", + "hostname": "sonic", + "hwsku": "Force10-S6000", + "mac": "50:00:00:0f:00:00", + "nexthop_group": "enabled", + "platform": "x86_64-kvm_x86_64-r0", + "timezone": "UTC", + "type": "LeafRouter" + } + }, +``` + +### Warmboot and Fastboot Design Impact + + + +- When the feature is disabled, there should be no impact to Warmboot and Fastboot. +- When the feature is enabled, there will be no warmboot nor fastboot support. + +When the feature is enabled, NHG ID will be managed by FRR which will change after FRR related process or BGP container restart. +We need a way to either let FRR preserve the ID or a way to correlate the NHGs, IDs and it's members before and after the restart. + +We will continue discussion on how we could support Warmboot/Fastboot for future enhancements. + +### Testing Requirements/Design + + + +One can use `redis-cli` command to check entries in CONFIG_DB. + +```sh +admin@sonic:/etc/sonic$ redis-cli -n 4 hget "DEVICE_METADATA|localhost" nexthop_group + "enabled" +``` + + +#### Unit Test cases + +#### Config test cases (feature enable/disable) + +Confirm the feature is disabled by default. + +1. Boot SONiC with default config (clean install) +2. Check there is no `DEVICE_METADATA|localhost` entry in CONFIG_DB +3. Log into BGP container. Check `/etc/sonic/frr/zebra.conf` has config `no fpm use-next-hop-groups` + +CONFIG_DB entry add/del via Klish CLI + +1. From CLI, enter `feature next-hop-group enable` +2. Confirm `DEVICE_METADATA|localhost` entry with attr `nexthop_group=enabled` is created CONFIG_DB +3. From CLI, enter `no feature next-hop-group` +4. Confirm `DEVICE_METADATA|localhost` entry does not exist in CONFIG_DB + +`zebra.conf` option based on CONFIG_DB entry (disable) + +1. Confirm `DEVICE_METADATA|localhost` entry does not exist in CONFIG_DB +2. Reboot system +3. Confirm `/etc/sonic/frr/zebra.conf` has config `no fpm use-next-hop-groups` + +`zebra.conf` option based on CONFIG_DB entry (enable) + +1. Confirm `DEVICE_METADATA|localhost` entry with attr `nexthop_group=enabled` exist in CONFIG_DB +2. Reboot system +3. Confirm `/etc/sonic/frr/zebra.conf` has config `fpm use-next-hop-groups` + +#### System Test cases +##### Multiple NextHops +In case of multiple nexthops, Nexthop group will create it. +For multiple nexthops, ensure next hop group is created. + +Add route + +1. Create static route or bgp with 2 or more ECMP routes (which cause zebra to send `RTM_NEWNEXTHOP`) +2. Confirm `APPL_DB` entries are created as expected + +Sample of APPL_DB output result when Add route. +``` +admin@sonic:~$ show ip route +B>*10.1.1.2/32 [20/0] via 10.0.0.1, Ethernet0, 00:00:14 + * via 10.0.0.3, Ethernet4, 00:00:14 + +admin@sonic:~$ sonic-db-cli APPL_DB keys \* | grep NEXT +NEXTHOP_GROUP_TABLE:ID94 + +admin@sonic:~$ sonic-db-cli APPL_DB HGETALL NEXTHOP_GROUP_TABLE:ID94 +{'nexthop': '10.0.0.1,10.0.0.3', 'ifname': 'Ethernet0,Ethernet4', 'weight': '1,1'} + +admin@sonic:~$ sonic-db-cli APPL_DB keys \* | grep ROUTE +ROUTE_TABLE:10.1.1.2 + +admin@sonic:~$ sonic-db-cli APPL_DB HGETALL ROUTE_TABLE:10.1.1.2 +{'nexthop_group': 'ID94', 'protocol': 'bgp'} +``` + +Del route + +1. Delete nexthop(s) except one nexthop. +2. Confirm `APPL_DB` entries are deleted as expected + +Sample of APPL_DB output result when Del route. +``` +admin@sonic:~$ show ip route +B>*10.1.1.2/32 [20/0] via 10.0.0.3, Ethernet4, 00:00:14 + +admin@sonic:~$ sonic-db-cli APPL_DB keys \* | grep NEXT + +admin@sonic:~$ sonic-db-cli APPL_DB keys \* | grep ROUTE +ROUTE_TABLE:10.1.1.2 + +admin@sonic:~$ sonic-db-cli APPL_DB HGETALL ROUTE_TABLE:10.1.1.2 +{'nexthop': '10.0.0.3', 'ifname': 'Ethernet4', 'protocol': 'bgp'} +``` + +##### Single NextHops +For Single NextHop, ensure next hop group is not created. + +Add route + +1. Create static route or bgp with 1 routes +2. Confirm `APPL_DB` entries are created as expected + +Sample of APPL_DB output result when Add route. +``` +admin@sonic:~$ show ip route +B>*10.1.1.3/32 [20/0] via 10.0.0.1, Ethernet0, 00:00:34 +B>*10.1.1.4/32 [20/0] via 10.0.0.3, Ethernet4, 00:00:34 + +admin@sonic:~$ sonic-db-cli APPL_DB keys \* | grep NEXT + +admin@sonic:~$ sonic-db-cli APPL_DB keys \* | grep ROUTE +ROUTE_TABLE:10.1.1.3 +ROUTE_TABLE:10.1.1.4 + +admin@sonic:~$ sonic-db-cli APPL_DB HGETALL ROUTE_TABLE:10.1.1.3 +{'nexthop': '10.0.0.1', 'ifname': 'Ethernet0', 'protocol': 'bgp'} + +admin@sonic:~$ sonic-db-cli APPL_DB HGETALL ROUTE_TABLE:10.1.1.4 +{'nexthop': '10.0.0.3', 'ifname': 'Ethernet4', 'protocol': 'bgp'} +``` + +Del route + +1. Delete static or bgp route created in previous test +2. Confirm `APPL_DB` entries are deleted as expected + +Sample of APPL_DB output result when Del route. +``` +admin@sonic:~$ show ip route +B>*10.1.1.4/32 [20/0] via 10.0.0.3, Ethernet4, 00:09:30 + +admin@sonic:~$ sonic-db-cli APPL_DB keys \* | grep NEXT + +admin@sonic:~$ sonic-db-cli APPL_DB keys \* | grep ROUTE +ROUTE_TABLE:10.1.1.4 + +admin@sonic:~$ sonic-db-cli APPL_DB HGETALL ROUTE_TABLE:10.1.1.4 +{'nexthop': '10.0.0.3', 'ifname': 'Ethernet4', 'protocol': 'bgp'} +``` + +### Open/Action items - if any + + + +#### libnl compatibility with upstream + +To add this feature, we have extended `libnl` to support NextHop Group. (i.e. `nh_id`, `RTM_NEWNEXTHOP` etc.) + +However, there is a proposal [libnl: PR#332](https://github.com/thom311/libnl/pull/332/) to support NextHop Group in upstream `libnl`. +We should review this PR (and any other related patches if found) so difference from the upstream code would be minimal. + +#### Further performance improvements + +Extention to fpmsyncd described in this HLD will only change how `fpmsyncd` will handle `RTM_NEWNEXTHOP` and `RTM_DELNEXTHOP`. + +Further study is required for more fundamental improvements, e.g. how zebra handles NextHop Groups in scale, communication channel between zebra and fpmsyncd, improvements in FRR like BGP PIC support etc. + +Refer to the meeting minutes [SONiC Routing Working Group](https://lists.sonicfoundation.dev/g/sonic-wg-routing/wiki) for discussions related to future improvements. +For the discussion specific to this HLD, check [07132023 Meeting Minutes](https://lists.sonicfoundation.dev/g/sonic-wg-routing/wiki/34321) + +#### Backward compatibility with current NHG creation logic (Fine-grain NHG, Ordered NHG/ECMP) + +This feature is disabled by default and thus backward compatible that it would not impact the current NHG creation logic in SWSS/Orchagent. + +When enabled, NHG ID and member management will be handled by FRR, and the current NHG creation logic in SWSS/Orchagent will not be used. +i.e. behavior will be same as the current behavior of manually adding entry to `APPL_DB: NEXTHOP_GROUP_TABLE`. + + +#### nexthop_compat_mode Kernel option + +In regards to NextHop Group, Linux Kernel runs in compatibility mode which sends netlink message using both old route format without `RTA_NH_ID` and new format using `RTA_NH_ID`. + +There is a `sysctl` option `net.ipv4.nexthop_compat_mode nexthop_compat_mode` which is on by default but provides the ability to turn off compatibility mode allowing systems to only send route update with the new format which could potentially improve performance. + +This option is not changed as part this HLD to avoid unexpected impact to the existing behavior. + +One should carefully study the impact of this change before chainging this option. + +#### Warmboot/Fastboot support + +Currently this feature does not work with Warmboot/Fastboot. +We will continue discussion on how we could support Warmboot/Fastboot for future enhancements. + +#### No support for setting config enable/disable on runtime + +This feature can NOT be enabled or disabled at runtime. +Reboot is required after enabling/disabling this feature to make sure route entry using and not using this NHG feature would not co-exisit in the `APPL_DB`. + +#### Source of APPL_DB entry related to NHG + +Expectation today is there is only one source, FRR or some other routing container, to modify NHG related entries in `APPL_DB`. + +If there is any use case to use more than one source, then design of `APPL_DB` schema and related logic need to be studied. +For example, we might need additional attr/entity to distinguish the source of the NHG/NH entry. + +Not that this not specific to NHG feature but typical limitation when more than one entities are modifying same `APPL_DB` entry. diff --git a/doc/pic/images/background.png b/doc/pic/images/background.png new file mode 100644 index 00000000000..89c94a8ab21 Binary files /dev/null and b/doc/pic/images/background.png differ diff --git a/doc/pic/images/bgp_peering.png b/doc/pic/images/bgp_peering.png new file mode 100644 index 00000000000..2ebe7ea199a Binary files /dev/null and b/doc/pic/images/bgp_peering.png differ diff --git a/doc/pic/images/core_failure_local_flow.png b/doc/pic/images/core_failure_local_flow.png new file mode 100644 index 00000000000..92595e907b3 Binary files /dev/null and b/doc/pic/images/core_failure_local_flow.png differ diff --git a/doc/pic/images/core_failure_remote.png b/doc/pic/images/core_failure_remote.png new file mode 100644 index 00000000000..d1e99d3004c Binary files /dev/null and b/doc/pic/images/core_failure_remote.png differ diff --git a/doc/pic/images/core_failure_remote_flow.png b/doc/pic/images/core_failure_remote_flow.png new file mode 100644 index 00000000000..22b9ab3f2cb Binary files /dev/null and b/doc/pic/images/core_failure_remote_flow.png differ diff --git a/doc/pic/images/nhg_concept.png b/doc/pic/images/nhg_concept.png new file mode 100644 index 00000000000..bb7bdcae465 Binary files /dev/null and b/doc/pic/images/nhg_concept.png differ diff --git a/doc/pic/images/nhg_flow.png b/doc/pic/images/nhg_flow.png new file mode 100644 index 00000000000..0170768e6a8 Binary files /dev/null and b/doc/pic/images/nhg_flow.png differ diff --git a/doc/pic/images/pic_core.png b/doc/pic/images/pic_core.png new file mode 100644 index 00000000000..f1e7ffed245 Binary files /dev/null and b/doc/pic/images/pic_core.png differ diff --git a/doc/pic/images/pic_edge.png b/doc/pic/images/pic_edge.png new file mode 100644 index 00000000000..652509f049e Binary files /dev/null and b/doc/pic/images/pic_edge.png differ diff --git a/doc/pic/images/pic_local.png b/doc/pic/images/pic_local.png new file mode 100644 index 00000000000..f1a99186b14 Binary files /dev/null and b/doc/pic/images/pic_local.png differ diff --git a/doc/pic/images_fpmsyncd/fig1.svg b/doc/pic/images_fpmsyncd/fig1.svg new file mode 100644 index 00000000000..19074d2f940 --- /dev/null +++ b/doc/pic/images_fpmsyncd/fig1.svg @@ -0,0 +1 @@ +
Zebra
Zebra
fpmsyncd
fpmsyncd
redis-server
redis-server
add route
add route
RTM_NEWROUTE
RTM_NEWROUTE
HSET ROUTE_TABLE:PREFIX
HSET ROUTE_TABLE:PREFIX
del route
del route
RTM_DELROUTE
RTM_DELROUTE
HDEL ROUTE_TABLE:PREFIX
HDEL ROUTE_TABLE:PREFIX
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/pic/images_fpmsyncd/fig2.svg b/doc/pic/images_fpmsyncd/fig2.svg new file mode 100644 index 00000000000..e25ce00e93f --- /dev/null +++ b/doc/pic/images_fpmsyncd/fig2.svg @@ -0,0 +1 @@ +
Zebra
Zebra
fpmsyncd
fpmsyncd
redis-server
redis-server
add route
add route
RTM_NEWNEXTHOP
RTM_NEWNEXTHOP
HSET NEXT_HOP_GROUP_TABLE:PREFIX
HSET NEXT_HOP_GROUP_TABLE:PREFIX
RTM_NEWNEXTHOP
RTM_NEWNEXTHOP
RTM_NEWNEXTHOP
RTM_NEWNEXTHOP
HSET ROUTE_TABLE:PREFIX
HSET ROUTE_TABLE:PREFIX
RTM_NEWNEXTHOP
RTM_NEWNEXTHOP
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/pic/images_fpmsyncd/fig3.svg b/doc/pic/images_fpmsyncd/fig3.svg new file mode 100644 index 00000000000..e40664e4a4b --- /dev/null +++ b/doc/pic/images_fpmsyncd/fig3.svg @@ -0,0 +1,4 @@ + + + +
NEXT_HOP
NEXT_HOP
NEXT_HOP_GROUP_MEMBER
NEXT_HOP_GROUP_MEMBER
Key: oid: 6
Value: {
  NEXT_HOP_GROUP_ID: oid: 5
  NEXT_HOP_ID:
3
}
Key: oid: 6...
NEXT_HOP_GROUP_MEMBER
NEXT_HOP_GROUP_MEMBER
Key: oid: 7
Value: {
  NEXT_HOP_GROUP_ID: oid: 5
  NEXT_HOP_ID:
4
}
Key: oid: 7...
NEXT_HOP
NEXT_HOP
Key: oid: 3
Value: {
  NEXT_HOP_ATTR_IP: 10.0.1.5
  NEXT_HOP_ATTR_ROUTER_INTERFACE_ID:  
1
Key: oid: 3...
Key: oid: 4
Value: {
  NEXT_HOP_ATTR_IP: 10.0.1.5
  NEXT_HOP_ATTR_ROUTER_INTERFACE_ID:  
1
Key: oid: 4...
ROUTE_ENTRY
ROUTE_ENTRY
Key: {
  dest:
8.8.8.0/24
}
Value: {
  NEXT_HOP_ID:
5
}
Key: {...
NEXT_HOP_GROUP
NEXT_HOP_GROUP
Key: oid: 5

Key: oid: 5...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/pic/images_fpmsyncd/fig4-config.svg b/doc/pic/images_fpmsyncd/fig4-config.svg new file mode 100644 index 00000000000..67d33e705a4 --- /dev/null +++ b/doc/pic/images_fpmsyncd/fig4-config.svg @@ -0,0 +1,4 @@ + + + +
BGP Container
BGP Container
Management
Framework
(Klish CLI)
Management...
CONFIG_DB
CONFIG_DB
config_db.json
config_db.json
sonic-cfggen
sonic-cfggen
zebra.conf
zebra.conf
FRR
(zebra)
FRR...
zebra.conf.j2
zebra.conf.j2
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/pic/images_fpmsyncd/fpmsyncd-nhg-architecture.png b/doc/pic/images_fpmsyncd/fpmsyncd-nhg-architecture.png new file mode 100644 index 00000000000..90b24d70dd7 Binary files /dev/null and b/doc/pic/images_fpmsyncd/fpmsyncd-nhg-architecture.png differ diff --git a/doc/pmon/pmon-sensormon.md b/doc/pmon/pmon-sensormon.md new file mode 100644 index 00000000000..31ec5c8be50 --- /dev/null +++ b/doc/pmon/pmon-sensormon.md @@ -0,0 +1,260 @@ +# SONiC PMON Sensor Monitoring Enhancement # + + +### Revision 1.0 + +## Table of Content +1. [Scope](#Scope) +2. [Definitions](#Definitions/Abbreviations) +3. [Overview](#Overview) +3. [Requirements](#Requirements) +4. [High Level Design](#High-Level-Design) +5. [CLI](#CLI-Enhancements) +7. [Test](#Testing-Considerations) + + + +### Scope + +This document covers the support for monitoring voltage and current sensor devices in SONiC. + +### Definitions/Abbreviations + +PMON - Platform Monitor container in SONiC. + +PSU - Power Supply Unit + +Voltage Sensor - Sensor device which can report a voltage measurement in the system. + +Current Sensor - Sensor device which can report current measurement in the system. + +Altitude Sensor - Sensor device which can report the altitude of the system. + + +### Overview + +Modern hardware systems have many different types of sensors and control devices. Voltage sensor devices can measure and in some cases control the voltages on the boards. Current sensor devices can measure current. It is also possible to have other types of sensors such as Altitude Sensors etc. These devices can report measurements from different parts of the system which are useful for monitoring system health. For example, voltage controller devices distribute power across different parts of the system such as motherboard, daughterboards etc. and can report voltage measurements from there. Often these devices can report under-voltage/over-voltage faults which should be monitored to alert the operator about any power related failures in the system. This document provides an overview for monitoring voltage and current sensors in SONiC. The solution proposed in this document can be enhanced for other types of sensors as well. + +Note that temperature sensor devices are managed via SONiC ThermalCtlD daemon today. At this point there is no change proposed for ThermalCtlD. This proposed design can be used for voltage, current and other types of sensors. + +Linux does provide some support of voltage and current sensor monitoring using lmsensors/hwmon infrastructure. However there are a few limitations with that + +- Devices not supported with Hwmon are not covered +- Simple devices which donot have an inbuilt monitoring functions do not generate any alarms +- Platform specific thresholds for monitoring are not available + +The solution proposed in this document tries to address these limitations by extending the coverage to a larger set of devices and providing platform specific thresholds for sensor monitoring. + + +### Requirements + +This HLD covers + +* Discovery of voltage and current sensor devices in the system +* Monitoring the sensor devices periodically and update that data in Redis DB +* Raise/Clear alarms if the sensor devices indicate readings which are unexpected and clear them if they return to a good state. +* Report sensor alarm conditions in system health +* Enable such sensors in Entity MIB +* A framework for adding new sensor types in future + +This HLD does not cover + +* An automated recovery action that system might take as a result of a fault reported by the voltage sensor device. A network management system may process the alarms and take recvoery action as it sees fit. + + +### High Level Design + +The proposal for monitoring sensor devices is to create a new SensorMon daemon. SensorMon will use API provided by platform to discover the sensor devices. It will periodically poll the devices to refresh the sensor information and update the readings in StateDB. + +If the sensor device readings cross the minor/major/critical thresholds, syslogs will be generated to indicate to the operator about the alarm condition. If the sensor reports normal data in a subsequent poll, another syslog will be generated to indicate that the fault condition is cleared. + +CLIs are provided to display the sensor devices, their measurements, threshold values and if they are reporting an alarm. + +Platform APIs will provide + +* List of sensors devices of a specific type in the system +* Way to read the sensor information + +The following SONiC repositories will have changes + +#### sonic-platform-daemons + +SensorMon will be a new daemon that will run in PMON container. It will retrieve a list of sensors of different sensor types from the platform during initialization. Subsequently, it will poll the sensor devices on a periodic basis and update their measurments in StateDb. SensorMon will also raise syslogs on alarm conditions. + +Following is the DB schema for voltage and current sensor data. + +##### Voltage Sensor StateDb Schema + + ; Defines information for a voltage sensor + key = VOLTAGE_INFO|sensor_name ; Voltage sensor name + ; field = value + voltage = float : Voltage measurement + unit = string ; Unit for the measurement + high_threshold = float ; High threshold + low_threshold = float ; Low threshold + critical_high_threshold = float ; Critical high threshold + critical_low_threshold = float ; Critical low threshold + warning_status = boolean ; Sensor value in range + timestamp = string ; Last update time + maximum_voltage = float ; Maximum recorded measurement + minimum_voltage = float ; Mininum recorded measurement + +##### Current Sensor StateDb Schema + + ; Defines information for a current sensor + key = CURRENT_INFO|sensor_name ; Current sensor name + ; field = value + current = float : Current measurement + unit = string ; Unit for the measurement + high_threshold = float ; High threshold + low_threshold = float ; Low threshold + critical_high_threshold = float ; Critical high threshold + critical_low_threshold = float ; Critical low threshold + warning_status = boolean ; Sensor value in range + timestamp = string ; Last update time + maximum_current = float ; Maximum recorded measurement + minimum_current = float ; Mininum recorded measurement + + +#### sonic-platform-common + +Chassis Base class will be enhanced with prototype methods for retrieving number of sensors and sensor objects of a specific type. + +Module base class will also be enhanced with similar methods for retrieving sensors present on the modules. + +New base classes will be introduced for new sensor types. + +VsensorBase is introduced for voltage sensor objects. +IsensorBase is introdued for current sensor objects. + +The classes will have methods to retrieve threshold information, sensor value and min/max recorded values from the sensor. + +#### sonic-utilities + +CLIs are introduced to retrieve and display sensor data from State DB for different sensor types. CLIs are described in the next section. + +#### sonic-buildimage + +The CLI "show system-health" should report sensor fault conditions. Hardware health check script will need enhancement to retrieve sensor data from StateDB. + +#### sonic-snmpagent + +Voltage and Current sensors should be available in Entity MIB. Entity MIB implementation will need an enhancement to retrieve voltage and current sensors from the state DB. + +### CLI Enhancements + +Following CLI is introduced to display the Voltage and Current Sensor devices. + + root@sonic:/home/cisco# show platform voltage + Sensor Voltage High TH Low TH Crit High TH Crit Low TH Warning Timestamp + ---------------- ------------- --------- -------- -------------- ------------- --------- ----------------- + VP0P75_CORE_NPU0 750 mV 852 684 872 664 False 20230204 11:35:21 + VP0P75_CORE_NPU1 750 mV 852 684 872 664 False 20230204 11:35:21 + VP0P75_CORE_NPU2 750 mV 852 684 872 664 False 20230204 11:35:22 + + ... + + root@sonic:/home/cisco# show platform current + Sensor Current High TH Low TH Crit High TH Crit Low TH Warning Timestamp + -------------- ------------- --------- -------- -------------- ------------- --------- ----------------- + POL_CORE_N0_I0 25000 mA 30000 18000 28000 15000 False 20230212 11:18:28 + POL_CORE_N0_I1 21562 mA 30000 18000 28000 15000 False 20230212 11:18:28 + POL_CORE_N0_I2 22250 mA 30000 18000 28000 15000 False 20230212 11:18:28 + + + + +#### Configuration and Management + +At this point, there is no configuration requirement for this feature. + +If the SensorMon daemon is not desired to be run in the system, an entry can be added to pmon_daemon_control.json to exclude it from running in the system. + +It is advised to monitor the sensor alarms and use that to debug and identify any issues in the system. In the event, a sensor crosses a high or low threshold, syslogs will be raised indicating the alarm. + + FJul 27 08:26:32.561330 sonic WARNING pmon#sensormond: High voltage warning: VP0P75_CORE_NPU2 current voltage 880mV, high threshold 856mV + +The alarm condition will be visible in the CLI ouputs for sensor data and system health. + + e.g + root@sonic:/home/cisco# show platform voltage + Sensor Voltage High TH Low TH Crit High TH Crit Low TH Warning Timestamp + ---------------- ------------- --------- -------- -------------- ------------- --------- ----------------- + VP0P75_CORE_NPU2 880 mV 720 684 720 664 True 20230204 11:35:22 + + root@sonic:/home/cisco# show system-health detail + System status summary + ... + Hardware: + Status: Not OK + Reasons: Voltage sensor VP0P75_CORE_NPU2 measurement 880 mV out of range (679,856) + ... + VP0P75_CORE_NPU2 Not OK voltage + ... + + +##### Platform Sensors Configuration + +Sensormond will use the platform APIs for retrieving platform sensor information. However, for platforms with only file-system/sysfs based drivers, a simple implementation is provided wherein the platform can specify the sensor information for the board and any submodules (such as fabric cards) in a data file and Sensormond can use that for finding sensors and monitoring them. + +The file system/Sysfs based platform sensor information can be provided using a yaml file. The yaml file shall have the following format. + + + sensors.yaml + + voltage_sensors: + - name : + sensor: + high_thresholds: [ , , ] + low_thresholds: [ , , ] + ... + + current_sensors: + - name : + sensor: + high_thresholds: [ , , ] + low_thresholds: [ , , ] + ... + + : + voltage_sensors: + - name: + sensor: + high_thresholds: [ , , ] + low_thresholds: [ , , ] + ... + + current_sensors: + - name: + sensor: + high_thresholds: [ , , ] + low_thresholds: [ , , ] + ... + + + + +##### PDDF Support + +SONiC PDDF provides a data driven framework to access platform HW devices. PDDF allows for sensor access information to be read from platform specific json files. PDDF support can be added for voltage and current sensors which can be retrieved by Sensormon. + + +#### Warmboot and Fastboot Design Impact + +Warmboot and Fastboot should not be impacted by this feature. On PMON container restart, the sensor monitoring should restart the same way as on boot. + +### Testing Considerations + +Unit test cases cover the CLI and sensor monitoring aspects. All SONiC common repos will have unit tests and meet code coverage requirements for the respective repos. In addition SONiC management tests will cover the feature on the target. + +### Feature availability + +The core implementation for the daemon process is available at this time along with the HLD. This includes changes in the following repos + +* Sonic-platform-daemons +* sonic-platform-common +* sonic-utilities +* sonic-buildimage +* sonic-snmpagent + +SONiC management test cases and PDDF support will be available in the next phase of development. diff --git a/doc/port-si/Notify-Media-Setting-Process.drawio.svg b/doc/port-si/Notify-Media-Setting-Process.drawio.svg new file mode 100755 index 00000000000..12801b6dfda --- /dev/null +++ b/doc/port-si/Notify-Media-Setting-Process.drawio.svg @@ -0,0 +1,556 @@ + + + + + + + + + + + +
+
+
+ Update STATE_DB +
+
+
+
+ + Update STATE_DB + +
+
+ + + + + +
+
+
+ + + +
+ TRANSCEIVER_DOM_SENSOR +
+
+ + TRANSCEIVER_DOM_THRESHOLD +
+
+ + TRANSCEIVER_STATUS +
+
+ + TRANSCEIVER_PM +
+
+ + TRANSCEIVER_INFO + +
+
+
+
+
+
+
+
+ + TRANSCEIVER_DOM_S... + +
+
+ + + +
+
+
+ + STATE_DB + +
+
+
+
+ + ST... + +
+
+ + + + + +
+
+
+ + PORT_TABLE +
+
+
+
+
+
+ + PORT_TABLE + +
+
+ + + +
+
+
+ + APP_DB + +
+
+
+
+ + AP... + +
+
+ + + + + + + + + + +
+
+
+ + XCVRD / SFP + +
+
+
+
+ + XCVRD / SFP + +
+
+ + + + + +
+
+
+ + PORT_OA + +
+
+
+
+ + PORT_OA + +
+
+ + + + + +
+
+
+ + SAI + +
+
+
+
+ + SAI + +
+
+ + + + + +
+
+
+ + SDK + +
+
+
+
+ + SDK + +
+
+ + + + + + + + + +
+
+
+ + XCVRD / SFP + +
+
+
+
+ + XCVRD / SFP + +
+
+ + + + + +
+
+
+ + PORT_OA + +
+
+
+
+ + PORT_OA + +
+
+ + + + + +
+
+
+ + SAI + +
+
+
+
+ + SAI + +
+
+ + + + + +
+
+
+ + SDK + +
+
+
+
+ + SDK + +
+
+ + + + + + + + + +
+
+
+ + FW + +
+
+
+
+ + FW + +
+
+ + + + + +
+
+
+ + FW + +
+
+
+
+ + FW + +
+
+ + + + + + + +
+
+
+ sai_port_api.remove_port_serdes() +
+
+
+
+ + sai_port_api.remove_port_serdes() + +
+
+ + + + + + +
+
+
+ setPortSerdesAttribute() +
+
+
+
+ + setPortSerdesAttribute() + +
+
+ + + +
+
+
+ sai_port_api.create_port_serdes() +
+
+
+
+ + sai_port_api.create_port_serdes() + +
+
+ + + +
+
+
+ triggers +
+
+
+
+ + triggers + +
+
+ + + + +
+
+
+ notify_media_settings() +
+
+
+
+ + notify_media_settings() + +
+
+ + + +
+
+
+ if port.admin_state_up: ADMIN_DOWN +
+
+
+
+ + if port.admin_state_up: ADMIN_DOWN + +
+
+ + + +
+
+
+ ADMIN_UP +
+
+
+
+ + ADMIN_UP + +
+
+ + + + +
+
+
+ get_media_settings_key() (extended format) +
+
+
+
+ + get_media_settings_key() (e... + +
+
+ + + + +
+
+
+ get_media_settings_value() +
+ Lookup in media_setting.json +
+
+
+
+ + get_media_settings_value()... + +
+
+ + + + +
+
+
+ parsePortConfig() +
+
+
+
+ + parsePortConfig() + +
+
+ + + + +
+
+
+ getPortSerdesAttribute() +
+
+
+
+ + getPortSerdesAttribute() + +
+
+ + + + +
+
+
+ Module Insertion / +
+ xcvrd Initialization +
+
+
+
+ + Module Insertion /... + +
+
+
+ + + + + Text is not SVG - cannot display + + + +
\ No newline at end of file diff --git a/doc/port-si/Port_SI_Per_Speed.md b/doc/port-si/Port_SI_Per_Speed.md new file mode 100755 index 00000000000..1787722ff0b --- /dev/null +++ b/doc/port-si/Port_SI_Per_Speed.md @@ -0,0 +1,230 @@ +# Feature Name +Port Signal Integrity Per Speed Enhancements

+# High Level Design Document +#### Rev 0.2 + +
+ +# Table of Contents + * [General Information](#general-information) + * [Revision](#revision) + * [About This Manual](#about-this-manual) + * [Definitions/Abbreviations](#definitionsabbreviations) + * [Reference](#reference) + * [Feature Motivation](#feature-motivation) + * [Design](#design) + * [New SERDES parameters](#new-serdes-parameters) + * [Application DB Enhancements](#application-db-enhancements) + * [Json Format For SI Parameters](#json-format-for-si-parameters) + * [How are we going to use this json?](#how-are-we-going-to-use-this-json) + * [Port SI configuration (flows)](#port-si-configuration-flows) + * [Changes to support the enhancements](#changes-to-support-the-enhancements) + * [Unit Test](#unit-test) + +

+ +# General Information + +## Revision +| Rev | Date | Author | Change Description | +|:---:|:-----------:|:------------------:|----------------------------------- | +| 0.1 | 08/28/2023 | Tomer Shalvi | Base version | + +## About this Manual +This document is the high level design of the Port SI Per Speed Enhancements feature. + +## Definitions/Abbreviations +| Term | Description | +|:--------:|:-------------------------------------------------:| +| SAI | Switch Abstraction Interface | +| SONiC | Software for Open Networking in the Cloud | +| xcvrd | Transceiver Daemon | + +## Reference +| Document | Description | +|:----------------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------:| +| [cmis-init.md](https://github.com/sonic-net/SONiC/blob/master/doc/sfp-cmis/cmis-init.md) | CMIS initialization HLD. | +| [Media based port settings in SONiC](https://github.com/sonic-net/SONiC/blob/master/doc/media-settings/Media-based-Port-settings.md) | Media based port settings HLD. | +

+ +# Feature Motivation +Today SONIC has the capability to apply port SI configuration. To achieve optimal configuration of their SerDes, some vendors wish to have the option to configure SerDes SI based on lane speeds.
+This design will focus on enhancing the current ASIC configuration to provide support for configurations that take into account lane speed. +

+ + + +# Design + +The functionality of configuring the SerDes with the NOS currently exists; However, the configuration is currently limited to the plugged-in module and lacks the capability to differentiate between various lane speeds of a module.
+The suggested enhancements will introduce the following additions: + + 1. The ability to support different port SI configurations for different lane speeds. + 2. Expansion of the existing set of supported SI parameters. + +Important note: Backward compatibility will be maintained. Vendors who wish to continue configuring port SI irrespective of the lane speed will be able to do so without making any code changes. +

+ + +# New SERDES parameters + +## Application DB Enhancements + +6 new fields: **ob_m2lp**, **ob_alev_out**, **obplev**, **obnlev**, **regn_bfm1p**, **regn_bfm1n**, will be added to **PORT_TABLE**: + +``` +ob_m2lp = 1*8HEXDIG *( "," 1*8HEXDIG) ; list of hex values, one per lane ; ratio between the central eye to the upper and lower eyes (for PAM4 only) +ob_alev_out = 1*8HEXDIG *( "," 1*8HEXDIG) ; list of hex values, one per lane ; output common mode +obplev = 1*8HEXDIG *( "," 1*8HEXDIG) ; list of hex values, one per lane ; output buffers input to Common mode PMOS side +obnlev = 1*8HEXDIG *( "," 1*8HEXDIG) ; list of hex values, one per lane ; output buffers input to Common mode NMOS side +regn_bfm1p = 1*8HEXDIG *( "," 1*8HEXDIG) ; list of hex values, one per lane ; voltage regulator to pre output buffer PMOS side +regn_bfm1n = 1*8HEXDIG *( "," 1*8HEXDIG) ; list of hex values, one per lane ; voltage regulator to pre output buffer NMOS side +``` + +
+Here is the table for mapping the new SI fields and SAI attributes: + +| Parameter | sai_port_attr_t | +|:-------------:|:-------------------------------------------------:| +| ob_m2lp | SAI_PORT_SERDES_ATTR_TX_PAM4_RATIO | +| ob_alev_out | SAI_PORT_SERDES_ATTR_TX_OUT_COMMON_MODE | +| obplev | SAI_PORT_SERDES_ATTR_TX_PMOS_COMMON_MODE | +| obnlev | SAI_PORT_SERDES_ATTR_TX_NMOS_COMMON_MODE | +| regn_bfm1p | SAI_PORT_SERDES_ATTR_TX_PMOS VLTG_REG | +| regn_bfm1n | SAI_PORT_SERDES_ATTR_TX_NMOS VLTG_REG | + + +These new SAI attributes were code reviewed by the SAI community and are now merged, available in the latest version of the sai_port.h file: https://github.com/opencomputeproject/SAI/blob/master/inc/saiport.h#L3653C29-L3653C29 +

+ + +## Json format for SI parameters + +To ensure that these SI values are transmitted properly to SDK, a JSON file, called _media_settings.json_, is used.
+The format of this JSON is going to be modified to support the dependency on lane speed and the new SI parameters added to APP_DB.
+The updated format is the following: + + +![Alt text]() + +Within the "PORT_MEDIA_SETTINGS" section (or "GLOBAL_MEDIA_SETTINGS" if dealing with port ranges, instead of individual ports), the SI values for each port are organized into four levels of hierarchy: + +* The first level relates to the vendor_key/media_key level +* The second hierarchy level is the lane_speed_key level, which specifies the port speed and lane count. Entries at this hierarchy level always begin with the prefix 'speed:'. +* On the third level, we encounter the names of the SI fields. +* Finally, at the last hierarchy level, the corresponding values for these fields are presented. + +The lane_speed_key is the only new hierarchy level in this JSON. All other hierarchy levels already exist in the current format. +The parser for media_settings.json has been updated to be compatible with JSONs of both the updated format and the current format. This ensures that vendors whose JSON does not include lane_speed_key hierarchy level can continue working with their existing JSONs without changing anything. +

+ + +## How are we going to use this json? + +The flow of using this json will be referred to as the **_Notify-Media-setting-Process_**: + +![Notify Media Setting Process](Notify-Media-Setting-Process.drawio.svg) +
_The red blocks in this diagram represent required changes in the code._ +
+ + +## Port SI configuration (flows) + +Currently, the Notify-Media-Settings-Process is carried out only in the initialization phase of xcvrd and whenever a module is plugged in. After applying the port SI per speed enhancements, it will also be carried out upon port speed change events: Whenever a port speed change is detected by listening to CONFIG_DB, Notify-Media-Settings-Process will be called to send the most applicable SI values in the JSON to SAI. +Port speed changes require invoking the Notify-Media-Settings-Process becuase after such a change, the lane_speed_key used for lookup in the JSON changes accordingly, and the previously configured SI values in the ASIC are no longer relevant. +

+ + + +# Changes to support the enhancements + + +1. Changes in SfpStateUpdateTask thread:

+ With the port SI per speed enhancements applied, we rely on the lane speed when we lookup in _media_settings,json_. Hence, this flow has to be triggered not only by insertion/removal of modules, but by speed configuration changes as well.
+ In order to establish the dependency of port SI configurations on lane speed, we need to be able to monitor speed configuration changes. Therefore, we will add to the SfpStateUpdateTask a listener to detect such changes: a new member will be added to SfpStateUpdateTask to listen to changes in CONFIG_DB.PORT_TABLE, and once such change is detected, _notify_media_settings()_ will be trigerred. Additionally, the SfpStateUpdateTask thread will have a new dictionary that will store the speed and number of lanes for each port. +

+ + +2. The _XCVRD::Notifty_media_settings()_ function should be modified to support the updated format of _media_settings_json_: + + - The method _get_media_settings_key()_ should be extended: + + We need to extend the key used for lookup in the '_media_settings.json_' file to consider lane speed.
+ Currently, there are two types of keys: 'vendor_key' (vendor name + vendor part number, for example: 'AMPHENOL-1234') and 'media_key' (media type + media_compliance_code + media length, for example: 'QSFP28-40GBASE-CR4-1M').
+ In the new format of '_media_settings.json_', the '_get_media_settings_key()_' method will return three values instead of the two values described above. The additional value returned from this method will be the 'lane_speed_key', for example: 'speed:400GAUI-8' (where 'speed:' is the lane_speed_key prefix, '400' refers to the port speed and '8' refers to the lane count).

+ + How will the 'lane_speed_key' be calculated?
+ Each module contains a list called Application Advertisements in its EEPROM, which is a list of all speeds the module is compatible with. For example: + + ``` + Application Advertisement: + 400GAUI-8 C2M (Annex 120E) - Host Assign (0x1) - Active Cable assembly with BER < 2.6x10^-4 - Media Assign (0x1) + IB EDR (Arch.Spec.Vol.2) - Host Assign (0x11) - Active Cable assembly with BER < 10^-12 - Media Assign (0x11) + 200GAUI-4 C2M (Annex 120E) - Host Assign (0x11) - Active Cable assembly with BER < 2.6x10^-4 - Media Assign (0x11) + CAUI-4 C2M (Annex 83E) without FEC - Host Assign (0x11) - Active Cable assembly with BER < 10^-12 - Media Assign (0x11) + ``` + + We will use this list to derive the lane_speed_key. We will iterate over this list and return the item whose port speed and lane count match the port speed and lane count for the corresponding port, that were extracted from CONFIG_DB. + The existing method '_get_cmis_application_desired()_' performs exactly this task, so we will use it to calculate the new key. + +
+ + + - The method _get_media_settings_value()_ needs to be modified to enable lookup in both the extended format JSON and the current one: + + The following diagram describes the updated parser flow for _media_settings.json_:
+ ![Parser Flow](parser_flow.drawio.svg) +
_The red blocks represent additions to the existing flow._ + + Determining whether the JSON file supports per-speed SI parameters or not will be done by searching for the presence of the string "speed:" in the relevant hierarchy level, which is the prefix of each lane_speed_key. This determination is essential to ensure the compatibility of the code with vendors whose '_media_settings.json_' does not include per-speed SI parameters. + + Its important to notice that this lookup mechanism maintains backward compatibility - the parser is capable of handling JSONs that contain the lane speed key as well as JSONs without it. + + Here is the full code: + + ```python #10-16 + def get_media_settings_value(physical_port, key): + + if PORT_MEDIA_SETTINGS_KEY in g_dict: + for keys in g_dict[PORT_MEDIA_SETTINGS_KEY]: + if int(keys) == physical_port: + media_dict = g_dict[PORT_MEDIA_SETTINGS_KEY][keys] + break + + if vendor_key in media_dict: + if is_si_per_speed_supported(media_dict[vendor_key]): # new code + if lane_speed_key in media_dict[vendor_key]: # new code + return media_dict[vendor_key][lane_speed_key] # new code + else: # new code + return {} # new code + else: + return media_dict[vendor_key] + elif media_key in media_dict: + if is_si_per_speed_supported(media_dict[media_key]): # new code + if lane_speed_key in media_dict[media_key]: # new code + return media_dict[media_key][lane_speed_key] # new code + else: # new code + return {} # new code + else: + return media_dict[media_key] + elif DEFAULT_KEY in media_dict: + return media_dict[DEFAULT_KEY] + elif len(default_dict) != 0: + return default_dict + + return {} + ``` + +
+ +3. Ports Orchagent Additions: Introducing the new SI parameters into the current data flow between APP_DB and SAI. +

+ +# Unit Test +- Generation of keys in the new format: Expand the _test_get_media_settings_key()_ method to create a dictionary that contains a mapping between a port and its port speed and lane count. Then call _get_media_settings_key()_ with that dictionary and assert that a valid lane_speed_key was composed. + +- The lookup functionality works seamlessly with both the new and legacy JSON formats: Create a new test, _test_get_media_settings_value()_, that gets the (vendor_key, media_key, lane_speed_key) tuple. This test will perform the lookup using this tuple in two instances of the media_settings.json file: one from the updated format and one from the current format. The only difference between these two JSONs is that the first contains the hierarchy level that corresponds to the lane_speed_key received, while the latter doesn't. Everyhing else is similar for the two JSONs. Both lookup should end up with a match, extracting the same values from the JSONs. +This test verifies backward compatibility and ensures that the updated JSON format does not cause any issues for other vendors. + +- PortsOrchagent tests: +Verify the SAI object is created properly with the new SI parameters: Create an instance of _media_settings.json_ that contains all 6 new SI parameters for a certain module, call to _notify_media_setting()_ and ensure PortOrchagent creates SAI object that contains all new parameters diff --git a/doc/port-si/media_settings_template.png b/doc/port-si/media_settings_template.png new file mode 100755 index 00000000000..8af07fc0aa9 Binary files /dev/null and b/doc/port-si/media_settings_template.png differ diff --git a/doc/port-si/parser_flow.drawio.svg b/doc/port-si/parser_flow.drawio.svg new file mode 100755 index 00000000000..9197dd1d607 --- /dev/null +++ b/doc/port-si/parser_flow.drawio.svg @@ -0,0 +1,475 @@ + + + + + + + + + + +
+
+
+

+
+

+

+ Search vendor_key +
+
+ ( + + AMPHENOL-1234 ) + +

+

+
+

+
+
+
+
+ + Search vendor_key... + +
+
+ + + + + + +
+
+
+

+ Search media_key +
+
+ ( + + QSFP28-40GBASE-CR4-1M ) + +
+

+
+
+
+
+ + Search media_key... + +
+
+ + + + + + +
+
+
+ Check if JSON supports per speed SI parameters +
+
+
+
+ + Check if JSON suppor... + +
+
+ + + + + + + + +
+
+
+ Found? +
+
+
+
+ + Found? + +
+
+ + + + + + +
+
+
+ Search +
+ lane_speed _key +
+ + ( speed: + + + 400GAUI-8 + + + ) + +
+
+
+
+ + Search... + +
+
+ + + + + + + + +
+
+
+ Supports? +
+
+
+
+ + Supports? + +
+
+ + + + + + + + +
+
+
+ Found? +
+
+
+
+ + Found? + +
+
+ + + + +
+
+
+ Fetch SI values +
+
+
+
+ + Fetch SI values + +
+
+ + + + +
+
+
+ No +
+
+
+
+ + No + +
+
+ + + + +
+
+
+ Yes +
+
+
+
+ + Yes + +
+
+ + + + +
+
+
+ Yes +
+
+
+
+ + Yes + +
+
+ + + + +
+
+
+ Yes +
+
+
+
+ + Yes + +
+
+ + + + + + + + +
+
+
+ Found? +
+
+
+
+ + Found? + +
+
+ + + + +
+
+
+ No +
+
+
+
+ + No + +
+
+ + + + +
+
+
+ Return empty data +
+
+
+
+ + Return empty data + +
+
+ + + + +
+
+
+ No +
+
+
+
+ + No + +
+
+ + + + +
+
+
+ Yes +
+
+
+
+ + Yes + +
+
+ + + + + + +
+
+
+

+ Search 'Default' key +
+

+
+
+
+
+ + Search 'Default' key + +
+
+ + + + +
+
+
+ Fetch default +
+ values +
+
+
+
+ + Fetch default... + +
+
+ + + + + + + + +
+
+
+ Found? +
+
+
+
+ + Found? + +
+
+ + + + +
+
+
+ Yes +
+
+
+
+ + Yes + +
+
+ + + + +
+
+
+ No +
+
+
+
+ + No + +
+
+ + + + +
+
+
+ No +
+
+
+
+ + No + +
+
+
+ + + + + Text is not SVG - cannot display + + + +
\ No newline at end of file diff --git a/doc/qos/ECN_and_WRED_statistics_HLD.md b/doc/qos/ECN_and_WRED_statistics_HLD.md new file mode 100644 index 00000000000..6a779b0c5df --- /dev/null +++ b/doc/qos/ECN_and_WRED_statistics_HLD.md @@ -0,0 +1,454 @@ +# WRED and ECN Statistics + + +## Table of Contents +- [WRED and ECN Statistics](#wred-and-ecn-statistics) + - [Table of Contents](#table-of-contents) + - [Revision](#revision) + - [Scope](#scope) + - [Abbreviations](#abbreviations) + - [Overview](#overview) + - [Requirements](#requirements) + - [Architecture Design](#architecture-design) + - [High-Level Design](#high-level-design) + - [Changes in CONFIG_DB](#changes-in-config_db) + - [Changes in STATE_DB](#changes-in-state_db) + - [Changes in FLEX_COUNTER_DB](#changes-in-flex_counter_db) + - [Changes in COUNTERS_DB](#changes-in-counters_db) + - [Changes in Orchagent](#changes-in-orchagent) + - [CLI Changes](#cli-changes) + - [CLI output on a WRED and ECN queue statistics supported platform](#cli-output-on-a-wred-and-ECN-queue-statistics-supported-platform) + - [CLI output on a platform which supports WRED drop statistics and does not support ECN statistics](#cli-output-on-a-platform-which-supports-wred-drop-statistics-and-does-not-support-ecn-statistics) + - [CLI output on a platform which supports ECN statistics and does not support WRED statistics](#cli-output-on-a-platform-which-supports-ecn-statistics-and-does-not-support-wred-statistics) + - [show interface counters CLI output on a WRED drop statistics supported platform](#show-interface-counters-cli-output-on-a-wred-drop-statistics-supported-platform) + - [show interface counters on a platform which does not support WRED drop statistics](#show-interface-counters-cli-output-on-a-platform-which-does-not-support-wred-drop-statistics) + - [SAI API](#sai-api) + - [Configuration and management](#configuration-and-management) + - [Warmboot and Fastboot Design Impact](#warmboot-and-fastboot-design-impact) + - [Restrictions/Limitations](#restrictionslimitations) + - [Testing Requirements/Design](#testing-requirementsdesign) + - [Unit Test cases](#unit-test-cases) + - [System Test cases](#system-test-cases) + +### Revision + +| Rev | Date | Author | Change Description | +|:---:|:--------:|:---------------------------:|--------------------| +| 0.1 |23/Feb/23 | Rajesh Perumal **(Marvell)**| Initial Version | + +### Scope + +This document provides the high level design for WRED and ECN Statistics in SONiC + +### Abbreviations + + + +| Abbreviation | Description | +|:-------------:|---------------------------------| +| __ECN__ | Explicit Congestion Notification| +| __WRED__ | Weighted Random Early Detection | +| __CLI__ | Command Line interface | +| __SAI__ | Switch Abstraction Interface | + + +### Overview + +The main goal of this feature is to provide better WRED impact visibility in SONiC by providing a mechanism to count the packets that are dropped or ECN-marked due to WRED. + +The other goal of this feature is to display these statistics only if the underlying platform supports it. [CLI changes](#cli-changes) section explains this in detail. Every platform may have unique statistics capabilities, and they change over time, and so it is important for this feature to be capability based. + +We will accomplish both the goals by adding statistics support for per-queue WRED dropped packets/bytes, per-queue ECN marked packets/bytes and per-port WRED dropped packets. Existing “show interface counters detailed” CLI will be enhanced for displaying the port level WRED statistics. New CLI will be introduced for queue level WRED and ECN statistics. + +In this document, we will be using the term "WRED and ECN statistics" for combined statistics of per-queue WRED dropped packets/bytes, per-queue ECN marked packets/bytes and per-port WRED dropped packets. + +### Requirements + +1. Support per-queue total ECN marked packets counters(ECN marked in the local switch) +2. Support per-queue total ECN marked byte counters(ECN marked in the local switch) +3. Support per-queue total WRED dropped packets counters +4. Support per-queue total WRED dropped byte counters +5. Support per-port WRED dropped packets counters (per-color and total count) +6. Support per-port total ECN marked packets(in the local switch) counters [to be supported in the next phase of enhancement] +7. Support configuration control to enable and disable the above statistics +8. Non-supported platforms will not display these statistics on CLI + + +### Architecture Design + +There are no architectural design changes as part of this design. + +### High-Level Design + +This section covers the high level design of the WRED and ECN statistics feature. The following step-by-step short description provides an overview of the operations involved in this feature, + +1. Orchagent fetches the platform statistics capability for WRED and ECN Statistics from SAI +2. The stats capability will be updated to STATE_DB by orchagent +3. Based on the stats capability and CONFIG_DB status of respective statistics, Orchagent sets stat-ids to FLEX_COUNTERS_DB + * In case, the platform is capable of WRED and ECN statistics, + * Per-queue WRED and ECN counters will create and use the new flexcounter group WRED_ECN_QUEUE + * Per-port WRED and ECN counters will create and use the new flexcounter group WRED_ECN_PORT +5. Syncd has subscribed to Flex Counter DB and it will set up flex counters. +6. Flex counters periodically query platform counters and publishes data to COUNTERS_DB +7. CLI will look-up the Capability at STATE_DB +8. Only the supported statistics will be fetched and displayed on the CLI output + +#### Changes in CONFIG_DB + +CONFIG_DB changes are required to enable and disable these statistics globally. For that purpose, Two new flexcounter groups that poll for WRED and ECN statistics will be added to FLEX_COUNTER_TABLE of CONFIG_DB. Flexcounters WRED_ECN_QUEUE and WRED_ECN_PORT will be added to FLEX_COUNTER_TABLE as shown below, + +``` +{ + "FLEX_COUNTER_TABLE": { + "WRED_ECN_QUEUE": { + "FLEX_COUNTER_STATUS": "enable", + "POLL_INTERVAL": "10000" + }, + "WRED_ECN_PORT": { + "FLEX_COUNTER_STATUS": "enable", + "POLL_INTERVAL": "1000" + }, + } +} +``` + +Default polling intervals for flexcounter groups WRED_ECN_QUEUE and WRED_ECN_PORT are 10000 millisecond and 1000 millisecond respectively. By default these flexcounter groups are disabled for polling. + +#### Changes in STATE_DB +State DB will store information about WRED and ECN statistics support as per the platform capability. Statistics capabilities will be populated during Orchagent startup by checking the platform capability. These capabilities are used in CLI to display only the supported counters to user. + +``` +"QUEUE_COUNTER_CAPABILITIES": { + "WRED_ECN_QUEUE_ECN_MARKED_PKT_COUNTER": { + "isSupported": "true", + }, + "WRED_ECN_QUEUE_ECN_MARKED_BYTE_COUNTER": { + "isSupported": "true", + }, + "WRED_ECN_QUEUE_WRED_DROPPED_PKT_COUNTER": { + "isSupported": "true", + }, + "WRED_ECN_QUEUE_WRED_DROPPED_BYTE_COUNTER": { + "isSupported": "true", + }, +} + +"PORT_COUNTER_CAPABILITIES": { + "WRED_ECN_PORT_WRED_GREEN_DROP_COUNTER": { + "isSupported": "true", + }, + "WRED_ECN_PORT_WRED_YELLOW_DROP_COUNTER": { + "isSupported": "true", + }, + "WRED_ECN_PORT_WRED_RED_DROP_COUNTER": { + "isSupported": "true", + }, + "WRED_ECN_PORT_WRED_TOTAL_DROP_COUNTER": { + "isSupported": "true", + }, +} + +``` + +The default capability will be isSupported=false for all the above statistics. + +#### Changes in FLEX_COUNTER_DB + +The flexcounter groups need to be created for polling the required statistics. Two new flex counter groups will be introduced for this feature. These are created during Orchagent startup. + +On supported platforms, +* The WRED and ECN queue counters will use the new flexcounter group WRED_ECN_QUEUE for following list of counters, + * SAI_QUEUE_STAT_WRED_ECN_MARKED_PACKETS + * SAI_QUEUE_STAT_WRED_ECN_MARKED_BYTES + * SAI_QUEUE_STAT_WRED_DROPPED_PACKETS + * SAI_QUEUE_STAT_WRED_DROPPED_BYTES + +* The WRED port counters will use the new flex counter group WRED_ECN_PORT for following list of counters, + * SAI_PORT_STAT_GREEN_WRED_DROPPED_PACKETS + * SAI_PORT_STAT_YELLOW_WRED_DROPPED_PACKETS + * SAI_PORT_STAT_RED_WRED_DROPPED_PACKETS + * SAI_PORT_STAT_WRED_DROPPED_PACKETS + * SAI_PORT_STAT_ECN_MARKED_PACKETS [to be supported in next phase of Enhancement] + + +#### Changes in COUNTERS_DB + +The following new port counters will be added along with existing counters on supported platforms + +* COUNTERS:oid:port_oid + * SAI_PORT_STAT_GREEN_WRED_DROPPED_PACKETS + * SAI_PORT_STAT_YELLOW_WRED_DROPPED_PACKETS + * SAI_PORT_STAT_RED_WRED_DROPPED_PACKETS + * SAI_PORT_STAT_WRED_DROPPED_PACKETS + * SAI_PORT_STAT_ECN_MARKED_PACKETS [to be supported in next phase of Enhancement] + +For every egress queue, the following statistics will be added along with existing queue conters on supported platforms + +* COUNTERS:oid:queue_oid + * SAI_QUEUE_STAT_WRED_ECN_MARKED_PACKETS + * SAI_QUEUE_STAT_WRED_ECN_MARKED_BYTES + * SAI_QUEUE_STAT_WRED_DROPPED_PACKETS + * SAI_QUEUE_STAT_WRED_DROPPED_BYTES + + +### Changes in Orchagent +Orchagent gets the WRED and ECN statistics capability during the startup and updates the STATE_DB with supported statistics. If a counter is supported, respective capability will be set to true. Otherwise the capability will be set to false. Based on the capability in STATE_DB, FLEXCOUNTER_DB will be updated with supported statistics for polling. +

+StateDB syncd interactions +

+ +Once the WRED_ECN_QUEUE or WRED_ECN_PORT of FLEX_COUNTER_TABLE is enabled, Orchagent will enable the respective flexcounter group. The following diagram illustrates the same, +

+StateDB syncd interactions +

+ +If the user enables the WRED and ECN statistics on a platform in which all of the statistics of a flexcounter group are not supported, there will be an error message logged to syslog. For example, assume that none of the Queue-level Wred and ECN statistics are supported on a platform, enabling the same will log a syslog error. + +### CLI Changes + +There are few new CLIs and new CLI tokens are introduced for this feature. And also the output of existing "show interface counters detailed" CLI would change. The details are illustrated below in this section. The show CLIs will display the WRED and ECN statistics only if the capability is supported by the platform. It gets the capability from STATE_DB and queries only the supported statistics from COUNTERS_DB. + +* New CLI tokens are introduced under the existing ```counterpoll``` CLI for enabling and disabling the WRED and ECN statistics polling globally, + * Enable/Disable the queue level counters : ```counterpoll wredqueue ``` + * Enable/Disable the port level counters : ```counterpoll wredport ``` +* New CLI tokens are introduced under the existing ```counterpoll``` CLI for setting the polling interval of the statistics, + * Set polling interval for queue level counters: ```counterpoll wredqueue interval ``` + * Set polling interval for port level counters: ```counterpoll wredport interval ``` + +* Existing ```counterpoll``` CLI can be used to display counter status and polling interval, + * Display the status of the counters : ```counterpoll show``` + +* Following new CLIs are introduced for Per-queue WRED and ECN Statistics + * Statistics are cleared on user request : ```sonic-clear queue wredcounters``` + * Display the statistics on the console : ```show queue wredcounters [interface-name]``` + + +* Following existing CLIs are used for Per-port WRED statistics + * Statistics are cleared on user request : ```sonic-clear counters``` + * Display the statistics on the console : ```show interfaces counters detailed ``` + +#### CLI output on a WRED and ECN queue statistics supported platform + +``` +sonic-dut:~# show queue wredcounters Ethernet16 + Port TxQ WredDrp/pkts WredDrp/bytes EcnMarked/pkts EcnMarked/bytes +---------- ----- -------------- --------------- -------------- --------------- +Ethernet16 UC0 0 0 0 0 +Ethernet16 UC1 1 120 0 0 +Ethernet16 UC2 0 0 0 0 +Ethernet16 UC3 0 0 0 0 +Ethernet16 UC4 0 0 0 0 +Ethernet16 UC5 0 0 0 0 +Ethernet16 UC6 0 0 0 0 +Ethernet16 UC7 0 0 0 0 +``` +#### CLI output on a platform which supports WRED drop statistics and does not support ECN statistics +``` +sonic-dut:~# show queue wredcounters Ethernet8 + Port TxQ WredDrp/pkts WredDrp/bytes EcnMarked/pkts EcnMarked/bytes +--------- ----- -------------- --------------- ---------------- ----------------- +Ethernet8 UC0 0 0 N/A N/A +Ethernet8 UC1 1 120 N/A N/A +Ethernet8 UC2 0 0 N/A N/A +Ethernet8 UC3 0 0 N/A N/A +Ethernet8 UC4 0 0 N/A N/A +Ethernet8 UC5 0 0 N/A N/A +Ethernet8 UC6 0 0 N/A N/A +Ethernet8 UC7 0 0 N/A N/A + +``` +#### CLI output on a platform which supports ECN statistics and does not support WRED statistics +``` +sonic-dut:~# show queue wredcounters Ethernet16 + Port TxQ WredDrp/pkts WredDrp/bytes EcnMarked/pkts EcnMarked/bytes +---------- ----- -------------- --------------- ---------------- ----------------- +Ethernet16 UC0 N/A N/A 0 0 +Ethernet16 UC1 N/A N/A 1 120 +Ethernet16 UC2 N/A N/A 0 0 +Ethernet16 UC3 N/A N/A 0 0 +Ethernet16 UC4 N/A N/A 0 0 +Ethernet16 UC5 N/A N/A 0 0 +Ethernet16 UC6 N/A N/A 0 0 +Ethernet16 UC7 N/A N/A 0 0 + +``` +#### show interface counters CLI output on a WRED drop statistics supported platform +``` +root@sonic-dut:~# show interfaces counters detailed Ethernet8 +Packets Received 64 Octets..................... 0 +Packets Received 65-127 Octets................. 2 +Packets Received 128-255 Octets................ 0 +Packets Received 256-511 Octets................ 0 +Packets Received 512-1023 Octets............... 0 +Packets Received 1024-1518 Octets.............. 0 +Packets Received 1519-2047 Octets.............. 0 +Packets Received 2048-4095 Octets.............. 0 +Packets Received 4096-9216 Octets.............. 0 +Packets Received 9217-16383 Octets............. 0 + +Total Packets Received Without Errors.......... 2 +Unicast Packets Received....................... 0 +Multicast Packets Received..................... 2 +Broadcast Packets Received..................... 0 + +Jabbers Received............................... N/A +Fragments Received............................. N/A +Undersize Received............................. 0 +Overruns Received.............................. 0 + +Packets Transmitted 64 Octets.................. 32,893 +Packets Transmitted 65-127 Octets.............. 16,449 +Packets Transmitted 128-255 Octets............. 3 +Packets Transmitted 256-511 Octets............. 2,387 +Packets Transmitted 512-1023 Octets............ 0 +Packets Transmitted 1024-1518 Octets........... 0 +Packets Transmitted 1519-2047 Octets........... 0 +Packets Transmitted 2048-4095 Octets........... 0 +Packets Transmitted 4096-9216 Octets........... 0 +Packets Transmitted 9217-16383 Octets.......... 0 + +Total Packets Transmitted Successfully......... 51,732 +Unicast Packets Transmitted.................... 0 +Multicast Packets Transmitted.................. 18,840 +Broadcast Packets Transmitted.................. 32,892 +Time Since Counters Last Cleared............... None + +WRED Green Dropped Packets..................... 1 +WRED Yellow Dropped Packets.................... 3 +WRED RED Dropped Packets....................... 10 +WRED Total Dropped Packets..................... 14 + +``` + +#### show interface counters CLI output on a platform which does not support WRED drop statistics +``` +root@sonic-dut:~# show interfaces counters detailed Ethernet8 +Packets Received 64 Octets..................... 0 +Packets Received 65-127 Octets................. 2 +Packets Received 128-255 Octets................ 0 +Packets Received 256-511 Octets................ 0 +Packets Received 512-1023 Octets............... 0 +Packets Received 1024-1518 Octets.............. 0 +Packets Received 1519-2047 Octets.............. 0 +Packets Received 2048-4095 Octets.............. 0 +Packets Received 4096-9216 Octets.............. 0 +Packets Received 9217-16383 Octets............. 0 + +Total Packets Received Without Errors.......... 2 +Unicast Packets Received....................... 0 +Multicast Packets Received..................... 2 +Broadcast Packets Received..................... 0 + +Jabbers Received............................... N/A +Fragments Received............................. N/A +Undersize Received............................. 0 +Overruns Received.............................. 0 + +Packets Transmitted 64 Octets.................. 32,893 +Packets Transmitted 65-127 Octets.............. 16,449 +Packets Transmitted 128-255 Octets............. 3 +Packets Transmitted 256-511 Octets............. 2,387 +Packets Transmitted 512-1023 Octets............ 0 +Packets Transmitted 1024-1518 Octets........... 0 +Packets Transmitted 1519-2047 Octets........... 0 +Packets Transmitted 2048-4095 Octets........... 0 +Packets Transmitted 4096-9216 Octets........... 0 +Packets Transmitted 9217-16383 Octets.......... 0 + +Total Packets Transmitted Successfully......... 51,732 +Unicast Packets Transmitted.................... 0 +Multicast Packets Transmitted.................. 18,840 +Broadcast Packets Transmitted.................. 32,892 +Time Since Counters Last Cleared............... None +``` + +The "show queue wredcounters" fetches the statistics from COUNTERS_DB and will display to the console based on the capability of the platform. The below sequence diagram logically explains the "show queue wredcounters" interaction among CLI, STATE_DB and COUNTERS_DB for ECN and WRED statistics on supported platform, +

+queue CLI interactions +

+ +The "show interface counters detailed" will follow the same flow as of now for existing statistics. The below sequence diagram explains the "show interface counters detailed" interaction among CLI, STATE_DB and COUNTERS_DB for WRED drop statistics on supported platform, +

+port CLI interactions +

+ + + +### SAI API + +Following SAI statistics are used in this feature, + +* SAI counters list, + * SAI_QUEUE_STAT_WRED_ECN_MARKED_PACKETS + * SAI_QUEUE_STAT_WRED_ECN_MARKED_BYTES + * SAI_QUEUE_STAT_WRED_DROPPED_PACKETS + * SAI_QUEUE_STAT_WRED_DROPPED_BYTES + * SAI_PORT_STAT_GREEN_WRED_DROPPED_PACKETS + * SAI_PORT_STAT_YELLOW_WRED_DROPPED_PACKETS + * SAI_PORT_STAT_RED_WRED_DROPPED_PACKETS + * SAI_PORT_STAT_WRED_DROPPED_PACKETS + * SAI_PORT_STAT_ECN_MARKED_PACKETS [to be supported in next phase of Enhancement] + +* sai_query_stats_capability() will be used to identify the supported statistics + +* For statistics get and clear, existing APIs are used as it is + +### Configuration and management + +The config CLI commands ```"counterpoll wredqueue "``` and ```"counterpoll wredport "``` will enable and disable the WRED and ECN statistics by interacting with the FLEX_COUNTER_TABLE table of CONFIG_DB. + +### Manifest (if the feature is an Application Extension) +Not applicable? + +### CLI/YANG model Enhancements +The sonic-flex_counter.yang will be updated with new containers to reflect the proposed CONFIG_DB changes as shown below, + + { + container WRED_ECN_QUEUE { + /* WRED_ECN_QUEUE_FLEX_COUNTER_GROUP */ + leaf FLEX_COUNTER_STATUS { + type flex_status; + } + leaf FLEX_COUNTER_DELAY_STATUS { + type flex_delay_status; + } + leaf POLL_INTERVAL { + type poll_interval; + } + }, + container WRED_ECN_PORT { + /* WRED_ECN_PORT_FLEX_COUNTER_GROUP */ + leaf FLEX_COUNTER_STATUS { + type flex_status; + } + leaf FLEX_COUNTER_DELAY_STATUS { + type flex_delay_status; + } + leaf POLL_INTERVAL { + type poll_interval; + } + } + } + + + +### Warmboot and Fastboot Design Impact +There are no impact to warmboot or fastboot. + + +### Restrictions/Limitations + +* None + +### Testing Requirements/Design + +#### Unit Test cases +- On Supported platforms, Verify if the queuestat CLI display has WRED and ECN Queue statistics +- On Supported platforms, Verify if the port statistcs CLI display has the WRED statistics +- On ECN-WRED stats non-supported platforms, + - Verify that CLI does not show the respective Headers in queue stats + - Verify that CLI does not show the respective rows in port stats + - Verify that the stat capability get is not logging errors to syslog + +#### System Test cases +* New sonic-mgmt(PTF) ecn wred statistics testcase will be created to verify the statistics on supported platforms diff --git a/doc/qos/ecn-wred-stats-images/orchagent_db_state_flow.png b/doc/qos/ecn-wred-stats-images/orchagent_db_state_flow.png new file mode 100644 index 00000000000..712e88473d1 Binary files /dev/null and b/doc/qos/ecn-wred-stats-images/orchagent_db_state_flow.png differ diff --git a/doc/qos/ecn-wred-stats-images/orchagent_stats_enable_disable.png b/doc/qos/ecn-wred-stats-images/orchagent_stats_enable_disable.png new file mode 100644 index 00000000000..7d186be6620 Binary files /dev/null and b/doc/qos/ecn-wred-stats-images/orchagent_stats_enable_disable.png differ diff --git a/doc/qos/ecn-wred-stats-images/port_counter_cli_flow.png b/doc/qos/ecn-wred-stats-images/port_counter_cli_flow.png new file mode 100644 index 00000000000..5bbd63b4784 Binary files /dev/null and b/doc/qos/ecn-wred-stats-images/port_counter_cli_flow.png differ diff --git a/doc/qos/ecn-wred-stats-images/queue_counter_cli_flow.png b/doc/qos/ecn-wred-stats-images/queue_counter_cli_flow.png new file mode 100644 index 00000000000..5344a08a8b9 Binary files /dev/null and b/doc/qos/ecn-wred-stats-images/queue_counter_cli_flow.png differ diff --git a/doc/sfputil/read_write_eeprom_by_page.md b/doc/sfputil/read_write_eeprom_by_page.md new file mode 100644 index 00000000000..6be1ae20b7d --- /dev/null +++ b/doc/sfputil/read_write_eeprom_by_page.md @@ -0,0 +1,199 @@ +# sfputil | add the ability to read/write any byte from EERPOM both by page and offset # + +## Table of Content + +### Revision + +### Scope + +This document describes SONiC feature to read/write cable EEPROM data via SONiC CLI. + +### Definitions/Abbreviations + +N/A + +### Overview + +CLI sfputil shall be extended to support reading/writing cable EEPROM by page and offset. This implementation is based on existing platform API `sfp.read_eeprom` and `sfp.write_eeprom`. + +### Requirements + +- Support reading/writing cable EEPROM data by page, offset and size. For sff8472, wire address "a0h" or "a2h" must be provided by user. +- Support basic validation for input parameter such as page, offset and size. +- Support reading/writing cable EEPROM for all types of cables except RJ45. +- User shall provide page and offset according to the standard. For example, CMIS page 1h starting offset is 128, offset less than 128 shall be treated as invalid. +- Vendor who does not support `sfp.read_eeprom` and `sfp.write_eeprom` is expected to raise `NotImplementedError`, this error shall be properly handled +- Others error shall be treated as read/write failure + + +### Architecture Design + +The current architecture is not changed. + +### High-Level Design + +Submodule sonic-utilities shall be extended to support this feature. + +#### sonic-utilities + +Two new CLIs shall be added to sfputil module: + +- sfputil read-eeprom +- sfputil write-eeprom + +For detail please check chapter [CLI/YANG model Enhancements](#cliyang-model-enhancements) + +The existing API `sfp.read_eeprom` and `sfp.write_eeprom` accept "overall offset" as parameter. They have no concept of page. If user wants to use it, user has to convert page and offset to "overall offset" manually. Different cable types use different covert method according to the standard. So, it is not user friendly to provide such API directly to user. sonic-utilities shall provide the function to translate from page offset to overall offset. + +##### CMIS validation + +Passive cable: +- Valid page: 0 +- Valid offset: 0-255 + +Active cable: +- Valid page: 0-255 += Valid offset: page 0 (0-255), other (128-255) + +For active cable, there is no "perfect" page validation as it is too complicated. User is responsible to make sure the page existence according to cable user manual. + +Example: +``` +sfputil read-eeprom -p Ethernet0 -n 0 -o 255 -s 1 # valid +sfputil read-eeprom -p Ethernet0 -n 0 -o 255 -s 2 # invalid size, out of range, 255+2=257 is not a valid offset +sfputil read-eeprom -p Ethernet0 -n 0 -o 256 -s 1 # invalid offset 256 for page 0, must be in range [0, 255] +sfputil read-eeprom -p Ethernet0 -n 1 -o 0 -s 1 # invalid offset 0 for page 1, must be >=128 +``` + +##### sff8436 and sff8636 validation + +Passive cable: +- Valid page: 0 +- Valid offset: 0-255 + +Active cable: +- Valid page: 0-255 +- Valid offset: page 0 (0-255), other (128-255) + +For active cable, there is no "perfect" page validation as it is too complicated. User is responsible to make sure the page existence according to cable user manual. + +``` +sfputil write-eeprom -p Ethernet0 -n 0 -o 255 -d ff # valid +sfputil write-eeprom -p Ethernet0 -n 0 -o 255 -d ff00 # invalid size, out of range, 255+2=257 is not a valid offset +sfputil write-eeprom -p Ethernet0 -n 0 -o 256 -d ff # invalid offset 256 for page 0, must be in range [0, 255] +sfputil write-eeprom -p Ethernet0 -n 1 -o 0 -d ff # invalid offset 0 for page 1, must be >=128 +``` + +##### sff8472 validation + +Passive cable: +- Valid wire address: [A0h] (case insensitive) +- Valid offset: A0h (0-128) + +Active cable: +- Valid wire address: [A0h, A2h] (case insensitive) +- Valid offset: A0h (0-255), A2h (0-255) + +``` +sfputil read-eeprom -p Ethernet0 -n 0 -o 0 -s 1 --wire-addr a0h # valid +sfputil read-eeprom -p Ethernet0 -n 0 -o 0 -s 2 --wire-addr A0h # invalid size, out of range, 255+2=257 is not a valid offset +sfputil read-eeprom -p Ethernet0 -n 1 -o 0 -s 1 --wire-addr a2h # invalid offset 256 for page 0, must be in range [0, 255] +sfputil read-eeprom -p Ethernet0 -n 1 -o 0 -s 1 --wire-addr a0h # invalid offset 0 for page 1, must be >=128 +``` + +### SAI API + +N/A + +### Configuration and management + +#### Manifest (if the feature is an Application Extension) + +N/A + +#### CLI/YANG model Enhancements + +##### sfputil read-eeprom + +``` +admin@sonic:~$ sfputil read-eeprom --help +Usage: sfputil read-eeprom [OPTIONS] + + Read SFP EEPROM data + +Options: + -p, --port Logical port name [required] + -n, --page EEPROM page number [required] + -o, --offset EEPROM offset within the page [required] + -s, --size Size of byte to be read [required] + --no-format Display non formatted data + --wire-addr TEXT Wire address of sff8472 + --help Show this message and exit. +``` + +Example: + +``` +sfputil read-eeprom -p Ethernet0 -n 0 -o 100 -s 2 + 00000064 4a 44 |..| + +sfputil read-eeprom -p Ethernet0 -n 0 -o 0 -s 32 + 00000000 11 08 06 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| + 00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| + +sfputil read-eeprom -p Ethernet0 -n 0 -o 100 -s 2 --no-format +4a44 +``` + +##### sfputil write-eeprom + +``` +admin@sonic:~$ sfputil write-eeprom --help +Usage: sfputil write-eeprom [OPTIONS] + + Write SFP EEPROM data + +Options: + -p, --port Logical port name [required] + -n, --page EEPROM page number [required] + -o, --offset EEPROM offset within the page [required] + -d, --data Hex string EEPROM data [required] + --wire-addr TEXT Wire address of sff8472 + --verify Verify the data by reading back + --help Show this message and exit. +``` + +Example: + +``` +sfputil write-eeprom -p Ethernet0 -n 0 -o 100 -d 4a44 + +sfputil write-eeprom -p Etherent0 -n 0 -o 100 -d 4a44 --verify +Error: Write data failed! Write: 4a44, read: 0000. +``` + +#### Config DB Enhancements + +N/A + +### Warmboot and Fastboot Design Impact + +N/A + +### Memory Consumption + +No memory consumption is expected when the feature is disabled via compilation and no growing memory consumption while feature is disabled by configuration. + +### Restrictions/Limitations + +- Vendor should support plaform API `sfp.read_eeprom` and `sfp.write_eeprom` to support this feature. + +### Testing Requirements/Design + +#### Unit Test cases + +- sonic-utilities unit test shall be extended to cover new subcommands + +#### System Test cases + +### Open/Action items - if any diff --git a/doc/smart-switch/high-availability/images/eni-creation-step-1.svg b/doc/smart-switch/high-availability/images/eni-creation-step-1.svg new file mode 100644 index 00000000000..aefc8891d9a --- /dev/null +++ b/doc/smart-switch/high-availability/images/eni-creation-step-1.svg @@ -0,0 +1,4 @@ + + + +
ENI creation
ENI creation
Smart Switch 1
Smart Switch 1
DPU Card 1
DPU Card 1
Mgmt Port
Mgmt Port
NPU
NPU
`hamgrd`
`hamgrd`
ENI1->DPU2, DPU3
ENI1->DPU2, DPU3
FP-Ports
FP-Ports
DPU Card 2
DPU Card 2
ENI1
ENI1
Upsteam Service
Upsteam Service
Smart Switch 2
Smart Switch 2
DPU Card 3
DPU Card 3
ENI1
ENI1
NPU
NPU
`hamgrd`
`hamgrd`
ENI1->DPU2, DPU3
ENI1->DPU2, DPU3
FP-Ports
FP-Ports
DPU Card 4
DPU Card 4
Mgmt Port
Mgmt Port
Smart Switch 3
Smart Switch 3
DPU Card 5
DPU Card 5
NPU
NPU
`hamgrd`
`hamgrd`
ENI1->DPU2, DPU3
ENI1->DPU2, DPU3
FP-Ports
FP-Ports
DPU Card 6
DPU Card 6
Mgmt Port
Mgmt Port
Legend
Legend
Upstream service channel
Upstream service channel
HA control plane data channel
HA control plane data channel
HA control plane control channel
HA control plane control channel
Internal redis/zmq based channel
Internal redis/zmq based channel
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/eni-creation-step-2.svg b/doc/smart-switch/high-availability/images/eni-creation-step-2.svg new file mode 100644 index 00000000000..bc0a0ee7b5c --- /dev/null +++ b/doc/smart-switch/high-availability/images/eni-creation-step-2.svg @@ -0,0 +1,4 @@ + + + +
Primary Election
Primary Election
Smart Switch 1
Smart Switch 1
DPU Card 1
DPU Card 1
Mgmt Port
Mgmt Port
NPU
NPU
`hamgrd`
`hamgrd`
ENI1->DPU2, DPU3
ENI1->DPU2, DPU3
FP-Ports
FP-Ports
DPU Card 2
DPU Card 2
ENI1
(Standby)
ENI1...
Upsteam Service
Upsteam Service
Smart Switch 2
Smart Switch 2
DPU Card 3
DPU Card 3
ENI1
(Active)
ENI1...
NPU
NPU
`hamgrd`
`hamgrd`
ENI1->DPU2, DPU3
ENI1->DPU2, DPU3
FP-Ports
FP-Ports
DPU Card 4
DPU Card 4
Mgmt Port
Mgmt Port
Smart Switch 3
Smart Switch 3
DPU Card 5
DPU Card 5
NPU
NPU
`hamgrd`
`hamgrd`
ENI1->DPU2, DPU3
ENI1->DPU2, DPU3
FP-Ports
FP-Ports
DPU Card 6
DPU Card 6
Mgmt Port
Mgmt Port
Legend
Legend
Upstream service channel
Upstream service channel
HA control plane data channel
HA control plane data channel
HA control plane control channel
HA control plane control channel
Internal redis/zmq based channel
Internal redis/zmq based channel
Primary election
Primary electi...
New primary notification
New primary notificat...
Channel Establishment
Channel Establishment
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/eni-creation-step-3.svg b/doc/smart-switch/high-availability/images/eni-creation-step-3.svg new file mode 100644 index 00000000000..ae7719e7ece --- /dev/null +++ b/doc/smart-switch/high-availability/images/eni-creation-step-3.svg @@ -0,0 +1,4 @@ + + + +
Probe update
Probe update
Smart Switch 1
Smart Switch 1
DPU Card 1
DPU Card 1
Mgmt Port
Mgmt Port
NPU
NPU
`hamgrd`
`hamgrd`
ENI1->DPU2, DPU3
(DPU2: Down, DPU3: Up)
ENI1->DPU2, DPU3...
FP-Ports
FP-Ports
DPU Card 2
DPU Card 2
ENI1
(Standby)
ENI1...
Upsteam Service
Upsteam Service
Smart Switch 2
Smart Switch 2
DPU Card 3
DPU Card 3
ENI1
(Active)
ENI1...
NPU
NPU
`hamgrd`
`hamgrd`
ENI1->DPU2, DPU3
(DPU2: Down, DPU3: Up)
ENI1->DPU2, DPU3...
FP-Ports
FP-Ports
DPU Card 4
DPU Card 4
Mgmt Port
Mgmt Port
Smart Switch 3
Smart Switch 3
DPU Card 5
DPU Card 5
NPU
NPU
`hamgrd`
`hamgrd`
ENI1->DPU2, DPU3
(DPU2: Down, DPU3: Up)
ENI1->DPU2, DPU3...
FP-Ports
FP-Ports
DPU Card 6
DPU Card 6
Mgmt Port
Mgmt Port
Legend
Legend
Upstream service channel
Upstream service channel
HA control plane data channel
HA control plane data channel
HA control plane control channel
HA control plane control channel
Internal redis/zmq based channel
Internal redis/zmq based channel
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/eni-pair-placement.svg b/doc/smart-switch/high-availability/images/eni-pair-placement.svg new file mode 100644 index 00000000000..0898d053efb --- /dev/null +++ b/doc/smart-switch/high-availability/images/eni-pair-placement.svg @@ -0,0 +1,4 @@ + + + +
T1 Set (Card level pairing)
T1 Set (Card level pairing)
T1-A
T1-A
DPU-A
(Paired w/ E)
DPU-A...
ENI1
ENI1
ENI2
ENI2
DPU-B
(Paired w/ H)
DPU-B...
ENI3
ENI3
ENI4
ENI4
T1-B
T1-B
DPU-C
(Paired w/ G)
DPU-C...
ENI5
ENI5
ENI6
ENI6
DPU-D
(Paired w/ F)
DPU-D...
ENI7
ENI7
ENI8
ENI8
T1-C
T1-C
DPU-F
(Paired w/ D)
DPU-F...
ENI7
ENI7
ENI8
ENI8
DPU-E
(Paired w/ A)
DPU-E...
ENI1
ENI1
ENI2
ENI2
T1-D
T1-D
DPU-G
(Paired w/ C)
DPU-G...
ENI5
ENI5
ENI6
ENI6
DPU-H
(Paired w/ B)
DPU-H...
ENI3
ENI3
ENI4
ENI4
......
......
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/eni-programming.svg b/doc/smart-switch/high-availability/images/eni-programming.svg new file mode 100644 index 00000000000..b71b1ce3ae1 --- /dev/null +++ b/doc/smart-switch/high-availability/images/eni-programming.svg @@ -0,0 +1,4 @@ + + + +
ENI programming
ENI programming
Smart Switch 1
Smart Switch 1
DPU Card 1
DPU Card 1
Mgmt Port
Mgmt Port
NPU
NPU
`hamgrd`
`hamgrd`
ENI1->DPU2, DPU3
ENI1->DPU2, DPU3
FP-Ports
FP-Ports
DPU Card 2
DPU Card 2
ENI1
ENI1
Upsteam Service
Upsteam Service
Smart Switch 2
Smart Switch 2
DPU Card 3
DPU Card 3
ENI1
ENI1
NPU
NPU
`hamgrd`
`hamgrd`
ENI1->DPU2, DPU3
ENI1->DPU2, DPU3
FP-Ports
FP-Ports
DPU Card 4
DPU Card 4
Mgmt Port
Mgmt Port
Smart Switch 3
Smart Switch 3
DPU Card 5
DPU Card 5
NPU
NPU
`hamgrd`
`hamgrd`
ENI1->DPU2, DPU3
ENI1->DPU2, DPU3
FP-Ports
FP-Ports
DPU Card 6
DPU Card 6
Mgmt Port
Mgmt Port
Legend
Legend
Upstream service channel
Upstream service channel
HA control plane data channel
HA control plane data channel
HA control plane control channel
HA control plane control channel
Internal redis/zmq based channel
Internal redis/zmq based channel
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/flow-replication-data-path.svg b/doc/smart-switch/high-availability/images/flow-replication-data-path.svg new file mode 100644 index 00000000000..1f02dce9682 --- /dev/null +++ b/doc/smart-switch/high-availability/images/flow-replication-data-path.svg @@ -0,0 +1,4 @@ + + + +
Smart Switch 1
Smart Switch 1
NPU1
(lo:1.1.1.1)
NPU1...
DPU Card 1
DPU Card 1
DPU1
(VIP: 2.2.2.1)
(PA: 10.0.0.1)
(Active)
DPU1...
DPU-Port1
DPU-Port1
DPU-PortN
DPU-PortN
......
......
5
5
19
19
18
18
6
6
3
3
4
4
20
20
21
21
16
16
17
17
FP-PortX
FP-PortX
7
7
8
8
Smart Switch 1
Smart Switch 1
NPU2
(lo:1.1.1.2)
NPU2...
DPU Card 1
DPU Card 1
DPU-Port1
DPU-Port1
DPU-PortN
DPU-PortN
......
......
11
11
12
12
DPU2
(VIP: 2.2.2.1)
(PA: 10.0.0.2)
(Standby)
DPU2...
9
9
10
10
FP-PortX
FP-PortX
13
13
14
14
8
8
Node1
Node1
VM1
(CA: 10.0.0.1)
VM1...
Host Networking
Host Networking
1
1
2
2
15
15
22
22
Legend
Legend
Request (Initial)
Request (Initial)
Request (Flow Replication)
Request (Flow Replication)
Return (Flow Replication Ack)
Return (Flow Replication Ack)
Return (Outgoing)
Return (Outgoing)
Packet (Initial)
Packet (Initial)
Eth
Eth
SRC=<VM1 MAC (Initial)>
SRC=<VM1 MAC (Initial)>
VxLan
VxLan
SRC=<VM1 PA>, DST=<DPU1 VIP>,
VNI=<Outbound VNI>
SRC=<VM1 PA>, DST=<DPU1 VIP>,...
Eth
Eth
SRC=<VM1 MAC>
SRC=<VM1 MAC>
IP
IP
SRC=<VM1 CA>, DST=<VM2 CA>
SRC=<VM1 CA>, DST=<VM2 CA>
Payload
Payload
Payload
Payload
Both DPU1 and DPU2 advertises the same VIP (2.2.2.1) to network, so they will both get the traffic. However, NPU2 knows the active DPU and will forward the traffic actively to DPU1.

Please check DPU HA (Passive to Active tunnel) for more details.
Both DPU1 and DPU2 advertises the same VIP (2.2.2.1...
Packet (Flow Replication)
Packet (Flow Replication)
Eth
Eth
SRC=<DPU1 MAC (Initial)>
SRC=<DPU1 MAC (Initial)>
UDP
UDP
SRC=<DPU1 PA>:*
DST=<DPU2 PA>:<DP Channel Port>
SRC=<DPU1 PA>:*...
Payload
Payload
Vendor-defined metadata for flow replication.
Vendor-defined metadata for flow repli...
Eth
Eth
SRC=<VM1 MAC>
SRC=<VM1 MAC>
IP
IP
SRC=<VM1 CA>, DST=<VM2 CA>
SRC=<VM1 CA>, DST=<VM2 CA>
Payload
Payload
Payload
Payload
Packet (Flow Replication Ack)
Packet (Flow Replication Ack)
Eth
Eth
SRC=<DPU2 MAC (Initial)>
SRC=<DPU2 MAC (Initial)>
UDP
UDP
SRC=<DPU2 PA>:<DP Channel Port>
DST=<DPU1 PA>:*
SRC=<DPU2 PA>:<DP Channel Port>...
Payload
Payload
Vendor-defined metadata for flow replication ack.
Vendor-defined metadata for flow repli...
Eth
Eth
SRC=<VM1 MAC>
SRC=<VM1 MAC>
IP
IP
SRC=<VM1 CA>, DST=<VM2 CA>
SRC=<VM1 CA>, DST=<VM2 CA>
Payload
Payload
Payload
Payload
Packet (Outgoing)
Packet (Outgoing)
Eth
Eth
SRC=<DPU1 MAC (Initial)>
SRC=<DPU1 MAC (Initial)>
VxLan
VxLan
SRC=<DPU1 VIP>, DST=<DPU2 VIP>,
VNI=<VNET VNI>
SRC=<DPU1 VIP>, DST=<DPU2 VIP>,...
Eth
Eth
SRC=<VM1 MAC>
SRC=<VM1 MAC>
IP
IP
SRC=<VM1 CA>, DST=<VM2 CA>
SRC=<VM1 CA>, DST=<VM2 CA>
Payload
Payload
Payload
Payload
Initial Packet (Customer View)
Initial Packet (Customer View)
Eth
Eth
SRC=<VM1 MAC>
SRC=<VM1 MAC>
IP
IP
SRC=<VM1 CA>, DST=<VM2 CA>
SRC=<VM1 CA>, DST=<VM2 CA>
Just like VM to VM traffic, ConnTrack Update in DPU1 will create a pair of flows:
  1. Forwarding flow:
    • Matching <Protocol, Src = VM1 CA:Port, Dst = VM2 CA:Port>
    • Action: Encap to VM2 DPU VIP (From policy)
  2. Reverse flow:
    1. Matching <Protocol, Src = VM2 CA:Port, Dst = VM1 CA:Port>
    2. Action: Encap to VM1 PA (From packet)
Just like VM to VM traffic, ConnTrack Update in DPU1 will create a pair of flows:...
The PA of each DPU is routable within the regional network, just like normal machines.
The PA of each DPU is routable with...
Once DPU receives the packet, it will recreate the flow info from the vendor-defined metadata in the outer packet.
Once DPU receives the packet, it wil...
VM2 DPU Pair
(VIP: 2.2.2.2)
VM2 DPU Pair...
DPU doesn't setup BGP sessions for the VIP to NPU. When NPU receives the packet, it uses the inner source MAC to see which DPU we forward the packet to.
DPU doesn't setup BGP sessions for the VIP t...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-active-standby-setup.svg b/doc/smart-switch/high-availability/images/ha-active-standby-setup.svg new file mode 100644 index 00000000000..16173369d49 --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-active-standby-setup.svg @@ -0,0 +1,4 @@ + + + +
Active-Standby
Active-Standby
Smart Switch 1
Smart Switch 1
NPU1
NPU1
DPU Card 1
DPU Card 1
DPU1
(Active)
DPU1...
DPU-Port1
DPU-Port1
DPU-PortN
DPU-PortN
......
......
4
4
5
5
8
8
7
7
2
2
3
3
6
6
7
7
10
10
9
9
FP-PortX
FP-PortX
6
6
5
5
Smart Switch 2
Smart Switch 2
NPU2
NPU2
DPU Card 2
DPU Card 2
DPU2
(Standby)
DPU2...
DPU-Port1
DPU-Port1
DPU-PortN
DPU-PortN
......
......
3
3
2
2
FP-PortX
FP-PortX
1
1
8
8
1
1
4
4
11
11
Legend
Legend
Packets to active
Packets to active
Packets to standby
Packets to standby
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-control-plane-control-channel-data-path.svg b/doc/smart-switch/high-availability/images/ha-control-plane-control-channel-data-path.svg new file mode 100644 index 00000000000..0c8ce1226b1 --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-control-plane-control-channel-data-path.svg @@ -0,0 +1,4 @@ + + + +
Smart Switch 1
Smart Switch 1
NPU1
(lo:1.1.1.1)
NPU1...
DPU Card 1
DPU Card 1
DPU1
(VIP: 2.2.2.1)
(PA: 10.0.0.1)
(Active)
DPU1...
DPU-Port1
DPU-Port1
DPU-PortN
DPU-PortN
......
......
6
6
FP-PortX
FP-PortX
1
1
Smart Switch 2
Smart Switch 2
NPU2
(lo:1.1.1.2)
NPU2...
DPU Card 2
DPU Card 2
DPU-Port1
DPU-Port1
DPU-PortN
DPU-PortN
......
......
DPU2
(VIP: 2.2.2.1)
(PA: 10.0.0.2)
(Standby)
DPU2...
3
3
FP-PortX
FP-PortX
4
4
2
2
5
5
Packet (Request)
Packet (Request)
Eth
Eth
SRC=<DPU1 MAC (Initial)>
SRC=<DPU1 MAC (Initial)>
TCP
TCP
SRC=<DPU1 PA>:*
DST=<DPU2 PA>:<CP Port>
SRC=<DPU1 PA>:*...
Payload
Payload
SONiC defined message payload
SONiC defined message payload
Packet (Response)
Packet (Response)
Eth
Eth
SRC=<DPU2 MAC (Initial)>
SRC=<DPU2 MAC (Initial)>
TCP
TCP
SRC=<DPU2 PA>:<CP Port>
DST=<DPU1 PA>:*
SRC=<DPU2 PA>:<CP Port>...
Payload
Payload
SONiC defined message payload
SONiC defined message payload
Legend
Legend
Request (e.g. Syn)
Request (e.g. Syn)
Response (e.g. SynAck)
Response (e.g. SynAck)
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-control-plane-overview.svg b/doc/smart-switch/high-availability/images/ha-control-plane-overview.svg new file mode 100644 index 00000000000..f22f8cb6a1f --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-control-plane-overview.svg @@ -0,0 +1,4 @@ + + + +
Upsteam Service
Upsteam Service
Smart Switch 2
Smart Switch 2
DPU Card 3
DPU Card 3
PCIe Bus
PCIe Bus
database
database
Redis
Redis
DPU ASIC / ARM Cores
DPU ASIC / ARM Cores
swss
swss
*orch*
*orch*
syncd
syncd
syncd
syncd
BP-Port
BP-Port
DPU Card 4
DPU Card 4
PCIe Bus
PCIe Bus
database
database
Redis
Redis
DPU ASIC / ARM Cores
DPU ASIC / ARM Cores
swss
swss
*orch*
*orch*
syncd
syncd
syncd
syncd
BP-Port
BP-Port
NPU
NPU
ha
ha
swbusd
swbusd
mgmt
mgmt
gNMI Agent
gNMI Agent
database
database
Redis
Redis
swss
swss
orchagent
orchagent
BP-Port
BP-Port
PCIe Bus
PCIe Bus
Syncd
Syncd
syncd
syncd
ASIC
ASIC
hamgrd
hamgrd
FP-Ports
FP-Ports
Mgmt Port
Mgmt Port
Smart Switch 1
Smart Switch 1
DPU Card 1
DPU Card 1
PCIe Bus
PCIe Bus
database
database
Redis
Redis
DPU ASIC / ARM Cores
DPU ASIC / ARM Cores
swss
swss
*orch*
*orch*
syncd
syncd
syncd
syncd
BP-Port
BP-Port
Mgmt Port
Mgmt Port
DPU Card 2
DPU Card 2
PCIe Bus
PCIe Bus
database
database
Redis
Redis
DPU ASIC / ARM Cores
DPU ASIC / ARM Cores
swss
swss
*orch*
*orch*
syncd
syncd
syncd
syncd
BP-Port
BP-Port
NPU
NPU
PCIe Bus
PCIe Bus
ha
ha
hamgrd
hamgrd
swbusd
swbusd
mgmt
mgmt
gNMI Agent
gNMI Agent
database
database
Redis
Redis
swss
swss
orchagent
orchagent
BP-Ports
BP-Ports
Syncd
Syncd
syncd
syncd
ASIC
ASIC
FP-Ports
FP-Ports
Legend
Legend
Upstream service channel
Upstream service channel
HA control plane data channel
HA control plane data channel
HA control plane control channel
HA control plane control channel
Internal redis/zmq based channel
Internal redis/zmq based channel
Data Plane Channel
Data Plane Channel
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-network-topology.svg b/doc/smart-switch/high-availability/images/ha-network-topology.svg new file mode 100644 index 00000000000..8958649d431 --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-network-topology.svg @@ -0,0 +1,4 @@ + + + +
Tier-2
Tier-2
Switch
Switch
Switch
Switch
Switch
Switch
......
......
Tier-1
Tier-1
Smart Switch 0
Smart Switch 0
NPU0
NPU0
DPU0
(Active)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
NPU1
DPU1
(Standby)
DPU1...
Smart Switch N
Smart Switch N
NPU2
NPU2
......
......
Tier-0
Tier-0
Switch
Switch
Switch
Switch
Switch
Switch
......
......
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-planned-events-launch-clean.svg b/doc/smart-switch/high-availability/images/ha-planned-events-launch-clean.svg new file mode 100644 index 00000000000..5c1a4e30c15 --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-planned-events-launch-clean.svg @@ -0,0 +1,4 @@ + + + +
Step 1
Step 1
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: None)
NPU0...
DPU0
(Dead)
(Term: 0)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: None)
NPU1...
DPU1
(Dead)
(Term: 0)
DPU1...
Step 2
Step 2
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: None)
NPU0...
DPU0
(Dead->Connecting)
(Term: 0)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: None)
NPU1...
DPU1
(Dead->Connecting)
(Term: 0)
DPU1...
Connect
Connect
Step 3
Step 3
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: None)
NPU0...
DPU0
(Dead->Connected)
(Term: 0)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: None)
NPU1...
DPU1
(Dead->Connected)
(Term: 0)
DPU1...
Request
Vote
Request...
Step 4
Step 4
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: None)
NPU0...
DPU0
(Connected->
InitialzingToActive)
(Term: 0)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: None)
NPU1...
DPU1
(Connected->
InitializingToStandby)
(Term: 0)
DPU1...
HAState
Changed
HAState...
Step 5
Step 5
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: None
->DPU0)
NPU0...
DPU0
(InitialzingToActive->Active)
(Term: 0->1)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: None
->DPU0)
NPU1...
DPU1
(InitializingToStandby)
(Term: 0)
DPU1...
BulkSync
Done
BulkSyn...
Step 6
Step 6
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: DPU0)
NPU0...
DPU0
(Active)
(Term: 1)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: DPU0)
NPU1...
DPU1
(InitializingToStandby
->Standby)
(Term: 0->1)
DPU1...
HAState
Changed
HAState...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-planned-events-launch-rejoin.svg b/doc/smart-switch/high-availability/images/ha-planned-events-launch-rejoin.svg new file mode 100644 index 00000000000..efa55ab1ed6 --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-planned-events-launch-rejoin.svg @@ -0,0 +1,4 @@ + + + +
Step 1
Step 1
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: DPU0)
NPU0...
DPU0
(Standalone)
(Term: N)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: DPU0)
NPU1...
DPU1
(Dead->Connecting)
(Term: 0)
DPU1...
Connect
Connect
Step 2
Step 2
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: DPU0)
NPU0...
DPU0
(Standalone)
(Term: N)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: DPU0)
NPU1...
DPU1
(Connecting
->Connected)
(Term: 0)
DPU1...
Request
Vote
Request...
Step 4
Step 4
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: DPU0)
NPU0...
DPU0
(Active)
(Term: N+1)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: DPU0)
NPU1...
DPU1
(InitializingToStandby)
(Term: 0)
DPU1...
BulkSync
BulkSync
Step 5
Step 5
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: DPU0)
NPU0...
DPU0
(Active)
(Term: N+1)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: DPU0)
NPU1...
DPU1
(InitializingToStandby
->Standby)
(Term: N+1)
DPU1...
BulkSync
Done
BulkSyn...
Step 3
Step 3
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: DPU0)
NPU0...
DPU0
(Standalone->Active)
(Term: N->N+1)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: DPU0)
NPU1...
DPU1
(Connected->
InitializingToStandby)
(Term: 0->N)
DPU1...
HAState
Changed
HAState...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-planned-events-launch-timeout.svg b/doc/smart-switch/high-availability/images/ha-planned-events-launch-timeout.svg new file mode 100644 index 00000000000..76136a8af2d --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-planned-events-launch-timeout.svg @@ -0,0 +1,4 @@ + + + +
Step 1
Step 1
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: None)
NPU0...
DPU0
(Dead)
(Term: 0)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: None)
NPU1...
DPU1
(Dead->Connecting)
(Term: 0)
DPU1...
Connect
Connect
PeerDead

PeerNot
Found
PeerDea...
Step 2
Step 2
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: None
->DPU1)
NPU0...
DPU0
(Dead)
(Term: 0)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: None
->DPU1)
NPU1...
DPU1
(Connecting
->Standalone)
(Term: 0->1)
DPU1...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-planned-events-shutdown-step-1.svg b/doc/smart-switch/high-availability/images/ha-planned-events-shutdown-step-1.svg new file mode 100644 index 00000000000..def37c02de4 --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-planned-events-shutdown-step-1.svg @@ -0,0 +1,4 @@ + + + +
Step 1
Step 1
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: DPU0)
NPU0...
DPU0
(Active)
(Term: N)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: DPU0)
NPU1...
DPU1
(Standby)
(Term: N)
DPU1...
Set
DesiredState
to Dead
Set...
Upstream
Upstream
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-planned-events-shutdown-step-2.svg b/doc/smart-switch/high-availability/images/ha-planned-events-shutdown-step-2.svg new file mode 100644 index 00000000000..24289b0b493 --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-planned-events-shutdown-step-2.svg @@ -0,0 +1,4 @@ + + + +
Step 2
Step 2
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: DPU0)
NPU0...
DPU0
(Active)
(Term: N)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: DPU0)
NPU1...
DPU1
(Standby)
(Term: N)
DPU1...
Request
Shutdown
Request...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-planned-events-shutdown-step-3.svg b/doc/smart-switch/high-availability/images/ha-planned-events-shutdown-step-3.svg new file mode 100644 index 00000000000..bbcc8254535 --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-planned-events-shutdown-step-3.svg @@ -0,0 +1,4 @@ + + + +
Step 3
Step 3
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: DPU0)
NPU0...
DPU0
(Active->Standalone)
(Term: N->N+1)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: DPU0)
NPU1...
DPU1
(Standby)
(Term: N)
DPU1...
Shutdown
Standby
Shutdow...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-planned-events-shutdown-step-4.svg b/doc/smart-switch/high-availability/images/ha-planned-events-shutdown-step-4.svg new file mode 100644 index 00000000000..1fa18aaf9fb --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-planned-events-shutdown-step-4.svg @@ -0,0 +1,4 @@ + + + +
Step 4
Step 4
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: DPU0)
NPU0...
DPU0
(Standalone)
(Term: N+1)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: DPU0)
NPU1...
DPU1
(Standby->Destroying)
(Term: N)
DPU1...
HAState
Changed
HAState...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-planned-events-shutdown-step-5.svg b/doc/smart-switch/high-availability/images/ha-planned-events-shutdown-step-5.svg new file mode 100644 index 00000000000..2211f49f760 --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-planned-events-shutdown-step-5.svg @@ -0,0 +1,4 @@ + + + +
Step 5
Step 5
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: DPU0)
NPU0...
DPU0
(Standalone)
(Term: N+1)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: DPU0)
NPU1...
DPU1
(Destroying->Dead)
(Term: N->0)
DPU1...
HAState
Changed
HAState...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-planned-events-standalone-pin.svg b/doc/smart-switch/high-availability/images/ha-planned-events-standalone-pin.svg new file mode 100644 index 00000000000..a43198d3ee6 --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-planned-events-standalone-pin.svg @@ -0,0 +1,4 @@ + + + +
Pinning standalone
Pinning standalone
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: DPU0)
NPU0...
DPU0
(Active->Standalone)
(Term: N->N+1)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: DPU0)
NPU1...
DPU1
(Standby)
(Term: N)
DPU1...
Set desired state
to Standalone
Set des...
Smart Switch 2
Smart Switch 2
NPU2
(NextHop: DPU0)
NPU2...
Upstream
Upstream
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-planned-events-standalone-unpin.svg b/doc/smart-switch/high-availability/images/ha-planned-events-standalone-unpin.svg new file mode 100644 index 00000000000..9d273f26c07 --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-planned-events-standalone-unpin.svg @@ -0,0 +1,4 @@ + + + +
Unpinning standalone
Unpinning standalone
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: DPU0)
NPU0...
DPU0
(Standalone->Active)
(Term: N->N+1)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: DPU0)
NPU1...
DPU1
(Standby)
(Term: N)
DPU1...
Smart Switch 2
Smart Switch 2
NPU2
(NextHop: DPU0)
NPU2...
Set desired state
to Active
Set des...
Upstream
Upstream
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-planned-events-switchover-step-1.svg b/doc/smart-switch/high-availability/images/ha-planned-events-switchover-step-1.svg new file mode 100644 index 00000000000..8b73b5d046e --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-planned-events-switchover-step-1.svg @@ -0,0 +1,4 @@ + + + +
Step 1
Step 1
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: DPU0)
NPU0...
DPU0
(Active)
(Term: N)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: DPU0)
NPU1...
DPU1
(Standby)
(Term: N)
DPU1...
Set
DesiredState
to Active
Set...
Upstream
Upstream
Set
DesiredState
to None
Set...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-planned-events-switchover-step-2.svg b/doc/smart-switch/high-availability/images/ha-planned-events-switchover-step-2.svg new file mode 100644 index 00000000000..93ca5ccd5b2 --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-planned-events-switchover-step-2.svg @@ -0,0 +1,4 @@ + + + +
Step 2
Step 2
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: DPU0)
NPU0...
DPU0
(Active)
(Term: N)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: DPU0)
NPU1...
DPU1
(Standby)
(Term: N)
DPU1...
Request
Switchover
Notification
Request...
Upstream
Upstream
Set Approved
Switchover Id
Set App...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-planned-events-switchover-step-3.svg b/doc/smart-switch/high-availability/images/ha-planned-events-switchover-step-3.svg new file mode 100644 index 00000000000..6c8f7b06e41 --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-planned-events-switchover-step-3.svg @@ -0,0 +1,4 @@ + + + +
Step 3
Step 3
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: DPU0)
NPU0...
DPU0
(Active)
(Term: N)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: DPU0)
NPU1...
DPU1
(Standby->
SwitchingToActive)
(Term: N)
DPU1...
Switch
Over
Switch...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-planned-events-switchover-step-4.svg b/doc/smart-switch/high-availability/images/ha-planned-events-switchover-step-4.svg new file mode 100644 index 00000000000..8cc0b9cd36c --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-planned-events-switchover-step-4.svg @@ -0,0 +1,4 @@ + + + +
Step 4
Step 4
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: DPU0)
NPU0...
DPU0
(Active->
SwitchingToStandby)
(Term: N)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: DPU0)
NPU1...
DPU1
(SwitchingToActive)
(Term: N)
DPU1...
HAState
Changed
HAState...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-planned-events-switchover-step-5.svg b/doc/smart-switch/high-availability/images/ha-planned-events-switchover-step-5.svg new file mode 100644 index 00000000000..b09da4b2e7f --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-planned-events-switchover-step-5.svg @@ -0,0 +1,4 @@ + + + +
Step 5
Step 5
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: DPU0
->DPU1)
NPU0...
DPU0
(SwitchingToStandby)
(Term: N)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: DPU0
->DPU1)
NPU1...
DPU1
(SwitchingToActive
->Active)
(Term: N)
DPU1...
HAState
Changed
HAState...
Update
NextHop
Update...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-planned-events-switchover-step-6.svg b/doc/smart-switch/high-availability/images/ha-planned-events-switchover-step-6.svg new file mode 100644 index 00000000000..d2296cc6f8a --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-planned-events-switchover-step-6.svg @@ -0,0 +1,4 @@ + + + +
Step 6
Step 6
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: DPU1)
NPU0...
DPU0
(SwitchingToStandby
->Standby)
(Term: N)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: DPU1)
NPU1...
DPU1
(Active)
(Term: N)
DPU1...
HAState
Changed
HAState...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-scope.svg b/doc/smart-switch/high-availability/images/ha-scope.svg new file mode 100644 index 00000000000..e0692c5085f --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-scope.svg @@ -0,0 +1,4 @@ + + + +
Smart Switch 0
Smart Switch 0
NPU0
NPU0
ENI1
(Active)
ENI1...
ENI2
(Standby)
ENI2...
Smart Switch 1
Smart Switch 1
NPU1
NPU1
ENI1
(Standby)
ENI1...
ENI2
(Active)
ENI2...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-state-transition.svg b/doc/smart-switch/high-availability/images/ha-state-transition.svg new file mode 100644 index 00000000000..2ed22f16883 --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-state-transition.svg @@ -0,0 +1,4 @@ + + + +
Initializing
ToActive
Initializing...
Dead
Dead
Connecting
Connecting
Connected
Connected
Initializing
ToStandby
Initializing...
Active
Active
Standby
Standby
Standalone
Standalone
Switching
ToStandby
Switching...
Destroying
Destroying
Switching
ToActive
Switching...
After Launching
After Launching
Connected to peer
Connect...
Win the vote
Win the...
Lose the vote
Lose th...
After traffic
is drained
After traffic...
After standby
is ready
After s...
After bulk sync
is done
After b...
After standby
is ready for
switchover
After s...
Received
switchover
request
Receive...
After old
standby
becoming
active
After o...
After old active
starts tunneling
traffic across
After o...
Initializing
ToActive
Initializing...
Dead
Dead
Connecting
Connecting
Connected
Connected
Initializing
ToStandby
Initializing...
Active
Active
Standby
Standby
Standalone
Standalone
Switching
ToStandby
Switching...
Destroying
Destroying
Switching
ToActive
Switching...
Recovery
Recovery
After
Problem
Resolved
After...
Legend
Legend
Planned
Planned
Unplanned
Unplanned
Problem
detected
and win
the check
Problem...
Bulk sync
done
Bulk sy...
Problem
detected
and lose
the check
Problem...
Peer planned
shutdown
Peer pl...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-unplanned-events-enter-standalone-with-peer-down-step-1.svg b/doc/smart-switch/high-availability/images/ha-unplanned-events-enter-standalone-with-peer-down-step-1.svg new file mode 100644 index 00000000000..87d1a9d23ba --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-unplanned-events-enter-standalone-with-peer-down-step-1.svg @@ -0,0 +1,4 @@ + + + +
Step 1
Step 1
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: DPU0)
NPU0...
DPU0
(Active)
(Term: N)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: DPU0)
NPU1...
DPU1
(Standby)
(Term: N)
DPU1...
On
DPU
Critical
Event
On...
Smart Switch 2
Smart Switch 2
NPU2
(NextHop: DPU0)
NPU2...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-unplanned-events-enter-standalone-with-peer-down-step-2.svg b/doc/smart-switch/high-availability/images/ha-unplanned-events-enter-standalone-with-peer-down-step-2.svg new file mode 100644 index 00000000000..41a892a4c1f --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-unplanned-events-enter-standalone-with-peer-down-step-2.svg @@ -0,0 +1,4 @@ + + + +
Step 2
Step 2
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: DPU0
->DPU1)
NPU0...
DPU0
(Dead)
(Term: 0)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: DPU0
->DPU1)
NPU1...
DPU1
(Standby->Standalone)
(Term: N->N+1)
DPU1...
On
Peer
DPU
Lost
On...
Smart Switch 2
Smart Switch 2
NPU2
(NextHop: DPU0
->DPU1)
NPU2...
Probe
Update
Probe...
Probe
Update
Probe...
HAState
Changed
HAState...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-unplanned-events-enter-standalone-with-peer-up-step-1.svg b/doc/smart-switch/high-availability/images/ha-unplanned-events-enter-standalone-with-peer-up-step-1.svg new file mode 100644 index 00000000000..f1887c8fe63 --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-unplanned-events-enter-standalone-with-peer-up-step-1.svg @@ -0,0 +1,4 @@ + + + +
Step 1
Step 1
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: DPU0)
NPU0...
DPU0
(Active)
(Term: N)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: DPU0)
NPU1...
DPU1
(Standby)
(Term: N)
DPU1...
Packet
loss
detected
Packet...
Smart Switch 2
Smart Switch 2
NPU2
(NextHop: DPU0)
NPU2...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-unplanned-events-enter-standalone-with-peer-up-step-2-standby.svg b/doc/smart-switch/high-availability/images/ha-unplanned-events-enter-standalone-with-peer-up-step-2-standby.svg new file mode 100644 index 00000000000..cde83858ead --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-unplanned-events-enter-standalone-with-peer-up-step-2-standby.svg @@ -0,0 +1,4 @@ + + + +
Step 2 (Standby to Standalone)
Step 2 (Standby to Standalone)
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: DPU1
->DPU0)
NPU0...
DPU0
(Standby
->Standalone)
(Term: N->N+1)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: DPU1
->DPU0)
NPU1...
DPU1
(Active->Standby)
(Term: N)
DPU1...
Smart Switch 2
Smart Switch 2
NPU2
(NextHop: DPU1
->DPU0)
NPU2...
HAState
Changed
HAState...
UpdateNextHop
UpdateN...
UpdateNextHop
UpdateN...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-unplanned-events-enter-standalone-with-peer-up-step-2.svg b/doc/smart-switch/high-availability/images/ha-unplanned-events-enter-standalone-with-peer-up-step-2.svg new file mode 100644 index 00000000000..f066f72baf1 --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-unplanned-events-enter-standalone-with-peer-up-step-2.svg @@ -0,0 +1,4 @@ + + + +
Step 2 (Active to Standalone)
Step 2 (Active to Standalone)
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: DPU0)
NPU0...
DPU0
(Active->Standalone)
(Term: N->N+1)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: DPU0)
NPU1...
DPU1
(Standby)
(Term: N)
DPU1...
Smart Switch 2
Smart Switch 2
NPU2
(NextHop: DPU0)
NPU2...
HAState
Changed
HAState...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-unplanned-events-exit-standalone-step-1.svg b/doc/smart-switch/high-availability/images/ha-unplanned-events-exit-standalone-step-1.svg new file mode 100644 index 00000000000..b66d51ff6dd --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-unplanned-events-exit-standalone-step-1.svg @@ -0,0 +1,4 @@ + + + +
Step 1
Step 1
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: DPU0)
NPU0...
DPU0
(Standalone)
(Term: N+1)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: DPU0)
NPU1...
DPU1
(Standby)
(Term: N)
DPU1...
Smart Switch 2
Smart Switch 2
NPU2
(NextHop: DPU0)
NPU2...
Problem
solved
signal
Problem...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-unplanned-events-exit-standalone-step-2.svg b/doc/smart-switch/high-availability/images/ha-unplanned-events-exit-standalone-step-2.svg new file mode 100644 index 00000000000..d5f5a3f7f72 --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-unplanned-events-exit-standalone-step-2.svg @@ -0,0 +1,4 @@ + + + +
Step 2
Step 2
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: DPU0)
NPU0...
DPU0
(Standalone)
(Term: N+1)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: DPU0)
NPU1...
DPU1
(Standby->
InitializingToStandby)
(Term: N)
DPU1...
Smart Switch 2
Smart Switch 2
NPU2
(NextHop: DPU0)
NPU2...
Exit
Stand
alone
Exit...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-unplanned-events-exit-standalone-step-3.svg b/doc/smart-switch/high-availability/images/ha-unplanned-events-exit-standalone-step-3.svg new file mode 100644 index 00000000000..da79ff99349 --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-unplanned-events-exit-standalone-step-3.svg @@ -0,0 +1,4 @@ + + + +
Step 3
Step 3
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: DPU0)
NPU0...
DPU0
(Standalone->Active)
(Term: N+1->N+2)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: DPU0)
NPU1...
DPU1
(InitializingToStandby)
(Term: N)
DPU1...
Smart Switch 2
Smart Switch 2
NPU2
(NextHop: DPU0)
NPU2...
HAState
Changed
HAState...
Bulk
Sync
Bulk...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-unplanned-events-exit-standalone-step-4.svg b/doc/smart-switch/high-availability/images/ha-unplanned-events-exit-standalone-step-4.svg new file mode 100644 index 00000000000..1b4148775d3 --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-unplanned-events-exit-standalone-step-4.svg @@ -0,0 +1,4 @@ + + + +
Step 4
Step 4
Smart Switch 0
Smart Switch 0
NPU0
(NextHop: DPU0)
NPU0...
DPU0
(Active)
(Term: N+2)
DPU0...
Smart Switch 1
Smart Switch 1
NPU1
(NextHop: DPU0)
NPU1...
DPU1
(InitializingToStandby->Standby)
(Term: N->N+2)
DPU1...
Smart Switch 2
Smart Switch 2
NPU2
(NextHop: DPU0)
NPU2...
Bulk
Sync
Done
Bulk...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/npu-to-dpu-card-level-probe-local.svg b/doc/smart-switch/high-availability/images/npu-to-dpu-card-level-probe-local.svg new file mode 100644 index 00000000000..5d6c680bf9c --- /dev/null +++ b/doc/smart-switch/high-availability/images/npu-to-dpu-card-level-probe-local.svg @@ -0,0 +1,4 @@ + + + +
DPU probing (Local)
DPU probing (Local)
Smart Switch 1
Smart Switch 1
NPU1
(lo:1.1.1.1)
NPU1...
DPU Card 1
DPU Card 1
DPU1
(VIP: 2.2.2.1)
(lo: 10.0.0.1)
DPU1...
DPU-Port1
DPU-Port1
DPU-PortN
DPU-PortN
......
......
2
2
1
1
1
1
2
2
FP-PortX
FP-PortX
Packet (Local BFD Probe)
Packet (Local BFD Probe)
Eth
Eth
SRC=<NPU1 MAC>
SRC=<NPU1 MAC>
IP
IP
SRC=<NPU1 IP>, DST=<DPU1 IP>
SRC=<NPU1 IP>, DST=<DPU1 IP>
UDP
UDP
DstPort=3784
DstPort=3784
BFD
BFD
BFD Payload
BFD Payload
Packet (Local BFD Probe)
Packet (Local BFD Probe)
Eth
Eth
SRC=<DPU1 MAC>
SRC=<DPU1 MAC>
IP
IP
SRC=<DPU1 IP>, DST=<NPU1 IP>
SRC=<DPU1 IP>, DST=<NPU1 IP>
UDP
UDP
DstPort=3784
DstPort=3784
BFD
BFD
BFD Payload
BFD Payload
Legend
Legend
Request (NPU to DPU)
Request (NPU to DPU)
Request (DPU to NPU)
Request (DPU to NPU)
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/npu-to-dpu-card-level-probe-remote.svg b/doc/smart-switch/high-availability/images/npu-to-dpu-card-level-probe-remote.svg new file mode 100644 index 00000000000..d5051699414 --- /dev/null +++ b/doc/smart-switch/high-availability/images/npu-to-dpu-card-level-probe-remote.svg @@ -0,0 +1,4 @@ + + + +
DPU probing (Remote, Multi-hop)
DPU probing (Remote, Multi-hop)
Smart Switch 1
Smart Switch 1
NPU1
(lo:1.1.1.1)
NPU1...
DPU Card 1
DPU Card 1
DPU1
(VIP: 2.2.2.1)
(lo: 10.0.0.1)
DPU1...
DPU-Port1
DPU-Port1
DPU-PortN
DPU-PortN
......
......
10
10
FP-PortX
FP-PortX
1
1
Smart Switch 2
Smart Switch 2
NPU2
(lo:1.1.1.2)
NPU2...
DPU Card 2
DPU Card 2
DPU2
(VIP: 2.2.2.2)
(lo: 10.0.0.2)
DPU2...
DPU-Port1
DPU-Port1
DPU-PortN
DPU-PortN
......
......
5
5
6
6
4
4
FP-PortX
FP-PortX
7
7
8
8
3
3
Packet (Remote BFD Probe)
Packet (Remote BFD Probe)
Eth
Eth
SRC=<NPU1 MAC>
SRC=<NPU1 MAC>
IP
IP
SRC=<NPU1 IP>, DST=<DPU2 IP>
SRC=<NPU1 IP>, DST=<DPU2 IP>
UDP
UDP
DstPort=4784
DstPort=4784
BFD
BFD
BFD Payload
BFD Payload
The DPU IP is advertised in the entire network, so NPU1 can directly use DPU2's IP to talk to it.
The DPU IP is advertised in the entir...
Packet (Remote BFD Probe)
Packet (Remote BFD Probe)
Eth
Eth
SRC=<DPU2 MAC>
SRC=<DPU2 MAC>
IP
IP
SRC=<DPU2 IP>, DST=<NPU1 IP>
SRC=<DPU2 IP>, DST=<NPU1 IP>
UDP
UDP
DstPort=4784
DstPort=4784
BFD
BFD
BFD Payload
BFD Payload
9
9
2
2
Legend
Legend
Request (NPU to DPU)
Request (NPU to DPU)
Request (DPU to NPU)
Request (DPU to NPU)
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/npu-to-dpu-tunnel-2-stages.svg b/doc/smart-switch/high-availability/images/npu-to-dpu-tunnel-2-stages.svg new file mode 100644 index 00000000000..2c8b5392d39 --- /dev/null +++ b/doc/smart-switch/high-availability/images/npu-to-dpu-tunnel-2-stages.svg @@ -0,0 +1,4 @@ + + + +
Smart Switch 2
Smart Switch 2
NPU2
(lo:1.1.1.2)
NPU2...
DPU Card 2
DPU Card 2
DPU2
(VIP: 2.2.2.1)
(PA: 10.0.0.2)
(Standby)
DPU2...
DPU-Port1
DPU-Port1
DPU-PortN
DPU-PortN
......
......
8
8
9
9
6
6
FP-PortX
FP-PortX
11
11
7
7
10
10
Smart Switch 1
Smart Switch 1
NPU1
(lo:1.1.1.1)
NPU1...
DPU Card 1
DPU Card 1
DPU-Port1
DPU-Port1
DPU-PortN
DPU-PortN
......
......
15
15
16
16
DPU1
(VIP: 2.2.2.1)
(PA: 10.0.0.1)
(Active)
DPU1...
13
13
14
14
FP-PortX
FP-PortX
17
17
18
18
12
12
5
5
19
19
Packet (Standby to Active)
Packet (Standby to Active)
Eth
Eth
SRC=<VM1 MAC (Initial)>
SRC=<VM1 MAC (Initial)>
VxLan
VxLan
SRC=<VM1 PA>, DST=<DPU1 PA>,
VNI=<Outbound VNI>
SRC=<VM1 PA>, DST=<DPU1 PA>,...
Eth
Eth
SRC=<VM1 MAC>
SRC=<VM1 MAC>
IP
IP
SRC=<VM1 CA>, DST=<VM2 CA>
SRC=<VM1 CA>, DST=<VM2 CA>
Payload
Payload
Payload
Payload
Packet (Outgoing)
Packet (Outgoing)
Eth
Eth
SRC=<DPU1 MAC (Initial)>
SRC=<DPU1 MAC (Initial)>
VxLan
VxLan
SRC=<DPU1 VIP>, DST=<DPU2 VIP>,
VNI=<VNET VNI>
SRC=<DPU1 VIP>, DST=<DPU2 VIP>,...
Eth
Eth
SRC=<VM1 MAC>
SRC=<VM1 MAC>
IP
IP
SRC=<VM1 CA>, DST=<VM2 CA>
SRC=<VM1 CA>, DST=<VM2 CA>
Payload
Payload
Payload
Payload
The PA of each DPU is routable within the regional network, just like normal machines.
The PA of each DPU is routable withi...
The flow replication process is emitted here. Please see DPU HA (Flow Replication) graph for more details.
The flow replication process is emitted here....
VM2 DPU Pair
(VIP: 2.2.2.2)
VM2 DPU Pair...
DPU2 will update the existing outer most tunnel to forward the packet to DPU1.
DPU2 will update the existing outer mo...
Legend
Legend
Request (NPU-to-NPU Tunnel)
Request (NPU-to-NPU Tunnel)
Request (Initial)
Request (Initial)
Request (Outgoing)
Request (Outgoing)
Request (Standby to Active)
Request (Standby to Active)
Smart Switch 3
Smart Switch 3
NPU3
(lo:1.1.1.3)
NPU3...
DPU Card 3
DPU Card 3
DPU3
(VIP: 2.2.2.1)
(PA: 10.0.0.3)
(Standby)
DPU3...
DPU-Port1
DPU-Port1
DPU-PortN
DPU-PortN
......
......
3
3
FP-PortX
FP-PortX
4
4
Node1
Node1
VM1
(CA: 10.0.0.1)
VM1...
Host Networking
Host Networking
1
1
2
2
Packet (Initial)
Packet (Initial)
Eth
Eth
SRC=<VM1 MAC (Initial)>
SRC=<VM1 MAC (Initial)>
VxLan
VxLan
SRC=<VM1 PA>, DST=<VM1 DPU VIP>,
VNI=<Outbound VNI>
SRC=<VM1 PA>, DST=<VM1 DPU VIP>,...
Eth
Eth
SRC=<VM1 MAC>
SRC=<VM1 MAC>
IP
IP
SRC=<VM1 CA>, DST=<VM2 CA>
SRC=<VM1 CA>, DST=<VM2 CA>
Payload
Payload
Payload
Payload
Packet (NPU-to-NPU Tunnel)
Packet (NPU-to-NPU Tunnel)
Eth
Eth
SRC=<NPU3 MAC (Initial)>
SRC=<NPU3 MAC (Initial)>
VxLan
VxLan
SRC=<NPU3 PA>, DST=<NPU2 PA>,
VNI=<NPU Tunnel VNI>
SRC=<NPU3 PA>, DST=<NPU2 PA>,...
Eth
Eth
SRC=<VM1 MAC (Initial)>
SRC=<VM1 MAC (Initial)>
VxLan
VxLan
SRC=<VM1 PA>, DST=<VM1 DPU VIP>,
VNI=<Outbound VNI>
SRC=<VM1 PA>, DST=<VM1 DPU VIP>,...
Eth
Eth
SRC=<VM1 MAC>
SRC=<VM1 MAC>
IP
IP
SRC=<VM1 CA>, DST=<VM2 CA>
SRC=<VM1 CA>, DST=<VM2 CA>
Payload
Payload
Payload
Payload
Initial Packet (Customer View)
Initial Packet (Customer View)
Eth
Eth
SRC=<VM1 MAC>
SRC=<VM1 MAC>
IP
IP
SRC=<VM1 CA>, DST=<VM2 CA>
SRC=<VM1 CA>, DST=<VM2 CA>
Packet (After NPU-to-NPU Tunnel decap)
Packet (After NPU-to-NPU Tunnel decap)
Eth
Eth
SRC=<VM1 MAC (Initial)>
SRC=<VM1 MAC (Initial)>
VxLan
VxLan
SRC=<VM1 PA>, DST=<VM1 DPU VIP>,
VNI=<Outbound VNI>
SRC=<VM1 PA>, DST=<VM1 DPU VIP>,...
Eth
Eth
SRC=<VM1 MAC>
SRC=<VM1 MAC>
IP
IP
SRC=<VM1 CA>, DST=<VM2 CA>
SRC=<VM1 CA>, DST=<VM2 CA>
Payload
Payload
Payload
Payload
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/npu-to-dpu-tunnel-from-dpu.svg b/doc/smart-switch/high-availability/images/npu-to-dpu-tunnel-from-dpu.svg new file mode 100644 index 00000000000..4fc651952fa --- /dev/null +++ b/doc/smart-switch/high-availability/images/npu-to-dpu-tunnel-from-dpu.svg @@ -0,0 +1,4 @@ + + + +
Smart Switch 2
Smart Switch 2
NPU2
(lo:1.1.1.1)
NPU2...
DPU Card 2
DPU Card 2
DPU2
(VIP: 2.2.2.1)
(PA: 10.0.0.2)
(Standby)
DPU2...
DPU-Port1
DPU-Port1
DPU-PortN
DPU-PortN
......
......
5
5
6
6
3
3
FP-PortX
FP-PortX
8
8
4
4
7
7
Smart Switch 1
Smart Switch 1
NPU1
(lo:1.1.1.2)
NPU1...
DPU Card 1
DPU Card 1
DPU-Port1
DPU-Port1
DPU-PortN
DPU-PortN
......
......
12
12
13
13
DPU1
(VIP: 2.2.2.1)
(PA: 10.0.0.1)
(Active)
DPU1...
10
10
11
11
FP-PortX
FP-PortX
14
14
15
15
9
9
Node1
Node1
VM1
(CA: 10.0.0.1)
VM1...
Host Networking
Host Networking
1
1
2
2
16
16
Packet (Initial)
Packet (Initial)
Eth
Eth
SRC=<VM1 MAC (Initial)>
SRC=<VM1 MAC (Initial)>
VxLan
VxLan
SRC=<VM1 PA>, DST=<VM1 DPU VIP>,
VNI=<Outbound VNI>
SRC=<VM1 PA>, DST=<VM1 DPU VIP>,...
Eth
Eth
SRC=<VM1 MAC>
SRC=<VM1 MAC>
IP
IP
SRC=<VM1 CA>, DST=<VM2 CA>
SRC=<VM1 CA>, DST=<VM2 CA>
Payload
Payload
Payload
Payload
Packet (Outgoing)
Packet (Outgoing)
Eth
Eth
SRC=<DPU1 MAC (Initial)>
SRC=<DPU1 MAC (Initial)>
VxLan
VxLan
SRC=<VM1 DPU VIP>, DST=<VM2 DPU VIP>,
VNI=<VNET VNI>
SRC=<VM1 DPU VIP>, DST=<VM2 DPU VIP>,...
Eth
Eth
SRC=<VM1 MAC>
SRC=<VM1 MAC>
IP
IP
SRC=<VM1 CA>, DST=<VM2 CA>
SRC=<VM1 CA>, DST=<VM2 CA>
Payload
Payload
Payload
Payload
Initial Packet (Customer View)
Initial Packet (Customer View)
Eth
Eth
SRC=<VM1 MAC>
SRC=<VM1 MAC>
IP
IP
SRC=<VM1 CA>, DST=<VM2 CA>
SRC=<VM1 CA>, DST=<VM2 CA>
The PA of each DPU is routable within the regional network, just like normal machines.
The PA of each DPU is routable with...
The flow replication process is emitted here. Please see DPU HA (Flow Replication) graph for more details.
The flow replication process is emitted here....
VM2 DPU Pair
(VIP: 2.2.2.2)
VM2 DPU Pair...
DPU2 will update the existing outer most tunnel to forward the packet to DPU1.
DPU2 will update the existing outer...
Legend
Legend
Request (NPU-to-DPU Tunnel)
Request (NPU-to-DPU Tunnel)
Request (Initial)
Request (Initial)
Request (Outgoing)
Request (Outgoing)
Packet (Standby to Active)
Packet (Standby to Active)
Eth
Eth
SRC=<VM1 MAC (Initial)>
SRC=<VM1 MAC (Initial)>
VxLan
VxLan
SRC=<VM1 PA>, DST=<DPU1 PA>,
VNI=<Outbound VNI>
SRC=<VM1 PA>, DST=<DPU1 PA>,...
Eth
Eth
SRC=<VM1 MAC>
SRC=<VM1 MAC>
IP
IP
SRC=<VM1 CA>, DST=<VM2 CA>
SRC=<VM1 CA>, DST=<VM2 CA>
Payload
Payload
Payload
Payload
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/npu-to-dpu-tunnel-local.svg b/doc/smart-switch/high-availability/images/npu-to-dpu-tunnel-local.svg new file mode 100644 index 00000000000..8a6d3d3da48 --- /dev/null +++ b/doc/smart-switch/high-availability/images/npu-to-dpu-tunnel-local.svg @@ -0,0 +1,4 @@ + + + +
Node1
Node1
VM1
(CA: 10.0.0.1)
VM1...
Host Networking
Host Networking
1
1
Smart Switch 1
Smart Switch 1
NPU1
(lo:1.1.1.1)
NPU1...
DPU Card 1
DPU Card 1
DPU1
(VIP: 2.2.2.1)
(IP: 100.0.0.1)
DPU1...
DPU-Port1
DPU-Port1
DPU-PortN
DPU-PortN
......
......
5
5
6
6
3
3
4
4
FP-PortX
FP-PortX
7
7
8
8
2
2
Initial Packet (Customer View)
Initial Packet (Customer View)
Eth
Eth
SRC=<VM1 MAC>, DST=<VM2 MAC>
SRC=<VM1 MAC>, DST=<VM2 MAC>
IP
IP
SRC=<VM1 CA>, DST=<VM2 CA>
SRC=<VM1 CA>, DST=<VM2 CA>
For traffic between host and its owning DPU, we will always use the hardcoded reserved VNI (VM Tunnel VNI) to encap the traffic.
For traffic between host and its owning DPU,...
For traffic coming from SmartSwitch-enabled VNET, NPU uses protocol DST IP + inner source MAC to lookup the nexthop (DPU).
For traffic coming from SmartSwitch-en...
For local DPU, NPU moves the port directly to the DPU port and send it out without encap.

NPU tunnel is used only when talking to remote DPUs.
For local DPU, NPU moves the port dire...
Initial Packet (Before DPU1)
Initial Packet (Before DPU1)
Eth
Eth
SRC=<VM1 MAC (Initial)>
SRC=<VM1 MAC (Initial)>
VxLan
VxLan
SRC=<VM1 PA>, DST=<DPU1 VIP>,
VNI=<VM Tunnel VNI>
SRC=<VM1 PA>, DST=<DPU1 VIP>,...
Eth
Eth
SRC=<VM1 MAC>, DST=<VM2 MAC>
SRC=<VM1 MAC>, DST=<VM2 MAC>
IP
IP
SRC=<VM1 CA>, DST=<VM2 CA>
SRC=<VM1 CA>, DST=<VM2 CA>
NPU uses BGP to advertise the DP VIPs to receive traffic, instead of DPU.
NPU uses BGP to advertise the DP VIPs...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/npu-to-dpu-tunnel-remote.svg b/doc/smart-switch/high-availability/images/npu-to-dpu-tunnel-remote.svg new file mode 100644 index 00000000000..99b25e390d6 --- /dev/null +++ b/doc/smart-switch/high-availability/images/npu-to-dpu-tunnel-remote.svg @@ -0,0 +1,4 @@ + + + +
References
References
How to tell if a packet is a flow replication packet or tunneled packet?

The tunneled packet will be VxLan with a dedicated NPU tunnel VNI, while flow replication packet will be a generic UDP packet.
How to tell if a packet is a flow rep...
Packet (Flow Replication)
Packet (Flow Replication)
Eth
Eth
SRC=<DPU1 MAC (Initial)>
SRC=<DPU1 MAC (Initial)>
UDP
UDP
SRC=<DPU1 PA>:*
DST=<DPU2 PA>:<DP Channel Port>
SRC=<DPU1 PA>:*...
Payload
Payload
Vendor-defined metadata for flow replication.
Vendor-defined metadata for flow replic...
Eth
Eth
SRC=<VM1 MAC>
SRC=<VM1 MAC>
IP
IP
SRC=<VM1 CA>, DST=<VM2 CA>
SRC=<VM1 CA>, DST=<VM2 CA>
Payload
Payload
Payload
Payload
Smart Switch 2
Smart Switch 2
NPU2
(lo:1.1.1.1)
NPU2...
DPU Card 2
DPU Card 2
DPU2
(VIP: 2.2.2.1)
(PA: 10.0.0.2)
(Standby)
DPU2...
DPU-Port1
DPU-Port1
DPU-PortN
DPU-PortN
......
......
3
3
FP-PortX
FP-PortX
4
4
Smart Switch 1
Smart Switch 1
NPU1
(lo:1.1.1.2)
NPU1...
DPU Card 1
DPU Card 1
DPU-Port1
DPU-Port1
DPU-PortN
DPU-PortN
......
......
8
8
9
9
DPU1
(VIP: 2.2.2.1)
(PA: 10.0.0.1)
(Active)
DPU1...
6
6
7
7
FP-PortX
FP-PortX
10
10
11
11
5
5
Node1
Node1
VM1
(CA: 10.0.0.1)
VM1...
Host Networking
Host Networking
1
1
2
2
12
12
Legend
Legend
Request (NPU-to-DPU Tunnel)
Request (NPU-to-DPU Tunnel)
Request (Initial)
Request (Initial)
Request (Outgoing)
Request (Outgoing)
Packet (Initial)
Packet (Initial)
Eth
Eth
SRC=<VM1 MAC (Initial)>
SRC=<VM1 MAC (Initial)>
VxLan
VxLan
SRC=<VM1 PA>, DST=<VM1 DPU VIP>,
VNI=<Outbound VNI>
SRC=<VM1 PA>, DST=<VM1 DPU VIP>,...
Eth
Eth
SRC=<VM1 MAC>
SRC=<VM1 MAC>
IP
IP
SRC=<VM1 CA>, DST=<VM2 CA>
SRC=<VM1 CA>, DST=<VM2 CA>
Payload
Payload
Payload
Payload
Packet (NPU-to-NPU Tunnel)
Packet (NPU-to-NPU Tunnel)
Eth
Eth
SRC=<NPU2 MAC (Initial)>
SRC=<NPU2 MAC (Initial)>
VxLan
VxLan
SRC=<NPU2 PA>, DST=<NPU1 PA>,
VNI=<NPU Tunnel VNI>
SRC=<NPU2 PA>, DST=<NPU1 PA>,...
Eth
Eth
SRC=<VM1 MAC (Initial)>
SRC=<VM1 MAC (Initial)>
VxLan
VxLan
SRC=<VM1 PA>, DST=<VM1 DPU VIP>,
VNI=<Outbound VNI>
SRC=<VM1 PA>, DST=<VM1 DPU VIP>,...
Eth
Eth
SRC=<VM1 MAC>
SRC=<VM1 MAC>
IP
IP
SRC=<VM1 CA>, DST=<VM2 CA>
SRC=<VM1 CA>, DST=<VM2 CA>
Payload
Payload
Payload
Payload
Packet (Outgoing)
Packet (Outgoing)
Eth
Eth
SRC=<DPU1 MAC (Initial)>
SRC=<DPU1 MAC (Initial)>
VxLan
VxLan
SRC=<VM1 DPU VIP>, DST=<VM2 DPU VIP>,
VNI=<VNET VNI>
SRC=<VM1 DPU VIP>, DST=<VM2 DPU VIP>,...
Eth
Eth
SRC=<VM1 MAC>, DST=<VM2 MAC>
SRC=<VM1 MAC>, DST=<VM2 MAC>
IP
IP
SRC=<VM1 CA>, DST=<VM2 CA>
SRC=<VM1 CA>, DST=<VM2 CA>
Payload
Payload
Payload
Payload
Initial Packet (Customer View)
Initial Packet (Customer View)
Eth
Eth
SRC=<VM1 MAC>
SRC=<VM1 MAC>
IP
IP
SRC=<VM1 CA>, DST=<VM2 CA>
SRC=<VM1 CA>, DST=<VM2 CA>
The PA of each DPU is routable within the regional network, just like normal machines.
The PA of each DPU is routable with...
The flow replication process is emitted here. Please see DPU HA (Flow Replication) graph for more details.
The flow replication process is emitted here....
VM2 DPU Pair
(VIP: 2.2.2.2)
VM2 DPU Pair...
Packet (After NPU-to-NPU tunnel decap)
Packet (After NPU-to-NPU tunnel decap)
Eth
Eth
SRC=<VM1 MAC (Initial)>
SRC=<VM1 MAC (Initial)>
VxLan
VxLan
SRC=<VM1 PA>, DST=<VM1 DPU VIP>,
VNI=<Outbound VNI>
SRC=<VM1 PA>, DST=<VM1 DPU VIP>,...
Eth
Eth
SRC=<VM1 MAC>
SRC=<VM1 MAC>
IP
IP
SRC=<VM1 CA>, DST=<VM2 CA>
SRC=<VM1 CA>, DST=<VM2 CA>
Payload
Payload
Payload
Payload
NPU2 matches DST IP + Inner Source MAC to find the active DPU and add NPU tunnel on top of it.
NPU2 matches DST IP + Inner Source MAC to find the...
NPU1 terminates the tunnel, then matches DST IP + Inner Source MAC and move to local DPU (DPU1).
NPU1 terminates the tunnel, then matches DST IP...
Packet drop on source IP == destination IP is disabled, in case VM2 is also within this DPU pair.
Packet drop on source IP == destina...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/smart-switch-ha-detailed-design.md b/doc/smart-switch/high-availability/smart-switch-ha-detailed-design.md new file mode 100644 index 00000000000..d53ab84e6fe --- /dev/null +++ b/doc/smart-switch/high-availability/smart-switch-ha-detailed-design.md @@ -0,0 +1,614 @@ +# SmartSwitch High Availability Detailed Design + +| Rev | Date | Author | Change Description | +| --- | ---- | ------ | ------------------ | +| 0.1 | 10/14/2023 | Riff Jiang | Initial version | + +1. [1. Database Schema](#1-database-schema) + 1. [1.1. High level data flow](#11-high-level-data-flow) + 2. [1.2. NPU DB schema](#12-npu-db-schema) + 1. [1.2.1. CONFIG DB](#121-config-db) + 1. [1.2.1.1. DPU / vDPU definitions](#1211-dpu--vdpu-definitions) + 2. [1.2.1.2. HA global configurations](#1212-ha-global-configurations) + 2. [1.2.2. APPL DB](#122-appl-db) + 1. [1.2.2.1. HA set configurations](#1221-ha-set-configurations) + 2. [1.2.2.2. ENI placement configurations](#1222-eni-placement-configurations) + 3. [1.2.2.3. ENI configurations](#1223-eni-configurations) + 3. [1.2.3. State DB](#123-state-db) + 1. [1.2.3.1. DPU / vDPU state](#1231-dpu--vdpu-state) + 4. [1.2.4. DASH State DB](#124-dash-state-db) + 1. [1.2.4.1. DPU / vDPU HA states](#1241-dpu--vdpu-ha-states) + 2. [1.2.4.2. ENI state](#1242-eni-state) + 3. [1.3. DPU DB schema](#13-dpu-db-schema) + 1. [1.3.1. APPL DB](#131-appl-db) + 1. [1.3.1.1. HA set configurations](#1311-ha-set-configurations) + 2. [1.3.1.2. DASH ENI object table](#1312-dash-eni-object-table) + 2. [1.3.2. State DB](#132-state-db) + 1. [1.3.2.1. ENI HA state](#1321-eni-ha-state) +2. [2. Telemetry](#2-telemetry) + 1. [2.1. HA state](#21-ha-state) + 2. [2.2. HA operation counters](#22-ha-operation-counters) + 1. [2.2.1. hamgrd HA operation counters](#221-hamgrd-ha-operation-counters) + 2. [2.2.2. HA SAI API counters](#222-ha-sai-api-counters) + 3. [2.3. HA control plane communication channel related](#23-ha-control-plane-communication-channel-related) + 1. [2.3.1. HA control plane control channel counters](#231-ha-control-plane-control-channel-counters) + 2. [2.3.2. HA control plane data channel counters](#232-ha-control-plane-data-channel-counters) + 1. [2.3.2.1. Per bulk sync flow receive server counters](#2321-per-bulk-sync-flow-receive-server-counters) + 2. [2.3.2.2. Per ENI counters](#2322-per-eni-counters) + 4. [2.4. NPU-to-DPU tunnel related (NPU side)](#24-npu-to-dpu-tunnel-related-npu-side) + 1. [2.4.1. NPU-to-DPU probe status](#241-npu-to-dpu-probe-status) + 2. [2.4.2. NPU-to-DPU data plane state](#242-npu-to-dpu-data-plane-state) + 3. [2.4.3. NPU-to-DPU tunnel counters](#243-npu-to-dpu-tunnel-counters) + 5. [2.5. NPU-to-DPU tunnel related (DPU side)](#25-npu-to-dpu-tunnel-related-dpu-side) + 6. [2.6. DPU-to-DPU data plane channel related](#26-dpu-to-dpu-data-plane-channel-related) + 7. [2.7. DPU ENI pipeline related](#27-dpu-eni-pipeline-related) +3. [3. SAI APIs](#3-sai-apis) +4. [4. CLI commands](#4-cli-commands) + +## 1. Database Schema + +NOTE: Only the configuration that is related to HA is listed here and please check [SONiC-DASH HLD](https://github.com/sonic-net/SONiC/blob/master/doc/dash/dash-sonic-hld.md) to see other fields. + +### 1.1. High level data flow + +```mermaid +flowchart LR + NC[Network Controllers] + SC[SDN Controllers] + + subgraph NPU Components + NPU_SWSS[swss] + NPU_HAMGRD[hamgrd] + + subgraph CONFIG DB + NPU_DPU[DPU_TABLE] + NPU_VDPU[VDPU_TABLE] + NPU_DASH_HA_GLOBAL_CONFIG[DASH_HA_GLOBAL_CONFIG] + end + + subgraph APPL DB + NPU_BFD_SESSION[BFD_SESSION_TABLE] + NPU_ACL_TABLE[APP_TABLE_TABLE] + NPU_ACL_RULE[APP_RULE_TABLE] + end + + subgraph DASH APPL DB + NPU_DASH_HA_SET[DASH_HA_SET_TABLE] + NPU_DASH_ENI_PLACEMENT[DASH_ENI_PLACEMENT_TABLE] + NPU_DASH_ENI_HA_CONFIG[DASH_ENI_HA_CONFIG_TABLE] + end + + subgraph STATE DB + NPU_DPU_STATE[DPU_TABLE] + NPU_VDPU_STATE[VDPU_TABLE] + NPU_BFD_SESSION_STATE[BFD_SESSION_TABLE] + NPU_ACL_TABLE_STATE[APP_TABLE_TABLE] + NPU_ACL_RULE_STATE[APP_RULE_TABLE] + end + + subgraph DASH STATE DB + NPU_DASH_DPU_HA_STATE[DASH_DPU_HA_STATE_TABLE] + NPU_DASH_VDPU_HA_STATE[DASH_VDPU_HA_STATE_TABLE] + NPU_DASH_ENI_HA_STATE[DASH_ENI_HA_STATE_TABLE] + NPU_DASH_ENI_DP_STATE[DASH_ENI_DP_STATE_TABLE] + end + + subgraph DASH COUNTER DB + NPU_DASH_COUNTERS[DASH_*_COUNTER_TABLE] + end + end + + subgraph "DPU0 Components (Same for other DPUs)" + DPU_SWSS[swss] + + subgraph DASH APPL DB + DPU_DASH_HA_SET[DASH_HA_SET_TABLE] + DPU_DASH_ENI[DASH_ENI_TABLE] + DPU_DASH_ENI_HA_BULK_SYNC_SESSION[DASH_ENI_HA_BULK_SYNC_SESSION_TABLE] + end + + subgraph DASH STATE DB + DPU_DASH_ENI_HA_STATE[DASH_ENI_HA_STATE_TABLE] + end + + subgraph DASH COUNTER DB + DPU_DASH_COUNTERS[DASH_*_COUNTER_TABLE] + end + end + + %% Upstream services --> northbound interfaces: + NC --> |gNMI| NPU_DPU + NC --> |gNMI| NPU_VDPU + NC --> |gNMI| NPU_DASH_HA_GLOBAL_CONFIG + + SC --> |gNMI| NPU_DASH_HA_SET + SC --> |gNMI| NPU_DASH_ENI_PLACEMENT + SC --> |gNMI| NPU_DASH_ENI_HA_CONFIG + SC --> |gNMI| DPU_DASH_ENI + + %% NPU tables --> NPU side SWSS: + NPU_DPU --> NPU_SWSS + NPU_VDPU --> NPU_SWSS + NPU_BFD_SESSION --> |ConsumerStateTable| NPU_SWSS + NPU_ACL_TABLE --> |ConsumerStateTable| NPU_SWSS + NPU_ACL_RULE --> |ConsumerStateTable| NPU_SWSS + + %% NPU side SWSS --> NPU tables: + NPU_SWSS --> NPU_DPU_STATE + NPU_SWSS --> NPU_VDPU_STATE + NPU_SWSS --> NPU_BFD_SESSION_STATE + NPU_SWSS --> NPU_ACL_TABLE_STATE + NPU_SWSS --> NPU_ACL_RULE_STATE + NPU_SWSS --> |Forward BFD Update| NPU_HAMGRD + + %% NPU tables --> hamgrd: + NPU_DPU --> NPU_HAMGRD + NPU_VDPU --> NPU_HAMGRD + NPU_DPU_STATE --> |Direct Table Query| NPU_HAMGRD + NPU_VDPU_STATE --> |Direct Table Query| NPU_HAMGRD + NPU_DASH_HA_GLOBAL_CONFIG --> NPU_HAMGRD + NPU_DASH_HA_SET --> NPU_HAMGRD + NPU_DASH_ENI_PLACEMENT --> NPU_HAMGRD + NPU_DASH_ENI_HA_CONFIG --> NPU_HAMGRD + NPU_DASH_COUNTERS --> |Direct Table Query| NPU_HAMGRD + + %% DPU tables --> hamgrd: + DPU_DASH_ENI_HA_STATE --> NPU_HAMGRD + DPU_DASH_COUNTERS --> |Direct Table Query| NPU_HAMGRD + + %% hamgrd --> NPU tables: + NPU_HAMGRD --> NPU_DASH_DPU_HA_STATE + NPU_HAMGRD --> NPU_DASH_VDPU_HA_STATE + NPU_HAMGRD --> NPU_DASH_ENI_HA_STATE + NPU_HAMGRD --> NPU_DASH_ENI_DP_STATE + NPU_HAMGRD --> |ProducerStateTable| NPU_BFD_SESSION + NPU_HAMGRD --> |ProducerStateTable| NPU_ACL_TABLE + NPU_HAMGRD --> |ProducerStateTable| NPU_ACL_RULE + + %% hamgrd --> DPU tables: + NPU_HAMGRD --> DPU_DASH_HA_SET + NPU_HAMGRD --> DPU_DASH_ENI_HA_BULK_SYNC_SESSION + + %% DPU tables --> DPU SWSS: + DPU_DASH_ENI --> DPU_SWSS + DPU_DASH_HA_SET --> DPU_SWSS + DPU_DASH_ENI_HA_BULK_SYNC_SESSION --> DPU_SWSS + + %% DPU swss --> DPU tables: + DPU_SWSS --> DPU_DASH_ENI_HA_STATE + DPU_SWSS --> DPU_DASH_COUNTERS +``` + +### 1.2. NPU DB schema + +#### 1.2.1. CONFIG DB + +##### 1.2.1.1. DPU / vDPU definitions + +* These tables are imported from the SmartSwitch HLD to make the doc more convenient for reading, and we should always use that doc as the source of truth. +* These tables should be prepopulated before any HA configuration tables below are programmed. + +| Table | Key | Field | Description | +| --- | --- | --- | --- | +| DPU_TABLE | | | Physical DPU configuration. | +| | \ | | Physical DPU ID | +| | | type | Type of DPU. It can be "local", "cluster" or "external". | +| | | state | Admin state of the DPU device. | +| | | slot_id | Slot ID of the DPU. | +| | | pa_ipv4 | IPv4 address. | +| | | pa_ipv6 | IPv6 address. | +| | | npu_ipv4 | IPv4 address of its owning NPU loopback. | +| | | npu_ipv6 | IPv6 address of its owning NPU loopback. | +| | | probe_ip | Custom probe point if we prefer to use a different one from the DPU IP address. | +| VDPU_TABLE | | | Virtual DPU configuration. | +| | \ | | Virtual DPU ID | +| | | profile | The profile of the vDPU. | +| | | tier | The tier of the vDPU. | +| | | main_dpu_ids | The IDs of the main physical DPU. | + +##### 1.2.1.2. HA global configurations + +* The global configuration is shared by all HA sets, and ENIs and should be programmed on all switches. +* The global configuration should be programmed before any HA set configurations below are programmed. + +| Table | Key | Field | Description | +| --- | --- | --- | --- | +| DASH_HA_GLOBAL_CONFIG | N/A | | HA global configurations. | +| | | dp_channel_dst_port | The destination port used when tunneling packetse via DPU-to-DPU data plane channel. | +| | | dp_channel_src_port_min | The min source port used when tunneling packetse via DPU-to-DPU data plane channel. | +| | | dp_channel_src_port_max | The max source port used when tunneling packetse via DPU-to-DPU data plane channel. | +| | | dp_channel_probe_interval_ms | The interval of sending each DPU-to-DPU data path probe. | +| | | dpu_bfd_probe_multiplier | The number of DPU BFD probe failure before probe down. | +| | | dpu_bfd_probe_interval_in_ms | The interval of DPU BFD probe in milliseconds. | + +#### 1.2.2. APPL DB + +##### 1.2.2.1. HA set configurations + +* The HA set table defines which DPUs should be forming the same HA set and how. +* The HA set table should be programmed on all switches, so we could program the ENI location information and setup the traffic forwarding rules. +* If the HA set contains local vDPU, it will be copied to DPU side DB by `hamgrd` as well. + +| Table | Key | Field | Description | +| --- | --- | --- | --- | +| DASH_HA_SET_TABLE | | | HA set table, which describes the DPUs that forms the HA set. | +| | \ | | HA set ID | +| | | version | Config version. | +| | | vdpu_ids | The ID of the vDPUs. | +| | | mode | Mode of HA set. It can be "activestandby". | +| | | pinned_vdpu_bfd_probe_states | Pinned probe states of vDPUs, connected by ",". Each state can be "" (none), "up" or "down". | +| | | preferred_standalone_vdpu_index | Preferred vDPU index to be standalone when entering into standalone setup. | + +##### 1.2.2.2. ENI placement configurations + +* The ENI placement table defines which HA set this ENI belongs to, and how to forward the traffic. +* The ENI placement table should be programmed on all switches. +* Once this table is programmed, `hamgrd` will generate the BFD + +| Table | Key | Field | Description | +| --- | --- | --- | --- | +| DASH_ENI_PLACEMENT_TABLE | | | ENI placement. | +| | \ | | ENI ID. Used for identifying a single ENI. | +| | | version | Config version. | +| | | eni_mac | ENI mac address. Used to create the NPU side ACL rules to match the incoming packets and forward to the right DPUs. | +| | | ha_set_id | The HA set ID that this ENI is allocated to. | +| | | pinned_next_hop_index | The index of the pinned next hop DPU for this ENI forwarding rule. "" = Not set. | + +##### 1.2.2.3. ENI configurations + +* The ENI HA configuration table contains the ENI-level HA config. +* The ENI HA configuraiton table only contains the ENIs that is hosted on the local switch. + +| Table | Key | Field | Description | +| --- | --- | --- | --- | +| DASH_ENI_HA_CONFIG_TABLE | | | ENI HA configuration. | +| | \ | | vDPU ID. Used to identifying a single vDPU. | +| | \ | | ENI ID. Used for identifying a single ENI. | +| | | version | Config version. | +| | | desired_ha_state | The desired state for this ENI. It can only be "" (none), dead, active or standalone. | +| | | approved_pending_operation_request_id | Approved pending approval operation ID, e.g. switchover operation. | + +#### 1.2.3. State DB + +##### 1.2.3.1. DPU / vDPU state + +DPU/vDPU state table stores the health states of each DPU/vDPU. These data are collected by `pmon`. + +| Table | Key | Field | Description | +| --- | --- | --- | --- | +| DPU_TABLE | | | Physical DPU state. | +| | \ | | Physical DPU ID | +| | | health_state | Health state of the DPU device. It can be "healthy", "unhealthy". Only valid when the DPU is local. | +| | | ... | see [SONiC-DASH HLD](https://github.com/sonic-net/SONiC/blob/master/doc/dash/dash-sonic-hld.md) for more details. | +| VDPU_TABLE | | | Virtual DPU state. | +| | \ | | Virtual DPU ID | +| | | health_state | Health state of the vDPU. It can be "healthy", "unhealthy". Only valid when the vDPU is local. | +| | | ... | see [SONiC-DASH HLD](https://github.com/sonic-net/SONiC/blob/master/doc/dash/dash-sonic-hld.md) for more details. | + +#### 1.2.4. DASH State DB + +##### 1.2.4.1. DPU / vDPU HA states + +| Table | Key | Field | Description | +| --- | --- | --- | --- | +| DASH_HA_DPU_STATE_TABLE | | | HA related Physical DPU state. | +| | \ | | Physical DPU ID | +| | | card_level_probe_state | Card level probe state. It can be "unknown", "up", "down". | +| DASH_HA_VDPU_STATE_TABLE | | | HA related Virtual DPU state. | +| | \ | | Virtual DPU ID | +| | | card_level_probe_state | Card level probe state. It can be "unknown", "up", "down". | + +##### 1.2.4.2. ENI state + +On NPU side, the ENI state table shows: + +* The HA state of each local ENI. +* The traffic forwarding state of all known ENIs. + +| Table | Key | Field | Description | +| --- | --- | --- | --- | +| DASH_ENI_HA_STATE_TABLE | | | Data plane state of each ENI that is hosted on local switch. | +| | \ | | VDPU ID. Used to identifying a single VDPU. | +| | \ | | ENI ID. Used to identifying a single ENI. | +| | | creation_time_is_ms | ENI creation time in milliseconds. | +| | | last_heartbeat_time_in_ms | ENI last heartbeat time in milliseconds. Heartbeat time happens once per minute and will not change the last state updated time. | +| | | last_state_updated_time_in_ms | ENI state last updated time in milliseconds. | +| | | data_path_vip | Data path VIP of the ENI. | +| | | local_ha_state | The state of the HA state machine. This is the state in NPU hamgrd. | +| | | local_ha_state_last_update_time_in_ms | The time when local target HA state is set. | +| | | local_ha_state_last_update_reason | The reason of the last HA state change. | +| | | local_target_asic_ha_state | The target HA state in ASIC. This is the state that hamgrd generates and asking DPU to move to. | +| | | local_acked_asic_ha_state | The HA state that ASIC acked. | +| | | local_target_term | The current target term of the HA state machine. | +| | | local_acked_term | The current term that acked by ASIC. | +| | | local_bulk_sync_recv_server_endpoints | The IP endpoints that used to receive flow records during bulk sync, connected by ",". | +| | | peer_ip | The IP of peer DPU. | +| | | peer_ha_state | The state of the HA state machine in peer DPU. | +| | | peer_term | The current term in peer DPU. | +| | | peer_bulk_sync_recv_server_endpoints | The IP endpoints that used to receive flow records during bulk sync, connected by ",". | +| | | ha_operation_type | HA operation type, e.g., "switchover". | +| | | ha_operation_id | HA operation ID (GUID). | +| | | ha_operation_state | HA operation state. It can be "created", "pendingapproval", "approved", "inprogress" | +| | | ha_operation_start_time_in_ms | The time when operation is created. | +| | | ha_operation_state_last_update_time_in_ms | The time when operation state is updated last time. | +| | | bulk_sync_start_time_in_ms | Bulk sync start time in milliseconds. | +| DASH_ENI_DP_STATE_TABLE | | | Data plane state of all known ENI. | +| | \ | | ENI ID. Used to identifying a single ENI. | +| | | ha_set_mode | HA set mode. See [HA set configurations](#1221-ha-set-configurations) for more details. | +| | | next_hops | All possible next hops for this ENI. | +| | | next_hops_types | Type of each next hops, connected by ",". | +| | | next_hops_card_level_probe_states | Card level probe state for each next hop, connected by ",". It can be "unknown", "up", "down". | +| | | next_hops_active_states | Is next hop set as active the ENI HA state machine. It can be "unknown", "true", "false". | +| | | next_hops_final_state | Final state for each next hops, connected by ",". It can be "up", "down". | + +### 1.3. DPU DB schema + +#### 1.3.1. APPL DB + +##### 1.3.1.1. HA set configurations + +If any HA set configuration is related to local DPU, it will be parsed and being programmed to the DPU side DB, which will be translated to SAI API calls and sent to ASIC by DPU side swss. + +| Table | Key | Field | Description | +| --- | --- | --- | --- | +| DASH_HA_SET_TABLE | | | HA set table, which describes the DPUs that forms the HA set. | +| | \ | | HA set ID | +| | | version | Config version. | +| | | mode | Mode of HA set. It can be "activestandby". | +| | | peer_dpu_ipv4 | The IPv4 address of peer DPU. | +| | | peer_dpu_ipv6 | The IPv6 address of peer DPU. | +| | | dp_channel_dst_port | The destination port used when tunneling packetse via DPU-to-DPU data plane channel. | +| | | dp_channel_src_port_min | The min source port used when tunneling packetse via DPU-to-DPU data plane channel. | +| | | dp_channel_src_port_max | The max source port used when tunneling packetse via DPU-to-DPU data plane channel. | +| | | dp_channel_probe_interval_ms | The interval of sending each DPU-to-DPU data path probe. | + +##### 1.3.1.2. DASH ENI object table + +* The DASH objects will only be programmed on the DPU that is hosting the ENIs. + +| Table | Key | Field | Description | +| --- | --- | --- | --- | +| DASH_ENI_TABLE | | | HA configuration for each ENI. | +| | \ | | ENI ID. Used to identifying a single ENI. | +| | | ha_set_id | HA set id. | +| | | ha_role | HA role. It can be "dead", "active", "standby", "standalone", "switching_to_active" | +| | | ... | see [SONiC-DASH HLD](https://github.com/sonic-net/SONiC/blob/master/doc/dash/dash-sonic-hld.md) for more details. | +| DASH_ENI_HA_BULK_SYNC_SESSION_TABLE | | | HA bulk sync session table. | +| | \ | | ENI ID. Used to identifying a single ENI. | +| | | session_id | Bulk sync session id. | +| | | peer_bulk_sync_recv_server_endpoints | The IP endpoints that used to receive flow records during bulk sync, connected by ",". | + +#### 1.3.2. State DB + +##### 1.3.2.1. ENI HA state + +* The ENI HA state table contains the ENI-level HA state. +* The ENI HA state table only contains the ENIs that is hosted on the local DPU. + +| Table | Key | Field | Description | +| --- | --- | --- | --- | +| DASH_ENI_HA_STATE_TABLE | | | HA state of each ENI that is hosted on local DPU. | +| | \ | | ENI ID. Used to identifying a single ENI. | +| | | ha_role | The current HA role confirmed by ASIC. It can be "dead", "active", "standby", "standalone", "switching_to_active" | +| | | term | The current term confirmed by ASIC. | +| | | ha_role_last_update_time | The time when HA role is last updated in milliseconds. | +| | | bulk_sync_recv_server_endpoints | The IP endpoints that used to receive flow records during bulk sync, connected by ",". | +| | | ongoing_bulk_sync_session_id | Ongoing bulk sync session id. | +| | | ongoing_bulk_sync_session_start_time_in_ms | Ongoing bulk sync session start time in milliseconds. | + +## 2. Telemetry + +To properly monitor the HA related features, we will need to add telemetry for monitoring it. + +The telemetry will cover both state and counters, which can be mapped into `DASH_STATE_DB` or `DASH_COUNTER_DB`. + +* For ENI level states and counters in NPU DB, we will have `VDPU_ID` in the key as well as the `ENI_ID` to make each counter unique, because ENI migration from one DPU to another on the same switch. +* For ENI level states and counters in DPU DB, we don’t need to have `VDPU_ID` in the key, because they are tied to a specific DPU, and we should know which DPU it is during logging. + +We will focus on only the HA counters below, which will not include basic counters, such as ENI creation/removal or generic DPU health/critical event counters, even though some of them works closely with HA workflows. + +### 2.1. HA state + +First of all, we need to store the HA states for us to check. + +Please refer to the [ENI state](#1242-eni-state) table in NPU DB for detailed DB schema design. + +### 2.2. HA operation counters + +Besides the HA states, we also need to log all the operations that is related to HA. + +HA operations are mostly lies in 2 places: `hamgrd` for operations coming from northbound interfaces and syncd for SAI APIs we call or SAI notification we handle related to HA. + +#### 2.2.1. hamgrd HA operation counters + +All the HA operation counters will be: + +* Saved in NPU side `COUNTER_DB`, since the `hamgrd` is running on NPU side. +* Partitioned with ENI level key: `DASH_HA_OP_STATS||`. + +| Name | Description | +| --- | --- | +| **state_enter*(req/success/failure)_count | Number of state transitions we have done (Request/Succeeded Request/Failed request). | +| total_(successful/failed)_*_state_enter_time_in_us | The total time we used to transit to specific state in microseconds. Successful and failed transitions need to be tracked separately, as they will have different patterns. | +| switchover_(req/success/failure)_count | Similar as above, but for switchover operations. | +| total_(successful/failed)_switchover_time_in_us | Similar as above, but for switchover operations. | +| shutdown_standby_(req/success/failure)_count | Similar as above, but for shutdown standby operations. | +| total_(successful/failed)_shutdown_standby_time_in_us | Similar as above, but for shutdown standby operations. | +| shutdown_self_(req/success/failure)_count | Similar as above, but for force shutdown operations. | +| total_(successful/failed)_shutdown_self_time_in_us | Similar as above, but for force shutdown operations. | + +#### 2.2.2. HA SAI API counters + +All the HA SAI API counters will be: + +* Saved in DPU side `DASH_COUNTER_DB`, as SAI APIs are called in DPU side syncd. +* Partitioned with ENI level key: `DASH_SAI_CALL_STATS|`. + +| Name | Description | +| --- | --- | +| *_(req/success/failure)_count | Number of SAI APIs we call or notifications we handle, with success and failure counters too. | +| total_*_(successful/failed)_time_in_us | Total time we used to do the SAI operations in microseconds. Successful and failed operations should be tracked separately, as they will have different patterns. | + +### 2.3. HA control plane communication channel related + +#### 2.3.1. HA control plane control channel counters + +HA control plane control channel is running on NPU side, mainly used for passing the HA control commands. + +The counters of this channel will be: + +* Collected by `hamgrd` on NPU side. +* Saved in NPU side `DASH_COUNTER_DB`. +* Stored with key: `DASH_HA_CP_CONTROL_CHANNEL_STATS|`. + * This counter doesn’t need to be partitioned on a single switch, because it is shared for all ENIs. + +| Name | Description | +| --- | --- | +| is_alive | Is the channel alive for use. 0 = dead, 1 = alive. | +| channel_connect_count | Number of connect calls for establishing the data channel. | +| channel_connect_succeeded_count | Number of connect calls that succeeded. | +| channel_connect_failed_count | Number of connect calls that failed because of any reason other than timeout / unreachable. | +| channel_connect_timeout_count | Number of connect calls that failed due to timeout / unreachable. | + +#### 2.3.2. HA control plane data channel counters + +HA control plane data channel is composed with 2 parts: SAI flow API calls and `swbusd` for flow forwarding. The first one is already covered by all the SAI API counters above, so we will only focus on the `swbusd` part here. And the counters will be: + +* Collected on `swbusd` on NPU side. +* Saved in NPU side `DASH_COUNTER_DB`. + +##### 2.3.2.1. Per bulk sync flow receive server counters + +Since the data channel is formed by multiple flow receive servers, the data plane counters needs to be logged per each server: `DASH_HA_CP_DATA_CHANNEL_CONN_STATS|`. + +| Name | Description | +| --- | --- | +| is_alive | Is the channel alive for use. 0 = dead, 1 = alive. | +| channel_connect_count | Number of connect calls for establishing the data channel. | +| channel_connect_succeeded_count | Number of connect calls that succeeded. | +| channel_connect_failed_count | Number of connect calls that failed because of any reason other than timeout / unreachable. | +| channel_connect_timeout_count | Number of connect calls that failed due to timeout / unreachable. | +| bulk_sync_message_sent/received | Number of messages we send or receive for bulk sync via data channel. | +| bulk_sync_message_size_sent/received | The total size of messages we send or receive for bulk sync via data channel. | +| bulk_sync_flow_received_from_local | Number of flows received from local DPU | +| bulk_sync_flow_forwarded_to_peer | Number of flows forwarded to paired DPU | + +##### 2.3.2.2. Per ENI counters + +Besides per flow receive server, the counters should also be tracked on ENI level, so we can have a more aggregated view for each ENI. The key can be: `DASH_HA_CP_DATA_CHANNEL_ENI_STATS|`. + +| Name | Description | +| --- | --- | +| bulk_sync_message_sent/received | Number of messages we send or receive for bulk sync via data channel. | +| bulk_sync_message_size_sent/received | The total size of messages we send or receive for bulk sync via data channel. | +| bulk_sync_flow_received_from_local | Number of flows received from local DPU | +| bulk_sync_flow_forwarded_to_peer | Number of flows forwarded to paired DPU | + +> NOTE: We didn't add the ENI key in the per flow receive server counters, because multiple ENIs can share the same flow receive server. It is up to each vendor's implementation. + +### 2.4. NPU-to-DPU tunnel related (NPU side) + +The second part of the HA is the NPU-to-DPU tunnel. This includes the probe status and traffic information on the tunnel. + +#### 2.4.1. NPU-to-DPU probe status + +Latest probe status is critical for checking how each card and ENI performs, and where the packets should be forwarded to. + +Please refer to the [DPU/vDPU HA state](#1241-dpu--vdpu-ha-states) tables in NPU DB for detailed DB schema design. + +#### 2.4.2. NPU-to-DPU data plane state + +Depending on the probe status and HA state, we will update the next hop for each ENI to forward the traffic. This also needs to be tracked. + +Please refer to the [DASH_ENI_DP_STATE_TABLE](#1242-eni-state) table in NPU DB for detailed DB schema design. + +#### 2.4.3. NPU-to-DPU tunnel counters + +On NPU side, we should also have ENI level tunnel traffic counters: + +* Collected on the NPU side via SAI. +* Saved in the NPU side `COUNTER_DB`. +* Partitioned into ENI level with key: `DASH_HA_NPU_TO_ENI_TUNNEL_STATS|`. + +| Name | Description | +| --- | --- | +| packets_in/out | Number of packets received / sent. | +| bytes_in/out | Total bytes received / sent. | +| packets_discards_in/out | Number of incoming/outgoing packets get discarded. | +| packets_error_in/out | Number of incoming/outgoing packets have errors like CRC error. | +| packets_oversize_in/out | Number of incoming/outgoing packets exceeds the MTU. | + +> NOTE: In implementation, these counters might have a more SAI-friendly name. + +### 2.5. NPU-to-DPU tunnel related (DPU side) + +On DPU side, every NPU-to-DPU tunnel traffic needs to be tracked on ENI level as well: + +* Collected on the DPU side via SAI. +* Saved in DPU side `COUNTER_DB`. +* Partitioned into ENI level with key: `DASH_HA_NPU_TO_ENI_TUNNEL_STATS|`. + +| Name | Description | +| --- | --- | +| packets_in/out | Number of packets received / sent. | +| bytes_in/out | Total bytes received / sent. | +| packets_discards_in/out | Number of incoming/outgoing packets get discarded. | +| packets_error_in/out | Number of incoming/outgoing packets have errors like CRC error. | +| packets_oversize_in/out | Number of incoming/outgoing packets exceeds the MTU. | + +> NOTE: In implementation, these counters might have a more SAI-friendly name. + +### 2.6. DPU-to-DPU data plane channel related + +The next part is the DPU-to-DPU data plane channel, which is used for inline flow replications. + +* Collected on the DPU side via SAI. +* Saved in DPU side `COUNTER_DB`. +* Partitioned into ENI level with key: `DASH_HA_DPU_DATA_PLANE_STATS|`. + +| Name | Description | +| --- | --- | +| inline_sync_packet_in/out | Number of inline sync packet received / sent. | +| inline_sync_ack_packet_in/out | Number of inline sync ack packet received / sent. | +| meta_sync_packet_in/out | Number of metadata sync packet (generated by DPU) received / sent. This is for flow sync packets of flow aging, etc. | +| meta_sync_ack_packet_in/out | Number of metadata sync ack packet received / sent. This is for flow sync packets of flow aging, etc. | +| probe_packet_in/out | Number of probe packet received from or sent to the paired ENI on the other DPU. This data is for DPU-to-DPU data plane liveness probe. | +| probe_packet_ack_in/out | Number of probe ack packet received from or sent to the paired ENI on the other DPU. This data is for DPU-to-DPU data plane liveness probe. | + +> NOTE: In implementation, these counters might have a more SAI-friendly name. + +### 2.7. DPU ENI pipeline related + +The last part is how the DPU ENI pipeline works in terms of HA, which includes flow operations: + +* Collected on the DPU side via SAI. +* Saved in DPU side `COUNTER_DB`. +* Partitioned into ENI level with key: `DASH_HA_DPU_PIPELINE_STATS|`. + +| Name | Description | +| --- | --- | +| flow_(creation/update/deletion)_count | Number of inline flow creation/update/delete request that failed for any reason. E.g. not enough memory, update non-existing flow, delete non-existing flow. | +| inline_flow_(creation/update/deletion)_req_sent | Number of inline flow creation/update/deletion request that sent from active node. Flow resimulation will be covered in flow update requests. | +| inline_flow_(creation/update/deletion)_req_received | Number of inline flow creation update/deletion request that received on standby node. | +| inline_flow_(creation/update/deletion)_req_succeeded | Number of inline flow creation update/deletion request that succeeded (ack received). | +| flow_creation_conflict_count | Number of inline replicated flow that is conflicting with existing flows (flow already exists and action is different). | +| flow_aging_req_sent | Number of flows that aged out in active and being replicated to standby. | +| flow_aging_req_received | Number of flow aging requests received from active side. Request can be batched, but in this counter 1 request = 1 flow. | +| flow_aging_req_succeeded | Number of flow aging requests that succeeded (ack received). | + +Please note that we will also have counters for how many flows are created/updated/deleted (succeeded or failed), aged out or resimulated, but this is not in the scope of HA, hence omitted here. + +> NOTE: In implementation, these counters might have a more SAI-friendly name. + +## 3. SAI APIs + +Please refer to HA session API and flow API HLD in DASH repo for SAI API designs. + +## 4. CLI commands + +The following commands shall be added in CLI for checking the HA config and states: + +* `show dash ha config`: Shows HA global configuration. +* `show dash eni ha config`: Show the ENI level HA configuration. +* `show dash eni ha status`: Show the ENI level HA status. +* `show dash eni ha dp-status`: Show the ENI level data path status. diff --git a/doc/smart-switch/high-availability/smart-switch-ha-hld.md b/doc/smart-switch/high-availability/smart-switch-ha-hld.md new file mode 100644 index 00000000000..3c0d204e4e1 --- /dev/null +++ b/doc/smart-switch/high-availability/smart-switch-ha-hld.md @@ -0,0 +1,1915 @@ +# SmartSwitch High Availability High Level Design + +| Rev | Date | Author | Change Description | +| --- | ---- | ------ | ------------------ | +| 0.1 | 08/09/2023 | Riff Jiang | Initial version | +| 0.2 | 08/10/2023 | Riff Jiang | Simpified ENI-level traffic control, primary election algorithm | +| 0.3 | 08/14/2023 | Riff Jiang | Added DPU level standalone support | +| 0.4 | 08/17/2023 | Riff Jiang | Redesigned HA control plane data channel | +| 0.5 | 10/14/2023 | Riff Jiang | Merged resource placement and topology section and moved detailed design out for better readability | +| 0.6 | 10/22/2023 | Riff Jiang | Added ENI leak detection | + +1. [1. Background](#1-background) +2. [2. Terminology](#2-terminology) +3. [3. Requirements, Assumptions and SLA](#3-requirements-assumptions-and-sla) + 1. [3.1. Goals](#31-goals) + 2. [3.2. Non-goals](#32-non-goals) + 3. [3.3. Assumptions](#33-assumptions) +4. [4. Network Physical Topology](#4-network-physical-topology) + 1. [4.1. ENI placement](#41-eni-placement) + 2. [4.2. Data path HA](#42-data-path-ha) + 1. [4.2.1. ENI-level NPU-to-DPU traffic forwarding](#421-eni-level-npu-to-dpu-traffic-forwarding) + 1. [4.2.1.1. Forwarding packet to local DPU](#4211-forwarding-packet-to-local-dpu) + 2. [4.2.1.2. Forwarding packet to remote DPU](#4212-forwarding-packet-to-remote-dpu) + 3. [4.3. Flow HA](#43-flow-ha) + 1. [4.3.1. Active-standby pair for Flow HA](#431-active-standby-pair-for-flow-ha) + 2. [4.3.2. Card-level ENI pair placement](#432-card-level-eni-pair-placement) + 3. [4.3.3. ENI-level HA scope](#433-eni-level-ha-scope) + 4. [4.3.4. ENI update domain (UD) / fault domain (FD) handling](#434-eni-update-domain-ud--fault-domain-fd-handling) + 5. [4.3.5. DPU to DPU communication for flow HA](#435-dpu-to-dpu-communication-for-flow-ha) + 1. [4.3.5.1. Standby to active DPU tunnel](#4351-standby-to-active-dpu-tunnel) + 2. [4.3.5.2. DPU-to-DPU data plane channel](#4352-dpu-to-dpu-data-plane-channel) + 1. [4.3.5.2.1. Multi-path data plane channel](#43521-multi-path-data-plane-channel) + 2. [4.3.5.2.2. (Optional) Multi-path data plane availability tracking](#43522-optional-multi-path-data-plane-availability-tracking) + 4. [4.4. VM to DPU data plane](#44-vm-to-dpu-data-plane) +5. [5. ENI programming with HA setup](#5-eni-programming-with-ha-setup) + 1. [5.1. Working with upstream service](#51-working-with-upstream-service) + 2. [5.2. HA control plane overview](#52-ha-control-plane-overview) + 1. [5.2.1. HA control plane components](#521-ha-control-plane-components) + 1. [5.2.1.1. ha containter](#5211-ha-containter) + 2. [5.2.1.2. swbusd](#5212-swbusd) + 3. [5.2.1.3. hamgrd](#5213-hamgrd) + 2. [5.2.2. HA Control Plane Channels](#522-ha-control-plane-channels) + 1. [5.2.2.1. HA control plane control channel](#5221-ha-control-plane-control-channel) + 2. [5.2.2.2. HA control plane data channel](#5222-ha-control-plane-data-channel) + 3. [5.2.2.3. HA control plane channel data path HA](#5223-ha-control-plane-channel-data-path-ha) + 3. [5.3. ENI creation](#53-eni-creation) + 4. [5.4. ENI programming](#54-eni-programming) + 5. [5.5. ENI removal](#55-eni-removal) +6. [6. DPU liveness detection](#6-dpu-liveness-detection) + 1. [6.1. Card level NPU-to-DPU liveness probe](#61-card-level-npu-to-dpu-liveness-probe) + 1. [6.1.1. BFD support on DPU](#611-bfd-support-on-dpu) + 1. [6.1.1.1. Lite-BFD server](#6111-lite-bfd-server) + 2. [6.1.1.2. BFD initialization on DPU boot](#6112-bfd-initialization-on-dpu-boot) + 3. [6.1.1.3. (Optional) Hardware BFD on DPU](#6113-optional-hardware-bfd-on-dpu) + 2. [6.1.2. BFD probing local DPU](#612-bfd-probing-local-dpu) + 3. [6.1.3. BFD probing remote DPU](#613-bfd-probing-remote-dpu) + 2. [6.2. ENI level NPU-to-DPU traffic control](#62-eni-level-npu-to-dpu-traffic-control) + 1. [6.2.1. Traffic control state update channel](#621-traffic-control-state-update-channel) + 2. [6.2.2. Traffic control mechanism](#622-traffic-control-mechanism) + 3. [6.3. DPU-to-DPU liveness probe](#63-dpu-to-dpu-liveness-probe) + 1. [6.3.1. Card-level and ENI-level data plane failure](#631-card-level-and-eni-level-data-plane-failure) + 2. [6.3.2. ENI-level pipeline failure](#632-eni-level-pipeline-failure) + 4. [6.4. Probe state pinning](#64-probe-state-pinning) + 1. [6.4.1. Pinning BFD probe](#641-pinning-bfd-probe) + 2. [6.4.2. Pinning ENI-level traffic control state](#642-pinning-eni-level-traffic-control-state) + 5. [6.5. All probe states altogether](#65-all-probe-states-altogether) +7. [7. HA state machine](#7-ha-state-machine) + 1. [7.1. HA state definition and behavior](#71-ha-state-definition-and-behavior) + 2. [7.2. State transition](#72-state-transition) + 3. [7.3. Primary election](#73-primary-election) + 4. [7.4. HA state persistence and rehydration](#74-ha-state-persistence-and-rehydration) +8. [8. Planned events and operations](#8-planned-events-and-operations) + 1. [8.1. Launch](#81-launch) + 1. [8.1.1. Clean launch on both sides](#811-clean-launch-on-both-sides) + 2. [8.1.2. Launch with standalone peer](#812-launch-with-standalone-peer) + 3. [8.1.3. Launch with no peer](#813-launch-with-no-peer) + 2. [8.2. Planned switchover](#82-planned-switchover) + 1. [8.2.1. Workflow](#821-workflow) + 2. [8.2.2. Packet racing during switchover](#822-packet-racing-during-switchover) + 3. [8.2.3. Failure handling during switchover](#823-failure-handling-during-switchover) + 3. [8.3. Planned shutdown](#83-planned-shutdown) + 1. [8.3.1. Planned shutdown standby node](#831-planned-shutdown-standby-node) + 2. [8.3.2. Planned shutdown active node](#832-planned-shutdown-active-node) + 4. [8.4. ENI-level DPU isolation / Standalone pinning](#84-eni-level-dpu-isolation--standalone-pinning) + 5. [8.5. ENI migration](#85-eni-migration) + 1. [8.5.1. Card-level ENI migration](#851-card-level-eni-migration) + 2. [8.5.2. Migrating active ENI](#852-migrating-active-eni) + 3. [8.5.3. Migrating standalone ENI](#853-migrating-standalone-eni) + 4. [8.5.4. Migrating ENI when upstream service cannot reach one side of switches](#854-migrating-eni-when-upstream-service-cannot-reach-one-side-of-switches) + 5. [8.5.5. Moving towards ENI-level ENI migration](#855-moving-towards-eni-level-eni-migration) +9. [9. Unplanned events](#9-unplanned-events) + 1. [9.1. Unplanned network failure](#91-unplanned-network-failure) + 1. [9.1.1. Upstream service channel failure](#911-upstream-service-channel-failure) + 1. [9.1.1.1. One side switch is not reachable](#9111-one-side-switch-is-not-reachable) + 2. [9.1.1.2. Both side switches are not reachable](#9112-both-side-switches-are-not-reachable) + 2. [9.1.2. HA control plane control channel failure](#912-ha-control-plane-control-channel-failure) + 3. [9.1.3. HA control plane data channel failure](#913-ha-control-plane-data-channel-failure) + 4. [9.1.4. Data plane channel failure](#914-data-plane-channel-failure) + 1. [9.1.4.1. Card level NPU-to-DPU probe failure](#9141-card-level-npu-to-dpu-probe-failure) + 2. [9.1.4.2. Data plane gray failure](#9142-data-plane-gray-failure) + 2. [9.2. Unplanned DPU failure](#92-unplanned-dpu-failure) + 1. [9.2.1. Syncd crash](#921-syncd-crash) + 2. [9.2.2. DPU hardware failure](#922-dpu-hardware-failure) + 3. [9.3. Unplanned NPU failure](#93-unplanned-npu-failure) + 1. [9.3.1. hamgrd crash](#931-hamgrd-crash) + 2. [9.3.2. Switch power down or kernel crash](#932-switch-power-down-or-kernel-crash) + 3. [9.3.3. Back panel port failure](#933-back-panel-port-failure) + 4. [9.3.4. Unplanned PCIe failure](#934-unplanned-pcie-failure) + 4. [9.4. Summary](#94-summary) +10. [10. Unplanned operations](#10-unplanned-operations) + 1. [10.1. Working as standalone setup](#101-working-as-standalone-setup) + 1. [10.1.1. Design considerations](#1011-design-considerations) + 1. [10.1.1.1. Standalone setup vs bulk sync](#10111-standalone-setup-vs-bulk-sync) + 2. [10.1.1.2. Card-level vs ENI-level standalone setup](#10112-card-level-vs-eni-level-standalone-setup) + 2. [10.1.2. Workflow triggers](#1012-workflow-triggers) + 3. [10.1.3. Determine desired standalone setup](#1013-determine-desired-standalone-setup) + 1. [10.1.3.1. Determine card-level standalone setup](#10131-determine-card-level-standalone-setup) + 2. [10.1.3.2. Determine ENI-level standalone setup](#10132-determine-eni-level-standalone-setup) + 4. [10.1.4. Entering standalone setup](#1014-entering-standalone-setup) + 1. [10.1.4.1. When peer is up](#10141-when-peer-is-up) + 2. [10.1.4.2. When peer is down](#10142-when-peer-is-down) + 5. [10.1.5. DPU restart in standalone setup](#1015-dpu-restart-in-standalone-setup) + 6. [10.1.6. Recovery from standalone setup](#1016-recovery-from-standalone-setup) + 7. [10.1.7. Force exit standalone (WIP)](#1017-force-exit-standalone-wip) + 2. [10.2. Force shutdown (WIP)](#102-force-shutdown-wip) +11. [11. Flow tracking and replication (Steady State)](#11-flow-tracking-and-replication-steady-state) + 1. [11.1. Flow lifetime management](#111-flow-lifetime-management) + 2. [11.2. Flow replication data path overview](#112-flow-replication-data-path-overview) + 1. [11.2.1. Data plane sync channel data path](#1121-data-plane-sync-channel-data-path) + 2. [11.2.2. Control plane sync channel data path](#1122-control-plane-sync-channel-data-path) + 3. [11.3. Flow creation and initial sync](#113-flow-creation-and-initial-sync) + 1. [11.3.1. Case 1: Flow creation for first packet and initial sync](#1131-case-1-flow-creation-for-first-packet-and-initial-sync) + 2. [11.3.2. Case 2: Flow recreation for non-first packet](#1132-case-2-flow-recreation-for-non-first-packet) + 1. [11.3.2.1. Challenge](#11321-challenge) + 2. [11.3.2.2. Solutions](#11322-solutions) + 1. [11.3.2.2.1. Missing connection-less flow](#113221-missing-connection-less-flow) + 2. [11.3.2.2.2. Missing connection flows](#113222-missing-connection-flows) + 4. [11.4. Flow destroy](#114-flow-destroy) + 1. [11.4.1. Flow destroy on explicit connection close](#1141-flow-destroy-on-explicit-connection-close) + 1. [11.4.1.1. Graceful connection termination with FIN packets](#11411-graceful-connection-termination-with-fin-packets) + 2. [11.4.1.2. Abrupt connection termination with RST packets](#11412-abrupt-connection-termination-with-rst-packets) + 2. [11.4.2. Flow destroy on flow aged out](#1142-flow-destroy-on-flow-aged-out) + 3. [11.4.3. Flow destroy on request](#1143-flow-destroy-on-request) + 4. [11.4.4. RST on flow destroy](#1144-rst-on-flow-destroy) + 1. [11.4.4.1. Send RST when denying packets](#11441-send-rst-when-denying-packets) + 2. [11.4.4.2. Bulk TCP seq syncing during flow aging](#11442-bulk-tcp-seq-syncing-during-flow-aging) + 3. [11.4.4.3. Flow destroy on standby node](#11443-flow-destroy-on-standby-node) + 5. [11.4.5. Packet racing on flow destroy](#1145-packet-racing-on-flow-destroy) + 5. [11.5. Bulk sync](#115-bulk-sync) + 1. [11.5.1. Perfect sync](#1151-perfect-sync) + 2. [11.5.2. Range sync (ENI-level only)](#1152-range-sync-eni-level-only) + 1. [11.5.2.1. Init phase](#11521-init-phase) + 2. [11.5.2.2. Flow tracking in steady state](#11522-flow-tracking-in-steady-state) + 3. [11.5.2.3. Tracking phase](#11523-tracking-phase) + 4. [11.5.2.4. Syncing phase](#11524-syncing-phase) + 6. [11.6. Flow re-simulation support](#116-flow-re-simulation-support) +12. [12. Debuggability](#12-debuggability) + 1. [12.1. ENI leak detection](#121-eni-leak-detection) +13. [13. Detailed Design](#13-detailed-design) +14. [14. Test Plan](#14-test-plan) + +## 1. Background + +To help disaggregate network functions, SmartSwitch provides an cost-effective and space-saving way to build SDN solutions, with low network latency. However, like all the other network solutions, one of the biggest challenges we need to solve is how to provide a reliable network service. + +This doc covers the aspect of High Availability and Scalability for SmartSwitch in the aspects of Requirements / SLA, the network physical topology, ENI allocation/placement, HA state management and transition under each planned or unplanned events, flow replication and more. + +## 2. Terminology + +| Term | Explanation | +| ---- | ----------- | +| ENI / vNIC / vPort | Elastic Network Interface. It means a VM’s NIC.
| +| T0/1/2 | Tier 0/1/2 switch. | +| Session | Session represents a network connection between 2 sides of the network, for example, TCP or UDP. | +| Flow | Each session has 2 flows in network – Forwarding flow and reverse flow for process the network packet in both directions. | +| HA | High Availability | +| HA Set | The set of DPUs that involved in providing the high availability for certain HA scope – switch or DPU or ENI. | +| HA Scope | The minimum scope that failover can happen on, for example – entire switch, entire DPU or each ENI. Every HA scope tracks its own active instance, and others being standby. | +| HA Active | The active instance in the HA set. It is the only instance in the HA set that can decide what the packet transformation we should apply any every new flow. | +| HA Standby | The standby instance in the HA set. It acts like a flow store and never makes any decisions. If any new connections happen to land on the standby instance, it will be either tunneled to the active. | +| DP | Data path. | +| DP VIP | Data path Virtual IP. This is the VIP that SmartSwitch advertises via BGP for attracting traffic.

After enabling SmartSwitch, when VM sends out traffic, it will be encap’ed with the destination IP set to the DP VIP. This makes all the VM traffic coming to our SmartSwitch for packet processing and transformation. | +| Upstream Service | The services that programs ENIs to smart switches, e.g., SDN controller. | +| UD / Update domain | It defines the groups of virtual machines and underlying physical hardware that can be rebooted at the same time. | +| FD / Fault domain | It defines the group of virtual machines that share a common power source and network switch. | +| SPOF | Single point of failure. | +| FP port | Front panel port. The ports on the front panel of the switch. | +| BP port | Back panel port. The ports that connects the NPU and DPU. | + +For more terminology, please refer to . + +## 3. Requirements, Assumptions and SLA + +### 3.1. Goals + +The goal of the SmartSwitch HA design is trying to achieve the following goals: + +1. 0 downtime on planned switchover. +2. <2 sec downtime on failover to standalone setup for each ENI. +3. Ability to resume connections in the event of both planned and unplanned failover. +4. After both planned and unplanned failover and recovery, the flow on all DPUs will be aligned. +5. Only focus on the case, in which only 2 DPUs are in the HA set, although conceptually, we can have more than 2, but it is not in our current scope. +6. No complicated logic or ECMP hashing requirement in other switches to ensure packets land on right switch and DPU. +7. If switch receives a valid packet, it must not drop it due to flow replication delays. +8. Ensure both inbound and outbound packets transit the same DPU for a given flow. + +### 3.2. Non-goals + +1. Handling data path failures perfectly is not one of the main goals of HA. + * Data plane failures can happen and affect our communications between NPU and DPUs, such as link flaps, traffic blackhole, but these issues will impact all traffic, not just smart switch scenarios. They are monitored in other ways and also should be addressed in a more generic way. + * Of course, when data path failure causes enough heartbeat or probes to be dropped, we will react on the change. But the goal is not providing a mechanism to detect and handle all gray failures due to ECMP setup. In short, if our probe says down, it is a down instead of 16% down. +2. HA is not designed for providing perfect solutions to merge conflicting flows, but only ensures the flows will end up in a consistent state in all DPUs. + * This can happen in cases, such as switchover or when DPUs lose connection to each other and start to act as standalone. If policies in these DPUs don’t match, flow conflict can happen when later the connection recovers. + * To resolve the flow conflict, HA will do best effort, but it won’t be able to solve the problem alone. In cases like planned switchovers, our upstream service must coordinate and make sure policies are aligned. +3. More than 2 DPUs participate in a single HA set. + +### 3.3. Assumptions + +When designing HA for SmartSwitch, we have a few assumptions: + +1. Stable decision on packet transformation: The same packet will always implement the same packet transformation as long as the DASH policies stay the same. Changes, such as random port selection, are not allowed in the current pipeline. +2. No delayed flow replication support: In [steady state](#43-flow-ha), when a flow is created, updated or deleted by a network packet, the change ***MUST*** be replicated to its pair by attaching the metadata into the original packet and forwarding the packet. This is called inline flow replication ([more described here](#11-flow-tracking-and-replication-steady-state)). We should never delay the flow replication, and before we receive the ack from the standby side, we should not consider the change is applied. +3. No warm boot support: Warm boot is used to keep the data path continuing to work while upgrade happens. It is mostly used for upgrading switches that being considered as SPOF, as a gentler solution than a hard data plane down. However, it requires complicated coordination between many services on NPU and DPU. Gladly, since HA requires more than 1 switch to participate, we don’t need to worry about SPOF problem, hence we don’t need to consider warm boot support in our design. + +## 4. Network Physical Topology + +> Today, in [SONiC DASH HLD](../../dash/dash-sonic-hld.md), a DASH pipeline is modeled to be an ENI, although all the discussion below is generically appliable to other scenarios, but we will use ENI to refer the DASH pipeline for simplicity. + +### 4.1. ENI placement + +To provide high availability for any network function / pipeline we model in the SmartSwitch, we must ensure the following redundencies are provided: + +* A single ENI **MUST** be placed on multiple DPUs forming a HA set. (Avoid single ENI or DPU failure) +* Each ENI instance within a HA set, **MUST** be placed on different DPUs which resides in different Tier-1 switches. (Avoid single DPU or switch failure) + +As we have multiple ENI instance forming a HA set, for properly defining the impact radius for HA discussions, we need to make the resource allocation strategy clear. + +From a high level, the DASH pipeline placement is completely decided by our upstream service. SmartSwitch is not involved in the decision-making process at all. + +### 4.2. Data path HA + +When ensuring the data path high availability for SmartSwitch, we need to ensure the following to avoid single link failure causing catastrophic impact: + +* Enough redundant links **MUST** be provided for the traffic to reach its target HA set. +* Enough redundant links **MUST** be provided for the all DPUs in the HA set to talk to each other. + +This drives us to model the SmartSwitches as our Tier-1 switches. And the data path HA can be achieved by ECMP setup in the network, as the following diagram shows: + +

+ +#### 4.2.1. ENI-level NPU-to-DPU traffic forwarding + +To establish the data path for DPU, NPU will advertise the BGP session to the network, receive the traffic first, then forward the traffic to the corresponding DPU using ACL rules. + +On high level, NPU uses the data path VIP and ENI MAC to find the ENI pipeline, and it always follows the following logic for traffic forwarding: + +* If the traffic is for a local ENI, and the ENI is **active**, it will forward the traffic to the local DPU by moving the packet to the DPU’s ingress port. +* If not, it will forward the traffic to one of the remote ***active*** DPUs as an ECMP group using a NPU side VxLan tunnel. + +The data path is shown and explained as below. And this helps the DPU always receives traffic without any additional encap from NPU and speed up the packet processing. + +##### 4.2.1.1. Forwarding packet to local DPU + +Here is the data path when forwarding the packet to local DPU. The packet will be forwarded to by moving it to the DPU’s ingress port directly, hence no extra encap will be added. + +

Forwarding packet to local DPU

+ +##### 4.2.1.2. Forwarding packet to remote DPU + +When forwarding the packet to a remove DPU, we will use a NPU side VxLan tunnel. The tunnels works as below in high level: + +1. The tunneled packets ***MUST*** have an VxLan encap as the outer packet, with a specified NPU tunnel VNI number as identifier. +2. The outer packet ***MUST*** calculate the source port from the 5 tuples of the inner packet and fold it into a port range specified by SONiC. +3. The outer packet ***MUST*** have the same DSCP fields as the inner packet. +4. The tunnel ***MUST*** be terminated by the remove NPU, then move the inner packet to the remove DPU’s ingress port. + +Here is how the data path looks like in details: + +

Tunneling packet to remote DPU

+ +### 4.3. Flow HA + +Besides data path HA, SmartSwitch also supports Flow HA to provide high availability for the flows - when one DPU or Switch is down, we won’t lose the flows permanently. + +#### 4.3.1. Active-standby pair for Flow HA + +To provide HA functionality, we are going to use Active-Standby setup. In this setup: + +* One side will act as active and making flow decisions, while the non-active node will act as a pure backup flow storage. +* When packets land on active DPU and creates a new flow, the new flow will be replicated to the standby DPU inline. +* When packets land on standby side, they will be tunneled to the active DPU by NPU (or DPU as fallback). +* When the active DPU runs into issues, we will failover the active, make the standby the new active, and switch over the traffic to avoid further impact. + +

+ +#### 4.3.2. Card-level ENI pair placement + +How to place the ENIs is critical for defining the impact radius. Currently, we are going with card level pairing. It works as below: + +* Whenever we want to move the ENI, all ENIs on the card will always be moved together. (See "[ENI migration](#85-eni-migration)" for more details.) +* When one card fails, all ENIs on that card will failover to the paired card. +* Different DPUs on the same switch can be paired with any other DPU on any other switches. + +This allows us to spread the load to all switches. And taking down a switch will not cause traffic to be shifted to another single switch, as it can be spread on multiple DPUs in the T1 set. + +

ENI pair placement

+ +#### 4.3.3. ENI-level HA scope + +Although we are using card-level ENI pair placement, each ENI pair has its own active instance. In another word, HA scope is on ENI-level. This means - when failover happens, we only move the active for that single ENI to the standby. + +This helps us spread the traffic load and flow creation cost between the paired DPUs - Instead of having the active DPU handling all the traffic, each DPU can have half of the active ENIs and handle half of the traffic. Especially when traffic is skewed, we can failover some active ENIs to avoid one side of DPU being busy. + +

ENI pair placement

+ +#### 4.3.4. ENI update domain (UD) / fault domain (FD) handling + +Many customer tenants have restrictions on how many nodes can be impacted when upgrade or impact happens. Today in Azure, this is controlled by update domain and fault domain (see "[Availability set overview](https://learn.microsoft.com/en-us/azure/virtual-machines/availability-set-overview)" doc). + +* Update domain defines the groups of virtual machines and underlying physical hardware that can be rebooted at the same time. +* Fault domain defines the group of virtual machines that share a common power source and network switch. + +Since SmartSwitch is a shared network service, one single DPU could host multiple ENIs from the same UD or FD. And with the ENI-level setup, these restrictions could also be supported by simply batching the ENI failover based on the UD and FD information when upgrade or other maintainance work happens. + +#### 4.3.5. DPU to DPU communication for flow HA + +##### 4.3.5.1. Standby to active DPU tunnel + +Although in steady state, NPU will help tunnel the ENI traffic to the active side. However, we cannot maintain strong consistency between NPU and DPU update. So, in certain transient cases, standby ENI might receive traffic from NPU and we should tunnel it over to the active side. + +Since the standby DPU knows the IP of the active DPU, this can be achieved easily by changing the outer most encap destination IP from the SmartSwitch data path VIP to the active DPU PA: + +

Tunneling packet between DPUs

+ +Now, let’s add the NPU-to-NPU tunnels in the picture as well. This is how the data path looks like with all tunnels together: + +

Tunneling packet between DPUs

+ +##### 4.3.5.2. DPU-to-DPU data plane channel + +The DPU-to-DPU data plane channel is essentially a UDP-based tunnel between DPUs, implemented below SAI, used in inline flow replications or any other network-triggered inline data sync. And the inline flow replication is triggered by network packet. Whenever a packet creates a new flow, this flow needs to be replicated to the standby side before sending this packet to the destination. + +Conceptually, this channel works as below in high-level: + +1. The outer packet ***MUST*** use the “DP channel port” specified via SONiC. +2. The outer packet ***MUST*** calculate the source port from the 5 tuples of the inner packet and fold it into a port range specified by SONiC. +3. The outer packet ***MUST*** have the same DSCP fields as the inner packet. +4. Different vendors ***CAN*** have their own definition of metadata format as payload. + +Here is an example from "[Flow replication data path overview – Data plane sync data path](#1121-data-plane-sync-channel-data-path)": + +

Flow replication data plane sync data path

+ +###### 4.3.5.2.1. Multi-path data plane channel + +Since data plane channel is inline, it is very important to make sure whenever gray network failure happens (traffic being blackholed partially), we should avoid causing the traffic for the entire ENI to be dropped. + +To provide HA for DPU-to-DPU communication, we can also leverage our ECMP network setup. When sending the tunneled packet, we must calculate the source port from the 5 tuples of the inner packet and fold it into a port range specified by SONiC. This helps us to spread the packets across all possible paths and avoid the blackhole problem. + +###### 4.3.5.2.2. (Optional) Multi-path data plane availability tracking + +Multi-path communication is great, but it also introduces gray failures. For example, flow replication fails 10% of the time instead of 100% of time. + +To help us further avoid link problems, DPUs could track the failures of each possible path and avoid using it when it is possible. This can be done by using actively probing (see "[DPU-to-DPU liveness probe](#63-dpu-to-dpu-liveness-probe)") or tracking the reachability of each data path by checking the response of data plane channel requests, e.g. flow replication. + +And here is an example of possible implementation based-on source port rotation that can be considered: + +1. Since data plane channel packet has VxLan encap as outermost header, and network routes the packet by outermost header too, the data path entropy is determined only by the source port of the outermost packet (protocol, destination IP/port, source IP are all determined already). +2. To leverage ECMP setup, a range of source ports will be specified for us to use in the VxLan encap. Each flow can pick up a source port from this range, by hashing the 5-tuple of the inner packet. +3. The flow replication packet will be sent out with the selected source port in outermost packet, as well as in the metadata. And on the other side, the ack packet can use the same source port too. This binds the packets for this flow to a single data path. +4. If one of the links went down, all the packets with certain source ports will start to fail. Hence, if we could track the failure for each port in the pool, we can avoid using it. +5. DPU-to-DPU probe should also use the same ports in the range, in order to detect the data path problem. This also helps us to determine whether a specific data path is recovered, and we can resume using this port again. + +### 4.4. VM to DPU data plane + +Again, it is very important to avoid completely killing the VM’s connectivity if one link is blackholing traffic. Hence, when VM sends traffic to DPU, we should also make it leverage ECMP setup. + +When VM sends packet out, the packet will be encap’ed by VxLan (see below). Hence, similar to DPU-to-DPU data plane channel, we can also set the VxLan source port by calculating the hash of the inner packet. Then, different flow will go through different data path to reach the SmartSwitch. + +

VM to DPU data plane

+ +## 5. ENI programming with HA setup + +After HA is enabled, instead of programming a single DPU for a single ENI, now we have to program 2 DPUs for a single ENI, so we need to define the behavior of SmartSwitch and its upstream service with HA setup. + +### 5.1. Working with upstream service + +In SmartSwitch, upstream service and switches will work together on providing the HA features. + +The responsibility of these 2 things is defined as below (HA-related only): + +1. Upstream service is responsible for: + 1. Decide ENI placement and pairing. + 2. Decide preferred active ENI placement. + 3. Trigger HA operations under planned events, such as planned switchover, planned shutdown for upgrades, evening traffic and etc. + 4. Trigger HA operations for manual live site mitigations. +2. SmartSwitch is responsible for: + 1. Drive the HA state machine transitions. + 2. Report every ENI HA state change and reasons, so upstream service knows what is happening and can make decisions for planned events. + 3. Handle HA related requests from upstream service. + 4. Monitor and handle unplanned events, and trigger defined mitigations, such as driving to standalone setup. + +> It is very important to enable SmartSwitch to drive the HA state machine transition on its own. Under unplanned events such as network failures, we could lose connectivity to upstream service. If we could not drive the state machines or make decisions, we would lose the ability to handle the failures. + +### 5.2. HA control plane overview + +Here is the overview of HA control plane, and we will walk through the components and communication channels in the following sections. + +

HA control plane overview

+ +#### 5.2.1. HA control plane components + +##### 5.2.1.1. ha containter + +To support HA, we will need to add new programs that communicates between multiple smart switches, manages the HA state machine transition or syncing other data, such as flow info. These programs will be running in a new container called `ha`. + +This container will be running on NPU side, so when DPU is down, we can still drive the HA state machine transition properly and notify our upstream service about any state change. + +##### 5.2.1.2. swbusd + +`swbusd` is used for establish the connections between each switches: + +* `swbusd` will be running in the `ha` container. +* `swbusd` will be responsible for establishing the HA control plane control channel and data channel between each switches. +* `swbusd` will be responsible for running a gRPC server to receive and route the requests coming `hamgrd` to the right switch. +* `swbusd` will be responsible for running a gRPC server to receive flow info from DPU ASIC or ARM cores directly, and route the flow info to the right switch. + +##### 5.2.1.3. hamgrd + +`hamgrd` is used to manage the HA state machine and state transition: + +* `hamgrd` will be running in the `ha` container. +* `hamgrd` will communicate with: + * Redis for: + * Learning HA related setups, such as peer info, which switches will be involved in forwarding traffic. + * Receive the config updates from upstream service, so we can initiate HA operations for planned events or live site mitigations, such as switchover. + * Notify swss control things like, BFD probes, ENI traffic forwarding, etc. + * gNMI agent for sending state changed notifications, such as HA state changed notifications. + * Syncd on DPUs for calling SAI APIs to help HA state transition. + +#### 5.2.2. HA Control Plane Channels + +There are 2 channels in HA control plane: + +* **Control Plane Control Channel** (Red channel above): This channel is used for transferring HA control messages, e.g. messages related to HA state machine transition. This channel can be implemented by gRPC. +* **Control Plane Data Channel** (Purple channel above): This channel is used for doing heavy data transfer between DPUs, such as bulk sync. This channel can also be done by gRPC. + +##### 5.2.2.1. HA control plane control channel + +The control plane control channel is used for each DPU to talk to each other for things like state machine management. Implementation-wise, we will need a new service for HA communication and management. + +Having our own control channel is important. If we consider the switches as data plane and our upstream services as control plane, then data plane should be able to operate on its own, when control plane is disconnected. For example, when upstream service is down due to network problem or other reasons, we should still be able to react and adjust our services. Coupling smart switch with our upstream service in HA state machine transition can cause us fail to react to [unplanned events](#9-unplanned-events). + +The DPU state machine management will run on NPU. Since each NPU manages multiple DPUs, all NPUs need to connect to all other NPUs, which forms a full mesh. + +The data path of this channel will look like below: + +

HA control plane control channel data path

+ +Since DPU events are still coming from the DPU side SAI/syncd, we will need to pass these events to the NPU side via our internal service communication channel, which goes through the PCIe bus (see [HA control plane overview](#52-ha-control-plane-overview) for more details). + +##### 5.2.2.2. HA control plane data channel + +The data channel is used for transferring large chunks of data between DPUs, e.g. [flow bulk sync](#115-bulk-sync). Since control messages and data sync are going through different channels, this helps us avoid head-of-queue blocking for control messages, which ensures the HA control plane and state machine transition is always responsive. + +Because we expect large chunks of data being transferred, this channel is designed to: + +* Avoid using PCIe bus. +* Minimize the overhead on DPU, due to limited compute resource on DPU. + +The data channel is designed to have 2 parts: from DPU ASIC / ARM core to `swbusd` and from `swbusd` to other `swbusd`. This channel is established with the steps below: + +* During DPU initialization, SONiC will get how many channels are needed for bulk sync via SAI API. +* Once the number is returned, it will be forwarded to local `swbusd` along with its pairing information to establish the data channels between `swbusd`. +* When bulk sync starts, + * SONiC will call get flow SAI API with all channel information passed as SAI attributes, such as gRPC server addresses. + * DPU will send the flow info directly to `swbusd` and being forwarded to its paired DPU. + * The flow info format will be defined publicly as part of the SAI APIs, so we can also directly call the SAI APIs to get the flow info as well. + +##### 5.2.2.3. HA control plane channel data path HA + +Although it might be counter-intuitive, we could consider that we can always get a working connection within a bounded time for all HA control plane channels. This is achieved by leveraging our ECMP setup on T1s. + +The approach is simply brute force: as long as we have tried enough number of connections with different source port, we can get the channel recovered. + +This approach works because with our current setup, it is extremely unlikely that 2 switches cannot communicate with each other. + +* If the possibility of 1 link failing is P, and we have N possible links between each T1, then the possibility of 2 switches failing talk to each other will be $(2p – p^2)^{N}$. +* Let's say, the down time of each link within the year is 1% and we have 10 links, then the possibility will be $(0.01 - 0.0001)^{10} = 0.0199^{10} = 9.73 * 10^{-18}$. This equals to $3*10^{-10}$ second downtime per year. + +Additionally, to reduce the time of finding a good path and meet our SLA requirement, we could also try to use multiple different source port to establish the new connection in parallel, and choose whichever one that works first. + +To summarize, as long as the peer switch NPU is running, with this approach, we can basically consider we always can get a working connection to our peer, which can be used to simplify our design. + +### 5.3. ENI creation + +First, we need to create the ENI in all DPUs: + +1. Upstream service will first decide where to put the new ENI and form the HA set, based on the load and other information. +2. Upstream service calls northbound interface and programs the following things on each SmartSwitch independently: + 1. Program HA set information and ENI placement information to all the switches that will receive the traffic. + 1. This will make `hamgrd` setup the traffic forwarding ACLs on the NPU side. + 2. This will make all the `hamgrd` connect to the `hamgrd` that own the ENI and register the probe listener. + 1. Whenever a probe listener is registered, the latest probe state will be pushed to make it in sync. + 2. If the probe listener is lost, it should be either the ENI placement information is removed or `hamgrd` is crashed. For the former case, there won’t be any problem. For the latter case, please see "[`hamgrd` crash](#931-hamgrd-crash)" section on how to handle it. + 3. If the ENI is placed on the local DPU, `hamgrd` will create the state machine for the ENI. The state will be initially set to `Dead`, until it see the state report from DPU. + 2. Create the DASH ENI object on selected DPUs with its HA set id. + 1. The detailed HA set information will be copied from NPU to DPU by SWSS, and DPU side SWSS will wait for the HA set information gets ready and then create the DASH objects. + 2. With the detailed HA set information, when ENIs in DPU becomes active or standby, it will know how to esablish the data plane channel to its peer. + 3. After the DASH ENI object is created, it will report its state into DASH state table, which then being consumed by `hamgrd` to drive the state machine transition. + 4. When upstream service creates the ENI, it could use desired state to control which side should be active. This allows us to distribute the ENIs evenly on the paired DPUs. Here are the 2 typical ways of using this state: + 1. Set the desired state of preferred side to `Active`. When the ENI is created, the `hamgrd` will start to drive the state machine transition immediately, and try to make the preferred side active with the [Primary election](#73-primary-election) process. + 2. If we want to make sure the DASH objects are programmed before the ENI is launched, we can set the ENI to `Dead` state first, then program the DASH objects, and then set the desired state to `Active` or empty (as standby). + +

ENI creation step 1

+ +3. Once programming is finished and coming out of `Dead` state, ENIs will start forming the HA pair automatically: + 1. The ENI level control plane channels will be created. + 2. The 2 ENI state machine will start to communicate with each other via the control plane control channel, elect the primary and form a HA set automatically. +4. All state transitions will be reported to upstream service via gNMI, so upstream service will know which ENI becomes the active side. + 1. However, the elected active may or may not be the preferred one. For example, if the DPU pair is running in standalone setup and the non-preferred side might be running in standalone state, then the non-preferred side will be elected instead. + +

ENI creation step 2

+ +5. In the meantime, the state machine transition will also update the state of all ENIs, making sure the traffic is forwarded to the right DPU. + +

ENI creation step 3

+ +### 5.4. ENI programming + +Programming ENI follows the same rule as ENI creation – Upstream service programs each ENI independently via our northbound interface. + +* For all DASH objects, they will directly go to DPU side databases. +* For HA configurations, they will be programmed on NPU side databases, because HA control plane is running on NPU side. However, each ENI will still be programmed independently. + +

ENI programming

+ +### 5.5. ENI removal + +ENI removal works in similar way as ENI programming. In HA setup, every ENI exists in multiple DPUs. Since our upstream knows where all the ENIs are allocated on the switches, it can send the request down to all related switches to shutdown and remove the ENI via our northbound interface. + +1. Upstream service first program all switches to remove the traffic forwarding rules to stop the traffic. +2. Upstream service then programs the 2 DPUs to remove the ENIs. + +If we decide to remove only a single ENI instead of pair, it is very important to avoid data path impact as much as possible. In this case, we can use "[Planned Shutdown](#83-planned-shutdown)" to pin the traffic to the other side before removing the ENI. + +## 6. DPU liveness detection + +In order to implement HA, the first setup is to tell if a DPU is up or down. This is determined by 2 different ways below. + +The target of liveness detection is to determine the DPU state, and we are not trying to detect and handle the gray network failures here. In short, if our probe says down, it is a down, instead of 16% down (see "[Requirements, Assumption and SLA](#3-requirements-assumptions-and-sla)"). + +To determine the DPU state, we are using 3 approaches: + +1. **Card level NPU-to-DPU liveness probe**: Detect if a DPU is reachable from a switch or not. The minimum scope of this probe is card level, so it can only control the traffic for the entire card. If this probe is up, NPU will be ***allowed*** to forward traffic to this card, otherwise, NPU will stop forwarding traffic to all ENIs on that card. +2. **ENI level NPU-to-DPU probe report**: This report is using the control channel to report its liveness status to all the switches that advertises the DP VIP and involved in traffic forwarding. +3. **DPU-to-DPU liveness probe**: Detects gray network failure between DPUs. It can be on card-level or ENI-level. Whenever this probe fails or failure rate above certain level, we will start to consider the peer is dead and drive into standalone mode. + +### 6.1. Card level NPU-to-DPU liveness probe + +Today, we will use BFD sessions to probe the local and remote DPU in SmartSwitch. +BFD is already supported by SONiC and can be offloaded to switch ASIC. After we setup the BFD sessions, switch ASIC will generate BFD packet to local or remote DPUs to test if the DPUs are live or not. + +To control this probe behavior, we have 3 configurations: + +* The interval between each probes. +* How many probes succeed continuously, we probe it up? +* How many probes fail continuously, we probe it down? + +And here are the details on how the card level probe controls the traffic: + +| DPU0 | DPU1 | Is forwarding to DPU0 allowed? | Is forwarding to DPU1 allowed? | Comments | +| --- | --- | --- | --- | --- | +| Down | Down | Yes | Yes | Both down is essentially the same as both up, hence effect is the same as Up+Up. | +| Down | Up | No | Yes | NPU will forward all traffic to DPU1, even when the ENI’s active runs on DPU0. | +| Up | Down | Yes | No | NPU will forward all traffic to DPU0, even when the ENI’s active runs on DPU1. | +| Up | Up | Yes | Yes | Respect the ENI-level traffic control (active-standby). | + +#### 6.1.1. BFD support on DPU + +Currently, BFD is running as part of FRR in BGP container. However, in DPU, we will not run the FRR/BGP stack, hence we don’t have anything on DPU side to support BFD today. + +To make BFD work, the practical and memory/space saving way is to implement a light-weighted BFD server on DPU. + +This approach allows us to reflect the true DPU liveness. DPU is running doesn't mean the programs we run on DPU are ready to take traffic, e.g. in standby state. And having a custom implemented BFD server can help us better control the traffic. + +##### 6.1.1.1. Lite-BFD server + +##### 6.1.1.2. BFD initialization on DPU boot + +The lite-BFD server will be a standalone program running on DPU. + +During DPU boot and initialization, certain parts of the DPU packet processing pipeline might not be fully running, while our programs are already launched. If we start BFD session at this moment, packets might be coming and getting dropped. + +To reflect the true liveness, we will postpone the BFD session setup after SAI create switch is done. And SAI create switch should make sure every component in the system is fully running. + +Additionally, if any parts are broken, they will be reported via SAI. Once we receive the notification, we will shut down all ENIs on that card and BFD sessions as well. + +##### 6.1.1.3. (Optional) Hardware BFD on DPU + +It would be also recommended to make DPU support BFD hardware offload. This will make the BFD more stable in cases of service crash and etc. + +#### 6.1.2. BFD probing local DPU + +Probing local DPU is straightforward, we can directly set up the BFD session in NPU. The data path will be looking like below: + +

BFD probing local DPU

+ +#### 6.1.3. BFD probing remote DPU + +We wiil use multi-hop BFD to probe the remote DPU with certain changes in NPU. + +In order to make it work, we need the DPU IP to be advertised in the network, so that we can directly send BFD packet to it. With this, the data path will be looking like below: + +

BFD probing remote DPU

+ +### 6.2. ENI level NPU-to-DPU traffic control + +Since we need to support failing over each ENI independently, we need to have a mechanism for managing the probe state at ENI level. However, the BFD session is running at card level, and we cannot control the traffic at ENI level. + +#### 6.2.1. Traffic control state update channel + +Because of the network setup for achieving data path HA, all T1s will need to know which DPU we should forward the traffic to. Hence, even we somehow make BFD working on ENI-level, we will end up creating more BFD sessions than we can offload to hardware. + +To solve this problem, we need to use a new way for controlling the per ENI traffic forwarding behavior: Instead of sending probes, all ENIs broadcast its state to all T1s for traffic control. + +To ensure HA state machine can operate on its own, even when upstream service is disconnected, we created the [HA control plane control channel](#5221-ha-control-plane-control-channel), which enables the DPU sending states to its peer DPU. Hence, we can use this channel directly to control the state. + +The traffic control state update workflow works briefly as below: + +1. Our upstream service decides which T1s are involved for receiving traffic and forming the HA set. +2. Our upstream service updates all involved T1s with the DP VIP, its related ENIs and NPU/DPU info in the HA set. This makes the T1s creating all the initial NPU traffic forwarding routes, as well as setting up the control plane control channel and registering for state update notification properly. +3. Once probe update notification is registered, the latest probe decision will be pushed over as initial reconcile. +4. Whenever we need to change the probe during the HA transition, we use this channel to talk with other side and sync the probe across. + +#### 6.2.2. Traffic control mechanism + +Since currently only the active side handles the traffic in steady state, the ENI-level traffic control simply works as a route setting up the next hop for the ENI. + +The traffic control state update mechanism is really simple. It works as the last up wins - whenever an ENI reports itself as the latest next hop, we start to use it from then, and shift the traffic to it. + +### 6.3. DPU-to-DPU liveness probe + +The purpose of DPU-to-DPU liveness probe is to find gray failures in the system. Currently, gray failures can happen in 2 ways: + +* **Data path failure**: Some links are dropping traffic and how can we avoid these links. +* **ENI pipeline failure**: Pipeline for certain ENI is failing, but not all. + +#### 6.3.1. Card-level and ENI-level data plane failure + +To detect data plane failure, there are 2 things we should do: + +1. Have counters on how many flow replication packets are dropped overall and for each ENI. +2. Send probe packet, so we can detect data path gray failure actively. + 1. The probe packet should have source port rotated. + 1. The failure count should be tracked on both card-level and ENI-level, so we can use it to detect data path being fully broken. + +When failures are detected by these 2 things, we will start drive the ENI into standalone mode. + +Furthermore, we could also consider tracking the data path availability for each link/source port independently and avoid the bad path actively. Please see "[Multi-path data plane availability tracking](#43522-optional-multi-path-data-plane-availability-tracking)" for more information. But this is totally optional, as we stated in our "[Requirements, Assumptions and SLA](#3-requirements-assumptions-and-sla)", handling data path gray failure is nice to have, but it is not one of our main goal. + +#### 6.3.2. ENI-level pipeline failure + +The hard problem is how to really detect gray failure with generic probe mechanism. + +* If it is for detecting hardware failure, it can and should be monitored by hardware monitors. +* If it is for detecting if hardware is booted and really to use, delaying BFD responses to create switch and hardware BFD can both handle it well (see "[BFD support on DPU](#611-bfd-support-on-dpu)" section). +* If it is for detecting certain scenario failing, we don’t have enough knowledge to build the required packets to make sure it exercises the expected policies and assert the expected final transformation. Hence, having a scenario-based probing service will be a better solution. + +Hence, to summarize, we are skipping designing ENI-level pipeline probe here. + +### 6.4. Probe state pinning + +For each ENI, besides probing and handling the probe state update, we can also choose to pin the probe state to whatever state we like and use it as our final state. This is mostly designed for failure handling, where we want to force the packet to go to a certain place. + +Since we have 2 probes to control the traffic – card level and ENI-level, we need to support pinning both probe states in case of needed by HA state transition or manual live-site mitigation. + +The underlying mechanism for both types of probes are working in similar way: + +* Pinning a probe state will not stop the underlying probe mechanism from running. The original probe state will still be updated, whenever it is changed. +* Whenever a pinned probe exists, we will always honor the pinned probe state. However, due to certain limitations, the detailed effect of pinned probe state will be slightly different. Please see the sub-sections below. + +#### 6.4.1. Pinning BFD probe + +When a BFD probe state is pinned, it will control the traffic for the entire switch or card, depends on HA scope. + +* If pinned to up, it will allow the traffic of its owning ENI being forwarded. +* If pinned to down, it will stop all traffic of its owning being forwarded. + +The detailed effects of probe pinning are listed below. + +| Original probe state | Pinned probe state | Final probe state | +| --- | --- | --- | +| Down | None | Down | +| Up | None | Up | +| Down | Up | Up | +| Up | Up | Up | +| Down | Down | Down | +| Up | Down | Down | + +A few things to note: + +* BFD probe is a big hummer. It only controls on a higher level. ENI-level probe will still be honored when forwarding traffic. Hence, the traffic will only be forward, when both BFD probe and ENI-level probe are up. +* It is ok for both cards to be pinned as up, because it will allow the traffic to be forwarded to either side but won’t force the traffic to be forwarded to both sides. + +#### 6.4.2. Pinning ENI-level traffic control state + +Since ENI-level traffic control works differently from card level BFD probes, the ENI-level probe pinning also works differently - Instead of allowing the pinning both sides to certain state independently, we limit the pin to only 3 cases: None, DPU1 or DPU2. Whenever pin exists, we honor the pin instead of original state. + +The detailed effects are listed as below: + +| Next hop | Pinned next hop | Final next hop | +| --- | --- | --- | +| None | None | None | +| DPU1 | None | DPU1 | +| DPU2 | None | DPU2 | +| None | DPU1 | DPU1 | +| DPU1 | DPU1 | DPU1 | +| DPU2 | DPU1 | DPU1 | +| None | DPU2 | DPU2 | +| DPU2 | DPU2 | DPU2 | +| DPU1 | DPU2 | DPU2 | + +### 6.5. All probe states altogether + +Now, level put all probe states together and see how the next hop gets decided: + +1. If the incoming packet is already having the NPU tunnel encap, skip the process and forward as it is. +2. Check the card level probe states for both DPUs. If pinned state exists, use pinned state and continue checking below. +3. If we have only one side card level probe state is up, forward the packet to this side. Otherwise continue. +4. Do ENI lookup from the packet and get the ENI-level probe state. +5. If ENI-level probe state is pinned, forward the packet to the pinned DPU. +6. If we have only one side ENI-level probed up, forward the packet to this side. +7. If both are up, forward the packet to the last updated one. +8. Otherwise (both are down), drop the packet. + +In real implementation, we could consider pre-calculating certain decisions, such as ENI-level next hop. + +## 7. HA state machine + +Since switch and DPU can be restarted and ENIs can be created/moved/destroyed, for each HA scope, i.e. ENI, we will need to manage its states during its lifecycle. + +Again – the goals of the HA design are: + +* Providing a failover plan with only 2 instances in a HA set, forming a HA pair. If there are more than 2 instances, we will need a more complicated state machine or underlying mechanism to work. +* At the end of each planned or unplanned event w/ recovery, the flows in all HA participants are matched. +* Stale flow can be created due to policy being not up to date. It is the responsibility of the upstream services to make sure the policy in all DPUs are up to date and aligned to avoid stale / conflicting flows from being created. + +> In this section, we will have a lot of graphs showing how "Term" is changed during the state transitions. For what is term and how it affects flow replication, please refer to "[Primary election](#73-primary-election)" and "[Range sync](#1152-range-sync-eni-level-only)" sections for more explanation. + +### 7.1. HA state definition and behavior + +Here are all the states that each HA participants will go through to help us achieve HA. Their definition and behavior are listed as below. + +| State | Definition | Receive traffic from NPU? | Make decision? | Old flow handling? | Respond flow sync? | Init flow sync? | Init Bulk sync? | Comment | +| --- | --- | --- | --- | --- | --- | --- | --- | --- | +| Dead | HA participant is just getting created, and not connected yet. | No | Drop | Drop | No | No | No | | +| Connecting | Connecting to its peer. | No | Drop | Drop | No | No | No | | +| Connected | Connected to its peer, but starting primary election to join the HA set. | No | Drop | Drop | No | No | No | | +| InitializingToActive | Connected to pair for the first time, voted to be active. | No | Drop | Drop | No | No | No | We have to drop the packet at this moment, because we need to ensure the peer is ready to respond the flow replication traffic, otherwise the traffic will be dropped. | +| InitializingToStandby | Connected to pair for the first time, voted to be standby. | No | Tunneled to pair | Tunneled to pair | Yes | No | No | It needs to wait until existing flows to be replicated before going to the next state. | +| Destroying | Preparing to be destroyed. Waiting for existing traffic to drain out. | No | Tunneled to pair | Tunneled to pair | No | No | No | Wait for traffic to drain and doesn’t take any new traffic load. | +| Active | Connected to pair and act as decision maker. | Yes | Yes | Yes | No | Yes | Yes | | +| Standby | Connected to pair and act as backup flow store. | No | Tunneled to pair | Tunneled to pair | Yes | No | No | | +| Standalone | Heartbeat to pair is lost. Acting like a standalone setup. | Yes | Yes | Yes | Yes | No | No | Acting like the peer doesn’t exist, hence skip all data syncs.

However, the peer can be still connected, because in certain failure cases, we have to drive the DPU to standalone mode for mitigation. | +| SwitchingToActive | Connected and preparing to switch over to active. | No | Yes | Yes | Yes | Yes | No | SwitchingToActive state is a transient state to help old active moving to standby, hence it accepts flow sync from old active, as well as making decision and sync flow back, when old active moved to standby.

Bulk sync is not used in SwitchOver at all. | +| SwitchingToStandby | Connected, leaving active state and preparing to become standby. | Yes | Tunneled to pair | Tunneled to pair | Yes | No | No | SwitchingToStandby is a transient state to help new active moving from SwitchingToActive state to active state. It is identical to standby except responding NPU probe to tunneling traffic. This is needed to make sure we always only have 1 decider at any moment during the transition.

Bulk sync is not used in SwitchOver at all. | + +Here are the definitions of the columns above: + +* **Receive traffic from NPU**: ENI level NPU-to-DPU traffic control state. If yes, we will set this DPU as the next hop for the ENI, so NPU will forwarding traffic to it. +* **Make decision**: Will this DPU create new flows for new connections (not found in flow table, including both syn/non-syn cases, see "[Flow creation and initial sync](#113-flow-creation-and-initial-sync)" section) +* **Old flow handling**: When packets of existing flow arrive to this DPU, what will the DPU do? +* **Respond flow sync**: Will this DPU accept flow replication request from its paired DPU? +* **Init flow sync**: Will this DPU send replicate flow request to its paired DPU when new flow is created? +* **Init bulk sync**: Will this DPU start bulk sync in this state? + +### 7.2. State transition + +The state transition graph is shown as below: + +

HA state transition

+ +### 7.3. Primary election + +When 2 DPUs reconnect to each other, they will need to decide which one should be the active side. + +The key idea is to find the DPU that lives the longest and knows the most on the historical flow changes. To do it, we introduce a concept called "Term": + +1. In the beginning or steady state, both DPUs will have the same set of flows, as all flows will be inline sync’ed. Hence, the 2 DPUs will have the same term. +2. If a DPUs can make flow decision, but it cannot inline sync to its peer anymore, their flow data will start to diverge, and at this moment, we move to the next term. +3. Whenever the inline sync channel can be established again, the flow data will stop diverging, so we can move to next term and resume the inline sync. +4. Hence, whenever the term is larger, it means this node knows more about the flow history, and it should be preferred. + +With this idea, the implementation becomes really simple – we only need to compare the term. However, we might have some cases where we prefer the active on one side for better traffic distribution or might end up with situations with both sides having the same term. Hence, to make the algorithm more practical, we add a few more things in the primary election algorithm. + +The key variables are shown as below: + +* **Term** (Int, Default = 0): The term of the ENI. It changes in 2 cases: + * Increases whenever the ENI establishes or loses inline sync channel with its peer, while it can make flow decision. + * Match to active side, when we are back to stable state – Active-Standby pair. +* **CurrentState** (Enum): The current HA state of the ENI. This is to avoid comparing with ENIs before Connected state. +* **DesiredState** (Enum): When both sides are in the same term, set the active to the preferred side. The initial value comes from the ENI configuration. +* **RetryCount** (Int, Default = 0): How many times we have retried. This is to avoid retrying forever. + +Hence, when receiving a `RequestVote` message, our primary election algorithm works as below: + +1. If ENI is not found: + 1. If `RetryCount` is smaller than a specified threshold, respond `RetryLater`. + 2. Otherwise, respond `BecomeStandalone`. +2. If `DesiredState` is `Standalone` (pinned), respond `RetryLater`. +3. If ENI is running in `Active` state, continue to be active and respond `BecomeStandby`. +4. If both `CurrentState` and `DesiredState` are `Dead`, or `CurrentState` is `Destroying`, respond `BecomeStandalone`. +5. If `CurrentState` is `Dead` or `Connecting`: + 1. If `RetryCount` is smaller than a specified threshold, respond `RetryLater`. + 2. Otherwise, respond `BecomeStandalone`. +6. If our `Term` is higher, respond `BecomeStandby` to other side. If our `Term` is smaller, then respond `BecomeActive`. Otherwise, continue. +7. If `DesiredState` is set to `Active`, while not on the other side, respond `BecomeStandby`. If set on the other side, but not locally respond `BecomeActive`. Otherwise continue (could be both set or not set). +8. Return `RetryLater`. If RetryCount is larger than certain threshold, fire alerts, because our upstream is not updating the ENI config properly. + +### 7.4. HA state persistence and rehydration + +Since `hamgrd` could crash in the middle of the HA state transition and lose all its in-memory state, we need to persist the HA state, whenever it is changed and provide a way to rehydrate it back. + +This is done by 2 key things: + +1. All states that is related to HA state machine transition will be saved to Redis, whenever it is changed. This includes: per-ENI current HA state and cached peer state, etc. +2. All actions that we do during state transition are idempotent. For examples, + * **Move the current state to a specific state**: If we repeat this action, it won’t have any effect. + * **Update all probe state to Up or Down**: If we repeat this action, we will get the same result. + * **RequestVote**: If we win or lose the vote previously, we shall win or lose the vote now, as long as the peer is not crashed. + * **Start bulk sync**: If we start it again, we should either get a in-progress response or start a new one, if previous one is done. + * **Bulk sync done**: This action is usually equivalent to "Update state to Standby". If we are already in standby, then no-op. + * **SwitchOver/ShutdownStandby**: If we are already in next state, then the operation will be a no-op. + * ... + +This makes the state being rehydratable. Once we move the state into certain state, we can update redis to make sure it is persistent, then invoke whatever actions we want and move on. If we crashed, then we can read the states back from redis and retry all actions safely. + +During rehydration, we also need to check the current config of ENI, to ensure it is not removed. + +## 8. Planned events and operations + +### 8.1. Launch + +After a new HA participant is being created, e.g. ENI, we need to make it join the existing HA set, getting existing flows from its peer and start taking traffic. + +#### 8.1.1. Clean launch on both sides + +For clean launch, it starts with our upstream service creating the ENIs on the DPUs. Please see the "[ENI creation](#53-eni-creation)" section for more information. + +Once ENIs are created, they will start the clean launch process: + +

HA clean launch

+ +1. Both DPUs will start with `Dead` state. +2. During launch, they will move to `Connecting` state and start to connect to its peer. +3. Once connected, they will start to send `RequestVote` to its peer with its own information, such as Term, to start Primary election process. This will decide which of them will become Active or Standby. +4. Say, DPU0 will become active. Then, DPU0 will start to move to `InitializingToActive` state, while its peer moves to `InitializingToStandby` state. +5. Once DPU0 receives `HAStateChanged` event and knows DPU1 is ready, DPU0 will: + 1. Move to `Active` state. This will cause ENI traffic forwarding rule to be updated, and make the traffic start to flow. + 2. Send back `BulkSyncDone` event, since there is nothing to sync. +6. Once received `BulkSyncDone` event, DPU1 will move to `Standby` state. DPU1 will also receive `HAStateChanged` event from DPU0, which updates the Term from 0 to 1. + +To summarize the workflow, here is the sequence diagram: + +```mermaid +sequenceDiagram + autonumber + + participant S0D as Switch 0 DPU
(Desired Active) + participant S0N as Switch 0 NPU + participant S1N as Switch 1 NPU + participant S1D as Switch 1 DPU
(Desired Standby) + participant SA as All Switches + + S0N->>S0N: Move to Connecting state + S1N->>S1N: Move to Connecting state + + S0N->>S1N: Connect to peer and
send RequestVote with its
term and desired state + S1N->>S0N: Connect to peer and
send RequestVote with its
term and desired state + S0N->>S1N: Respond BecomeStandby + S1N->>S0N: Respond BecomeActive + + S0N->>S0N: Move to InitializingToActive state + S1N->>S1N: Move to InitializingToStandby state + S1N->>S1D: Activate standby role + S1N->>S0N: Send HAStateChanged event + + S0N->>S1N: Send BulkSyncDone event + S0N->>S0N: Move to Active state
and move to next term (Term 1) + S0N->>S0D: Activate active role with term 1 + S0N->>SA: Update ENI traffic forwarding
rule next hop to DPU0 + Note over S0N,SA: From now on, traffic will be forwarded
from all NPUs to DPU0 + S0N->>S1N: Send HAStateChanged event + + S1N->>S1N: Move to Standby state and update its term +``` + +#### 8.1.2. Launch with standalone peer + +While ENI is created, its peer might be already created and running. This can happen in multiple cases, such as recovering from planned/unplanned failure or ENI being moved from 1 switch to another. + +Once created, the new ENI will go through similar steps as above and join the HA set/pair. Say, DPU1 is one trying to start: + +

HA launch with standalone peer

+ +1. DPU1 will start as `Dead` state as initial state, then move to `Connecting` state and start to connect to its HA pair. +2. After successfully connected to its peer, the new node will move to `Connected` state and send out `RequestVote` message to its peer to start primary election (see "[Primary Election](#73-primary-election)" for detailed algorithm). +3. The primary election response will move the new node to `InitializingToStandby` state. + * Using `RequestVote` method is trying to make the launch process simple, because the pair could also be doing a clean start, so we are not determined to become standby, but also might become the active. + * If the DPU0 is pinned to standalone state, we can reject the `RequestVote` message with retry later. +4. DPU1 sends `HAStateChanged` event to DPU0, so DPU0 knows DPU1 is ready. Then it moves to `Active` state, which will, + 1. Start replicating flows inline. + 2. Start bulk sync to replicate all old flows. + 3. Move to next term, because the flow replication channel is re-established. +5. When entering `Active` state, DPU0 will also send out `HAStateChanged` event to sync the term across, but DPU1 should save this update and postpone it until it moves to standby state. This is to ensure all previous flow is copied over. After bulk sync is done, DPU0 will send `BulkSyncDone` event to DPU1 to move DPU1 to `Standby`. + +To summarize the workflow, here is the sequence diagram: + +```mermaid +sequenceDiagram + autonumber + + participant S0D as Switch 0 DPU
(Initial Standalone) + participant S0N as Switch 0 NPU + participant S1N as Switch 1 NPU + participant S1D as Switch 1 DPU
(Initial Dead) + participant SA as All Switches + + S1N->>S1N: Move to Connecting state + + S1N->>S0N: Connect to peer + S1N->>S1N: Move to Connected state + + S1N->>S0N: Send RequestVote with its
term and desired state + S0N->>S1N: Respond BecomeStandby + + S1N->>S1N: Move to InitializingToStandby state + S1N->>S1D: Activate standby role + S1N->>S0N: Send HAStateChanged event + + S0N->>S0N: Move to Active state and move to next term + S0N->>S0D: Activate active role with new term + S0N->>S0D: Start bulk sync + S0N->>S1N: Send HAStateChanged event + + S0N->>S1N: Send BulkSyncDone event + S1N->>S1N: Move to Standby state and update its term to match its peer +``` + +#### 8.1.3. Launch with no peer + +However, when the DPU is launching or ENI being created, its peer might not exist due to not being created yet or crashed not recovered yet (If peer ENI is running, we will be able to connect to it). + +In this case, we will go through the same checks as we handle the unplanned events and drive the new ENI to standalone state. + +The detailed steps are shown below: + +

HA launch with no peer

+ +1. After DPU1 moved to Connecting state, it will keep retry to establish the control channel. If network is having trouble, we might have some intermediate failure, but it will soon be addressed due to our ECMP network setup. +2. Since DPU0 is dead, it will return BecomeStandalone with reason PeerDead or PeerNotFound to DPU1. +3. DPU1 will directly move its state to Standalone and move its term to next term. + +To summarize the workflow, here is the sequence diagram: + +```mermaid +sequenceDiagram + autonumber + + participant S0D as Switch 0 DPU
(Dead) + participant S0N as Switch 0 NPU + participant S1N as Switch 1 NPU + participant S1D as Switch 1 DPU
(Initial Dead) + participant SA as All Switches + + S1N->>S1N: Move to Connecting state + + S1N->>S0N: Connect to peer and
send RequestVote with its
term and desired state + S0N->>S1N: Respond BecomeStandalone
with Reason PeerDead or PeerNotFound + + S1N->>S1N: Move to Standalone state and move to next term + S1N->>S1D: Activate Standalone role
with the new term + S1N->>SA: Update ENI traffic forwarding
rule next hop to DPU1 + Note over S1N,SA: From now on, traffic will be forwarded
from all NPUs to DPU1 +``` + +Afterwards, when DPU0 comes back, it will launch again, following the sequence of "[Launch with standalone peer](#812-launch-with-standalone-peer)" as above. + +### 8.2. Planned switchover + +Planned switchovers are usually done for reasons like maintenance. + +With the planned switchover, we are trying to achieve the following goals: + +1. Close to zero loss and to coordinate between the active and standby to achieve this goal (see "[Requirements, Assumption and SLA](#3-requirements-assumptions-and-sla)" section). +2. Avoid extra flow merge during and after switchover, which includes: + 1. Only 1 instance takes traffic and makes flow decisions at any moment during the entire switchover. + 1. Avoid using bulk sync for any flow replication. During the transition, all flow should be replicated inline. + +#### 8.2.1. Workflow + +Planned switchover starts from a standby node, because in order to avoid flow loss, we need to make sure we have 1 valid standby node that works. And here are the main steps: + +1. Upstream service set standby (DPU1) desired state to `Active` and current Active (DPU0) to `None` to initiate switchover. + +

HA planned switchover step 1

+ + To ensure we know which config is the latest one, due to 2 ENIs are programmed independently, the HA config should have a version number, which is increased whenever the config is changed. And the upstream service should set the desired state to `Active` with the latest version number. + +2. Switch 1 detects desired state not matching and updates its state to trigger gNMI notification to upstream service for requesting switching over. Upstream service sets the approved switchover id after ensuring the policy matches and pausing future policy updates. + +

HA planned switchover step 1

+ + This extra step here is also used when we [recover from standalone setup](#1016-recovery-from-standalone-setup). + +3. DPU1 moves to `SwitchingToActive` state and send `SwitchOver` to DPU0 (active). + +

HA planned switchover step 2

+ + We cannot move DPU1 directly to `Active` state and need `SwitchingToActive` state to help us doing the transition, because DPU0 doesn’t accept flow replication, because it is still running in `Active` state. + + To ensure we always have 1 decision maker at all times, the `SwitchingToActive` state will: + + 1. Support making decisions and handling traffic. + 1. Set ENI probe to down, but wait for the other side giving up active state and tunnel the traffic across. + 1. Accepts flow replication request, because the other side is still running at active state and replicate flow. + + If DPU0 is not in `Active` state or also have desired state set to `Active`, it will reject the `SwitchOver` request, which moves DPU1 back to `Standby`. And later on, DPU1 will retry the switchover again. + +4. Once switch 1 receives approval, DPU0 moves itself to `SwitchingToStandby` state and notifies DPU1 that it is ready with `HAStateChanged`. + +

HA planned switchover step 3

+ + We cannot directly transit to `Standby` state directly, because it will set ENI probe to down on both sides and drop all traffic. + + In `SwitchingToStandby` state, DPU will: + 1. Continue to respond to NPU probe, so traffic still being sent to it. + 1. Stop making decision and tunnel all traffic to the pair, no matter it is new flow or existing flow. + + Hence, although the traffic can land on both DPUs, there is only 1 decision maker – DPU1. + +5. After DPU1 receives `HAStateChanged` event from DPU0, it drives itself to Active state and notifies DPU0 back with `HAStateChanged`. + +

HA planned switchover step 4

+ + This will probe DPU1 up on both NPUs. And due to the last-win rule, all traffic will be forwarded to DPU1. + +6. At last, the old active DPU0 drives itself to `Standby` state. This probes the DPU0 down on both NPUs. All traffic will be kept on DPU1, as NextHop is not changed. + +

HA planned switchover step 5

+ +To make the workflow more clear, here we summarize the workflow again in sequence diagram as below: + +```mermaid +sequenceDiagram + autonumber + + participant U as Upstream Service + participant S0D as Switch 0 DPU
(Initial Active) + participant S0N as Switch 0 NPU + participant S1N as Switch 1 NPU + participant S1D as Switch 1 DPU
(Initial Standby) + participant SA as All Switches + + U->>S1N: Set desired state to Active + U->>S0N: Set desired state to None + + S1N->>U: Send request switchover notification + U->>S1N: Set approved switchover id + + S1N->>S1N: Move to SwitchingToActive state + S1N->>S1D: Activate SwitchingToActive role + S1N->>S0N: Send SwitchOver to active ENI + + S0N->>S0N: Move to SwitchingToStandby state + S0N->>S0D: Activate Standby role + Note over S0N,S0D: From now on, traffic will be
tunneled from old active ENI to new + S0N->>S1N: Send HAStateChanged to standby ENI + + S1N->>S1N: Move to Active state + S1N->>S1D: Active Active role + S1N->>SA: Update ENI next hop to DPU1 + Note over S1N,SA: From now on, traffic will be forwarded
from all NPUs to DPU1 + S1N->>S0N: Send HAStateChanged to old active ENI + + S0N->>S0N: Move to Standby state + S0N->>S1N: Send HAStateChanged to new active ENI +``` + +#### 8.2.2. Packet racing during switchover + +Although we have ensure that there will be only 1 flow decider at all time, from state machine point of view. In practice, there still could be some packet racing during the switchover, because flow replication packet delay caused by flow insertion latency and network latency. + +```mermaid +sequenceDiagram + autonumber + + participant C as Customer + participant S0D as DPU0
(Initial Active) + participant S1D as DPU1
(Initial Standby) + + C->>S0D: First packet lands on DPU0 + S0D->>S0D: First packet creates flow on DPU0
and start replication + note over S0D,S1D: NPU shifted traffic to DPU1 + C->>S1D: Second packet lands on DPU1
before flow replication packet arrives + S1D->>S1D: Second packet creates another flow on DPU1 + S0D->>S1D: Flow replication packet arrives on DPU1 + note over S0D,S1D: Flow conflict!!! +``` + +This case will mostly happen on UDP traffic or other connection-less traffic, because TCP traffic will go through handshake which usually won't run this into problem. + +To avoid this issue causing problems, during switchover we will do 2 things below: + +1. In the case above, the flow replication packet will be dropped, because the flow is already created on DPU1. This prevents the flow being created on DPU0 and solves this problem. +2. In switchover step 2 above, we will request approval from our upstream service, which will ensure both DPU runs same set of policy and stop future policy programming until switchover is finished. This helps us ensure the flows that created on both sides are identical, if it happens due to bugs and etc. (see [stable decision assumption](#33-assumptions)) + +This helps us to avoid conflicting / stale flows being created during the switchover. + +#### 8.2.3. Failure handling during switchover + +During switchover, both NPU and DPU liveness detection are still running in the background. If any node died in the middle, instead of rolling states back, we would start unplanned event workflow right away by driving the HA pair into Standalone setup. + +See discussions for [unplanned events](#9-unplanned-events) below. + +### 8.3. Planned shutdown + +Planned shutdown is usually used for maintenance purposes, like upgrade. Same as planned switchover, we aim to have close to zero traffic loss (see "[Requirements, Assumption and SLA](#3-requirements-assumptions-and-sla)" section). + +#### 8.3.1. Planned shutdown standby node + +Because each ENI is programmed independently with our northbound interface, the request will start from the standby side. However, when planned shutdown standby node, we need to make sure the current active works, so we need to ask the active node to initiate the shutdown process. + +1. Upstream service update standby (DPU1) desired state to `Dead`. + +

HA planned shutdown standby step 1

+ +2. Standby (DPU1) sends `RequestShutdown` message to current active (DPU0) to initiate the shutdown process. + +

HA planned shutdown standby step 2

+ + If DPU0 is not in `Active` state, it will reject the `RequestShutdown` and DPU1 will retry later. + +3. If DPU0 is in `Active` state, it will move to `Standalone` state, stop flow replication to avoid later data path impact due to flow replication packet drop, and send `ShutdownStandby` message to DPU1. + +

HA planned shutdown standby step 3

+ +4. DPU1 can now move to `Destroying` state. In `Destroying` state, the DPU will wait for existing flow replication traffic to drain. + + Since DPU1 state is changed, it will still send back `HAStateChanged` message, but the message will not trigger any action. + +

HA planned shutdown standby step 4

+ + If standby side is stuck and timed out responding `ShutdownStandby` message, we could be running as "Standalone-Standby" pair, which has no data path impact as well. And the standby side can be forced shutdown afterwards as unplanned. + +5. Once all flow replication traffic are drained, move to dead state, so it will not restart launching unless being explicitly moved back to Connecting state to restart launch process. + +

HA planned shutdown standby step 5

+ +To make the workflow more clear, here we summarize the workflow again in sequence diagram as below: + +```mermaid +sequenceDiagram + autonumber + + participant U as Upstream Service + participant S0D as Switch 0 DPU
(Initial Active) + participant S0N as Switch 0 NPU + participant S1N as Switch 1 NPU + participant S1D as Switch 1 DPU
(Initial Standby) + + U->>S1N: Set desired state to Dead + + S1N->>S0N: Send RequestShutdown to active ENI + + S0N->>S0N: Move to Standalone state + S0N->>S0D: Activate Standalone role + S0N->>S1N: Send ShutdownStandby to standby ENI + + S1N->>S1N: Move to Destroying state + S1N->>S1D: Shutdown ENI + S1N->>S0N: Send HAStateChanged to active ENI + S1N->>S1D: Pull DPU counters and wait for traffic to drain + + S1N->>S1D: Move to Dead state + S1N->>S0N: Send HAStateChanged to active ENI +``` + +#### 8.3.2. Planned shutdown active node + +To shutdown active node, the most important thing is to make sure the new active node will be running properly. Hence, this can be split into 2 steps: + +1. Switchover the current active to the desired active node. (See "[Planned switchover](#82-planned-switchover)" for detailed workflow) +2. Shutdown the old active node, which is running as standby now. (See "[Planned shutdown standby node](#831-planned-shutdown-standby-node)" for detailed workflow). + +### 8.4. ENI-level DPU isolation / Standalone pinning + +If we suspect an ENI pipeline is not working right and it is caused by unknown DPU hardware/software problem, we could try to isolate this DPU for this ENI. This can be achieved by 2 steps: + +1. Use [Planned Switchover](#82-planned-switchover) to move the active to the other DPU. +2. Set active node desired state to `Standalone` to drive this ENI into standalone setup. + +

HA pin standalone

+ +3. To recover, upstream service set the standalone node desired state back to `Active` to remove the pinning. + +

HA unpin standalone

+ +### 8.5. ENI migration + +There are a few cases which need us to move the ENI from one DPU to another. For example, + +* When doing upgrades, sometimes unfortunately, we might upgrade both DPUs for the same ENI at the same time. +* Moving ENI due to capacity reasons. +* When upstream service lost connection to one side of the switches and like to establish a new pair somewhere else. + +#### 8.5.1. Card-level ENI migration + +ENI migration is limited to ENI pair placement – with card-level ENI pair placement, we can only migrate all ENIs on the card together. + +ENI migration can be done in 3 steps: + +1. Remove all the ENIs on the DPU that we don’t want anymore (See "[ENI removal](#55-eni-removal)"), which will update the traffic forwarding rules and move its peer to Standalone state. +2. Update the standalone ENIs with the latest peer info, so it will start connecting to the new peer, which doesn’t exist yet. +3. Create all the removed ENIs on the new DPU with the peer info pointing to the standalone ENIs. This will make them form a new pair again (see "[ENI creation](#53-eni-creation)") and get new traffic forwarding rule setup too. + +#### 8.5.2. Migrating active ENI + +Migrating active ENI directly will cause data path impact and should be avoided as much as possible. Please try to use "[Planned Switchover](#82-planned-switchover)" workflow to move the active away before removing it. + +#### 8.5.3. Migrating standalone ENI + +If an ENI is running in standalone state, it means something unplanned is already happening, removing the ENI will definitely cause flow loss and data path impact. Hence, by default, this operation should be rejected. + +If we really like to do it, e.g. manual operation for live site mitigation, we shall start the workflow with explicit force flag (in upstream service), so we can avoid unexpected requests. + +#### 8.5.4. Migrating ENI when upstream service cannot reach one side of switches + +This action should be rejected. Because each ENI is programmed independently, if we cannot reach one side of switches, we won't be able to remove the ENI on that side and leads to ENI leak. + +#### 8.5.5. Moving towards ENI-level ENI migration + +In certain cases, we will have to migrate some ENIs on a DPU to another, for example - if the traffic goes to a single HA set becomes way too high, we have to move some ENIs to another HA set to balance the traffic. However, this is not supported with today's design. + +Essentially, this requires us to use ENI level pair placement, which will require the ENI-level bulk sync being supported. + +## 9. Unplanned events + +In unplanned events, network interruption will be unavoidable, and we are aiming to react and recover within 2 seconds (see "[Requirements, Assumption and SLA](#3-requirements-assumptions-and-sla)" section). + +All unplanned events **MUST** be monitored and fire alerts when it last for long time. The auto mitigations we are applying here is to minimize the impact of unplanned events per our best effort, but it might be perfect. Also, this will not solve the underlying problem, so in these events, we need manual confirmation and interventions anyway. + +### 9.1. Unplanned network failure + +Unplanned network failure can happen at any moment. This type of events is usually caused by link flap or wrong configuration causing traffic being blackholed for certain amount of time. + +From our data path setup, unplanned network failure can happen in several different ways: + +1. Upstream service channel failure. +2. HA control plane control channel failure. +3. HA control plane data channel failure. +4. Data plane channel failure. + +Hence, we need solutions to handle all these types of failures. + +> When network failure happens, it can be a gray failure, which means certain path might be blackholing traffic, but not all. The main goal of HA is not detecting and handling the gray failures as gray failures, which should be monitored and handled in a generic way. So when failure happens and bad enough to kill our probe, we will consider it a complete failure. + +#### 9.1.1. Upstream service channel failure + +When this happens, our upstream service might not be able to talk to either side or both side of the switch. + +In our HA design, although our upstream service controls what is the desired state for HA, but the HA state machine transitions are all driven by the HA control plane within our switches. Hence, when upstream service is lost, we can still react to any type of unplanned failures own our own. This minimizes the impact of upstream service failure. + +This is very important, otherwise whenever failure happens and kills upstream service, all switches will be stuck in the last state and cannot recover from it. The only thing we could do will be firing alerts and wait for manual mitigation. + +More discussions on each case are placed below. + +##### 9.1.1.1. One side switch is not reachable + +If only one side switch is not reachable, we won’t be able to send the latest policy to it. This will eventually build up enough difference between the latest policy and the ones that run on the switch and cause live site to happen. + +Whenever our upstream service detected this issue, we can [pin the reachable side to standalone](#84-eni-level-dpu-isolation--standalone-pinning) as mitigation. And since the policy doesn't match on both sides, we assume the reachable side should have a newer policy, hence flow resimulation should be triggered to make sure the flow actions are up to date. + +##### 9.1.1.2. Both side switches are not reachable + +Since switch HA management is decoupled with upstream service, switches will automatically move to standalone state, if they cannot talk to each other. Hence, data path will continue to work, although policies cannot be updated anymore. + +The good news is that the policy on both sides will not be updated in this case, hence if we are lucky and both sides run the same set of policies, we will at least stay consistent to react to any other type of failures. + +If this lasts for long time, fire alert, so we can see how to get the connectivity back. + +#### 9.1.2. HA control plane control channel failure + +If the control channel is down, the HA pair will not be able to share the states or properly drive the state machine on the peer. + +Whenever this happens, it can be caused by 2 reasons: network failure or peer NPU/DPU going down. Here we only cover the network failure case as steps listed below: + +* Keep the state as it is without change, because all intermediate states in planned events can be considered as working state due to 0 loss requirements. +* Recover the control channel as what we described in "[HA control plane channel data path HA](#5223-ha-control-plane-channel-data-path-ha)" section above. +* Once recovered, we move on the state machine transitions. + +For the case of DPU going down, it is covered by "[Unplanned DPU failure](#92-unplanned-dpu-failure)" section, and entire switch failure will be covered by "[Unplanned NPU failure](#93-unplanned-npu-failure)" section. + +#### 9.1.3. HA control plane data channel failure + +If data channel is down, the 2 DPUs will not be able to run bulk sync. + +Since bulk sync is fairly rare, we can delay the reaction: + +1. Keep retry connecting to other side without driving the HA states, until control channel also went down or somehow we went to standalone state. +2. When we want to drive the HA state out of standalone state, we can check if data channel is alive or not. If not, we consider we are not ready for exiting standalone, and postpone any HA transition until it recovers. + +This also needs to be monitored. If it cannot recover by rotating the ports, something weird must be happening, and we should fire alerts for it. + +#### 9.1.4. Data plane channel failure + +##### 9.1.4.1. Card level NPU-to-DPU probe failure + +The card level NPU-to-DPU probe only controls which DPU PA the NPU will use to forward the traffic. And we will not use it to trigger the HA state machine update. + +This probe is mostly used for determining if the NPU (no matter if it owns the DPU) can reach the wanted DPU directly or not. And if the DPU cannot reached, it will send traffic to the other one, which will tunnel the traffic to the right one again. + +The detailed mechanism is described in "[Card level NPU-to-DPU liveness probe](#61-card-level-npu-to-dpu-liveness-probe)" and "[All probe state altogether](#65-all-probe-states-altogether)" sections. The detailed data path is described in "[NPU-to-DPU traffic forwarding](#421-eni-level-npu-to-dpu-traffic-forwarding)" section. + +##### 9.1.4.2. Data plane gray failure + +When data plan channel failure occurs, it is very important to avoid making a gray failure becoming 100% down. See "[DPU-to-DPU Data Plane Channel](#4352-dpu-to-dpu-data-plane-channel)" for more information. + +Furthermore, we can monitor 2 things: + +1. How many packets are lost in the data plane, e.g. flow replication request ack miss rate (1 - ack packet count / (request packet count + ack packet count))? +2. How large is the DPU-to-DPU data plane probe failure rate? + +If these 2 rates are larger than certain threshold, we can start to [drive the DPU to standalone setup](#101-working-as-standalone-setup), avoiding unnecessary packet drop. + +### 9.2. Unplanned DPU failure + +Unplanned DPU failure, such as shutdown, crash or even power failure, can happen on one of the DPU or Switch at any time. In this case, all states could be lost, and things might start from clean. + +Same as other unplanned events, network interruption will be unavoidable, and we are aiming to react and recover within 2 seconds (see "[Requirements, Assumption and SLA](#3-requirements-assumptions-and-sla)" section). + +#### 9.2.1. Syncd crash + +When syncd on DPU crashes, from HA perspective, the impact would be: + +* `hamgrd` will fail to talk to syncd. If it lasts long, it will look like a DPU hardware failure or PCIe bus failure. +* `hamgrd` cannot invoke SAI APIs for initiating bulk sync or receive SAI notifications for bulk sync being done. + +However, when this happens, we will ***NOT*** drive the HA pair into standalone setup, but keep the HA pair running as is, and wait for syncd to recover. + +When syncd recovers, all SAI objects will be redriven by syncd, which recovers all ENIs we created, DASH-SAI objects we programmed, such as bulk sync session, HA state machine state and etc. During the syncd crash, no new policies can be programmed, but this should not cause any flow to be dropped. + +To differentiate from the cases of DPU having hardware failure or PCIe bus went down, we will check the signals from pmon, such as DPU state, PCIe status or other hardware monitor status. + +#### 9.2.2. DPU hardware failure + +When a DPU dies, it could kill both active and standby ENIs. In both cases, the network interruption will still be unavoidable: + +* When standby ENIs die, although it doesn’t directly serve the traffic, some flow replication packets will still be lost and there is no way to get them back (we don’t store packets in active side). +* When active ENIs die, some packets will be dropped for sure until the traffic being shifted to the other side. + +To detect this issue, we will use 2 ways: + +1. For local DPU, NPU will receive DPU status change as critical events from pmon. +2. For peer DPU, we will receive SAI notification about peer lost, and we can also use SAI API to check DPU-to-DPU probe state or data path availability to confirm. + +Once detected, we will start to [drive the HA pair into standalone setup](#101-working-as-standalone-setup). The recovery path in this case will be the same as launch. Please see "[Launch](#81-launch)" section for more details. + +### 9.3. Unplanned NPU failure + +Besides failures on DPU, NPU could also run into certain problems. + +#### 9.3.1. hamgrd crash + +Whenever `hamgrd` crashes, 2 things will happen: + +1. HA state machine transition will stop. +2. Both card-level probe state and ENI level traffic control state update will stop, which will stop us from shifting traffic from one DPU to another during HA state transition. + +Problem 1 will be handled by the mechanism we described in "[HA state persistence and rehydration](#74-ha-state-persistence-and-rehydration)" section. And we will focus on problem 2 here. + +To solve problem 2, we have 2 mechanisms: + +1. **DPU packet tunneling**: Once an ENI is connected to its peer and knows peer is also ready, all HA states is defined to either handle the traffic directly or tunnelling the traffic to its peer via NPU-to-DPU traffic forward tunnel. If the packet happens to be sent to the wrong side, it will be forwarded to the right side by DPU. +2. **DP VIP range BGP withdraw**: We will also withdraw all DP VIPs routes, so the traffic will so shifts to other switches and get forwarded to the right side directly. + +With these 2 mechanisms, as long as we have 1 switch working, the traffic will be forwarded correctly. + +For detailed data path, please see "[Standy to active DPU tunnel](#4351-standby-to-active-dpu-tunnel)" section. + +#### 9.3.2. Switch power down or kernel crash + +When switch completely goes down, all DPUs on this switch go down with it. This will show up and handled in the same way as "[DPU hardware failure](#922-dpu-hardware-failure)". The good side will receive peer lost SAI notification, which drives the HA set into standalone setup. + +#### 9.3.3. Back panel port failure + +This could be caused by hardware problem or certain configuration problem kills the port. Today, this will be one of a few SPOFs in the current setup, since we only have 1 port connected to each DPU. + +Whenever this failure happens: + +* Both HA data channel and data plane channel will stop working, as they all go through the back panel ports. This will cause peer lost SAI notification happen. +* HA control channel will be still working, and `hamgrd` can still talk to syncd. +* Port/serdes will be monitored by pmon on DPU side. Depending on how the monitor is implemented, we will see this failure soon and use it as signal to react on it. + +To mitigate this issue, we can: + +1. Trigger alerts, so we can be aware of this issue and RMA the switch. +2. Switchover all active ENIs to the other side, then [pin them to Standalone state](#84-eni-level-dpu-isolation--standalone-pinning). + +#### 9.3.4. Unplanned PCIe failure + +When PCIe fails, we will not be able to talk to the local DPUs from NPU. This will be detected by pmon. + +PCIe should be really hard to fail. And whenever it fails, it could be something serious happening on hardware. So, to mitigate this, we will treat this as DPU hard down and force the DPU to be powered off, then handle it as [DPU hardware failure](#922-dpu-hardware-failure). + +### 9.4. Summary + +First of all, please note that, all unplanned events will be monitored. Alerts will be fired, if it lasts for certain time, so it is not included in the mitigations below, but we should consider they are always included. + +| Category | Problem | Detected By | Resolve Signal | Mitigation | +| --- | --- | --- | --- | --- | +| Upstream service failure | | | | | +| | Upstream service cannot talk to one side of switches | Upstream service | Upstream service can talk to both sides of switches | [Upstream pin the reachable side to standalone](#84-eni-level-dpu-isolation--standalone-pinning) | +| | Upstream service cannot talk to both sides of switches | Upstream service | Upstream service can talk to both sides of switches | No-op | +| HA control plane channel failure | | | | | +| | HA control plane control channel failure | hamgrd | HA control plane control channel recovers | Spray network to restabilish the data path | +| | HA control plane data channel failure | swbusd | HA control plane data channel recovers | Rotate source port to restabilish the data path (w/o spray) | +| Data plane channel failure | | | | | +| | High data plane packet drop rate | DPU counter updates | Data plane packet drop rate recovers | [Drive to standalone setup](#101-working-as-standalone-setup) | +| DPU failure | | | | | +| | syncd crash | syncd container supervisor | syncd recovers | No-op, but syncd will hard reinit all SAI objects in ASIC, including ENI, HA session, etc, which resets the states in hamgrd as well | +| | DPU hardware failure | **Local**: DPU state change by pmon

**Peer**: Peer state update to Dead by hamgrd | **Local**: DPU state change by pmon

**Peer**: Peer state update to Connected | [Drive to standalone setup](#101-working-as-standalone-setup) | +| NPU failure | | | | | +| | hamgrd crash | ha container supervisor | hamgrd recovers | Withdraw SmartSwitch BGP routes, until hamgrd recovers | +| | Switch power down or kernel crash | PeerLost SAI notification on the peer switch | PeerConnected SAI notification | [Drive to standalone setup](#101-working-as-standalone-setup) | +| | Back panel port failure | PeerLost SAI notification on the peer switch | PeerConnected SAI notification | [Drive to standalone setup](#101-working-as-standalone-setup) | +| | PCIe failure | pmon | PCIe recovers | Treat as hard down and power cycle the DPU from NPU | + +## 10. Unplanned operations + +### 10.1. Working as standalone setup + +The standalone setup is used whenever we only have one side of DPU running or we detect large data path impact. Once we move a DPU into standalone state, it will stop flow replication, which reduces the chance the packet lands on bad links and getting dropped. + +#### 10.1.1. Design considerations + +The standalone setup is designed with the following considerations and principles: + +1. A ENI ***MUST*** only be served by a single DPU, just like Active-Standby setup. + * If both DPUs run as Standalone-Standalone pair, the incoming and outgoing traffic might land on 2 different DPUs. Since both DPU can make flow decisions, we will create unexpected flows and causing packet to be dropped. +2. Standalone setup is a best-effort mitigation. When network is dropping packet, no matter what we do, we will never be able to avoid the impact perfectly. We can only try to reduce the network hops to reduce the chance of packet drop. +3. Automated actions ***MUST*** be safe, otherwise it will cause more damage than good. If something we cannot handle automatically, we will fire alerts and wait for manual mitigation. +4. If we are running in standalone setup for a long time, we shall raise alerts, because it means something is going wrong and we cannot recover. +5. Driving into standalone setup will require 2 DPUs to work together. This communication is still done via HA control channel, because [as long as the peer DPU is running fine, we can always get a working control channel within a bounded time](#5223-ha-control-plane-channel-data-path-ha). + +Because of these, the standalone setup is designed to be Standalone-Standby pair. + +##### 10.1.1.1. Standalone setup vs bulk sync + +Once the HA pair starts to run as standalone setup, the inline sync will stop working, and their saved flows will start to diverge. + +1. New flows can be created on one side, but not the other. +2. Existing flows can be terminated on one side, but not the other. +3. Existing flows can be aged out on one side, but not the other, depending on how we manage the lifetime of the lows. +4. Due to policy updates, the same flow might get different packet transformations now, e.g., flow resimulation or flow recreation after policy update. + +And during recovery, we need to merge these 2 sets of flows back to one using "[bulk sync](#115-bulk-sync)". + +##### 10.1.1.2. Card-level vs ENI-level standalone setup + +There are 2 ways to implement standalone setup: Card-level and ENI-level. + +* Card-level standalone setup: All ENIs on the paired cards will be failover to one side and go into the `Standalone` state, so only one card can have ENIs running in `Standalone` state. +* ENI-level standalone setup: Each ENI can failover by itself and go into the `Standalone` state, so both cards can have ENIs running in `Standalone` state. + +Obviously, card-level standalone setup is easier to implement, but any ENI-level operations that involves standalone state, such as planned shutdown, will cause all ENIs failover to one side of the card, even if only one ENI is having problem. On the other hand, ENI-level standalone setup will be much more flexible, safe, but also harder to implement. + +However, which one is supported depends on the bulk sync ability. If ENI level bulk sync is supported, we can drive the HA pair into standalone setup at ENI level. Otherwise, we can only do it at card level. + +More details on bulk sync implementation can be found later in "[Bulk sync](#115-bulk-sync)" section. + +#### 10.1.2. Workflow triggers + +The standalone setup can be triggered by following types of signals. Each signal will work as a different type of pinning. Upon request (or any valid signal we consider), we start to drive the HA pair into standalone setup. And we will not come out of standalone setup, unless all signals being reset. + +| Problem | Trigger | Resolve signal | +| --- | --- | --- | +| Peer shutdown | Planned shutdown request | HAStateChanged with Connected state | +| Peer DPU lost | Peer lost SAI notification | Peer connected SAI notification | +| Peer DPU dead | HAStateChanged with dead peer | HAStateChanged with non-dead peer | +| High data plane packet drop rate | ENI-level data plane counters | ENI-level data plane counters | +| Manual Pinned | ENI-level DPU isolation | Isolation removed | +| Card pinned to standalone | Card pinned to standalone | Pinning removed | + +#### 10.1.3. Determine desired standalone setup + +The key of driving the HA pair into standalone setup is to determine which side should be the standalone. The steps are different for card-level and ENI-level standalone setup. + +##### 10.1.3.1. Determine card-level standalone setup + +Since many signals we are detecting are on ENI level, we need to merge them into card level first. If any trigger is hit, we will start the process as below: + +First, we need to check the DPU health signals: + +1. If the card is already pinned to standalone, we drive ourselves to standalone (no-op). +2. If the signals have "Peer DPU lost" or "Peer DPU dead", we drive ourselves to standalone. + * This covers the cases where the paired card or entire switch is dead, as well as manual operations. + +At this moment, both DPU should be running fine, so we start to check the ENI status and data path status. To ensure we have the latest state, we will send the `DPURequestEnterStandalone` message with aggregated signals and ENI states to the peer DPU. And upon receiving the message, we will run the following checks: + +1. If the signals from local DPU have "Card pinned to standalone", we return `Deny` to the peer DPU. +2. Check "Manual Pinned" for manual opertaions: + 1. If the signals from local DPU have "Manual Pinned", which the peer DPU doesn't have, we return `Deny` to the peer DPU. + 2. If the signals from peer DPU have "Manual Pinned", which the local DPU doesn't have, we return `Allow` to the peer DPU. + 3. If both sides have "Manual Pinned", we return `Deny` and raise alert after retrying certain amount of times. +3. Check "Peer shutdown" for planned shutdown: + 1. If the signals from local DPU have "Peer shutdown", we return `Deny` to the peer DPU. + 2. If the signals from peer DPU have "Peer shutdown", we return `Allow` to the peer DPU. + 3. If both sides have "Peer shutdown", we return `Deny` and raise alert after retrying certain amount of times. +4. If PreferredStandalone is set to local DPU, we return `Deny` to the peer DPU, otherwise, we return `Allow`. + * This covers the last case - high data plane packet drop rate. + * The reason we use predefined side here is that - there is no perfect solution in this case, and both side can send request at the same time. If we use information such as packet drop rate, we might get inconsistent data and making conflict decisions. So, instead of being too smart, we prefer to be stable. + +Once the we determined which DPU should be the standalone, we notify all ENIs in the DPU with "Card pinned to standalone" signal, so they can start to drive themselves to standalone. + +##### 10.1.3.2. Determine ENI-level standalone setup + +If ENI-level standalone is supported, each ENI decide on its own with the following steps: + +1. If the ENI is already pinned to standalone, we drive ourselves to standalone (no-op). +2. If the signals have "Manual Pinned", we drive ourselves to standalone. +3. If the signals have "Peer DPU lost" or "Peer DPU dead", we drive ourselves to standalone. + * This should cover the DPU or switch level failure. +4. Use primary election (maybe w/ local cached data) and use primary as the preferred standalone. + * This covers cases where both DPU and services are running fine, but network is having gray failure and killing flow replication packets. + +When doing these tests, we won’t change the HA state. Remaining stable and avoiding making bad moves are very important when we don’t have enough confidence. The worse case than not doing anything is breaking the current steady state and causing more damage. + +#### 10.1.4. Entering standalone setup + +Once we have decided who to be the standalone, we will drive the HA pair into Standalone-Standby pair. And no matter we are using card level or ENI level standalone setup, each ENI will drive itself to standalone state independently. + +##### 10.1.4.1. When peer is up + +This is usually caused by data plane gray failure. The detailed steps are listed as below: + +1. Firstly, packet loss will be detected by our counters in active (DPU0). + +

Entering standalone setup with peer up step 1

+ +2. After checking the conditions in step 1 above, active will start to move to Standalone. Also, since the inline sync channel is shutting down, we will move to next Term. + +

Entering standalone setup with peer up step 2

+ + Since the other side can be in state other than `Dead`, `Destroying` and `Standby`, e.g. `Active`, the HAStateChanged will be sent to peer (DPU1), which will force the peer into `Standby` state. + +

Entering standalone setup with peer up step 3

+ +##### 10.1.4.2. When peer is down + +When peer DPU is dead, the other DPU will start to drive itself to standalone, no matter which state it is in. + +1. On the dead DPU side, we will detect the DPU went down by DPU critical event from pmon. + +

Entering standalone setup with peer down step 1

+ +2. On the peer DPU side, we will receive 2 notifications: + + 1. HAStateChanged indicating ENIs on DPU0 are going to Dead state. + 2. SAI notification saying the peer DPU are 100% unreachable, indicating peer lost. + + On these 2 signals, we can run the checks in step 1, which will eventually move ourselves to Standalone state and update the next hop for this ENI to DPU1. And per our term definition, this will move DPU1 to next term. + +

Entering standalone setup with peer down step 2

+ +3. Later, when DPU0 launches, it will join back the HA set as what we have already discussed in "[Launch with standalone peer](#812-launch-with-standalone-peer)" section, which will cause all existing flows to be replicated from DPU1 to DPU0 and move into a new Active-Standby pair. + +#### 10.1.5. DPU restart in standalone setup + +When a DPU is running in standalone state, its flow replication will be stopped. Hence, if this DPU is dead somehow, we will not be able to recover the new flows. + +However, we are not expecting standalone state to last long (if so, alerts will fire), hence, we are not going to do anything special to handle this case. + +#### 10.1.6. Recovery from standalone setup + +After all problems that we detected are solved, we will start moving the standalone setup back to active-standby pair automatically, and sync the flows to standby side with bulk sync. + +The detailed steps are similar to "[Launch with standalone peer](#812-launch-with-standalone-peer)" as below: + +1. When we see all problems we detected are resolved on standalone node (DPU0), we start to drive ourselves out of standalone to active. + +

Recovery from standalone setup step 1

+ +2. Standalone (DPU0) will first trying to make sure Standby side is truly ready by sending a `ExitStandalone` message to standby (DPU1). This will move DPU1 to `InitializingToStandby` state. + +

Recovery from standalone setup step 2

+ +3. Now, we are back to the same track as "[Launch with standalone peer](#812-launch-with-standalone-peer)". DPU1 will send `HAStateChanged` back to DPU0, which drives DPU0 to active and initiates bulk sync. + +

Recovery from standalone setup step 3

+ +4. When bulk sync is done, we are back to our steady state: Active-Standby pair. + +

Recovery from standalone setup step 4

+ +During the whole process, probe states are never changed, hence transition should be seamless. + +#### 10.1.7. Force exit standalone (WIP) + +### 10.2. Force shutdown (WIP) + +## 11. Flow tracking and replication (Steady State) + +For deterministically applying the actions to the network packets of each connection as well as offloading the actions to hardware, we will create a session for every connection (3 tuple or 5 tuple depending on scenarios). Although sessions can be implemented in different ways, conceptually, each session will always have a few characteristics: + +* When a session is created, we will create a pair of flows – forwarding flow and reverse flow. +* Each flow will have 2 parts: match condition for matching the network packets, and flow actions to specify what transformations we need to apply to every packets of this connection. +* The forwarding flow matches the initial direction packet, e.g. syn, while the reverse flow matches the return packets, e.g. syn-ack. +* Each session will also maintain other metadata, such as epoch, for supporting idle timeout or flow re-simulation. + +Once a session is created, it can be offloaded to hardware for packet processing, which essentially runs the match and actions in its flows. + +However, when a node is down, all flows on it will be lost, which is unacceptable for real life environment. To ease the problem, flows need to be replicated across the HA set for each ENI. This section covers the high-level approach and requirement for flow tracking and replication, which is not bound to any vendor’s detailed implementation. + +### 11.1. Flow lifetime management + +With flow being replicated, we need to make sure the lifetime of the flows are properly managed. Essentially, in our HA set setup, the lifetime of all flows is always controlled by the active node: + +* When a new connection is created or existing connection is closed or re-simulated, the flow change ***MUST*** be replicated inline to the standby side. +* When a connection is aged out or terminated by other non-packet triggered reasons, the active side ***MUST*** send notification to standby side to close the flows. +* The standby side ***MUST*** never ages flow by itself, because it doesn't handle any traffic, so only active side knows the real flow ttl. +* When a node is rejoining the HA set, the active side ***MUST*** use [bulk sync](#115-bulk-sync) to send all existing flows to standby side. +* [TCP sequence number should be sync'ed in bulk during flow age out process](#11442-bulk-tcp-seq-syncing-during-flow-aging) for idle flows, so we can have RST on flow age out working properly. + +### 11.2. Flow replication data path overview + +Currently, flows can be replicated in 2 different ways: data plane sync and control plane sync, such as bulk sync. + +#### 11.2.1. Data plane sync channel data path + +Whenever a flow is created, destroyed due to a network packet being arrived, inline flow replication will be triggered. + +This is done by using the [data plane channel](#4352-dpu-to-dpu-data-plane-channel) defined above. And since different vendors may have different implementation of flows, the metadata used for replicating the flows will be kept as vendor-defined format. + +

Flow replication data plane sync data path

+ +Since the metadata format is vendor defined, it is particularly important to make the format backward compatible. Otherwise, there is no way to upgrade the device anymore, since we need to upgrade both sides at the same time, which will cause all flows to be lost. + +Data plane sync can also be DPU triggered instead of packet triggered, for example, flow aging / idle timeout. + +#### 11.2.2. Control plane sync channel data path + +Besides data plane sync, flow sync can also happen during bulk operations, such as bulk sync. + +This is done by using the [HA control plane data channel](#5222-ha-control-plane-data-channel), which starts from syncd on one side of the DPU to the syncd on its peer DPU. + +Unlike data plane sync data path, which is usually triggered by network packet and handled directly within data plane (under SAI). Control plane sync will require the data being sent to SONiC via SAI as an notification event, and SONiC will send it to the other side and program it back to ASIC by calling a SAI API. + +### 11.3. Flow creation and initial sync + +Flows can be created in multiple cases when we don’t have any existing flow: + +1. When the first packet for a connection (or specific 3 or 5 tuple) arrives: + 1. When a TCP syn packet arrives. + 2. When the first packet of any other connection-less protocols, such as UDP. +2. When a non-first packet for a connection arrives, but there is no flow created for it: + 1. When a TCP non-syn packet arrives. + 2. In the middle of any other protocol traffic. + +Case 1 is straightforward to understand – Whenever we receive a new connection, we need to create a flow for it. However, case 2 might not be and we will discuss later in section "[Flow recreation for non-first packet](#1132-case-2-flow-recreation-for-non-first-packet)". + +#### 11.3.1. Case 1: Flow creation for first packet and initial sync + +Whenever a new flow is created, for HA purposes, we need to sync the flow across all the DPUs in the HA set for this ENI. This is called initial sync. + +Initial sync will always be done inline with the very first packet, using the [DPU-to-DPU data plane channel](#4352-dpu-to-dpu-data-plane-channel). + +#### 11.3.2. Case 2: Flow recreation for non-first packet + +This can happen in cases below: + +1. Planned or Unplanned events cause certain flows not being fully replicated over the network, specifically for connection-less protocols, like UDP. For example: + * During switchover, the first UDP packet goes to DPU 1 while the second packet right behind it goes to DPU 2. + * In standalone setup, the standalone side crashed, and we have to shift to standby side. However, the new flows are not replicated over. +2. Flow being dropped: + * Usually, this is caused by flow aged out due to misconfigured TTL or missing keep-alive. + * Flow conflicts when doing flow merge after recovered from standalone setup and somehow caused flow to be dropped. + +This requires us to have the ability to recreate the flow whenever it is needed. + +##### 11.3.2.1. Challenge + +The problem is the asymmetrical policy on the 2 traffic directions. + +A typical case is network ACL. Customer can create a VM with ACL set to "only allow outbound connections while blocking all inbound connections", which will work as per-connection ACL. Say, we created a connection from A to B, + +* For the initial packet, the packet will flow from A to B, so we will hit the outgoing side ACL and allow the packet going out and creates the flow. When the return packet comes back from B, the packet will hit the flow instead of going through ACLs again, hence it will be allowed. +* But, if we lost the flow in the middle of the traffic, the return packet will be dropped, because it will match the incoming side ACLs and hit the drop policy. + +Furthermore, dropping packet is not the worst, if we somehow get wrong packet transformation and send it out, it will be even worse. + +##### 11.3.2.2. Solutions + +###### 11.3.2.2.1. Missing connection-less flow + +For UDP or other connection-less protocol packets, we cannot tell if it is first packet anyway, so the policy should be setup in the way that always works. If policy is changed, e.g., ACL, it will apply to all flows anyway, hence recreating the flows is acceptable. + +###### 11.3.2.2.2. Missing connection flows + +In this case, we should help notify the packet source to terminate the connection. Hence, for TCP, we should drop the packet and respond a RST packet when the packet is denied. + +For TCP packets, this missing-replicated issue should not happen during connection creation, as syn packet will not be forwarded out before flow replication completes. + +For certain connection-ish cases like CONENAT, there is not much we could do here, because there is no way to close the "connection". Also we cannot allow the flow being created, because it will totally defeat the purpose of CONENAT. + +### 11.4. Flow destroy + +When a connection is closed, the corresponding session needs to be destroyed. This includes 3 cases: explicit connection close, flow age out and upon request. + +#### 11.4.1. Flow destroy on explicit connection close + +Explicit connection close happens when we receive the FIN/RST packet in TCP connection. Upon those packets, we need to destroy the flow and replicate them to other instances in the HA set. But the way to handle these 2 types of packet is slightly different. + +##### 11.4.1.1. Graceful connection termination with FIN packets + +The FIN packets need to be handled as follows: + +1. When FIN arrives, we need to mark the flow for this direction as closing, but the flow shall continue to work. +2. When FIN-ACK arrives from the other side, we need to mark the flow for the original direction as closed. And same as 1, the flow shall still remain to work, because we need it to deliver FIN-ACK packet to the other side later. +3. When both flows are marked as closed, we can destroy the flow. + +Since the termination handshake changes the TCP state machine, all 4 packets need to trigger the inline flow replication logic to make sure the connection state in standby nodes is up-to-date. Please see [data plane channel](#4352-dpu-to-dpu-data-plane-channel) for more details. + +##### 11.4.1.2. Abrupt connection termination with RST packets + +In TCP, RST packet will force the connection to be closed in both directions and there is no ACK packet for RST packet too. Hence, when RST arrives, we can destroy the flow in both ways after forwarding the packet. + +Since there is only 1 packet in this case, the flow state needs to be sync’ed to other nodes in-band with the RST packet, using the same [data plane channel](#4352-dpu-to-dpu-data-plane-channel) as above. + +#### 11.4.2. Flow destroy on flow aged out + +Every session, even connection-less one like UDP, it will have a property called "idle timeout" or "ttl". Whenever a session is not matched by any packet (idle) for certain period of time, we can consider the connection is closed and destroy the flow. + +With HA setup, a flow will exist in both active and standby nodes, so we need to ensure their lifetime stays the same, otherwise flow leak will happen. + +Please see "[Flow lifetime management](#111-flow-lifetime-management)" section for more details. + +#### 11.4.3. Flow destroy on request + +Upon request, flows that matches certain criteria can be destroyed for purposes. This is required for supporting multiple cases: + +1. Connection draining ([AWS](https://aws.amazon.com/cn/blogs/aws/elb-connection-draining-remove-instances-from-service-with-care/), [GCP](https://cloud.google.com/load-balancing/docs/enabling-connection-draining)). When removing an instance or probing down an endpoint, a grace period, called draining period, will be given to this instance to allow all existing flows to be closed gracefully. At the end of the draining period, all remaining connections will be forcedly terminated. +2. VM crash and causing all flows to be lost. +3. [VM service healing](https://azure.microsoft.com/en-us/blog/service-healing-auto-recovery-of-virtual-machines/). + +Whenever a flow is destroyed because of this, we also need to send the notification to its peer and destroy the flows in standby, just like flow aging. + +This feature is supported by providing a flow APIs in SAI for explicitly killing selected flows or flows that created by certain rule. The design of this API is outside of the HA scope, so it won’t be covered here. + +#### 11.4.4. RST on flow destroy + +In certain cases, we are required to send RST packet to both sides whenever the flow is destroyed by any reasons other than explicit connection close, such as flow aged out or on request. With this feature, the client will know bad things is happening early and can retry if they want. + +This is particularly useful for long-running database queries, as the TCP keep-alive is usually set to very large timeout, e.g. 30s to 15mins. Without this feature, the client will blindly wait for long time, until sending the next packet and waiting for a timeout. + +However, this feature causes certain challenges in HA design: + +1. To send the RST packet, we need to have the right sequence number for every TCP connection. However, we are not going to sync every TCP sequence number update, so we might not have a up-to-date one in standby side. And after switchover, we might not be able to craft the RST packet. + +##### 11.4.4.1. Send RST when denying packets + +The first thing we can do is to actively send RST packet back, whenever a packet gets dropped by ACL or any other reason. + +This can cover the case where we don’t care about terminating the connection on time that much, because: + +1. Per [RFC 793 – Reset Generation](https://www.ietf.org/rfc/rfc793.txt), RST packet must be sent whenever a segment arrives which apparently is not intended for the current connection. This means after a connection is dead, any packet should trigger RST being sent back. +2. In our case, 2 cases might happen and results in the same behavior: + 1. New packet comes and gets dropped, because flow is not there anymore. We can directly send RST back. + 1. New packet comes and somehow recreates a flow with good or bad packet transformation. The packet will be sent to the destination, which will trigger a RST packet back, because the connection is dead there too. + +Some typical scenarios are: + +1. Asymmetrical ACL, where outbound connections are allowed while all inbound connections are denied. Once the flow is aged out, the incoming packet will be dropped by ACL, and at that moment, we can send the RST packet back directly. +2. Syn-ack attack. Since flow cannot be found, RST packet will be responded directly without forwarding to the VM. + +##### 11.4.4.2. Bulk TCP seq syncing during flow aging + +Furthermore, we could also consider syncing the TCP sequence number, so we can actively crafting the RST packet after failover. + +To make it more practical, we can implement the TCP seq syncing as below: + +1. When doing flow aging, we can collect the TCP sequence number of each flow and batch sync them over. The data can be sent over via the data plane channel. +2. We only need to sync the flows that haven’t received traffic for certain periods of time. + 1. The reason is because we don’t need to update the flows with active traffic, since their sequence number will be updated soon after switchover anyway. + 2. If we choose flow keep-alive as our flow lifetime management method, this can be inline and bulk and the seq number can be sync’ed as part of the keep-alive metadata. + +##### 11.4.4.3. Flow destroy on standby node + +If we are not running in the states which can make flow decisions, such as standby, we should never generate RST packets when flow is destroyed, otherwise the RST packets will be sent twice. + +#### 11.4.5. Packet racing on flow destroy + +When destroying a flow, we could run into a small race condition problem due to HA being involved. It happens when a flow gets hit while it is terminating. + +One example is when a packet arrives while its flow is right in the middle of aged out, the flow replication packets is sent out, but haven’t come back yet. In this case, flow still exists in active side and will still be hit. + +To simplify this problem, whenever a flow is being destroyed on active side, we can consider this flow is already gone. The main goal of HA is to ensure flows on 2 sides matches and this is a very rare case. + +Hence, when destroying a flow, we can consider removing the flow from regular flow table and put it into a temporary flow storage, only waiting for flow replication ACK packet coming back. During this time, if any new packet showing up, we can treat it as the first packet for a new flow. + +### 11.5. Bulk sync + +Bulk sync is used for replicating the flows from one DPU to its peer to make sure the flows match on both DPUs. + +Bulk sync is not a cheap operation. When bulk sync is triggered, it will check all the relevant flows, gather all the info that needs to be sync’ed across, then forward to it peer. This process usually can take 1 minute or so to complete, so we should avoid it as much as possible. + +With current design, if we are running in steady state or doing planned switchover, inline sync is enough to keep all DPUs in sync. However, there are still, and not only limit to, some cases when bulk sync is needed, such as whenever a DPU joined back to a HA set, e.g. fresh restart, upgrade, recover from failure/crash, recover from standalone setup. + +For data path, bulk sync uses our [HA control plane data channel](#5222-ha-control-plane-data-channel) to deliver the flow info to the other side. + +#### 11.5.1. Perfect sync + +By default, we can use perfect sync as our flow sync algorithm. + +The algorithm is straight-forward and already used in current DASH design. In short, whenever the peer is connected and paired, we shift to the next color and bulk sync all flows that is not the current color. + +More details can be found here: . + +#### 11.5.2. Range sync (ENI-level only) + +However, perfect sync can be slow, as it needs to enumerate, serialize, and send all existing flows across to the other side. This is not ideal for handling short-time outages, such as the ones caused by link flap. Hence, we are also proposing a new way to do flow sync as well here. This new sync approach can be opt-in, as perfect sync is a sub case of it. + +The high-level idea is to find only flows that is not sync’ed to sync, so we can reduce the sync time for cases like coming back from short period outage. + +The algorithm assumes that no flow can be created or changed on active side, if inline flow sync channel is broken. Today, this is ensured by delaying the flow operations until active side receives flow replication ack packet from standby. + +And the key idea is to find the start and end time whenever the DPUs are not in sync, then only sync the flow changes within this range: + +* ImpactStart = Whenever inline flow sync channel breaks: + * This means - the DPU is in a state that can initiate inline flow sync but its peer enters a state which doesn’t respond to flow sync. + * There will be a small time period from the actual impact start time to whenever we found the impact start. However, the flow will stay the same on both side, because no flow can be created/deleted/updated, unless the old active receives the flow replication ack. +* ImpactEnd = Whenever inline flow sync channel reestablished: + * This means - the DPU is in a state that can initiate inline flow sync while its peer also enters the state that responds to flow sync. + +Once impact ends, we need to sync all flow changes during the impact time, which includes: + +* New flows created after impact starts. +* Flows that created before impact starts but deleted during the impact period, including explicit close, but not flow aged out. +* Flows that created before impact start but updated during the impact period, such as flow resimulation. + +Based on this, we can design the algorithm as below: + +##### 11.5.2.1. Init phase + +Every DPU maintains a set of variables. When a HA scope, such as ENI, is created, init all of them to 0. + +1. `LocalFlowVer`: The flow version of current node. It is updated in 2 cases: + 1. Whenever we detected impact starts or ends as defined above, it increases by 1. + 1. Whenever we go back to steady state (forms active-standby pair), the standby side updates it to match the active side. +2. `ToSyncFlowVerMin` / `ToSyncFlowVerMax`: The min and max flow version that flows needs to be sync’ed. + +##### 11.5.2.2. Flow tracking in steady state + +Whenever any flow is created or updated (flow re-simulation), update the flow version to latest `LocalFlowVer`. + +##### 11.5.2.3. Tracking phase + +1. When impact starts, increase `LocalFlowVer` by 1. Then, call SAI API to pass in the new HA state, `LocalFlowVer` and start tracking all flows that is deleted and has lower flow version than `LocalFlowVer`. If `ToSyncFlowVerMin` is 0, set it to `LocalFlowVer`. +2. When impact ends, set `ToSyncFlowVerMax` to current `LocalFlowVer` and then increase `LocalFlowVer` by 1. Then, call SAI API to pass in the new HA state, `LocalFlowVer` and stop tracking the existing flow deletions. + +##### 11.5.2.4. Syncing phase + +1. Call bulk sync SAI API with the flow version range: [`ToSyncFlowVerMin`, `ToSyncFlowVerMax`]. + 1. During bulk sync, if any flows is deleted not matter it is new or not, we cannot really delete the flows, but only mark the flow as deleted. This is because existing flows can be deleted during the bulk sync. This will cause flow being inserted back after flow is deleted. +2. Handle flow change event from ASIC and send it over to peer. +3. Once received flow change event on the peer side, update the flow table accordingly. + 1. For all existing flows (not only new), if flow doesn’t exist on peer side or has lower flow version, insert or update it. This covers new flows and flow re-simulation case. + 2. For flow deletions, only delete the flow if the flow exists and flow version matches. +4. Handle bulk sync done event from ASIC, which will be sent after all flow change events are notified. +5. Call bulk sync completed SAI API, so ASIC can delete all tracked flow deletion records. Also reset `ToSyncFlowVerMin` and `ToSyncFlowVerMax` to 0, because there is nothing to sync anymore. + +### 11.6. Flow re-simulation support + +When certain policies are updated, we will have to update the existing flows to ensure the latest policy takes effect. This is called "flow re-simulation". + +Flow re-simulation is needed and not limited in following cases: + +* ACL updated. Existing connections are now getting denied. +* CA-PA mapping update for cases like VM getting moved from one machine to another. + +To support flow re-simulation, we can use a few different ways: + +* Adding a new attribute on certain DASH object, say `resimulate_on_update`. Then, whenever the object is updated, we will resimulate all corresponding flows. +* Provide an API to resimulate all or any specific flows. + +Although the approach of adding flag makes re-simulation simple, it is quite expensive to update all flows, especially whenever the policy changes. To save the cost, the re-simulation should be postponed to whenever the next packet matches the forwarding flow comes in (direction is important here), and use the packet to trigger the policy re-evaluation. + +In HA setup, whenever a flow gets re-simulated, we will need to sync the latest flow states over to the standby side. + +## 12. Debuggability + +### 12.1. ENI leak detection + +In order to avoid ENI being leaked after unplanned events, such as ENI being migrated away but doesn't cleaned up properly, each ENI will have a timer to update its state as heartbeat to notify our upstream service or using telemetry for ENI leak detection. + +For more detailed design, please refer to the detailed design doc. + +## 13. Detailed Design + +Please refer to the detailed design doc for DB schema, telemetry, SAI API and CLI design. + +## 14. Test Plan + +Please refer to HA test docs for detailed test bed setup and test case design. diff --git a/doc/smart-switch/high-availability/smart-switch-ha-overview-slides.pdf b/doc/smart-switch/high-availability/smart-switch-ha-overview-slides.pdf new file mode 100644 index 00000000000..03ba8e29ba1 Binary files /dev/null and b/doc/smart-switch/high-availability/smart-switch-ha-overview-slides.pdf differ diff --git a/doc/smart-switch/high-availability/smart-switch-ha-overview-slides.pptx b/doc/smart-switch/high-availability/smart-switch-ha-overview-slides.pptx new file mode 100644 index 00000000000..c6aa0758601 Binary files /dev/null and b/doc/smart-switch/high-availability/smart-switch-ha-overview-slides.pptx differ diff --git a/doc/smart-switch/hld/OCP23G-SmartSwitch-Microsoft-final.pptx b/doc/smart-switch/hld/OCP23G-SmartSwitch-Microsoft-final.pptx new file mode 100644 index 00000000000..b287112a0e2 Binary files /dev/null and b/doc/smart-switch/hld/OCP23G-SmartSwitch-Microsoft-final.pptx differ diff --git a/doc/smart-switch/ip-address-assigment/smart-switch-ip-address-assignment.md b/doc/smart-switch/ip-address-assigment/smart-switch-ip-address-assignment.md new file mode 100755 index 00000000000..4e1254fdd57 --- /dev/null +++ b/doc/smart-switch/ip-address-assigment/smart-switch-ip-address-assignment.md @@ -0,0 +1,436 @@ +# Smart Switch IP address assignment # + +## Table of Content ## + +- [Smart Switch IP address assignment](#smart-switch-ip-address-assignment) + - [Table of Content](#table-of-content) + - [Revision](#revision) + - [Scope](#scope) + - [Definitions/Abbreviations](#definitionsabbreviations) + - [Overview](#overview) + - [Requirements](#requirements) + - [IP address assignment requirements](#ip-address-assignment-requirements) + - [Architecture Design](#architecture-design) + - [Device data and PLATFORM](#device-data-and-platform) + - [NPU platform.json](#npu-platformjson) + - [DPU platform.json](#dpu-platformjson) + - [DPU and PCIe interfaces naming convention](#dpu-and-pcie-interfaces-naming-convention) + - [Configuration generation for IP address assignment](#configuration-generation-for-ip-address-assignment) + - [DPU IP address allocation](#dpu-ip-address-allocation) + - [Smart Switch configuration](#smart-switch-configuration) + - [Midplane network configuration flow](#midplane-network-configuration-flow) + - [NPU](#npu) + - [DPU](#dpu) + - [IP assignment flow](#ip-assignment-flow) + - [High-Level Design](#high-level-design) + - [SAI API](#sai-api) + - [Configuration and management](#configuration-and-management) + - [CLI/YANG model Enhancements](#cliyang-model-enhancements) + - [YANG model for the switch side configuration](#yang-model-for-the-switch-side-configuration) + - [DEVICE\_METADATA table](#device_metadata-table) + - [MID\_PLANE\_BRIDGE and DPUS tables](#mid_plane_bridge-and-dpus-tables) + - [Warmboot and Fastboot Design Impact](#warmboot-and-fastboot-design-impact) + - [Memory Consumption](#memory-consumption) + - [Restrictions/Limitations](#restrictionslimitations) + - [Testing Requirements/Design](#testing-requirementsdesign) + - [Unit Test cases](#unit-test-cases) + - [System Test cases](#system-test-cases) + - [Open/Action items - if any](#openaction-items---if-any) + +## Revision ## + +| Rev | Date | Author | Change Description | +| :---: | :---: | :----------------: | -------------------------------------- | +| 0.1 | | Oleksandr Ivantsiv | Initial version. IP address assignment | +| 0.2 | | Ze Gan | Update the services flow | + +## Scope ## + +This document provides a high-level design for Smart Switch IP address assignment flow. + +## Definitions/Abbreviations ## + +| Term | Meaning | +| ---- | ------------------------------------------------------- | +| NPU | Network Processing Unit | +| DPU | Data Processing Unit | +| PCIe | PCI Express (Peripheral Component Interconnect Express) | + +## Overview ## + +A DASH smart switch is a merging of a datacenter switch and one or more DPUs into an integrated device. The "front-panel" network interfaces of the DPU(s) are wired directly into the switching fabric instead of being presented externally, saving cabling, electronics, space and power. There can also be some consolidation of software stacks, for example see SONiC Multi-ASIC for how this is accomplished in standard SONiC multi-ASIC devices. + +![image](https://github.com/raw/sonic-net/DASH/main/documentation/general/images/hld/dash-high-level-smart-switch.svg) + + The interconnection between control planes in the NPU and DPUs is organized via PCIe interfaces. For each DPU there is a pair of PCIe interfaces. One endpoint is on the NPU side and the other endpoint is on the DPU side. For each PCIe endpoint, there is a netdev representation in the Linux Kernel. + +![image](./smart-switch-ip-assignment-pci-interfaces.png) + +## Requirements ## + +### IP address assignment requirements ### + +- Uniform procedure for assigning IP addresses on DPUs and Switch side interfaces​ +- Deterministic behavior​ +- Stateless operation of DPU​ +- All logic is kept on switch side​ +- Compatibility with network boot scenarios, like OOB PXE on DPU + +## Architecture Design ## + +To implement a uniform procedure of IP address assignment for DPUs in the smart switch the DHCP server on the switch side shall be used. The DHCP server allows keeping the IP address assignment logic on the switch side and makes DPU IP address assignment implementation stateless from the DPU point of view. The IP address assignment shall be port-based which guarantees the deterministic behavior, the same DPU shall always receive the same IP address on request. The feature requires different implementation for the switch and DPUs platforms. However, the implementation fits into the existing SONiC architecture. The feature shall reuse the existing DHCP server container introduced in the [PR](https://github.com/sonic-net/SONiC/pull/1282). DHCP-based address assignment provides compatibility with the network boot scenarios (like PXE) when the static IP address assignment can not be utilized. + +To organize the DPU PCIe interfaces on the switch side and provide the single source of the configuration for all the DPU PCIe interfaces the bridge interface ("midplane bridge") shall be used. To organize communication between the NPU and DPUs the link-local subnet shall be used. The IPv4 link-local subnetwork is chosen because it is a relatively unused network that is mostly used for the communication with directly connected hosts. This allows to assume that it shall not interfere with the other advertised networks. + +To implement the switch side functionality the following changes shall be done: + +- SONiC device data should be extended to include and support the following: + - New `t1-smartswitch` topology. The topology shall be used as a default topology for the Smart Switch platform. + - HW SKU schema shall be extended to include the following optional information: + - DPU modules available on the platform. + - DPU to netdev mapping. +- Config DB schema should be extended to include the following information: + - Support of the new `SmartSwitch` subtype in DEVICE_METADATA config DB table. The new subtype shall allow to identify when the SONiC is running on the Smart Switch. + - Support of the new `MID_PLANE_BRIDGE` table with the following information + - Midplane bridge configuration with the list of the DPU PCIe interfaces that should be added to the bridge for the Smart Switch case. +- sonic-cfggen utility shall be extended to generate the following sample configuration based on the `t1-smartswitch` topology: + - Midplane bridge and DHCP server configuration based on the DPU information provided in the PLATFORM file. +- systemd-networkd.service shall configure the midplane network of NPU and DPUs + - NPU side: create a midplane bridge with the corresponding configuration in the Linux Kernel. + - DPU side: start a DHCP client on the PCIe interface +- midplane-network-(npu/dpu).service is a oneshot service that waits for the midplane network to be initialized. +- DHCP server container should be included into the switch image. DHCP server feature should be enabled by default + +The midplane bridge configuration shall includes two stages: 1. Create the Linux Kernel bridge and assign a fixed IP address to it in the midplane-network-(npu/dpu).service 2. Add the PCIe interfaces to the midplane bridge. The DHCP server configuration shall be generated by sonic-cfggen utility. No user involvement is required. + +### Device data and PLATFORM ## + +The Smart Switch platform by default shall use the `t1-smartswitch`. The topology shall be used together with the DPU information for the sample configuration generation. + +The PLATFORM configuration shall be extended to provide the information about the DPU available in the system. + +**Smart Switch** + +#### NPU platform.json + +```json +{ + "DPUS": { + "dpu0": { + "midplane_interface": "dpu0", + }, + "dpu1": { + "midplane_interface": "dpu1", + } + }, + "midplane_network": { + "bridge_name": "bridge-midplane", + "bridge_address": "169.254.200.254/24" + } +} +``` + +#### DPU platform.json + +```json +{ + "DPU":{}, + "midplane_network": { + "bridge_address": "169.254.200.254/24" + } +} +``` + +### DPU and PCIe interfaces naming convention ### + +To follow the SONiC interfaces naming convention names of the DPUs in the DB and the PCIe interfaces that are connected to the DPUs shall start from the "dpu" prefix. The indexes shall start from "0" and shall represent the id of the DPU. For example dpu0, dpu1, etc. It is up to vendors' implementation to ensure that interfaces are assigned with the correct name during the system initialization. + +### Configuration generation for IP address assignment ### + +The configuration generation flow can be triggered in the following cases: + +- During the first boot of the system, when there is no Config DB configuration file available yet. +- During configuration recovery, when the Config DB configuration file is removed and config-setup.service restart is triggered by the user. + +![image](./smart-switch-ip-assignment-configuration-generation.png) + +#### DPU IP address allocation #### + +The IP address allocation for the DPUs shall be the following: + +- Midplane bridge network plus DPU ID plus 1. For example for the bridge with the 169.254.200.254/24 IP address DPU0 IP address shall be 169.254.200.1. + +#### Smart Switch configuration #### + +Based on the preset `t1-smartswitch` default topology the configuration generated by the sonic-cfggen utility will be the following: + +```json +{ + "DEVICE_METADATA": { + "localhost": { + "switch_type": "switch", + "type": "LeafRouter", + "subtype": "SmartSwitch", + } + }, + "MID_PLANE_BRIDGE" { + "GLOBAL": { + "bridge" : "bridge-midplane" + } + }, + "DHCP_SERVER_IPV4": { + "bridge-midplane": { + "gateway": "169.254.200.254", + "lease_time": "infinite", + "mode": "PORT", + "netmask": "255.255.255.0", + "state": "enabled" + } + }, + "DHCP_SERVER_IPV4_PORT": { + "bridge-midplane|dpu0": { + "ips": [ + "169.254.200.1" + ] + }, + "bridge-midplane|dpu1": { + "ips": [ + "169.254.200.2" + ] + } + } +} +``` + +The DEVICE_METADATA table includes the following: + +- "switch_type" field that indicates this device is switch. +- "type" field that indicates that this is Smart Switch. + +The MID_PLANE_BRIDGE table includes the following: + +- The name that shall be used to create bridge in the Kernel. +- The address if the bridge with the network. +- List of the netdevs that shall be added to the bridge as a members. + +The DHCP_SERVER_IPV4 table includes the following: + +- The name of the interface to listen for the request. Same as the bridge name. +- Gateway IP and network same as the bridge. + +The DHCP_SERVER_IPV4_PORT table includes the following: + +- For each DPU available in the system the IP address that should be assigned to the DPU. + +### Midplane network configuration flow ### + +![image](./smart-switch-ip-assignment-midplane-network-configuration-flow.svg) + +#### NPU #### + +1. systemd-sonic-generator renders configuration file of systemd-networkd and midplane-network-npu.service according to the platform.json. + +``` text +# bridge-midplane.netdev +[NetDev] +Name=bridge-midplane +Kind=bridge +``` + +``` text +# bridge-midplane.network +[Match] +Name=bridge-midplane + +[Network] +Address=169.254.200.254/24 +``` + +``` text +# midplane-network-npu.network +[Match] +Name=dpu* + +[Network] +Bridge=bridge-midplane +``` + +2. systemd-networkd helps to create the bridge-midplane interface and assign the specific IP address according to above configuration. Meanwhile, systemd-networkd will monitor the DPU PCIe interface. Once the PCIe interface is created, it will automatically add it into the bridge-midplane. +3. midplane-network-npu.service will be used to wait for the midplane bridge configured. + +``` text +# midplane-network-npu.service + +[Unit] +Description=Midplane network service +Requires=systemd-networkd.service +After=systemd-networkd.service +Before=database.service + +[Service] +Type=oneshot +User=root +ExecStart=/usr/lib/systemd/systemd-networkd-wait-online -i bridge-midplane + +[Install] +WantedBy=multi-user.target +``` + +#### DPU #### + +In the DPU side, the steps are almost similar to NPU. But the generated configuration is different but simpler. + +``` text +# midplane-network-dpu.network +[Match] +Name=eth0-midplane + +[Network] +DHCP=yes +``` + +``` text +# midplane-network-dpu.service + +[Unit] +Description=Midplane network service +Requires=systemd-networkd.service +After=systemd-networkd.service +Before=database.service + +[Service] +Type=oneshot +User=root +ExecStart=/usr/lib/systemd/systemd-networkd-wait-online -i eth0-midplane --timeout=600 + +[Install] +WantedBy=multi-user.target +``` + +Step 2: systemd-networkd helps to start the dhcp client for eth0-midplane +Setp 3: midplane-network-dpu.service will be used to wait for IP assigned for eth0-midplane from DHCP server. We expect the eth0-midplane can be configured within 10 mins. + +### IP assignment flow ### + +![image](./smart-switch-ip-assignment-ip-assignment-flow.png) + +- (1) sonic-cfggen based on platform.json renders midplane network and DHCP server configuration. +- (2) sonic-cfggen pushes configuration into the Config DB. +- (3-4) DHCP server and DHCP relay containers upon start consume configuration from the config DB and start listening for a requests from DHCP clients. +- (5) DHCP client on the DPU sends a request over the eth0-midplane interfaces. The request comes to the DHCP relay through the midplane bridge. DHCP relay inserts option 82 into the request with the information about the interface from which the request came. DHCP relay forwards the packet to the DHCP server. +- (6) The DHCP server sends a reply with the IP configuration to the DHCP client on the DPU. + +### High-Level Design ### + +### SAI API ### + +N/A + +### Configuration and management ### + +No new CLI commands are required. + +#### CLI/YANG model Enhancements ### + +The YANG model shown in this section is provided as a reference. The complete model shall be provided in a separate PR. + +#### YANG model for the switch side configuration #### + +##### DEVICE_METADATA table ##### + +```yang + container sonic-device_metadata { + + container DEVICE_METADATA { + leaf type { + subtype string { + length 1..255; + pattern "|SmartSwitch"; + } + } + } + } +``` + +##### MID_PLANE_BRIDGE and DPUS tables ##### + +```yang + container sonic-smart-switch { + + container MID_PLANE_BRIDGE { + + description "MID_PLANE_BRIDGE part of config_db.json"; + + container GLOBAL { + leaf bridge { + type string { + pattern "bridge_midplane"; + } + description "Name of the midplane bridge"; + + } + + leaf ip_prefix { + type inet:ipv4-prefix; + description "IP prefix of the midplane bridge"; + } + } + /* end of container GLOBAL */ + } + /* end of container MID_PLANE_BRIDGE */ + + container DPUS { + description "DPUS part of config_db.json"; + + list DPUS_LIST { + key "dpu_name"; + + leaf dpu_name { + description "Name of the DPU"; + type string { + pattern "dpu[0-9]+"; + } + } + + leaf midplane_interface { + description "Name of the interface that represents DPU"; + + type string { + pattern "dpu[0-9]+"; + } + } + } + /* end of container DPUS_LIST */ + } + /* end of container DPUS */ + } + /* end of container sonic-smart-switch */ +``` + +## Warmboot and Fastboot Design Impact ## + +The feature has no requirements for a warm and fast boot. + +## Memory Consumption ## + +The feature has minimal impact on the memory consumption. Overall it requires less than 1KB of disk and RAM space to store the new configuration. + +## Restrictions/Limitations ## + +## Testing Requirements/Design ## + +### Unit Test cases ### + +1. Add new YANG model tests for the new MID_PLANE_BRIDGE and DPUS table. +2. Extend existing YANG model tests to cover MGMT_INTERFACE table changes. +3. Add new tests to cover the handling of MID_PLANE_BRIDGE and DPUS tables in sonic-cfggen utility. +4. Add new tests to cover DHCP server configuration generation for the Smart Switch in sonic-cfggen utility. +5. Test to verify the midplane network configuration generation in NPU and DPU side. + +### System Test cases ### + +No separate test for the is required. The feature will be tested implicitly by the other DASH tests. + +## Open/Action items - if any ### + +- Do we need to extend the minigraph configuration to support the Smart Switch configuration? diff --git a/doc/smart-switch/ip-address-assigment/smart-switch-ip-assignment-configuration-generation.png b/doc/smart-switch/ip-address-assigment/smart-switch-ip-assignment-configuration-generation.png new file mode 100644 index 00000000000..723def2d0e7 Binary files /dev/null and b/doc/smart-switch/ip-address-assigment/smart-switch-ip-assignment-configuration-generation.png differ diff --git a/doc/smart-switch/ip-address-assigment/smart-switch-ip-assignment-ip-assignment-flow.png b/doc/smart-switch/ip-address-assigment/smart-switch-ip-assignment-ip-assignment-flow.png new file mode 100644 index 00000000000..73b96073820 Binary files /dev/null and b/doc/smart-switch/ip-address-assigment/smart-switch-ip-assignment-ip-assignment-flow.png differ diff --git a/doc/smart-switch/ip-address-assigment/smart-switch-ip-assignment-midplane-network-configuration-flow.svg b/doc/smart-switch/ip-address-assigment/smart-switch-ip-assignment-midplane-network-configuration-flow.svg new file mode 100644 index 00000000000..a8a8218ab18 --- /dev/null +++ b/doc/smart-switch/ip-address-assigment/smart-switch-ip-assignment-midplane-network-configuration-flow.svg @@ -0,0 +1,250 @@ + + + + + + + + + + + + + + + + + + Page-1 + + + + + + + + Can.1000 + Config DB + + Sheet.1001 + + + + + + + + + + + + Config DB + + + Rounded Rectangle.1002 + midplane-network.service + + + + + + + + + + + + + + + + + + + + + + midplane-network.service + + Document.1003 + platform.json + + + + + + + + + + + + + + + + + + + platform.json + + Rounded Rectangle.1006 + database.service + + + + + + + + + + + + + + + + + + + + + + database.service + + Dynamic connector.1011 + + + + Dynamic connector.1015 + + + + Dynamic connector.1026 + + + + Rounded Rectangle.1027 + systemd-networkd.service + + + + + + + + + + + + + + + + + + + + + + systemd-networkd.service + + Dynamic connector.1028 + + + + Rounded Rectangle.1029 + systemd-sonic-generator + + + + + + + + + + + + + + + + + + + + + + systemd-sonic-generator + + Dynamic connector.1030 + + + + Circle.1004 + 1 + + + + + + + 1 + + Circle.1032 + 2 + + + + + + + 2 + + Circle.1033 + 3 + + + + + + + 3 + + diff --git a/doc/smart-switch/ip-address-assigment/smart-switch-ip-assignment-pci-interfaces.png b/doc/smart-switch/ip-address-assigment/smart-switch-ip-assignment-pci-interfaces.png new file mode 100644 index 00000000000..8aab0182f3b Binary files /dev/null and b/doc/smart-switch/ip-address-assigment/smart-switch-ip-assignment-pci-interfaces.png differ diff --git a/doc/snmp/snmp-changes-to-support-ipv6.md b/doc/snmp/snmp-changes-to-support-ipv6.md new file mode 100644 index 00000000000..4e79142de94 --- /dev/null +++ b/doc/snmp/snmp-changes-to-support-ipv6.md @@ -0,0 +1,139 @@ +# SONiC SNMP Changes to support IPv6 # + +This document captures the changes required to support SNMP over IPv6. + +## Motivation ## + +SNMP query over IPv6 address fails in certain scenarios on single asic platforms. +Ideally, SNMP query should be successful over both IPv4 and IPv6 addresses. + +## Current configuration for SNMP ## + +Currently, snmpd process inside SNMP docker uses snmpd.conf as the configuration file. +One of the configuration directives in snmpd.conf is *agentaddress*. +*agentaddress* defines list of listening address on which SNMP request can be received. +In SONiC, the default listening address is 'any ip'. +``` +agentAddress udp:161 +agentAddress udp6:161 +``` +The other method is to use the below configuration command to configure agent address. +``` +config snmpagentaddress add +``` + +## Issue seen with IPv6 ## +In case of SNMP query over IPv6, the SNMP query fails with timeout. +This SNMP query fails over IPv6 because SNMP response goes out with the incorrect SRC IP. +This SRC IP is incorrect in SNMP packet as snmpd does not keep track of the DST IP from the SNMP request packet. +Below is a packet capture showing SNMP request packet sent to DUT Loopback IPV6 address with SRC IP as port-channel IP of cEOS neighbor. The SNMP response packet goes out with SRC address as DUT port-channel IP whereas it should have been the DUT Loopback IPv6 address. + +``` +23:18:51.620897 In 22:26:27:e6:e0:07 ethertype IPv6 (0x86dd), length 105: fc00::72.41725 > fc00:1::32.161: C="public" GetRequest(28) .1.3.6.1.2.1.1.1.0 +23:18:51.621441 Out 28:99:3a:a0:97:30 ethertype IPv6 (0x86dd), length 241: fc00::71.161 > fc00::72.41725: C="public" GetResponse(162) .1.3.6.1.2.1.1.1.0="SONiC Software Version: SONiC.xxx - HwSku: xx - Distribution: Debian 10.13 - Kernel: 4.19.0-12-2-amd64" +``` +**Sequence of SNMP request and response** + +1. SNMP request will be sent with SRC IP fc00::72 DST IP fc00:1::32 +2. SNMP request is received at SONiC device is sent to snmpd which is listening on port 161 :::161/ +3. snmpd process will parse the request create a response and sent to DST IP fc00::72. +4. snmpd process does not track the DST IP on which the SNMP request was received, which in this case is Loopback IP. +snmpd process will only keep track what is tht IP to which the response should be sent to. +5. snmpd process will send the response packet. +Kernel will do a route look up on destination IP and find the best path. +ip -6 route get fc00::72 +fc00::72 from :: dev PortChannel101 proto kernel src fc00::71 metric 256 pref medium +6. Using the "src" ip from above, the response is sent out. This SRC ip is that of the PortChannel and not the device Loopback IP. + +The same issue is seen when SNMP query is sent from a remote server over Management IP. +SONiC device eth0 --------- Remote server +SNMP request comes with SRC IP DST IP +If kernel finds best route to Remote_server_IP is via BGP neighbors, then it will send the response via front-panel interface with SRC IP as Loopback IP instead of Management IP. + + +Main issue is that in case of IPv6, snmpd ignores the IP address to which SNMP request was sent, in case of IPv6. +In case of IPv4, snmpd keeps track of DST IP of SNMP request, it will keep track if the SNMP request was sent to mgmt IP or Loopback IP. +Later, this IP is used in ipi_spec_dst as SRC IP which helps kernel to find the route based on DST IP using the right SRC IP. +https://github.com/net-snmp/net-snmp/blob/master/snmplib/transports/snmpUDPBaseDomain.c#L300 +ipi.ipi_spec_dst.s_addr = srcip->s_addr +Reference: https://man7.org/linux/man-pages/man7/ip.7.html + +**Why should the SRC IP in SNMP response match the DST IP in SNMP request?** + +In snmp pdu, There is a request-id, and response has to match the request id. + +Reference RFC 1157 + +_In GetResponse-PDU, the value of the request-id field of the GetResponse-PDU is that of the received message._ + + + +**This issue is not seen on multi-asic platform, why?** + +On multi-asic platform, there exists different network namespaces. +SNMP docker with snmpd process runs on host namespace. +Management interface belongs to host namespace. +Loopback0 is configured on asic namespaces. +Additional inforamtion on how the packet coming over Loopback IP reaches snmpd process running on host namespace: #5420 +Because of this separation of network namespaces, the route lookup of destination IP is confined to routing table of specific namespace where packet is received. +If packet is received over management interface, SNMP response also is sent out of management interface. Same goes with packet received over Loopback Ip. + +## Changes done to workaround the issue ## + +Currently snmpd listens on any ip by default, after this change: +1. If minigraph.xml is used to load the initial configuration, then SNMP_AGENT_ADDRESS_CONFIG in config_db will be updated with Loopback0 and Managment interface IP addresses during the parsing of minigraph.xml. +2. No change will be done if config_db.json is used to load the configuration. +config_db.json file could have SNMP_AGENT_ADDRESS_CONFIG table udpated with required IP addresses for snmpd to listen on. + +Before the change: +snmpd listens on any IP, snmpd binds to IPv4 and IPv6 sockets as below: +``` +netsnmp_udpbase: binding socket: 7 to UDP: [0.0.0.0]:0->[0.0.0.0]:161 +trace: netsnmp_udp6_transport_bind(): transports/snmpUDPIPv6Domain.c, 303: +netsnmp_udpbase: binding socket: 8 to UDP/IPv6: [::]:161 +``` +When IPv4 response is sent, it goes out of fd 7 and IPv6 response goes out of fd 8. +When IPv6 response is sent, it does not have the right SRC IP and it can lead to the issue described. + +If minigraph.xml is used, then SNMP_AGENT_ADDRESS_CONFIG will be configured with Loopback0 and Managment interface IP addresses. When snmpd listens on specific Loopback/Management IPs, snmpd binds to different sockets: +``` +trace: netsnmp_udpipv4base_transport_bind(): transports/snmpUDPIPv4BaseDomain.c, 207: +netsnmp_udpbase: binding socket: 7 to UDP: [0.0.0.0]:0->[10.250.0.101]:161 +trace: netsnmp_udpipv4base_transport_bind(): transports/snmpUDPIPv4BaseDomain.c, 207: +netsnmp_udpbase: binding socket: 8 to UDP: [0.0.0.0]:0->[10.1.0.32]:161 +trace: netsnmp_register_agent_nsap(): snmp_agent.c, 1261: +netsnmp_register_agent_nsap: fd 8 +netsnmp_udpbase: binding socket: 10 to UDP/IPv6: [fc00:1::32]:161 +trace: netsnmp_register_agent_nsap(): snmp_agent.c, 1261: +netsnmp_register_agent_nsap: fd 10 +netsnmp_ipv6: fmtaddr: t = (nil), data = 0x7fffed4c85d0, len = 28 +trace: netsnmp_udp6_transport_bind(): transports/snmpUDPIPv6Domain.c, 303: +netsnmp_udpbase: binding socket: 9 to UDP/IPv6: [fc00:2::32]:161 +``` + +When SNMP request comes in via Loopback IPv4, SNMP response is sent out of fd 8 +``` +trace: netsnmp_udpbase_send(): transports/snmpUDPBaseDomain.c, 511: +netsnmp_udp: send 170 bytes from 0x5581f2fbe30a to UDP: [10.0.0.33]:46089->[10.1.0.32]:161 on fd 8 +``` +When SNMP request comes in via Loopback IPv6, SNMP response is sent out of fd 10 +``` +netsnmp_ipv6: fmtaddr: t = (nil), data = 0x5581f2fc2ff0, len = 28 +trace: netsnmp_udp6_send(): transports/snmpUDPIPv6Domain.c, 164: +netsnmp_udp6: send 170 bytes from 0x5581f2fbe30a to UDP/IPv6: [fc00::42]:43750 on fd 10 +``` +This separation of socket fd will ensure that SNMP response goes out with the right SRC address. + +This change is also more secure approach instead of listening over any ip. + + +### Effects of this change ### +1. If minigraph.xml is used to load the initial configuration, then snmpd will listen on Loopback0/Management IPs on single asic platform. +2. If config_db.json is used load the configuration, then snmpd will listen on any IP, the IPv6 issue will be seen. To overcome IPv6 issue, SNMP_AGENT_ADDRESS_CONFIG table should be udpated to listen to specific IPv4 or IPv6 addresses using "config snmpagentaddress add ". +3. No change required or done for multi-asic platforms. + +### Pull request to support this change ### + +1. [SNMP][IPv6]: Fix SNMP IPv6 reachability issue in certain scenarios https://github.com/sonic-net/sonic-buildimage/pull/15487 +2. [SNMP][IPv6]: Fix to use link local IPv6 address as snmp agentAddress https://github.com/sonic-net/sonic-buildimage/pull/16013 +3. [SNMP]: Modify minigraph parser to update SNMP_AGENT_ADDRESS_CONFIG https://github.com/sonic-net/sonic-buildimage/pull/17045 \ No newline at end of file diff --git a/doc/sonic-application-extension/img/TPCM_Flow_Diagram.png b/doc/sonic-application-extension/img/TPCM_Flow_Diagram.png new file mode 100644 index 00000000000..7dc7856e86a Binary files /dev/null and b/doc/sonic-application-extension/img/TPCM_Flow_Diagram.png differ diff --git a/doc/sonic-application-extension/tpcm_app_ext.md b/doc/sonic-application-extension/tpcm_app_ext.md new file mode 100644 index 00000000000..cf01f80c2cf --- /dev/null +++ b/doc/sonic-application-extension/tpcm_app_ext.md @@ -0,0 +1,429 @@ +# Third Party Container Management Enhancements to SONiC Application Extensions framework + +#### Rev 0.3 + +## Table of Content + * [List of Tables](#list-of-tables) + * [Revision](#revision) + * [Scope](#scope) + * [Definitions/Abbreviations](#definitionsabbreviations) + * [Document/References](#documentreferences) + * [Overview](#overview) + * [Requirements](#requirements) + * [Architecture Design](#architecture-design) + * [High-Level Design](#High-Level-Design) + * [SAI API](#SAI-API) + * [Configuration and management](#Configuration-and-management) + * [Restrictions/Limitations](#restrictionslimitations) + * [Testing Requirements/Design](#testing-requirementsdesign) + +## List of Tables +- [Table 1: Abbreviations](#table-1-abbreviations) +- [Table 2: References](#table-2-references) + +### Revision +| Rev | Date | Author | Change Description | +|:----:|:----------:|:-----------------:|:-----------------------------------------------:| +| 0.1 | 10/14/2022 |Kalimuthu Velappan, Senthil Guruswamy, Babu Rajaram | Initial version | +| 0.2 | 03/10/2023 |Senthil Guruswamy | Update | +| 0.3 | 09/19/2023 |Senthil Guruswamy | Update | + + + + + +This document provides High Level Design information on extending SONiC Application Extension Infrastructure to seamlessly support a Third Party Application. + +### Scope +There are many Third Party application dockers, that can be used in SONiC to provision, manage and monitor SONiC devices. The dockers need not be compatible with SONiC and also need not be predefined or qualified with Sonic. These are extensions to SONiC and can be dynamically installed, similar to debian apt install or pip install paradigms. To support this, we need additional capabilities to seamlessly integrate with SONiC. These are related to installation, upgrade, and configuration. SONiC Application Extension Framework already provides an infrastructure to integrate applications, and this document details the enhancements. + + +### Definitions/Abbreviations + +#### Table 1 Abbreviations + +| **Abbreviation** | **Definition** | +| ---------------- | ------------------------------------- | +| SONiC | Software for Open Networking in Cloud | +| DB | Database | +| API | Application Programming Interface | +| SAI | Switch Abstraction Interface | +| YANG | Yet Another Next Generation | +| JSON | Java Script Object Notation | +| XML | eXtensible Markup Language | +| gNMI | gRPC Network Management Interface | +| TPC | Third Party Container | +| TPCM | Third Party Container Management | +| AEF | Application Extension Framework | + +### Document/References + +#### Table 2 References + +| **Document** | **Location** | +|------------------------------------|---------------| +| SONiC Application Extension Infrastructure HLD | [https://github.com/stepanblyschak/SONiC/blob/8d1811f7592812bb5a2cd0b7af5023f5b6449219/doc/sonic-application-extention/sonic-application-extention-hld.md) | + + +### Overview + +SONiC is an open and extensible operating system. SONiC can be extended with Third Party Containers to help with Orchestration, Monitoring, and extending any features or capabilities of SONiC NOS. These can be custom built or readily available in docker community. The rich open eco system provides with multiple docker based applications which can be used in SONiC. And these need not be pre-defined packages in SONiC. While SONiC Application Extensions framework provides the infrastructure to package and integrate a TPC into SONiC, we need more capabilities to provide a dynamic environment and capabilities: + +- Dynamically install TPCs from various sources without pre-packaging them in SONIC. +- Ensure TPC has right system resources and privileges +- Abilitiy to configure startup command arguments to TPC +- Configure resource limits to TPCs, to ensure it does not starve core SONiC containers +- Ability to seamlessly integrate into SONiC Services Architecture enabling start/ auto restart/ establish dependencies + + +### Requirements + +These open TPCs help extend SONiC capabilities, and thus the following requirements are outlined to integrate them seamlessly into SONiC + +- Dynamically install TPC from docker registry and docker image from the local file system, SCP, SFTP, URL. These remote download options are introduced now to SONiC AEF. +- Upgrade TPCs from docker registry and docker image from local file system, SCP, SFTP, URL +- Provide default manifest file for docker image without manifest during install. +- Provide runtime install capability to pass various docker startup arguments and parameters through local custom manifest file. +- Specify system resource limits for TPCs to restrict CPU, Memory usage +- Provide update capability to update various TPC configurations like their memory, cpu, dependencies etc. This capability is new to SONiC AEF. + + + +### Architecture Design + +There is no change to the SONiC architecture. + + +### High-Level Design + +SONiC Application Extension Infrastructure provides the framework to integrate SONiC compatible dockers. However there are many open source third party applications which can be installed on SONiC system, to extend the capabilities, and these typically are standalone or have less interaction with the SONiC system itself. So it is not necesary to for these docker applications to be SONiC compliant, and provide their corresponding manifest.json. These TPCs can be installed dynamically on a SONiC device and can be managed. + +- This feature enables the installation and management of third-party Docker packages without manifest files through the Sonic Package Manager. +- This feature enables users to create a local custom manifest file (from default manifest template file) on-the-fly which can be updated or deleted as needed. +- Additionally, the Sonic Package Manager provides CLI commands to create, update, and delete the local custom manifest file. + + + +###### Figure. SONiC TPCM Support + +

+Figure. SONiC TPCM Flow Diagram +

+ + + +#### TPC Install + +- The TPCM support feature will leverage the existing Sonic Package Manager framework. +- Ability to download image tarballs for the Docker packages through SCP, SFTP, and URL before installing them is introduced. +- If a manifest file is not found in the docker image during installation, a default local TPC manifest file is used to install the package. +- At the end of the TPC installation using sonic-package-manager install, if no manifest file is found in the image and the --name option is used, a new manifest file with the specified name is created. +- The user can also create a custom local manifest file to be used during installation by specifying the "--use-local-manifest" option along with a custom name using the "--name" option in "sonic-package-manager install" command. +- The custom local manifest file can be created using "sonic-package-manager manifests create" command +- The custom local manifest file will be created under the directory /var/lib/sonic-package-manager/manifests/. +- "--name" option takes effect only for TPC packages(that do not have a manifest file in them) and the custom local manifest file with the specified name is created. +- The --name option enables the installation of multiple TPC containers using the same TPC docker image. +- The default TPC manifest file will have the following properties: + +``` +{ + "version": "1.0.0", + "package": { + "version": "1.0.0", + "depends": [], + "name": "default" + }, + "service": { + "name": "default", + "requires": [ + "docker" + ], + "after": [ + "docker" + ], + "before": [], + "dependent-of": [], + "asic-service": false, + "host-service": false, + "warm-shutdown": { + "after": [], + "before": [] + }, + "fast-shutdown": { + "after": [], + "before": [] + }, + "syslog": { + "support-rate-limit": false + } + }, + "container": { + "privileged": false, + "volumes": [], + "tmpfs": [], + "entrypoint": "" + }, + "cli": { + "config": "", + "show": "", + "clear": "" + } +} + +``` + +- The 'entrypoint' attribute is used to update the docker startup command arguments. For example, setting "command":"--path.rootfs=/host" would configure the host filesystem as the root filesystem path for the container + + +#### Custom manifest creation process + +- The "sonic-package-manager manifests create " command allows users to create a custom manifest file with the specified name +- It uses same manifest directory (/var/lib/sonic-package-manager/manifests) for both creating and updating custom manifests. +- Manifest file format and validations are to be ensured. +- It takes a json file as input. + +#### Custom manifest update process + +- The local manifest file contents can be updated for a TPCM Docker package using "sonic-package-manager manifests update --from-json <>" command +- Executing this command will lead to the creation of a new file with the extension .edit for the manifest file that matches the provided name in the same location. +Conditions: + - If the package is already installed: + - If the name.edit file does not exist, create it by copying from the original name file. + - If the name.edit file already exists, update it directly. + - If the package is not installed: + - Update the name file directly. + + +#### TPC Update + +- After updating the local manifest file, the TPC container can be updated with the "sonic-package-manager update" command. This command will reinstall the TPC container using the new manifest file. + +- After the completion of the sonic-package-manager update process, if it succeeds, we should move the manifests/name.edit file to manifests/name. Otherwise, if the update process does not succeed, the name.edit file should remain unchanged. + +- During the sonic-package-manager update, the update process can be performed by taking the old package from name and the new package from name.edit. + + +#### TPC Uninstall + +- The uninstallation process for TPC packages using 'sonic-package-manager uninstall' follows the same procedure as regular Sonic packages, + where the corresponding Docker container and service are stopped and removed, and the package image is deleted if it is not in use. + The local manifest file associated with the TPC package is also removed during uninstallation. + +- At the end of sonic-package-manager uninstall name, this manifest file name and name.edit will be deleted. + + +#### TPC Upgrade + +- The upgrade process for TPC packages using 'sonic-package-manager install' follows the same procedure as regular Sonic packages. +- Since the manifest package version mismatch is the trigger for package upgrade, the user must use --force option for TPC upgrade. +- The new version of the TPC package is downloaded and installed, and the new manifest is created if necessary. +- The old version of the TPC package is then removed if not in use by other package, and the running container is restarted with the new version. + + + + +### SAI API + +Not applicable + +### Configuration and management + + +#### CLI Enhancements + +The SONiC Package Manager is another executable utility available in base SONiC OS called *sonic-package-manager* or abbreviated to *spm*. This would be extended to support these new TPCM capabilities. The command line interfaces are given bellow: + + +#### CLI + + +#### TPC Installation + +This section shows the additional options that would be added for TPC installation in the sonic-package-manager CLI + +``` +admin@sonic:~$ sudo sonic-package-manager install --help + +Usage: sonic-package-manager install [OPTIONS] [PACKAGE_EXPR] + + Install SONiC package. + +Options: + --from-tarball Install using the tarball from local/SCP/SFTP/HTTPS url + --use-local-manifest Use locally created manifest file + --name custom name for tpcm package + ``` + + +###### Examples + +``` +admin@sonic:~$ sudo sonic-package-manager install httpd:latest --name my_httpd +``` + +Install from url: +``` +admin@sonic:~$ sudo sonic-package-manager install --from-tarball https://tpc.local-server.com/home/tpcs/httpd.tar.gz --name=my_url_httpd +``` + +Install from scp: +``` +admin@sonic:~$ sudo sonic-package-manager install --from-tarball scp://username@10.171.112.156/home/tpcs/httpd.tar.gz --name my_scp_httpd +``` + +Install from sftp: +``` +admin@sonic:~$ sudo sonic-package-manager install --from-tarball sftp://username@10.171.112.156/home/tpcs/httpd.tar.gz --name my_sftp_httpd +``` + +Install from local FS: +``` +admin@sonic:~$ sudo sonic-package-manager install --from-tarball /usb1/tpcs/httpd.tar.gz --name my_local_httpd +``` + +#### TPC Uninstallation + +sonic-package-manager uninstall option shall be used to uninstall the TPC + + + +###### Examples + +``` +admin@sonic:~$ sudo sonic-package-manager uninstall my_scp_httpd +``` + + +#### TPC Update CLI + +Introduced a new option under sonic-package-manager CLI to support update of various parameters of manifest file for a given TPC through a json file + + + +###### Examples + +``` +admin@sonic:~$ sudo sonic-package-manager manifests update my_node_exporter --from-json /tmp/new.json +admin@sonic:~$ sudo sonic-package-manager update my_node_exporter --use-local-manifest +``` + + +#### TPC Upgrade CLI + +No change in the sonic-package-manager CLI for upgrade which is 'sonic-package-manager install' + + +#### TPC Manifests CLI + +##### Manifest Create + +sonic-package-manager manifests create --from-json + + +###### Examples +Create a default local tpcm manifest file +``` +admin@sonic:~$ sudo sonic-package-manager manifests create my_httpd +``` + +Create a local tpcm manifest file with a json file +``` +admin@sonic:~$ sudo sonic-package-manager manifests create my_httpd --from-json /tmp/new.json +``` + + +##### Manifest Update + +sonic-package-manager manifests update --from-json + + +###### Examples +Update a tpcm manifest +``` +admin@sonic:~$ sudo sonic-package-manager manifests update my_httpd --from-json /tmp/modified.json +``` + + +##### Manifest show + +sonic-package-manager manifests show + + +###### Examples +Display the contents of a local tpcm manifest file +``` +admin@sonic:~$ sudo sonic-package-manager manifests show my_httpd +``` + +##### Manifest delete + +sonic-package-manager manifests delete + + +###### Examples +Delete the local tpcm manifest file +``` +admin@sonic:~$ sudo sonic-package-manager manifests delete my_httpd +``` + +##### Manifest list + +sonic-package-manager manifests list + + +###### Examples +Display all the local tpcm manifest files from TPC manifest folder +``` +admin@sonic:~$ sudo sonic-package-manager manifests list +``` + +### Restrictions/Limitations + +TPC with mgmt VRF support can be provided later but currently we don't support it. +TPC with resource limits will be provided in the followup. + + +### Future Enhancements Proposal + +- Local custom Manfiest file could be used(created/updated) for SONiC packages as well. +- TPC could be configured to start right after system becomes ready so that SONiC packages bootup wont be delayed. +- Data preservation in TPCM migration +- TPC and SONiC containers to have resource limits in place in config db. + + +### Testing Requirements/Design + +#### Unit Test cases + +Installation test case: +- Verify that TPCM packages can be installed successfully using the sonic package manager with a custom local manifest file. +- Verify with SCP +- Verify with SFTP +- Verify with url + +Uninstallation test case: +- Verify that TPCM packages can be uninstalled using the sonic package manager without any issues. + +Upgrade test case: +- Verify that TPCM packages can be upgraded to newer versions using the sonic package manager without any issues. +- Verify TPC migration on Sonic to Sonic upgrade + +Custom manifest file test case: +- Verify that a custom manifest file can be created for TPCM packages using the sonic-package-manager manifests create command and used during TPCM installation. + +Multiple TPC containers test case: +- Verify that multiple TPC containers can be installed on the same system using the --name option and that they can be managed independently. + +Manifest file update test case: +- Verify that the local manifest file can be updated using the sonic-package-manager manifests update command and that the changes are reflected during TPCM installation. + +Error handling test case: +- Verify that the sonic package manager can handle errors gracefully and provide meaningful error messages to users in case of failures during TPCM installation, upgrade, or uninstallation. + + + +#### System Test cases + +- Verify that TPC containers can be installed and run alongside Sonic containers without conflicts or issues. +- Verify that TPC containers can be scaled up and down as needed to meet changing workload demands, and that Sonic containers can coexist with these changes + diff --git a/doc/srv6/images/srv6_sid_l3adj_sequence_diagram.png b/doc/srv6/images/srv6_sid_l3adj_sequence_diagram.png new file mode 100644 index 00000000000..def2ff31f71 Binary files /dev/null and b/doc/srv6/images/srv6_sid_l3adj_sequence_diagram.png differ diff --git a/doc/srv6/srv6_sid_l3adj.md b/doc/srv6/srv6_sid_l3adj.md new file mode 100644 index 00000000000..33d7350e33d --- /dev/null +++ b/doc/srv6/srv6_sid_l3adj.md @@ -0,0 +1,128 @@ +# SRv6 SID L3Adj # + +## Table of Content + +- [Revision](#revision) +- [Scope](#scope) +- [Definitions/Abbreviations](#definitionsabbreviations) +- [Overview](#overview) +- [High-Level Design](#high-level-design) + - [SRv6Orch Changes](#srv6orch-changes) +- [SAI API](#sai-api) +- [Testing Requirements/Design](#testing-requirementsdesign) + - [Unit Test cases](#unit-test-cases) +- [Open/Action items](#openaction-items) +- [References](#references) + +## Revision + +| Rev | Date | Author | Change Description | +| :--: | :-------: | :---------------------------------: | :---------------------: | +| 0.1 | 05/9/2023 | Carmine Scarpitta, Ahmed Abdelsalam | Initial version | + +## Scope + +Extending the SRv6Orch to support the programming of the L3Adj associated with SRv6 uA, End.X, uDX4, uDX6, End.DX4, and End.DX6 behaviors. + +## Definitions/Abbreviations + +| **Term** | **Definition** | +|--------------------------|----------------------------------------------------| +| End.X | L3 Cross-Connect | +| End.DX4 | Decapsulation and IPv4 Cross-Connect | +| End.DX6 | Decapsulation and IPv6 Cross-Connect | +| L3Adj | Layer 3 Adjacency | +| SID | Segment Routing Identifier | +| SRv6 | Segment Routing over IPv6 | +| uA | End.X behavior with NEXT-CSID, PSP and USD flavors | +| uDX4 | End.DX4 behavior with NEXT-CSID flavor | +| uDX6 | End.DX4 behavior with NEXT-CSID flavor | + +## Overview + +The support of SRv6 has been defined in this HLD: [Segment Routing over IPv6 (SRv6) HLD](https://github.com/sonic-net/SONiC/blob/master/doc/srv6/srv6_hld.md). + +The HLD includes several SRv6 behaviors defined in [RFC 8986](https://datatracker.ietf.org/doc/html/rfc8986). +The Appl DB was extended to support these behaviors. + +The Appl DB includes the L3Adj attributes which is used with these behaviors: uA, End.X, uDX4, uDX6, End.DX4, and End.DX6. + +The current implementation of SRv6Orch does not process the L3Adj. + +In this HLD, we extend the SRv6Orch to process the L3Adj and program it in the ASIC DB. + +## High-Level Design + +The following diagram shows the SRv6Orch workflow to process an SRv6 SID associated with a L3Adj in SONiC: +

+ +![SRv6 SID L3Adj Sequence Diagram](images/srv6_sid_l3adj_sequence_diagram.png "SRv6 SID_L3Adj Sequence Diagram"). + +- SRv6Orch is an APPL_DB subscriber. +- SRv6Orch receives a `SRV6_MY_SID_TABLE` update notification about the SID. The SID is associated with a L3 adjacency carried in the `adj` parameter. +- SRv6Orch gets the nexthop ID of the adjacency from NeighOrch. +- SRv6Orch sets the nexthop ID attribute of the SID. +- SRv6Orch invokes the sairedis `sai_srv6_api->create_my_sid_entry()` API to create the SRv6 SID entry in the ASIC DB. + +The next subsections describe the SRv6Orch changes required to support the HLD described above. + +### SRv6Orch Changes + +We extend SRv6Orch to support the programming of the L3Adj associated with uA, End.X, uDX4, uDX6, End.DX4, and End.DX6 behaviors. + +When SRv6Orch receives a SID with the `adj` parameter set, it calls the function `neighOrch->hasNextHop()` to make sure a nexthop associated with the adjacency exists. +- If the nexthop exists, SRv6Orch invokes the sairedis API `sai_srv6_api->create_my_sid_entry()` to create an entry `SAI_OBJECT_TYPE_MY_SID_ENTRY` into the ASIC DB. The `SAI_MY_SID_ENTRY_ATTR_NEXT_HOP_ID` of the entry is set to the nexthop ID. +- If the nexthop does not exist or is not ready, SRv6Orch keeps the SID in a new data structure called `m_pendingSRv6MySIDEntries`. + +SRv6Orch subscribes to Neighbor Change notifications. When the neighbor becomes ready, SRv6Orch receives a Neighbor ADD notification, walks through the `m_pendingSRv6MySIDEntries` list and installs all pending SIDs into the ASIC. + +We also handle the case when the neighbor associated with a SID is removed from the system. In this case, SRv6Orch receives a Neighbor DELETE notification, removes the SID from the ASIC and adds the SID to the `m_pendingSRv6MySIDEntries`. When the neighbor comes back, the SID is installed again into the ASIC. + +## SAI API + +The `SAI_OBJECT_TYPE_MY_SID_ENTRY` object already supports the `SAI_MY_SID_ENTRY_ATTR_NEXT_HOP_ID` attribute required to associate a L3Adj to a SID. + +``` + +/** + * @brief Attribute list for My SID + */ +typedef enum _sai_my_sid_entry_attr_t +{ + +... + /** + * @brief Next hop for cross-connect functions + * + * @type sai_object_id_t + * @flags CREATE_AND_SET + * @objects SAI_OBJECT_TYPE_NEXT_HOP, SAI_OBJECT_TYPE_NEXT_HOP_GROUP, SAI_OBJECT_TYPE_ROUTER_INTERFACE + * @allownull true + * @default SAI_NULL_OBJECT_ID + * @validonly SAI_MY_SID_ENTRY_ATTR_ENDPOINT_BEHAVIOR == SAI_MY_SID_ENTRY_ENDPOINT_BEHAVIOR_X or SAI_MY_SID_ENTRY_ATTR_ENDPOINT_BEHAVIOR == SAI_MY_SID_ENTRY_ENDPOINT_BEHAVIOR_DX4 or SAI_MY_SID_ENTRY_ATTR_ENDPOINT_BEHAVIOR == SAI_MY_SID_ENTRY_ENDPOINT_BEHAVIOR_DX6 or SAI_MY_SID_ENTRY_ATTR_ENDPOINT_BEHAVIOR == SAI_MY_SID_ENTRY_ENDPOINT_BEHAVIOR_B6_ENCAPS or SAI_MY_SID_ENTRY_ATTR_ENDPOINT_BEHAVIOR == SAI_MY_SID_ENTRY_ENDPOINT_BEHAVIOR_B6_ENCAPS_RED or SAI_MY_SID_ENTRY_ATTR_ENDPOINT_BEHAVIOR == SAI_MY_SID_ENTRY_ENDPOINT_BEHAVIOR_B6_INSERT or SAI_MY_SID_ENTRY_ATTR_ENDPOINT_BEHAVIOR == SAI_MY_SID_ENTRY_ENDPOINT_BEHAVIOR_B6_INSERT_RED + */ + SAI_MY_SID_ENTRY_ATTR_NEXT_HOP_ID, +... + +} sai_my_sid_entry_attr_t; +``` + +There is no SAI modification required to support the L3Adj parameter. + + +## Testing Requirements/Design + +### Unit Test cases + +To validate the SRv6 SID L3Adj parameter, we extend the existing `test_mysid` test cases contained in the `test_srv6.py` unit test. We create a new SRv6 SID entry associated with a L3 adjacency into the `SRV6_MY_SID_TABLE` of the Appl DB and we verify that the SID entry is created into the ASIC DB. This test is performed for all the behaviors that require a L3 adjacency: uA, End.X, End.DX4, End.DX6, uDX4, uDX6. + +## Open/Action items + +The changes proposed in this document depend on the following PRs: + +- [[orchagent]: Extend the SRv6Orch to support the programming of the L3Adj](https://github.com/sonic-net/sonic-swss/pull/2902) + +## References + +- [Segment Routing over IPv6 (SRv6) Network Programming](https://datatracker.ietf.org/doc/html/rfc8986) +- [Segment Routing over IPv6 (SRv6) HLD](https://github.com/sonic-net/SONiC/blob/master/doc/srv6/srv6_hld.md) diff --git a/doc/system_health_monitoring/diagrams/System-ready-HLD.drawio b/doc/system_health_monitoring/diagrams/System-ready-HLD.drawio new file mode 100644 index 00000000000..b8ea4a3ee6c --- /dev/null +++ b/doc/system_health_monitoring/diagrams/System-ready-HLD.drawio @@ -0,0 +1 @@ +7VxZk5s4EP41U7X7MBSSOB/nyCRVyWZTm9pK8siAbJPB4ACOPfvrV4CE0TEGHxzJ2A8zVgON1f31idAVultu36beavFXEuDoCurB9grdX0EIDGiRfwXluaI4JqwI8zQMKpK+I3wO/8P0SkZdhwHOKK0i5UkS5eGKJ/pJHGM/52hemiYb/rRZEgUcYeXNsUT47HuRTP0SBvmCUoGu7w68w+F8kbP50QNLj51MCdnCC5JNg4TeXKG7NEny6ttye4ejQnhMLvD7t6ebDz8+ftx8efc9D97fX7/dXFfMHg65pJ5CiuP8vKypLn960ZrKi841f2YCxHFwU+iBjPzIy7LQv0K3i3wZEQIgX2dJnFO1A4uO75IoScvLkV5+CD3L0+QJc0fAzcNDfYRph8j1Fm/D/Cv5rmsmHX0rRvT7/bY5eGaDOE+fvzYH33YciuHusnLErsuecO4v6GSy3Etzebol+SGM2JyJTBqjSmQ4kIDYoi96XpasUx/vUxJVCvkRc7yPoVujkpgzTpaYzJNct9nhnoF70UA8o6U48vLwJz8Jj5rfvGZX3+FTEpLpQZ26CsOlfKijgJarWcDQ6w/iOVbTpkyauBX5WkgjzOuPw90FsF/P2FZCktgSnXrPjdNWxQnZntnYiL8PNJvsyJeKIxs1BL4jlRZ4gDWiUa2RO6KyRtC0xdoy1dbIWdXONDUAUNM8r3VNt5wWCy1Hn3AaEvnilBL7NjpgdzQ6YIxpdSYxMgpNBlXTPM7QLN5+gSvwecGyzgV+6yDwx0lMiLeBly1wQGEmmEED3Ma966phLxxpwp6D8ElwG80l62cBBtBNqCETugYAtmXYOtrLtmec2BJO/s2IVyBAWXnEokXM5HibC9ggkZshgMLIi8J5XLhUoszCw9z+xGkekiTyhh5YhkFQcLzdLMIcfy7vhO43JGUmtDRZx0GJQl2E5Es4LN00/ZGi2zbkpESEJhUB+ZF4exY8Ui5QCOFUsw201nlzE65QfxmZnPoP1bWr8AlWlFOBcYq2fqwTduA6K0V5Q04A1mq7O1jWGUwbOyIqPrNZk2TNi//Zc7ZM4jAnV9D7Pqbs2B8L7EX5IvhTPsQoZM7V72Rk0Z9FESmDXoJVA7IK5yQ6suJTZdOy2zo/UEydR4orIwUCRzNlrFh9YQWAaWZPL9cX1Pk0iwvJgRyTe9XpVp17dauEek+pjI4p1ah1jG3y0EbgyLDpigmVyOhMpYorFF6DlCrgsM4BhfpAbQOgkZM4g3HKe0ykeUBuLFYzu4bCEIbYuaEARrVES0eaUNoYiJQ7x5mjbUGJG0lr7UGTV9ChxJ+TfLJKK3MivaRIP6G9g4UimNPeqffIWNS9zZOCPDEaTagtke1othzrkWErUEDivwsd0wZG+de0+or8RrtQdzk6OCbdUtQMXJiWvNTdna7f3YkeortK21WH1LqjitI1nZO96Rygtb70ZCr0xKXzyuTc93Xd97kkvjXthyZJ+8tEXMjo769v11nX/HxStePDg+sidHrtOAAGtyxz0hzdQbuPzecprto/NGsIh3ggkpCJPeQmYAs2fQG2Q0+q8tYDe2bL5lM+w7U0RQVmKGIzgEZ/8pJ7M6M74tpszlUXW60pkVLuLupL6I4k9Ow59oMKfbkXxkV37DV4tvMrlbXFAO+4dENhbYA9mRqkMwZaW2MHNbrYtdnKizu31VTxtYSeli32NsWqu/xGvbL98EFCVxWo0OPsqaPOjh4GywZ6/sFe0MFN8LKkjqJv3xGmJIZWKcsGZzmfxNTtdOooiqeJkfeIo1vPf5qXnNhvC/DMW0f5abreq2rXdOv6iHkLQ9a2pVA26k3ZHRqjo2QxyBTre4QUOaAqmkKnrxyGCed3zmFg+1NJtdTNvpIYKPdFAo8AziMe/5LHnKxYdR4DrbGzGKjq3Bz2gA8aL2QiKQ7C7DrD6c8CN92SjGzhrYqv/nMUEqWnqN28Hyt4fHisCXXY+XudEzYMBBkFgMlhCohwaOKXYCrwsDPzVU7B8h38OCuOsHWJpxXzLdgxRE8tQ0fVCga20xd2VN2kKYQ1sThHigRg8NIcdmhlDB3WdJ2tvTlTWDOPC2u9leZQ1Q9pL9KoXCSPtkiy3J/NSXlFnFpYrHV5xW3M88NkusU+lFs8kyj2GSBfVbHfAp8JFvuqVtEUAyVQrSIaOlAiuTXy+wXK9sf6wwZKpGpRHB8owyKEzYhCsmsC3Vk4v0TMXvAy3YiJVF2cCURMBTJfV6u8BUrTi56o84qdgaOnya/CNxqLdUaNnxNcjHPu/il6Yd3DePFT7oVk8XJ1aZ2ertMJxzhVS2cCMa5A3msLay3omWBY67BUZ8w3v49632GAxduVkPpbms2MunVpdqWvsZZmQ9fVXPell7JNG9a5yKHLtJEFNd1ssDYl1sCBPO+eF20jVfttOFt5Re9ljwVnFyAN8vk0cs2jXzWoVoFAiR/xe43lrsNiuEMTcDoYto8DsQugCGIIBgdxFy+vd/Ty424u4ELJLAydmAVouOcjLcSSLaRg3bQQKLwv0rOFGHLjd0WkN0vSJaEGKRFjmkk2w5c4LYmp8Gwfg8DEtspgXMtGnqUufgUj5Gxh4fmLdYrfFje+RydltC2rASxOdZZiJQliS1SbwDT6ymcNuYMc43yTpE8X5UnViAPatQcVbqU/7cntWq8IPxfVCaqzgT011cn90VnZQCjYkvkmm3JzjmoU4eC8quyyPGqiqhRcqK16QQ8oXtDrT5Ny5zZPvTjz8flN8BfWm+g9lYob1gTl7m/lPS8KU/nMCSjsstfV+fe6EnIa48j+E9BNtGezK4lv31XJYX3aC1QOh4pt9QMViW/fUJHblO9xGpf7+V72Rjt9bzQh8waKRziDbo5myC29d14abMg0L5o+RdNivoDG1rQpt6Y6BIEBN1tCXPcWQnZ8/Od1v9BmS9Uq77GCYrHZkm1wwDdNqBmGfWqv19b5Z4VAuImwKOboKEmGu83Yq9N3W9qjN/8D5Vxdc5s4FP01nuk+xIMkJPBj7Hxsd9I2TbbbNi8ZbGSHKQYKcmL316+wRYwkShwMGKeTzARdywKOjq6Orq7SQ6P58jJ2oocPoUv9HjTcZQ+d9SAEwLb5n9Sy2lgsYGwMs9hzN6ac4db7RcU3M+vCc2kibBsTC0OfeZFsnIRBQCdMsjlxHD7J1aah70qGyJlRzXA7cXzd+tVz2cPGakNra/+berOH7M6ADDafzJ2ssniT5MFxw6ecCZ330CgOQ7a5mi9H1E/By3C5Zx/Q4+W1sbqKJv8lyfh2hZ5ONo1dvOYrz68Q04DV2zTcNP3o+AuB12kUifslbJWByF89Si8Xc/90wsK4h4aPNGYeh/nKGVP/Okw85oUBrzIOGQvnuQqnvjdLP2BhxK0PbO7zAuCXyQ/KJg+isOMrikdL26bLXAeLV76k4ZyyeMWriE+hoK9gLzTNTflpywUk+vchRwMibI5g3+y54S3C/EKA/ArATQ1wDezWIbKwhBEAOkYQF4CEB4OGUMIaSjc0CmPGbTF13NUaMIfRg2OHTAU7omMHQBF2TRGMFEDnuAKwRcIvwmkPEp8/x3Ac86tZekUdDhU0FoGXYvzOiaK/enCk10to/OhNuod70bhuF3fr95RNVgmjcx3MblMZwUNDav+WykcJqGkcGtDsZjlEbwWUz3516nONA7kXQbw2Gi4SejJxEh1NDkIKuSNm9wkHj6a6YBoG7FZUAwUyYO65bvrhMImciRfMrug0fW20tdwIJNYmFoc/6Cj0U81xFoRB+s045P3rjNe3SJ8xCr2ArbHCQ/7LX3Jk9HEPn6UuDA/Btsx/0+oxG4UBb9rx1j1OnYQ90YQdbsK1CyZcoBMDkqaIoQ81rcMni/iRugIaGrinqThPe953ksSblGg7Dl+8+iY6a134nhb6EGfls2X+07OVKL2uA5JwEU/Ewy7/GTvXPy+d0erq25kBr7zItjKxy5x4RsvaE71DXWlloXdnrrsK5ZGwxdR3mPcor0eKulDc4Tql85YtIJvasqkOQbmJzXuLb+UXAEpDCBGZdrYpN7QBRmuId7SzylUTw+23Dwxt+YEBhgpBNy1u6fqM6R4MHhyCwVZzDC5brh0bg7PFdkYIy66HwZAoC5CaGAwGirpEqHkGQ31y3pnBYmJ8LX1fYi/XUTHTR8nafOH5/vZBRKkBp23tSHnSKcpDQ2YqMhRfW5XyyGyG8hAplBdDq1nKZ9KpVafdsurYlcBWtwkMayKwCZsicPFAaZbAerSlTgIvPfYtd51Xzby4pW9aaEhx4B3ZC0Cn6GsqGhQCpYld6Wtp48BqhL7EVtaEgzYkhx60qktyVJIOxYS3WiU8OVLCE1UaV/TXxIaKH22I8IrgQGYbhN8jztEI4TOJAtqSKKVcPjKNQpTNPY2qO3N+oDh5dfDUxHmsBP4QaYPze0RGXlpXFvrrcnfd8pqybGPh2H28iSr7eIXve0cCpejz60jrspsvd+Fn83xxf/N5ao+v7+6/nOixkFEYTL0Zt02pwxYx7XV2mycDs41tnkLwYHMDvmokvwuRpJ1nObtTo/45WpoNVjXrYveVuKzssNGMsuMtywMCWLXOcq75fXx9sjq/P/3kjdApufxh4YK8GrFX3HV3gQno4xYdRhF4Rek2a/DSfJAkgy7pAnZ9GT2CrANjt0cE6A1H7QHUfW3ZuO2KrzUGMr0GlaOeiq+19/W1+yisIuCbC/38Ebzt2A6rwlvLVDJFduatKfOWkKY0gnofKXm5GY3QbJJLhbVwdfKWLZqOb1mr+sqKAtc2gEwqdc+qrtClgeQHNurdaiok7x8dxSmTDsdHd3lrilSN1FsDmYYEK2k1ddEdypMLtupdz5Xl6skHU3TKv5mDKaAgObipgyllaQo5wJNVMg/5ApCj+nZgH8ipaNBGh4U9c0453F0acb+ZvreW4O5EkW58lwTz9LjGW+2j54DXwfpId0aLhL7hUYFxUUyqXcxRdb1zNKlhhavNEmXfEQWTJXRv6VI1SoKVWOiAZLxrOL0Rw3p3Xou7d4fDnn+YZj/WdDLyIlF3Vu1AHjx8OdDn1u1PMwOAEGVtbMBaBwA2n05ulv8ulj8/Rncfn4Lg68o+0bXN2clwoe8jHO/ESQwlbRq3pycLIdenTb5uQm8IcUWqANyeOCwE/CAn55rTKWWkOrZjR7ZMlQHuI9OEJL0ybFB5G10RLcBuTLRo56ZI8z672XN0HUjwLZAgZZNXR8isKpDK9LUsOfxoGlafWAAjaJvrv41QGSs5IbBm/V0qIvOxlfTgeNFx9uOdEC0ih4HNjFyHmhELYN/ZhVRa9rxwIqYLGacFpwqOw+mo+TYWrOZ2CLL7nKmGgS3bQCaRZzazoYO8WD3VaNXrdi4u/OV7cDOGH6aEwLvx7aer9wdIv3zxGHpXUyuOIf0SKbHRAejbXPTZtgmgAaqOhxZDX0DNT95rDPDi9l/ebapv/3EgOv8f7Vtdc5s4FP01frSHb8RjYqfJdpMmG3e2TV86GGSbFiMW5Djur18JhI2QjE0ciD2TPLToIiTQOffcqw/39OHi5Tpx4/kd8mHY0xT/paePepqmKo5N/qOWdW6xTDM3zJLAz03K1jAO/sDiSWZdBj5MmS03YYRCHMS80UNRBD3M2dwkQSu+2hSFPmeI3RkUDGPPDUXrt8DHc2ZVFWV74wYGsznrGpjsxsItKjNDOnd9tCqZ9KuePkwQwvnV4mUIQzp4xbiMYmPo/Bhf/Zl9fRj+B65A/7PVzxv71OSRzSckMMJv27SRN/3shks2Xuxb8boYwAQtIx/SRpSefrmaBxiOY9ejd1eEMsQ2x4uQlFRy6bvpPKtLC+lviL05K0yDMByiECWkHKGIPH554Gexz3+GCYYvJVDZZ15DtIA4WZMq7K7FAFvzxdUWfsdhtnkJeb0wuoxys03L22ElF2xkG4yyKYxyuk4xXPjCYBOOxfRyuQhvgykMg2ygYpgE5E0gHbuQmR+2tkviO9gltoSNtYfC0I3TYJI1S2FLoLdM0uAZPsI0d9HMykG74XdWwAn6vfGYKoA9TZ9OoeV5m5qlO77tTBSlin9bYKsqjzYQ0d74Ooe21Rba1n6filEQ4axf87JnjioQowTP0QxFblgG+QTBMpuCZZgcVrYMKxEqrTWobAGqCZXzKlz0AwMSUi7CYBYR0wRhjBbZcLoJvqBRisJGmiE2GPmFZRIi73dRjYVGsB+8I6AgHc1gXT1GTuhzUVMELIGhi4le8PFYMv7s0QfK6C3QfUOVIl20kKJl4kH2UAXEzVu8HlcgCi5MngOPJCJnrLjAg3InngDTMDtT3A2YJ6O4zgkqbjtggaZgnZriqqqAVaaOQTSj3WUNwMglfuRTAHd57U5FlqlvA7WFLwH+Xrp+oo8NFB2w8uiFtZMV1qVCiTiZrZn35Xq4X7f36rvTjb6rls3xSq0SpmWBVzWBRqP+5fKc5d13IZhKFcPyAJxMu5J3Qz81eVf109P3ttDSGsN1agIvLikkMCYAbOU8Q87FEm99Q1nfKPlA57RcN6zGWn6E3+0VdudAYS+coG1l1yvurzok/pX+Cr7tEXqhXUMxuHa16mpLPgKtRYxiFa/Ey+H9l09/Xf8cXZ5x1ICqb0JbpkOOZeuuJSzDkXdkM1GiNe1FERPwuvT+UUQTM893jyJN0TsQrQ3XzzaKaGLIXyUBpqFj/DT+enVHGyDxQ7n/u80gIp0bED00LS6mWEaX84P5+DI2btdPjzc/Zne33+aaefO9qLc3jBRe0HYYsTRjYFocqXTTGZims/17ZSSxDD5CCZKxI5IQ7N11qRpz9jZijZgD3aCUfsL9WKAr8VDMb6LwYsA2SyT7Jy4juEcolElRlfmLwPdpN9JdG163dm/ctBUhKrs0uig5liw+vIHiSB1ITA/SdbpAUYDJkJ9vftB0Haqr/MA23nGWWaegp5QeHL+IWEv1c8gOblfjoYvBn/tnMP23v0qTOFzJfHU5Sb0kmNAEASPyjz9pd3opzQz6yqCycGjb55IYdDW/NBx+HqjrYADsUlbwuqTAdkxumqo16aTLhEFKaMmieJXQJHhMg1nG7PfntQo+Et4PXu/ntbhKn0aL+IzTKTJdB74hi8hAm+jWuy23AOcd0ykp9Ke4Zt8QPBGsWpKfQzo1U7zHKP6c/log49e18s/9KphL0ikBqtLs9DWoHQnELHTTdGcYeRtQTN6BVPMwVFoDRUwJvrkBDQxTOh+13AWdw0eTNM7GxArpGsKE3pnRq/x8YRbVXH9dB2eT3IEMebLON1XUopgnBEDXC8M2I8hK63KpdmNFvvdOd5qPTzXqOFJONer8o5xq1NXbm2kwxikDwh5eClSRdIWSH5memJWODtyvFxff+HNd6oG7OLtTjh3nCxRF+r67XqtaX1O4+uQif4PK08XroOk0hcduO0k5IR79/dBW2YnsTrVVPKE7pmeQiGn85e5BVNOLOA6JRuIARR9SepSUSo6p1tU7SSntA73SVVUEDxXTPtAc3jO0buS0yDkPlVNd7V5OpTwTN1bqF0rYbuHPx6uL0ZPgux2smyiWw/mg/gYeeOi6Sd287GTWTWzHqFng0BQwAOCVqYqpNGy5y/USqeyJ6yWbs1NuHPe6PDelDHjiGs55npvqise6Dnh+Gc7AKe92a69jsQGMQakVp1kvb3eoSspXcZFnw1c8p4pMkstTIK+lOB/krSVv5XAeADytXqnARq0C13fSMnXFn36JB4vIsHqQjKrfeeZAZgWAT95tzewwdahlY5m2dTnG+6YOugEGttVG6iBrucXUgRS3PzLPq29/qq9f/Q8=7Vxbc5s6EP41frQHJC7iMbHTtNPTJtOcmdM+dTDINlOMPCDH8fn1RwJhI4Rt5BhfTpuHFC1Cgv12v9Wu1PTgcP72mPqL2RcS4rgHjPCtB0c9AEzDc9k/XLIuJI5tF4JpGoWFyNgKXqJ/cfmkkC6jEGdCVogoITGNFrIwIEmCAyrJ/DQlK7nbhMShJFj4U6wIXgI/VqX/RCGdCalpGNsbH3E0nYmpkS1uzP2ysxBkMz8kq4oIPvTgMCWEFlfztyGOufJKvXwfwdEj+PhMYfbUfzLoEIHv/WKwDzqPbD4hxQk97dCgGPrVj5dCX+Jb6bpUYEqWSYj5IEYP3q9mEcUvCz/gd1fMZJhsRucxa5nsMvSzWd6XN7JfmAYz0ZhEcTwkMUlZOyEJe/y+5WeJz3/FKcVvFVDFZz5iMsc0XbMu4q4jAFvLzdUWfs8TslkFeVgKfWFy083IW7WyC6FZDS1DRcvZOqN4HirKZja24JfLefxXNMFxlCtqgdOIvQnmuouF+Hkru2e+Q30mS4WuAxLH/iKLxvmwHLYUB8s0i17xN5wVLppLJWg39p03aEp+bTymDmAPwMkEO0Gw6Vm5E7re2DDq+HcFtmnKaCMV7Y2vS2g7XaFtHfapBYkSms9r3/fsUQ1iktIZmZLEj6sgXyFYUBcsy5awcpuwUqECnUFlK1CNOZ3X4eIfGLGQchdH04SJxoRSMs/V6af0jkcpDhsbhslwEpaScUyCX2U3ERrRYfDeAQWbaIr39RPGiUMpaqqApTj2KeMLOR436F88+swtegt03zIbkS5HyMgyDbB4qAbi5i2Ox9VRCRenr1HAFiI3zLgowM1OPEa2ZZ+NcTdgXg3julfIuN2A5eiCdW2Mi1TP5OQYJVM+W/48TnzmRiHHb5fT7iTkJvLVIFv8FtHvlesf/LGBAZFoj97EOHljXWlU7CaX6TlfQYeHafsgvbvnoXfTcSWzMuv20jG/e4oVjfr3y1sm99DHaNLIF06A8HhyLnK34LWRe8lP18TuHaHlaaN1ZexumgpWKV4w/W/JPAfOpw3OekJS3/D4AEpMDi1Hm8nf4XYHad1tSeslil3zOqx5v+mx6Ff5Ke3tAM0r41qGJY0L6qWWQgOdxQtTrXMNn75++PT4c3R/w0EDm6GN3SYa8hwX+o5Sg2PvKNJQppHugoiNZF66giCiluAuHkR00WuLFtCF6+qiiFpBW6UR5aHj5cfL3w9f+AAsfhhfn/5mv58+dxlKGvODvjGQUwQX6AeWE8cSs3UwgecJJg6wBrYjmRa0vYFte9ufI+OJY8lxSiGOHfGEYe+vK92Ey3cRcdTS4keS8U94elHMlfkplfdRZEoQ+yUNWyi+MPCA2VBOSHXLn0dhyKdp3LiR2Wv33k1XcaK2UQNV4nGaokRnvNNQNlxnc5JElOn8dpcJurWocy0TXOvqcs3fp5Jo3lIpEbzdzT9RvP48nLv9+cOz7y2+9dVcU8GqwqjHwPZOJKaxn2WawV4bFRtIqJgeaAVLZ6iomRZl+iRLdVetAo7Oeo0pMF0Xyb1ZNotFGIKwFGxXYXlrXW3tTfCbK8C83qm7vGttPEVqrW8j1RXhPgepLgj39Tu4HhQWx9RhAU9mbsXmSi5/7z4ix1Sy79LLdZeMfQRqNWu7NpL2mnFHbdyQPRKI9q4XU/pbUn92UbxB7enydchkkuFuiiYNezXLcRak0ZinYpSwX+G420Jeu+zLcPXdU29Rc8Ls60ylPMuTS24QogFyK6nXcW7kerZUEQQ6k1w+K1M3jhSTZiv0STTNbfvylm2ii289Xl1d4Y9lNx3kVHfJsmS+UCz4drLW0MYotJoWTAiMoXOx4jbyri1rBS0yobPvkGqi1xKtjZnfQtba/AVqgqSAddq0VReKI9JWfVhs2YnMEqYzpK3NX6BuEK38iIeHCa/9OTEv0I751ZRfFee381jmh+t9AP4fU1tNqzi4xChd4OASo+zYPlVlFiMTgKkSwIlyVbs2kX3s5oZ8dNZEtXFOlqcaje+7O0+tnw2Q+l8uTwUtDtz/fowKUDtGrR/pOB2jqjtgL/yoJ59rsYgZJ9KIJH+o853U2fZkaNlRo8rn1k7tuF0xZx9BmTuBeSR39lGtMqnYd0fsCTw99oTgSqp8oGHTc29JRJzB+Pnt4W70Q3HfM1RIGNVJbghccOkaCWgotV+0RuJ61p5iBjDQAKEjVye2oTny5Wsj6q7u5lQqC0W9c55INQaeZLyWd6MnUs9kyBCiGp97A696gggcZ8YWsgaVUTy9WTo+rgrUMvWug29MvwFm6g3PzsTmwDDl9RAyLr8N0/b/QYK2m56dEjG00MB1uiDippE7JGLW3P4hhKL79s9JwIf/AA==7Vpbd6o4FP41PtYFhHB5rNV6zkxvq55ZM52XsxCiUpFYiFXn108CQQmhCFovXat96CLbEMj+vr33x4YWuJmt+pEzn9xjDwUtTfFWLdBtaZptAfqfGdapAVpWahhHvpealK1h4P+HUqOaWRe+h2JuS00E44D4c9Ho4jBELhFsThThpThthANPMMydMZIMA9cJZOvfvkcm3KoqyvaHH8gfT/ilLch/mDnZZG6IJ46HlzkT6LXATYQxSY9mqxsUMN9t/PLwFip/va7hKu55y5dF8Mfr21W62G2TUzZbiFBIPndpLV363QkW3F98r2SdOTDCi9BDbBGlBTrLiU/QYO647NclZQy1TcgsoCOVHnpOPEnmskE8RcSd8MHID4IbHOCIjkMc0tM7NbfFt/+OIoJWOVD5NvsIzxCJ1nQK/9XggK3F4XILP8zYOckhD2xudDjlxpuVt26lB9yzTQCU3Xzz+HD7s/+725H8TWk2Z4eLWXDnj1DgJ76ao8inN4OY+wJuftraOjR8iENtEXe3i4PAmcf+MFmWIRchdxHF/jt6RnEapYlVQHdD8WRAIjzdBE0Rw5YGkOpBZG5m5n6xDRM4hkQBeo88QVCPHA9/DYoEsGQCbMJfIIBxNAKA3XE2x35IkgvDTgt2C5jjiEzwGIdOkEf9lOjVRUtrCpcOBbTMMrRksLTjgaVLYC0jmvWoafAy+NW7ZwvE9N/jnxKIbNc+LT7XgT8OqWmICcEz6j0UetesmjFbgN3pbrgE56OVT/7JHb+w09rQsPm4u+LrJIN1bpCjS2JrFnQxXkQuqppn8pruRGNUuSAPAeQJ9VpmRYQCh9A0JSqBEpD5qU8sbnLJX9OpXwRGAWi3IbS3f1BcNN0kXydfQYtL62JekfJF6gRpIQq8s85N45Fe5OnGBwdQF0rU/YFjtoXHgcRVGp5ErN5iJuBVuqRwO5zdLuVQkoeKtJ/5nscuUyoXxKT1sWI4kTwAcr4xyorD0dKNIWEWr+MZDn1Cff511cHIcpHrltWXoQV1qJxLHWxK/+WoA/Py1EFT9OqiZTSF6+LUgS2H62IYu5E/RMmzbfosO/LH7KF5eHKFoLQzTZBqBNWC3xqhUMjVQgnQlAJb6ioC04ZtJfeniZUFWG3LzOmOC1MLslgYsmZGbcbGdAck4yymy3zAYjaNZ3arIaubMXE3wcwTiVBbIIIK6/FrD2BvByR8I9p0Oh3fDd2nvhP0lrwRlAf2GbmIbUdTnmit+Em1RZdJuVMnpyulrQnZSVfPn53AhZFHNcSSB3SzbVv7JSig7F7rlHno36t+19OW13fm/fBVu39/fOr3Sugah7P5F5a+HkSWp5eJJ0sbAuNsjTH9nNK3FHpVgv7syrcpeDJYlST/CsK3dANyD1uCKtdK2Ae1A4EYB04cf1gjPgeUQmdZzUDagcrRQJH7ykvHZ/l/xHoHRsAaPEN2NGZH8TomaJZUKMdbV8HXRAhQF0frpDnZVrNhWt0tADLDtrwno3V+VCzwNTqfqmF+gm6o4kReN1TFQ142VM3bqRo4w6hCMgpFW5VDP8vcB0oNWLgQ3LczqorrWIV1GkuMD5RRoX5l9/vRbUnzbWE+PUjvoHB2djt4NIrRoQK9lBPyG4bvXFr2nvakuVR+HB6wJ1d2qfk8oNmQ+Dj8TpoHJU29ZtLUmyZN1QZAoJN9rJx5ZYFC1ix+N1A3a15Zmtgn2HxqcuS8mYnJunkTqML88+XNklcllb1XOp/dO+ikL25/P/euuy9SCJ+i66FBIRQ1U/uEUKzb96h69LqYpqxp6xW9VE2x2taeTZDie1uWL0wtdylwYNR9Ns3l90HihwcPj79a6ecHG4JHaYPPOzm5xRJjqqfkde1+XlUAHL0XrNPqpG7JJkpl9m2CpebeEeh7chwqVW8idKttGidjOR1uP4tMp2+/LQW9/wE= \ No newline at end of file diff --git a/doc/system_health_monitoring/diagrams/system-chart.png b/doc/system_health_monitoring/diagrams/system-chart.png new file mode 100644 index 00000000000..e28aa94ca0f Binary files /dev/null and b/doc/system_health_monitoring/diagrams/system-chart.png differ diff --git a/doc/system_health_monitoring/diagrams/system-ready-disabled-flow.png b/doc/system_health_monitoring/diagrams/system-ready-disabled-flow.png new file mode 100644 index 00000000000..11cb206d0d5 Binary files /dev/null and b/doc/system_health_monitoring/diagrams/system-ready-disabled-flow.png differ diff --git a/doc/system_health_monitoring/diagrams/system-ready-ok-flow.png b/doc/system_health_monitoring/diagrams/system-ready-ok-flow.png new file mode 100644 index 00000000000..5df5bd46b3b Binary files /dev/null and b/doc/system_health_monitoring/diagrams/system-ready-ok-flow.png differ diff --git a/doc/system_health_monitoring/diagrams/system-ready-timeout-flow.png b/doc/system_health_monitoring/diagrams/system-ready-timeout-flow.png new file mode 100644 index 00000000000..6dd99323030 Binary files /dev/null and b/doc/system_health_monitoring/diagrams/system-ready-timeout-flow.png differ diff --git a/doc/system_health_monitoring/diagrams/system-use-case.png b/doc/system_health_monitoring/diagrams/system-use-case.png new file mode 100644 index 00000000000..06e2aabe3e5 Binary files /dev/null and b/doc/system_health_monitoring/diagrams/system-use-case.png differ diff --git a/doc/system_health_monitoring/system-ready-HLD.md b/doc/system_health_monitoring/system-ready-HLD.md index c55698c1ec8..73076aea982 100644 --- a/doc/system_health_monitoring/system-ready-HLD.md +++ b/doc/system_health_monitoring/system-ready-HLD.md @@ -40,26 +40,36 @@ - [8 Unit Test Cases ](#8-unit-test-cases) - [9 References ](#9-references) +# List of Figures + +- [Figure 1: System ready system chart](#figure-1-system-ready-system-chart) +- [Figure 2: System ready use-cause diagrams](#figure-2-system-ready-use-cause-diagram) +- [Figure 3: System status OK sequence diagram](#figure-3-system-status-ok-sequence-diagram) +- [Figure 4: System status DOWN sequence diagram](#figure-4-system-status-down-sequence-diagram) +- [Figure 5: System ready feature disabled flow](#figure-5-system-ready-feature-disabled-flow) + # List of Tables -[Table 1: Abbreviations](#table-1-abbreviations) +- [Table 1: Abbreviations](#table-1-abbreviations) # Revision -| Rev | Date | Author | Change Description | -|:--:|:--------:|:-----------------:|:------------------------------------------------------------:| -| 0.1 | | Senthil Kumar Guruswamy | Initial version | -| 0.2 | | Senthil Kumar Guruswamy | Update as per review comments | -| 0.3 | | Senthil Kumar Guruswamy | Integrate systemready to system-health | +| Rev | Date | Author | Change Description | +|:---:|:----------------:|:-----------------------:|:------------------------------------------------------------:| +| 0.1 | | Senthil Kumar Guruswamy | Initial version | +| 0.2 | | Senthil Kumar Guruswamy | Update as per review comments | +| 0.3 | | Senthil Kumar Guruswamy | Integrate systemready to system-health | +| 0.4 | 16 June 2023 | Yevhen Fastiuk 🇺🇦 | Report host daemons status. System status permanent. System ready admin state | # Definition/Abbreviation ### Table 1: Abbreviations -| **Term** | **Meaning** | -| -------- | ----------------------------------------- | -| FEATURE | Docker/Service | -| App | Docker/Service | +| **Term** | **Meaning** | +| ----------- | --------------------------------------------- | +| FEATURE | Docker/Service | +| App | Docker/Service | +| Host daemon | The demonized application running on the host | # About this Manual @@ -79,6 +89,8 @@ A new python based System monitor tool is introduced to monitor all the essentia This framework gives provision for docker apps to notify its closest up status. CLIs are provided to fetch the current system status and also service running status and its app ready status along with failure reason if any. This feature will be part of system-health framework. +![System chart](diagrams/system-chart.png "Figure 1: System ready system chart") +###### Figure 1: System ready system chart ## 1.1 Limitation of Existing tools: - Monit tool is a poll based approach which monitors the configured services for every 1 minute. @@ -90,7 +102,7 @@ This feature will be part of system-health framework. - Event based model where the feedback is immediate - Know the overall system status through syslog and as well through CLIs - It brings in the concept of application readiness to allow each application/service/docker to declare themselves as ready based on different application specific criteria. - - Combatibility with application extension framework. + - Compatibility with application extension framework. SONiC package installation process will register new feature in CONFIG DB. Third party dockers(signature verified) gets integrated into sonic os and runs similar to the existing dockers accessing db etc. Now, once the feature is enabled, it becomes part of either sonic.target or multi-user.target and when it starts, it automatically comes under the system monitor framework watchlist. @@ -106,15 +118,22 @@ Following requirements are addressed by the design presented in this document: 1. Identify the list of sonic services to be monitored. 2. system-health to include the sysmon framework to check system status of all the service units and receive service state change notifications to declare the system ready status. 3. Provision for apps to notify its closest up status in STATE DB. This should internally cover Port ready status. Also support application extension framework. -4. Appropriate system ready syslogs to be raised. -5. New CLI to be introduced to know the current system status all services. +4. Allow host daemons to report their app's ready status +5. Appropriate system ready syslogs to be raised. +6. New CLI to be introduced to know the current system status all services. - "show system-health sysready-status" covers the overall system status. -6. During the techsupport data collection, the new CLI to be included for debugging. +7. During the techsupport data collection, the new CLI to be included for debugging. +8. The feature should have enable/disable configuration. + - By default it is enabled, so it preserves all the behavior described in this document. + - In disabled state it will still report system ready status, but it will wait for only one event - `PortInitDone` +9. The feature should respect multi-asic according to [this design](https://github.com/sonic-net/SONiC/blob/master/doc/multi_asic/SONiC_multi_asic_hld.md#2421-systemd-services). If service is configured to be ignored or system ready feature should wait for it's app status - wait for all instances of that service. +![System ready use-case diagram](diagrams/system-use-case.png "Figure 2: System ready use-cause diagram") +###### Figure 2: System ready use-cause diagram ## 2.2 Configuration and Management Requirements -This feature will support CLI and no configuration command is provided for any congiruations. +This feature will support CLI and one configuration command is supported. ## 2.3 Scalability Requirements @@ -131,8 +150,8 @@ warmboot-finalizer sonic service to be monitored as part of all services. This feature provides framework to determine the current system status to declare the system is (almost) ready for network traffic. System ready is arrived at considering the following factors. -1. All sonic docker services and its UP status(including Portready status) -2. All sonic host services +1. Configured sonic docker services and its UP status (including Portready status) +2. Configured sonic host services # 4 Feature Design @@ -142,6 +161,18 @@ System ready is arrived at considering the following factors. - When sysmonitor daemon boots up, it polls for the service list status once and maintains the operational data in STATE_DB and publishes the system ready status in form of syslog and as well as in STATE_DB. - Subsequently, when any service state changes, sysmonitor gets the event notification for that service to be checked for its status and update the STATE_DB promptly. - Hence the system status is always up-to-date to be notifed to user in the form of syslog, STATE_DB update and as well as could be fetched by appropriate CLIs. +- Once system declare the status (any of it, `UP`, `DOWN`, or `FAILED`) the applications which were waiting for it can continue execution and take actions according to received status. + - `UP` system status should be concidered as healthy system status + - `DOWN` system status means that required daemon/s didn't notify its ready status during timeout period. + - `FAILED` system status means that some daemon was failed during it's execution or SONiC application reported `false` `up_status`. + +System status OK flow: +![System status OK sequence diagram](diagrams/system-ready-ok-flow.png "Figure 3: System status OK sequence diagram") +###### Figure 3: System status OK sequence diagram + +System status DOWN (by timeout): +![System status WODN sequence diagram](diagrams/system-ready-timeout-flow.png "Figure 4: System status DOWN sequence diagram") +###### Figure 4: System status DOWN sequence diagram ## 4.2 Sysmonitor @@ -153,10 +184,14 @@ Sysmonitor is the subtask of system-health service which does the job of checkin 1. subscribe to system dbus - With the dbus subscription, any systemd events gets notified to this task and it puts the event in the multiprocessing queue. -2. subscribe to the new FEATURE table in STATE_DB of Redis database +1. subscribe to the new FEATURE table in STATE_DB of Redis database - With the STATE_DB feature table subscription, any input to the FEATURE table gets notified to this task and it puts the event in the queue. -3. Main task +1. Timeout task + - Timeout can be configured trought the platform's `system_health_monitoring_config.json` file by the `timeout` field. + System will be declared DOWN once timeout reached. + +1. Main task - Runs through the polling of all service status check once and listen for events in queue populated by dbus task and statedb task to take appropriate action of checking the specific event unit status and updating system status in the STATE_DB. @@ -165,8 +200,12 @@ Sysmonitor is the subtask of system-health service which does the job of checkin ## 4.3 Service Identification - It covers the enabled services from FEATURE table of CONFIG_DB. -- Also, since the idea is to cover only the sonic services but not the general system services, sysmonitor tracks services under "multi-user.target" and "sonic.target" +- Also, since the idea is to cover only the sonic services but not the general system services, sysmonitor tracks services under "multi-user.target" and "sonic.target". It also inportant to track all "generated" systemd services from `/run/systemd/generator/` folder, such as `ntp-config`, `interfaces-config`, etc. - This covers all the sonic docker services and most of the sonic host services. +- Additionaly, in `system_health_monitoring_config.json` we introduce a new fields: `services_to_wait` and `services_to_report_app_status`. + - `services_to_wait` - holds explicit list of services we would like to wait for in order to declare system ready state. This list shouldn't include the SONiC applications, because it is up to them to specify the effect on system ready by paramerizing FEATURE table. + - `services_to_report_app_status` - some daemon may want to notify the readiness to systemd earlier that functional readiness. + That parameter will hold all services that should notify app ready state by itself using the same mechanism as SONiC application. ## 4.4 System ready Framework logic @@ -177,14 +216,21 @@ but align the services within framework to flag the status as "Down" if the serv - For services: - Loaded, enabled/enabled-runtime/static, active & running, active & exited state services are considered 'OK'. - For active and running services, up_status marked by docker app should be True to be considered 'OK'. - - Failed state services are considered 'Down'. + - Failed state services are considered 'Failed'. - Activating state services are considered as 'Starting'. - Deactivating state services are considered as 'Stopping'. - Inactive state services category: - oneshot services are considered as 'OK'. - Special services with condition pathexists check failure are considered as 'OK'. - Other services in inactive state are considered to be 'Down'. + - Services exited with error code concidered 'Failed'. - Any service type other than oneshot if by default goes to inactive state, RemainAfterExit=yes entry needs to be added to their service unit file to be inline with the framework. + - Host daemons marked their status via `up_status` field in STATE_DB as `true` considered 'OK'. + - Host daemons marked their status via `up_status` field in STATE_DB as `false` considered 'Failed'. + +System ready feature disabled flow: +![System ready feature disabled flow](diagrams/system-ready-disabled-flow.png "Figure 5: System ready feature disabled flow") +###### Figure 5: System ready feature disabled flow ## 4.5 Provision for apps to mark closest UP status @@ -197,6 +243,7 @@ In simple, each app is responsible in marking its closest up status in STATE_DB. Docker apps marking their UP status in STATE_DB will input an entry in FEATURE table of CONFIG_DB with check_up_status flag set to true through /etc/sonic/init_cfg.json file change. Sysmonitor checks for the check_up_status flag in CONFIG_DB before reading the app ready status from STATE_DB. If the flag does not exist or if set to False, then sysmonitor will not read the app ready status but just checks the running status of the service. +Docker applications can mark `irrel_for_sysready` field in `FEATURE` table to instruct sysmonitor to ignore the application's status. For application extension package support, a new manifest variable is introduced to control whether "check_up_status" should be up true or false which will also be an indication whether docker implements marking the up_status flag in STATE_DB. @@ -210,19 +257,48 @@ a new manifest variable is introduced to control whether "check_up_status" shoul "": { ... "state": "enabled", - "check_up_status": "true" + "check_up_status": "true", + "irrel_for_sysready": "true" } } } ``` +The feature configuration is controlled by `sysready_state` field of `DEVICE_METADATA` table. +```yang +module sonic-device_metadata { + ... + + container sonic-device_metadata { + + container DEVICE_METADATA { + + description "DEVICE_METADATA part of config_db.json"; + + container localhost { + ... + + leaf sysready_state { + type stypes:state; + } + } + /* end of container localhost */ + } + /* end of container DEVICE_METADATA */ + } + /* end of top level container */ +} +/* end of module sonic-device_metadata */ +``` + + ### 4.5.2 STATE_DB Changes - Docker apps which rely on config, can mark 'up_status' to true in STATE_DB when they are ready to receive configs from CONFIG_DB and/or some extra dependencies are met. - Respective apps should mark its up_status considering Port ready status. Hence there is no separate logic check needed by system monitoring tool - Any docker app which has multiple independent daemons can maintain a separate intermediate key-value in the redis-db for each of the daemons and the startup script that invokes each of these daemons can determine the status from the redis entries by each daemon and finally update the STATE_DB up_status. - Along with up_status, docker apps should update the fail_reason field with appropriate reason in case of failure or empty string in case of success. - Also, update_time field to be fed in as well in the format of epoch time. - +- Deamon's application mentioned in `services_to_report_app_status` must report their status in `up_status` field in `SERVICE_APP` table of `STATE_DB`. For instances, - swss docker app can wait for port init done and wait for Vrfmgr, Intfmgr and Vxlanmgr to be ready before marking its up status. @@ -232,14 +308,16 @@ For instances, STATE_DB: +- For SONiC application the `` is `FEATURE` +- For daemon's applications mentioned in `services_to_report_app_status` the `
` is `SERVICE_APP` ``` -- sonic-db-cli STATE_DB HSET "FEATURE|" up_status true -- sonic-db-cli STATE_DB HSET "FEATURE|" fail_reason "" / "" -- sonic-db-cli STATE_DB HSET "FEATURE|" update_time "" +- sonic-db-cli STATE_DB HSET "
|" up_status true +- sonic-db-cli STATE_DB HSET "
|" fail_reason "" / "" +- sonic-db-cli STATE_DB HSET "
|" update_time "" - Schema in STATE_DB sonic-db-dump -n STATE_DB output - "FEATURE|": { + "
|": { "type": "hash", "value": { "up_status": "true", @@ -250,7 +328,7 @@ STATE_DB: }, - Example: - "FEATURE|bgp": { + "
|bgp": { "type": "hash", "value": { "fail_reason": "", @@ -268,12 +346,36 @@ In addition to this, sysmonitor posts the system status to SYSTEM_READY table in "SYSTEM_READY|SYSTEM_STATE": { "type": "hash", "value": { - "status": "up" + "Status": "UP" } } ``` -### 4.5.3 Feature yang Changes +### 4.5.3 Health configuration file changes +As it was mentioned before that feature will use `system_health_monitoring_config.json` file as configuration. +The example of that file is here: +```json +{ + "services_to_ignore": ["rsyslog", "syncd", "redis", "orchagent", "portsyncd", "portmgrd", "pmon"], + "services_to_wait": ["ntp-config", "interfaces-config", "hostcfgd"], + "services_to_report_app_status": ["hostcfgd"], + "timeout": 10, + "devices_to_ignore": [], + "user_defined_checkers": [], + "polling_interval": 3, + "led_color": { + "fault": "orange", + "normal": "green", + "booting": "orange_blink" + } +} +``` +- `services_to_ignore` - is used to filter services we don't want to wait for +- `services_to_wait` - is explicit list of services we would like to wait for +- `services_to_report_app_status` - the list of services which must report their status in order to declare system ready +- `timeout` - the timeout after which sysmonitor will consider the system is `DOWN` + +### 4.5.4 Feature yang Changes Following field is added to the sonic-feature.yang file. diff --git a/doc/wol/Wake-on-LAN-HLD.md b/doc/wol/Wake-on-LAN-HLD.md new file mode 100644 index 00000000000..4fa10153734 --- /dev/null +++ b/doc/wol/Wake-on-LAN-HLD.md @@ -0,0 +1,174 @@ +# Wake-on-LAN in SONiC + +## Table of Content + +- [Overview](#overview) +- [Background](#background) +- [Components](#components) +- [Magic Packet](#magic-packet) +- [CLI Design](#cli-design) +- [gNOI Design](#gnoi-design) +- [Test Plan](#test-plan) +- [Reference](#reference) + +## Revision + +| Revision | Date | Author | Change Description | +| -------- | ---------- | ---------- | ------------------ | +| 1.0 | Nov 7 2023 | Zhijian Li | Initial proposal | + +## Definitions/Abbreviations + +| Abbreviation | Definition | +|--------------|---------------------------------------------------| +| WoL | **W**ake-**o**n-**L**AN | +| gNOI | **g**RPC **N**etwork **O**perations **I**nterface | +| NIC | **N**etwork **I**nterface **C**ontroller | + +## Overview + +**Wake-on-LAN** (**WoL** or **WOL**) is an Ethernet or Token Ring computer networking standard that allows a computer to be turned on or awakened from sleep mode by a network message[^1]. This document describes the Wake-on-LAN feature design in SONiC. + +## Background + +Below diagram describes a common usage of WoL on SONiC switch, this HLD will focus on the green part: + +![WoL Background](./img/background.png) + +## Components + +### `wol` CLI script in sonic-utilities + +A `wol` script will be introduced in [sonic-utilities](https://github.com/sonic-net/sonic-utilities). The workflow of command line utility `wol` is: + +0. User login to the SONiC switch and enter the `wol` command. +1. The `wol` script send magic packet to specific interface, VLAN or port-channel. + +![Component Utilities](./img/component-utilities.png) + +### gNOI service + +A new gNOI service `SONiCWolService` will be implemented in sonic-gnmi container. The workflow is: + +0. User initialize a gNOI Client to communicate with gNOI server. +1. The gNOI Client call the RPC function `SONiCWolService.Wol`. +2. The gNOI server send a D-Bus request to sonic-host-service[^2]. +3. sonic-host-service call `wol` CLI to send the magic packet. + +![Component gNOI](./img/component-gnoi.png) + +## Magic Packet + +**Magic Packet** is a Ethernet frame with structure: + +* **Ethernet Frame Header**: + * **Destination MAC address**: broadcast MAC address (ff:ff:ff:ff:ff:ff) or target device's MAC address. [6 bytes] + * **Source MAC address**. [6 bytes] + * **EtherType**: `0x0842`. [2 bytes] +* **Ethernet Frame Payload**: + * Six bytes of all `0xff`. [6 bytes] + * Sixteen repetitions of the target device's MAC address. [96 bytes] + * (Optional) A four or six byte password. [4 or 6 bytes] + +``` +Byte | +Offset |0 |1 |2 |3 |4 |5 | +---------+------+------+------+------+------+------+ + 0 | Destination MAC | + +------+------+------+------+------+------+ ETHERNET FRAME + 6 | Source MAC | HEADER + +------+------+---------------------------+ + 12 | 0x08 | 0x42 | \ \ \ \ + +------+------+------+------+------+------+------------------------------ + 14 | 0xFF | 0xFF | 0xFF | 0xFF | 0xFF | 0xFF | + +------+------+------+------+------+------+ ETHERNET FRAME + 20 | Target MAC (repeat 16 times, 96 bytes) | PAYLOAD + +-----------------------------------------+ + 116 | Password (optional, 4 or 6 bytes) | + +-----------------------------------------+ +``` + +## CLI Design + +The `wol` command is used to send magic packet to target device. + +### Usage + +``` +wol [-b] [-p password] [-c count] [-i interval] +``` + +- `interface`: SONiC interface name. +- `target_mac`: a list of target devices' MAC address, separated by comma. +- `-b`: Use broadcast MAC address instead of target device's MAC address as **Destination MAC Address in Ethernet Frame Header**. +- `-p password`: An optional 4 or 6 byte password, in ethernet hex format or quad-dotted decimal[^3]. +- `-c count`: For each target MAC address, the `count` of magic packets to send. `count` must between 1 and 5. Default value is 1. This param must use with `-i`. +- `-i interval`: Wait `interval` milliseconds between sending each magic packet. `interval` must between 0 and 2000. Default value is 0. This param must use with `-c`. + +### Example + +``` +admin@sonic:~$ wol Ethernet10 00:11:22:33:44:55 +admin@sonic:~$ wol Ethernet10 00:11:22:33:44:55 -b +admin@sonic:~$ wol Vlan1000 00:11:22:33:44:55,11:33:55:77:99:bb -p 00:22:44:66:88:aa +admin@sonic:~$ wol Vlan1000 00:11:22:33:44:55,11:33:55:77:99:bb -p 192.168.1.1 -c 3 -i 2000 +``` + +For the 4th example, it specifise 2 target MAC addresses and `count` is 3. So it'll send 6 magic packets in total. + +## gNOI Design + +The gNOI service `SONiCWolService` will provide a `Wol` interface to user, which can be called to send magic packet to target device: + +```proto +syntax = "proto3"; + +package gnoi.sonic_wol; + +//option (types.gnoi_version) = "0.1.0"; +import "github.com/gogo/protobuf/gogoproto/gogo.proto"; + +option (gogoproto.marshaler_all) = true; +option (gogoproto.unmarshaler_all) = true; + +service SonicWolService { + rpc Wol(WolRequest) returns (WolResponse) {} +} + +message WolRequest { + string interface = 1; // SONiC interface name + repeated string target_mac = 2; // Target device's MAC addresses + optional bool broadcast = 3; // Default false + optional string password = 4; // In ethernet hex format or quad-dotted decimal + optional int32 count = 5; // For each target MAC address, the count of magic packets to send. Must use with interval together. + optional int32 interval = 6; // Wait interval milliseconds between sending each magic packet. Must use with count together. +} + +message WolResponse { + SonicOutput output = 1; +} +``` + +## Test Plan + +### Unit Tests for `wol` CLI +| Case Description | Expected Result | +| :- | :- | +| Input a valid SONiC interface name. | Parameter validation pass, send magic packet | +| Input an invalid SONiC interface name. | Parameter validation Fail | +| Input a valid SONiC interface name, but the interface status is not `up`. | Return `Error: interface not up` | +| Input `target_mac` or `password` with invalid format | Parameter validation Fail | +| Input valid `count` and `interval`. | Parameter validation pass, send magic packet | +| Input value of `count` or `interval` is out of range. | Parameter validation Fail | +| Param `count` and `interval` not appear in input together. | Parameter validation Fail | +| Mock a send magic packet failure (e.g., socket error) | Return user friendly error message | + +### Functional Test + +Functional test plan will be published in [sonic-net/sonic-mgmt](https://github.com/sonic-net/sonic-mgmt). + +## Reference + +[^1]: [Wake-on-LAN - Wikipedia](https://en.wikipedia.org/wiki/Wake-on-LAN) +[^2]: [Docker to Host communication.md - sonic-net/SONiC](https://github.com/sonic-net/SONiC/blob/master/doc/mgmt/Docker%20to%20Host%20communication.md) +[^3]: [Dot-decimal notation - Wikipedia](https://en.wikipedia.org/wiki/Dot-decimal_notation) diff --git a/doc/wol/img/background.png b/doc/wol/img/background.png new file mode 100644 index 00000000000..cdd29f9f74c Binary files /dev/null and b/doc/wol/img/background.png differ diff --git a/doc/wol/img/component-gnoi.png b/doc/wol/img/component-gnoi.png new file mode 100644 index 00000000000..6aa68c9742b Binary files /dev/null and b/doc/wol/img/component-gnoi.png differ diff --git a/doc/wol/img/component-utilities.png b/doc/wol/img/component-utilities.png new file mode 100644 index 00000000000..cb97b341236 Binary files /dev/null and b/doc/wol/img/component-utilities.png differ diff --git a/doc/xrcvd/Interface-Link-bring-up-sequence-on-sff-modules.md b/doc/xrcvd/Interface-Link-bring-up-sequence-on-sff-modules.md new file mode 100644 index 00000000000..c28ffe55e2b --- /dev/null +++ b/doc/xrcvd/Interface-Link-bring-up-sequence-on-sff-modules.md @@ -0,0 +1,181 @@ +# Feature Name +Deterministic Approach for Interface Link bring-up sequence on SFF compliant modules + +# High Level Design Document +#### Rev 0.1 + +# Table of Contents + * [List of Tables](#list-of-tables) + * [Revision](#revision) + * [About This Manual](#about-this-manual) + * [Abbreviation](#abbreviation) + * [References](#references) + * [Problem Definition](#problem-definition) + * [Background](#background) + * [Objective](#objective) + * [Plan](#plan) + * [Breakout handling](#breakout-handling) + * [Feature enablement](#feature-enablement) + * [Pre-requisite](#pre-requisite) + * [Proposed Work-Flows](#proposed-work-flows) + +# List of Tables + * [Table 1: Definitions](#table-1-definitions) + * [Table 2: References](#table-2-references) + +# Revision +| Rev | Date | Author | Change Description | +|:---:|:-----------:|:----------------------------------:|-------------------------------------| +| 0.1 | 07/12/2023 | Longyin Huang | Initial version | + + +# About this Manual +This is a high-level design document describing the need to have determinstic approach on SFF compliant modules for Interface link bring-up sequence and workflows for use-cases around it + +Parent HLD [Interface-Link-bring-up-sequence.md](https://github.com/sonic-net/SONiC/blob/master/doc/sfp-cmis/Interface-Link-bring-up-sequence.md) focuses on generic high level background/idea and details for CMIS modules, while this HLD focuses on SFF modules with details. + +# Abbreviation + +# Table 1: Definitions +| **Term** | **Definition** | +| -------------- | ------------------------------------------------ | +| pmon | Platform Monitoring Service | +| xcvr | Transceiver | +| xcvrd | Transceiver Daemon | +| gbsyncd | Gearbox (External PHY) docker container | + +# References + +# Table 2 References + +| **Document** | **Location** | +|---------------------------------------------------------|---------------| +| Deterministic Approach for Interface Link bring-up sequence for CMIS and SFF modules | [Interface-Link-bring-up-sequence.md](https://github.com/sonic-net/SONiC/blob/master/doc/sfp-cmis/Interface-Link-bring-up-sequence.md) | + + + +# Problem Definition +According to parent [HLD](https://github.com/sonic-net/SONiC/blob/master/doc/sfp-cmis/Interface-Link-bring-up-sequence.md#plan), as already discussed with sonic-chassis workgroup and OCP community: + +1. Presently in SONiC, for SFF compliant modules (100G/40G), there is no synchronization between enabling Tx of optical module and enabling ASIC (NPU/PHY) Tx which may cause link instability during administrative interface enable “config interface startup Ethernet” configuration and bootup scenarios. + +2. During administrative interface disable “config interface shutdown Ethernet”, only the ASIC(NPU) Tx is disabled and not the optical module Tx/laser. + This will lead to power wastage, un-necessary fan power consumption to keep the module temperature in operating range, and potential lab hazard when the port is shut off but the laser is still on. + +# Background + + Refer to parent [HLD](https://github.com/sonic-net/SONiC/blob/master/doc/sfp-cmis/Interface-Link-bring-up-sequence.md#background) + +# Objective + +According to parent [HLD](https://github.com/sonic-net/SONiC/blob/master/doc/sfp-cmis/Interface-Link-bring-up-sequence.md#objective), have a determistic approach for Interface link bring-up sequence for SFF compliant modules (100G/40G) i.e. below sequence to be followed: + 1. Initialize and enable NPU Tx and Rx path + 2. For system with 'External' PHY: Initialize and enable PHY Tx and Rx on both line and host sides; ensure host side link is up + 3. Then perform optics Tx enable + +# Plan + +Plan is to follow this high-level work-flow sequence to accomplish the Objective: +- Add a new thread SFF task manager (called sff_mgr) inside xcvrd to subscribe to existing field “host_tx_ready” in port table state-DB +- “host_tx_ready” is set to true only when admin_status is up and setting admin_status to syncd/gbsyncd is successful. (As part of setting admin_status to syncd/gbsyncd successfully, the NPU/PHY Tx is enabled/disabled) +- sff_mgr processes the “host_tx_ready” value change event and do optics Tx enable/disable using tx_disable API + +# Breakout Handling + +Refer to parent [HLD](https://github.com/sonic-net/SONiC/blob/master/doc/sfp-cmis/Interface-Link-bring-up-sequence.md#breakout-handling) + +# Feature enablement + + This feature (optics Interface Link bring-up sequence) would be enabled on per platform basis. + There could be cases where vendor(s)/platform(s) may take time to shift from existing codebase to the model (work-flows) described in this document. +- By default, sff_mgr feature is disabled. +- In order to enable sff_mgr feature, the platform would set ‘enable_xcvrd_sff_mgr’ to ‘true’ in their respective pmon_daemon_control.json. Xcvrd would parse ‘enable_xcvrd_sff_mgr’ and if found 'true', it would launch SFF task manager (sff_mgr). + +# Pre-requisite + +In addition to parent HLD's [pre-requisite](https://github.com/sonic-net/SONiC/blob/master/doc/sfp-cmis/Interface-Link-bring-up-sequence.md#pre-requisite), + +> **_Pre-requisite for enabling sff_mgr:_** +Platform needs to leave the transceiver (if capable of disabling Tx) in Tx disabled state when an module inserted or during boot-up. This is to make sure the transceiver is not transmitting with Tx enabled before host_tx_ready is True. + +# Proposed Work-Flows + + - ### Flow of pre-requisite for platform in insertion/bootup cases + ```mermaid + graph TD; + A[platfrom brings module out of RESET] + B[platform keeps module in Tx disabled state immediately after module out-of-RESET] + C[xcvrd detects module insertion via platform API get_transceiver_change_event, and update module status/info to DB] + D[Upon module insertion event, sff_mgr takes action accordingly if needed] + + Start --> A + A --> B + B --> C + C --> D + D --> End + ``` + - ### Feature enablment flow -- how xcvrd spawns sff_mgr thread based on enable_sff_mgr flag + ```mermaid + graph TD; + A[wait for PortConfigDone] + B[check if enable_sff_mgr flag exists and is set to true] + C[spawn sff_mgr] + D[proceed to other thread spawning and tasks] + + Start --> A + A --> B + B -- true --> C + C --> D + B -- false --> D + D --> End + ``` + - ### Flow of calculating target tx_disable value: + - When ```tx_disable value/status``` is ```True```, it means Tx is disabed + - when ```tx_disable value/status``` is ```False```, it means Tx is enabled + ```mermaid + graph TD; + + A[check if both host_tx_ready is True AND admin_status is UP] + B[target tx_disable value is set to False, Tx should be turned ON] + C[target tx_disable value is set to True, Tx should be turned OFF] + + Start --> A + A -- true --> B + A -- false --> C + B --> End + C --> End + ``` + - ### Main flow of sff_mgr, covering below cases: + - system bootup + - transceiver insertion + - admin enable/disable configurations + ```mermaid + graph TD; + A[subscribe to events] + B[while task_stopping_event is not set] + C[check insertion event, host_tx_ready change event and admin_status change event for each intended port] + D[double check if module is present] + E[fetch DB and update host_tx_ready value in local cahce, if not available locally] + E2[fetch DB and update admin_status value in local cahce, if not available locally] + F[calculate target tx_disable value based on host_tx_ready and admin_status] + G[check if tx_disable status on module is already the target value] + H[go ahead to enable/disable Tx based on the target tx_disable value] + + Start --> A + A --> B + B -- true --> C + C -- if either event happened --> E + C -- if neither event happened --> B + E --> E2 + E2 --> D + D -- true --> F + D -- false --> B + F --> G + G -- true --> B + G -- false --> H + H --> B + B -- false --> End + ``` + +# Out of Scope +Refer to parent [HLD](https://github.com/sonic-net/SONiC/blob/master/doc/sfp-cmis/Interface-Link-bring-up-sequence.md) diff --git a/supported_devices_platforms_md.sh b/supported_devices_platforms_md.sh index dc748d0d1c8..89dcb167db9 100644 --- a/supported_devices_platforms_md.sh +++ b/supported_devices_platforms_md.sh @@ -176,24 +176,28 @@ echo "| 99 | Nvidia | SN3420 | Nvidia | Spectrum 2 | 48 echo "| 100 | Nvidia | SN3700 | Nvidia | Spectrum 2 | 32x200G | [SONiC-ONIE-Mellanox]($(echo "${ARTF_MLNX}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-mellanox.bin/')) |" >> supported_devices_platforms.md echo "| 101 | Nvidia | SN3700C | Nvidia | Spectrum 2 | 32x100G | [SONiC-ONIE-Mellanox]($(echo "${ARTF_MLNX}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-mellanox.bin/')) |" >> supported_devices_platforms.md echo "| 102 | Nvidia | SN3800 | Nvidia | Spectrum 2 | 64x100G | [SONiC-ONIE-Mellanox]($(echo "${ARTF_MLNX}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-mellanox.bin/')) |" >> supported_devices_platforms.md -echo "| 103 | Nvidia | SN4600C | Nvidia | Spectrum 3 | 64x100G | [SONiC-ONIE-Mellanox]($(echo "${ARTF_MLNX}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-mellanox.bin/')) |" >> supported_devices_platforms.md -echo "| 104 | Nvidia | SN4700 | Nvidia | Spectrum 3 | 32x400G | [SONiC-ONIE-Mellanox]($(echo "${ARTF_MLNX}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-mellanox.bin/')) |" >> supported_devices_platforms.md -echo "| 105 | Pegatron | Porsche | Nephos | Taurus | 48x25G + 6x100G | [SONiC-ONIE-Nephos]($(echo "${ARTF_NPH}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-nephos.bin/')) |" >> supported_devices_platforms.md -echo "| 106 | Quanta | T3032-IX7 | Broadcom | Trident 3 | 32x100G | [SONiC-ONIE-Broadcom]($(echo "${ARTF_BRCM}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-broadcom.bin/')) |" >> supported_devices_platforms.md -echo "| 107 | Quanta | T4048-IX8 | Broadcom | Trident 3 | 48x25G + 8x100G | [SONiC-ONIE-Broadcom]($(echo "${ARTF_BRCM}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-broadcom.bin/')) |" >> supported_devices_platforms.md -echo "| 108 | Quanta | T4048-IX8C | Broadcom | Trident 3 | 48x25G + 8x100G | [SONiC-ONIE-Broadcom]($(echo "${ARTF_BRCM}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-broadcom.bin/')) |" >> supported_devices_platforms.md -echo "| 109 | Quanta | T7032-IX1B | Broadcom | Tomahawk | 32x100G | [SONiC-ONIE-Broadcom]($(echo "${ARTF_BRCM}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-broadcom.bin/')) |" >> supported_devices_platforms.md -echo "| 110 | Quanta | T9032-IX9 | Broadcom | Tomahawk 3 | 32x400G | [SONiC-ONIE-Broadcom]($(echo "${ARTF_BRCM}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-broadcom.bin/')) |" >> supported_devices_platforms.md -echo "| 111 | Ragile | RA-B6510-48V8C | Broadcom | Trident 3 | 48x25G+8x100G | [SONiC-ONIE-Broadcom]($(echo "${ARTF_BRCM}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-broadcom.bin/')) |" >> supported_devices_platforms.md -echo "| 112 | Ragile | RA-B6910-64C | Broadcom | Tomahawk 2 | 64x100G | [SONiC-ONIE-Broadcom]($(echo "${ARTF_BRCM}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-broadcom.bin/')) |" >> supported_devices_platforms.md -echo "| 113 | Ragile | RA-B6510-32C | Broadcom | Trident 3 | 32x100G | [SONiC-ONIE-Broadcom]($(echo "${ARTF_BRCM}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-broadcom.bin/')) |" >> supported_devices_platforms.md -echo "| 114 | Ragile | RA-B6920-4S | Broadcom | Tomahawk 3 | 128x100G | [SONiC-ONIE-Broadcom]($(echo "${ARTF_BRCM}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-broadcom.bin/')) |" >> supported_devices_platforms.md -echo "| 115 | Ragile | RA-B6010-48GT4X | Centec | Centec | 48x1G+4x10G | [SONiC-ONIE-Centec]($(echo "${ARTF_CTC}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-centec.bin/')) |" >> supported_devices_platforms.md -echo "| 116 | Tencent | TCS8400-24CC8CD | Broadcom | Trident 4 | 24x200G + 8x400G | [SONiC-ONIE-Broadcom]($(echo "${ARTF_BRCM}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-broadcom.bin/')) |" >> supported_devices_platforms.md -echo "| 117 | Tencent | TCS9400-128CC | Broadcom | Tomahawk 4 | 128x200G | [SONiC-ONIE-Broadcom]($(echo "${ARTF_BRCM}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-broadcom.bin/')) |" >> supported_devices_platforms.md -echo "| 118 | Wistron | sw-to3200k | Marvell | Teralynx 7 | 32x400G |[SONiC-ONIE-Innovium]($(echo "${ARTF_INNO}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-innovium-dbg.bin/')) |" >> supported_devices_platforms.md -echo "| 119 | Wistron | 6512-32r | Marvell | Teralynx 7 | 32x400G |[SONiC-ONIE-Innovium]($(echo "${ARTF_INNO}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-innovium-dbg.bin/')) |" >> supported_devices_platforms.md -echo "| 120 | Wnc | OSW1800 | Intel | Tofino | 48x25G + 6x100G | [SONiC-ONIE-Barefoot]($(echo "${ARTF_BFT}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-barefoot.bin/')) |" >> supported_devices_platforms.md +echo "| 103 | Nvidia | SN4410 | Nvidia | Spectrum 3 | 48x100 + 8x400 | [SONiC-ONIE-Mellanox]($(echo "${ARTF_MLNX}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-mellanox.bin/')) |" >> supported_devices_platforms.md +echo "| 104 | Nvidia | SN4600C | Nvidia | Spectrum 3 | 64x100G | [SONiC-ONIE-Mellanox]($(echo "${ARTF_MLNX}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-mellanox.bin/')) |" >> supported_devices_platforms.md +echo "| 105 | Nvidia | SN4600V | Nvidia | Spectrum 3 | 64x200G | [SONiC-ONIE-Mellanox]($(echo "${ARTF_MLNX}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-mellanox.bin/')) |" >> supported_devices_platforms.md +echo "| 106 | Nvidia | SN4700 | Nvidia | Spectrum 3 | 32x400G | [SONiC-ONIE-Mellanox]($(echo "${ARTF_MLNX}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-mellanox.bin/')) |" >> supported_devices_platforms.md +echo "| 107 | Nvidia | SN5600 | Nvidia | Spectrum 4 | 64x800G | [SONiC-ONIE-Mellanox]($(echo "${ARTF_MLNX}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-mellanox.bin/')) |" >> supported_devices_platforms.md +echo "| 108 | Pegatron | Porsche | Nephos | Taurus | 48x25G + 6x100G | [SONiC-ONIE-Nephos]($(echo "${ARTF_NPH}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-nephos.bin/')) |" >> supported_devices_platforms.md +echo "| 109 | Quanta | T3032-IX7 | Broadcom | Trident 3 | 32x100G | [SONiC-ONIE-Broadcom]($(echo "${ARTF_BRCM}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-broadcom.bin/')) |" >> supported_devices_platforms.md +echo "| 110 | Quanta | T4048-IX8 | Broadcom | Trident 3 | 48x25G + 8x100G | [SONiC-ONIE-Broadcom]($(echo "${ARTF_BRCM}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-broadcom.bin/')) |" >> supported_devices_platforms.md +echo "| 111 | Quanta | T4048-IX8C | Broadcom | Trident 3 | 48x25G + 8x100G | [SONiC-ONIE-Broadcom]($(echo "${ARTF_BRCM}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-broadcom.bin/')) |" >> supported_devices_platforms.md +echo "| 112 | Quanta | T7032-IX1B | Broadcom | Tomahawk | 32x100G | [SONiC-ONIE-Broadcom]($(echo "${ARTF_BRCM}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-broadcom.bin/')) |" >> supported_devices_platforms.md +echo "| 113 | Quanta | T9032-IX9 | Broadcom | Tomahawk 3 | 32x400G | [SONiC-ONIE-Broadcom]($(echo "${ARTF_BRCM}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-broadcom.bin/')) |" >> supported_devices_platforms.md +echo "| 114 | Ragile | RA-B6510-48V8C | Broadcom | Trident 3 | 48x25G+8x100G | [SONiC-ONIE-Broadcom]($(echo "${ARTF_BRCM}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-broadcom.bin/')) |" >> supported_devices_platforms.md +echo "| 115 | Ragile | RA-B6910-64C | Broadcom | Tomahawk 2 | 64x100G | [SONiC-ONIE-Broadcom]($(echo "${ARTF_BRCM}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-broadcom.bin/')) |" >> supported_devices_platforms.md +echo "| 116 | Ragile | RA-B6510-32C | Broadcom | Trident 3 | 32x100G | [SONiC-ONIE-Broadcom]($(echo "${ARTF_BRCM}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-broadcom.bin/')) |" >> supported_devices_platforms.md +echo "| 117 | Ragile | RA-B6920-4S | Broadcom | Tomahawk 3 | 128x100G | [SONiC-ONIE-Broadcom]($(echo "${ARTF_BRCM}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-broadcom.bin/')) |" >> supported_devices_platforms.md +echo "| 118 | Ragile | RA-B6010-48GT4X | Centec | Centec | 48x1G+4x10G | [SONiC-ONIE-Centec]($(echo "${ARTF_CTC}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-centec.bin/')) |" >> supported_devices_platforms.md +echo "| 119 | Ruijie | B6510-48VS8CQ | Broadcom | Trident 3 | 48x25G + 8x100G | [SONiC-ONIE-Broadcom]($(echo "${ARTF_BRCM}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-broadcom.bin/')) |" >> supported_devices_platforms.md +echo "| 120 | Tencent | TCS8400-24CC8CD | Broadcom | Trident 4 | 24x200G + 8x400G | [SONiC-ONIE-Broadcom]($(echo "${ARTF_BRCM}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-broadcom.bin/')) |" >> supported_devices_platforms.md +echo "| 121 | Tencent | TCS9400-128CC | Broadcom | Tomahawk 4 | 128x200G | [SONiC-ONIE-Broadcom]($(echo "${ARTF_BRCM}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-broadcom.bin/')) |" >> supported_devices_platforms.md +echo "| 122 | Wistron | sw-to3200k | Marvell | Teralynx 7 | 32x400G |[SONiC-ONIE-Innovium]($(echo "${ARTF_INNO}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-innovium-dbg.bin/')) |" >> supported_devices_platforms.md +echo "| 123 | Wistron | 6512-32r | Marvell | Teralynx 7 | 32x400G |[SONiC-ONIE-Innovium]($(echo "${ARTF_INNO}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-innovium-dbg.bin/')) |" >> supported_devices_platforms.md +echo "| 124 | Wnc | OSW1800 | Intel | Tofino | 48x25G + 6x100G | [SONiC-ONIE-Barefoot]($(echo "${ARTF_BFT}" | sed 's/format=zip/format=file\&subpath=\/target\/sonic-barefoot.bin/')) |" >> supported_devices_platforms.md echo "#### Note- Dell S6000-ON* supports the build till 202012 release image. No other image after 202012 release works on S6000." >> supported_devices_platforms.md echo "#### [click here](https://sonic-build.azurewebsites.net/ui/sonic/Pipelines) for previous builds. " >> supported_devices_platforms.md diff --git a/tsc/frr/frr.png b/tsc/frr/frr.png new file mode 100644 index 00000000000..66fe10949d6 Binary files /dev/null and b/tsc/frr/frr.png differ diff --git a/tsc/frr/sonic_frr_update_process.md b/tsc/frr/sonic_frr_update_process.md new file mode 100644 index 00000000000..03a6832caf9 --- /dev/null +++ b/tsc/frr/sonic_frr_update_process.md @@ -0,0 +1,80 @@ +# **SONiC Community FRR Upgrade Process Proposal** + +FRR upgrade and patch cadence is largely on need basis in the current SONiC release program. This SONiC FRR upgrade process proposal is to formalize the FRR upgrade cadence and process for future SONiC release programs. + +# SONiC FRR Version Upgrade History + +

+ +

+ + +# FRR Project Release Cadence +- FRR release numbering scheme x.y.z-s# +- New FRR releases roughly every 4 months + - Apr 2023 - Release 8.5.1 (48 fixes) + - Mar 2023 - Release 8.5 (947+ commits) + - Jan 2023 - Release 8.4.2 (22 fixes) + - Nov 2022 - Release 8.4.1 (16 fixes) + - Nov 2022 - Release 8.4 (700+ commits) + - Aug 2022 - Release 8.3.1 (14 fixes) + - Jul 2022 - Release 8.3 (1000+ commits) + - Mar 2022 - Release 8.2.2 (800+ commits) + - Nov 2021 - Release 8.1.0 (1200+ commits) + - Jul 2021 - Release 8.0.0 (2200+ commits) + +- SONiC to stay out from major/minor releases (x.y) and use patch release (.z) for stability (eg, FRR 8.3.1 instead of 8.3 if it is for 202211 release). Another example, at the time of SONiC FRR upgrade, the following FRR versions are avaialble 9.0.1, 8.5.3, 9.0, the guidance is to upgrade with the latest patch release 9.0.1 + +# SONiC Release FRR Upgrade +- SONiC default to rebase FRR in every November community release +- SONiC FRR upgrade test requirements + - MANDATORY: Pass all Azure pipeline build test and LGTM as required by the standard code PR merge process + - OPTIONAL: Additional tests in respect to specific changeset in the upgrade as deem necessary, manual tests should be automated and submitted to improve future test coverage +- Rotate SONiC FRR maintenance duty among repo maintainer org and others (BRCM, MSFT, Alibaba, NVDA, DELL) +- Responsibility of SONiC FRR release maintainer + - Default 12 months assignment + - Upgrade FRR version in Nov release, resolve SONiC FRR upgrade integration issues + - Triage and fix SONiC FRR issues when applicable. Fix may come from SONiC contributors or from FRR community, maintainer is responsible to drive the fix to unblock SONiC community + - Submit fixes to FRR project, submit new FRR topo test to FRR project if there is a gap + - Release maintainer to subscribe to FRR project, and be the FRR Point-of-Contact on behalf of SONiC + - Bring in FRR vulnerabilities and critical patches to SONiC + - If there is a need for May release upgrade due to feature requirement or other reasons, it should be requested in the May release plan and get a consensus with the FRR release maintainer. The person or organization requested the upgrade will be responsible of carrying out FRR upgrade with the same FRR upgrade process described in this proposal. FRR release maintainer will need to work closely with the person or organization to oversee the mid term upgrade, he/she will continue to assume the role of FRR release maintainer until the end of his/her term + +# SONiC FRR vulnerability and patch upgrade in between SONiC releases + +- FRR CVE Fixes + - Reference nvd.nist.gov cvss v3.x [rating](https://nvd.nist.gov/vuln-metrics/cvss#) + - Bring in Critical and High FRR CVE patches into SONiC + - Sample of [Critical](https://nvd.nist.gov/vuln/search/results?form_type=Advanced&results_type=overview&search_type=all&isCpeNameSearch=false&cpe_vendor=cpe%3A%2F%3Afrrouting&cpe_product=cpe%3A%2F%3A%3Afrrouting&cvss_version=3&cvss_v3_severity=CRITICAL), sample of [High](https://nvd.nist.gov/vuln/search/results?form_type=Advanced&results_type=overview&search_type=all&isCpeNameSearch=false&cpe_vendor=cpe%3A%2F%3Afrrouting&cpe_product=cpe%3A%2F%3A%3Afrrouting&cvss_version=3&cvss_v3_severity=HIGH) + - There are regular security scan in Azure pipeline, CVEs will be filed to sonic-buildimage/issues + - SONiC FRR release maintainer should subscribe to nvd.nist.gov for FRR alerts + - Need a process to bring in CVEs to earlier SONiC releases too (open to suggestions) + +- Patch FRR Bug Fixes + - SONiC FRR release maintainer should subscribe to FRR project to bring in critical patch which is applicable to SONiC + +# SONiC FRR Upgrade Steps +- Create sonic-frr branch for the target FRR version + - Contact release manager +- Find new package dependencies + - Upload newly required packages to a common location (Azure) +- Submodule update to new FRR commit id +- Code changes + - Version change in Makefiles + - New Makefiles for new packages (if any) + - Port patches + - Evaluate whether existing FRR patches still applicable to new FRR version + - Apply the old patches into new FRR version, and generate new patch files. Keep original credentials + - If the changes are already present in new FRR version, discard the old patch file + - If the patch does not apply, manually merge the changes and resolve any conflicts + - Review the existing FRR commands in SONiC techsupport. Add, Remove or modify the FRR commands in the generate_dump script based on the new FRR version. https://github.com/sonic-net/sonic-utilities/blob/master/scripts/generate_dump + - Build and verify + - Use PTF on local server, – or – + - Manually verify BGP, VRF, IPv4, IPv6 (on sonic-vs.) + - Create PR with the following template + - [https://github.com/sonic-net/sonic-buildimage/pull/15965](https://github.com/sonic-net/sonic-buildimage/pull/15965) +- FRR upgrade PRs for reference + - [https://github.com/sonic-net/sonic-buildimage/pull/15965](https://github.com/sonic-net/sonic-buildimage/pull/15965) + - [https://github.com/sonic-net/sonic-buildimage/pull/10691](https://github.com/sonic-net/sonic-buildimage/pull/10691) + - [https://github.com/sonic-net/sonic-buildimage/pull/11502](https://github.com/sonic-net/sonic-buildimage/pull/11502) + - [https://github.com/sonic-net/sonic-buildimage/pull/10947](https://github.com/sonic-net/sonic-buildimage/pull/10947)