- Revision
- About this manual
- Scope
- Abbreviations
- 1 Introduction
- 2 Design
- 3 Test plan
Rev | Date | Author | Description |
---|---|---|---|
0.1 | 24/01/2024 | Nazarii Hnydyn | Initial version |
This document provides general information about WCMP implementation in SONiC
This document describes the high level design of WCMP feature in SONiC
In scope:
- WCMP L3 configuration
Out of scope:
- WCMP EVPN Type-5 configuration
Term | Meaning |
---|---|
SONiC | Software for Open Networking in the Cloud |
WCMP | Weighted-Cost Multi-Path |
ECMP | Equal-Cost Multi-Path |
UCMP | Unequal-Cost Multi-Path |
EVPN | Ethernet Virtual Private Network |
FRR | Free Range Routing |
BGP | Border Gateway Protocol |
IP | Internet Protocol |
L3 | Layer 3 |
OA | Orchestration Agent |
DB | Database |
NH | Next Hop |
NHG | Next Hop Group |
ToR | Top-Of-Rack |
API | Application Programming Interface |
SAI | Switch Abstraction Interface |
ASIC | Application-Specific Integrated Circuit |
SWSS | Switch State Service |
CLI | Сommand-line Interface |
JSON | JavaScript Object Notation |
YANG | Yet Another Next Generation |
PYTEST | Python Testing Framework |
VS | Virtual Switch |
PTF | Packet Test Framework |
Figure 1: WCMP design
Figure 2: WCMP OA design
Figure 3: WCMP update flow
Figure 4: WCMP show flow
Table 1: Frontend event logging
Table 2: Backend event logging
In normal ECMP, the route to a destination has multiple next hops and traffic is expected to be equally distributed
across these next hops. In practice, flow-based hashing is used so that all traffic associated with a particular flow
uses the same next hop, and by extension, the same path across the network.
Weighted ECMP using BGP link bandwidth introduces support for network-wide UCMP to an IP destination.
The unequal cost load balancing is implemented by the forwarding plane based on the weights associated
with the next hops of the IP prefix. These weights are computed based on the bandwidths of the corresponding
multipaths which are encoded in the BGP link bandwidth extended community. Exchange of an appropriate
BGP link bandwidth value for a prefix across the network results in network-wide unequal cost multipathing.
WCMP is applicable in a pure L3 network as well as in a EVPN network.
This feature will support the following functionality:
- WCMP BGP device global configuration
- Warm/Fast reboot
This feature will support the following commands:
- config: set switch WCMP global configuration
- show: display switch WCMP global configuration
This feature will provide error handling for the next situations:
- Missing parameter value
- Invalid parameter value
This feature will provide error handling for the next situations:
- Invalid parameter value
- Parameter removal
- Configuration removal
This feature will provide event logging for the next situations:
- Configuration update
Event | Severity |
---|---|
Configuration update: success | NOTICE |
Configuration update: error | ERROR |
This feature will provide event logging for the next situations:
- Invalid parameter value
- Parameter removal
- Configuration removal
- Configuration update
Event | Severity |
---|---|
Invalid parameter value | ERROR |
Parameter removal | ERROR |
Configuration removal | ERROR |
Configuration update: success | NOTICE |
Configuration update: error | ERROR |
A network fabric is a type of network topology where all nodes, in this case switches and endpoints,
are interconnected to all other nodes. In such a topology, full bisection bandwidth is often expected.
The full bisection bandwidth allows one half of the network nodes to communicate simultaneously
with the other half of the nodes.
When a link fails, it is important not only for routing topology to converge quickly, but also converge
in a way that can maintain the best performance possible. For example, with multiple links connecting
each ToR and Spine, one link failing does not change reachability, but has big implications
in terms of performance due to reduced capacity.
Weighted-Equal Cost-Multipath (W-ECMP) offers a solution to this problem in which the relative link weight
reflecting the available capacity is distributed using the BGP bandwidth community attribute.
Dataplane configuration flow:
- The BGP link bandwidth extended community carries information about the available link capacity for the route
prefixes through the network, which gets mapped to the weight of the corresponding next hop. The mapping
factors the bandwidth value of a particular path against the total bandwidth values of all possible paths,
normalized to the range 1 to 255 - FRR converts the incoming BGP link bandwidth extended community values into proportionated weight
among the ecmp members in such a way that the cumulative value of individual weights is normalized to 255 - BGP gets notified about the weight change for the prefixes towards the switch that experienced the failure
of the link connecting the Spine and the remote ToR - FRR/Zebra recalculates the weight and makes a new NHG
- NH weights on routes are reported to
fpmsyncd
and further via Redis DB to Route OA - Route OA programs weight values to
syncd
via SAI Redis - ASIC is provisioned with weight values by
syncd
via SAI API
The NHG member weights can be used by a dataplane to do better forwarding decisions.
The configuration for weighted ECMP using BGP link bandwidth requires using a route-map to inject
the link bandwidth extended community.
There is no configuration necessary to process received link bandwidth and translate it into the weight
associated with the corresponding next hop; that happens by default. If some of the multipaths do not have
the link bandwidth extended community, the default behavior is to revert to normal ECMP.
At the entry point router that is injecting the prefix to which weighted load balancing must be performed,
a route-map must be configured to attach the link bandwidth extended community.
For the use case of providing weighted load balancing for an anycast service, this configuration will typically
need to be applied at the ToR or Leaf router that is connected to servers which provide the anycast service
and the bandwidth would be based on the number of multipaths for the destination.
Skeleton code:
!
route-map wcmp-map permit 100
set extcommunity bandwidth num-multipaths
exit
!
router bgp 65100
neighbor SPINE peer-group
neighbor 1.1.1.1 peer-group SPINE
neighbor 1.1.1.1 remote-as 65200
neighbor 2.2.2.2 peer-group SPINE
neighbor 2.2.2.2 remote-as 65200
!
address-family ipv4 unicast
neighbor SPINE route-map wcmp-map out
neighbor SPINE activate
exit-address-family
!
end
Skeleton code:
!
route-map wcmp-map permit 100
set extcommunity bandwidth num-multipaths
exit
!
router bgp 65100 vrf vrf1
neighbor SPINE peer-group
neighbor 1.1.1.1 peer-group SPINE
neighbor 1.1.1.1 remote-as 65200
neighbor 2.2.2.2 peer-group SPINE
neighbor 2.2.2.2 remote-as 65200
!
address-family l2vpn evpn
advertise ipv4 unicast route-map wcmp-map
neighbor SPINE activate
exit-address-family
!
end
Note: EVPN configuration is only relevant when docker_routing_config_mode
is either split
or split-unified
The existing DeviceGlobalCfgMgr
class will be extended with a new APIs to implement WCMP feature.
OA will be extended with a new WCMP Config DB schema and set/unset template support.
WCMP updates will be processed by OA based on Config DB changes.
Some updates will be handled and some will be considered as invalid.
Note: NHG member weight support is already part of SWSS
Class DeviceGlobalCfgMgr
holds a set of methods matching generic Manager
class pattern to handle
Config DB updates. For that purpose a SubscriberStateTable
mechanism (implemented in sonic-swss-common
)
is used. Method DeviceGlobalCfgMgr::handler
will be called on WCMP update. It will distribute handling
of DB updates between other handlers based on the table key updated (Redis Keyspace Notifications).
This class is responsible for:
- Processing updates of WCMP
- Partial input data validation
- Replicating data from Config DB to FRR configuration
- Caching objects in order to handle updates
WCMP object is stored under BGP_DEVICE_GLOBAL|STATE
key in Config DB. On BGP_DEVICE_GLOBAL
update,
method DeviceGlobalCfgMgr::set_wcmp
will be called to process the change.
Regular WCMP update will refresh the internal class structures and appropriate FRR objects.
Skeleton code:
class DeviceGlobalCfgMgr(Manager):
""" This class responds to change in device-specific state """
def __init__(self, common_objs, db, table):
"""
Initialize the object
:param common_objs: common object dictionary
:param db: name of the db
:param table: name of the table in the db
"""
...
self.wcmp_templates = {
"set": common_objs['tf'].from_file("bgpd/wcmp/bgpd.wcmp.set.conf.j2"),
"unset": common_objs['tf'].from_file("bgpd/wcmp/bgpd.wcmp.unset.conf.j2"),
}
def set_handler(self, key, data):
""" Handle device TSA/WCMP state change """
...
status = False
if "wcmp_enabled" in data:
if self.is_update_required("wcmp_enabled", data["wcmp_enabled"]):
self.set_wcmp(data["wcmp_enabled"])
self.directory.put(self.db_name, self.table_name, "wcmp_enabled", data["wcmp_enabled"])
status = True
return status
def del_handler(self, key):
""" Handle device TSA/WCMP state remove """
...
def is_update_required(self, key, value):
""" API to check if configuration update required """
...
def set_wcmp(self, wcmp_status):
""" API to set/unset WCMP """
...
Note: WCMP BGP configuration via FRR management framework is not considered
BGP OA uses policies.conf.j2
at sonic-buildimage/dockers/docker-fpm-frr/frr/bgpd/templates/general
in order to generate relevant route-map infrastructure.
WCMP BGP requires using an outbound route-map to inject the link bandwidth extended community.
For this purpose, the next definitions will be used:
- TO_BGP_PEER_V4
- TO_BGP_PEER_V6
BGP OA will use bgpd.wcmp.conf.j2
at sonic-buildimage/dockers/docker-fpm-frr/frr/bgpd/wcmp
to enable/disable WCMP.
Skeleton code:
!
! template: bgpd/wcmp/bgpd.wcmp.conf.j2
!
route-map TO_BGP_PEER_V4 permit 100
{%- if wcmp_enabled == 'true' %}
set extcommunity bandwidth num-multipaths
{%- else %}
no set extcommunity bandwidth
{%- endif %}
exit
!
route-map TO_BGP_PEER_V6 permit 100
{%- if wcmp_enabled == 'true' %}
set extcommunity bandwidth num-multipaths
{%- else %}
no set extcommunity bandwidth
{%- endif %}
exit
!
! end of template: bgpd/wcmp/bgpd.wcmp.conf.j2
!
; defines schema for WCMP configuration attributes
key = BGP_DEVICE_GLOBAL|STATE ; device global state. Must be unique
; field = value
wcmp_enabled = wcmp-state ; enable/disable WCMP using BGP link bandwidth
; value annotations
wcmp-state = "true" / "false"
Config DB:
redis-cli -n 4 HGETALL 'BGP_DEVICE_GLOBAL|STATE'
1) "wcmp_enabled"
2) "true"
WCMP configuration:
{
"BGP_DEVICE_GLOBAL": {
"STATE": {
"wcmp_enabled": "true"
}
}
}
WCMP initial configuration will be updated at sonic-buildimage/files/build_templates/init_cfg.json.j2
in order to expose the default state.
Skeleton code:
{
...
"BGP_DEVICE_GLOBAL": {
"STATE": {
"wcmp_enabled": "false"
}
},
...
}
User interface:
config
|--- bgp
|--- device-global
|--- wcmp <ARG>
show
|--- bgp
|--- device-global OPTIONS
Arguments:
config bgp device-global wcmp
enabled|disabled
- enable/disable WCMP using BGP link bandwidth
Options:
show bgp device-global
-j|--json
- display in JSON format
The following command updates WCMP configuration:
config bgp device-global wcmp enabled
The following command shows WCMP configuration:
show bgp device-global
TSA WCMP
--------- --------
Disabled Enabled
show bgp device-global --json
{
"tsa": "disabled",
"wcmp": "enabled"
}
Existing YANG model sonic-bgp-device-global.yang
at sonic-buildimage/src/sonic-yang-models/yang-models
will be extended with a new schema in order to provide support for WCMP.
Skeleton code:
module sonic-bgp-device-global {
yang-version 1.1;
namespace "http://github.com/sonic-net/sonic-bgp-device-global";
prefix bgp_device_global;
organization "SONiC";
contact "SONiC";
description "SONIC Device-specific BGP global data";
revision 2024-01-28 {
description "Add Weighted ECMP using BGP link bandwidth";
}
revision 2022-06-26 {
description "Initial revision";
}
container sonic-bgp-device-global {
container BGP_DEVICE_GLOBAL {
container STATE {
...
leaf wcmp_enabled {
description "Enables/Disables Weighted-Equal Cost-Multipath using BGP link bandwidth";
type boolean;
default "false";
}
}
/* end of STATE container */
}
/* end of BGP_DEVICE_GLOBAL container */
}
/* end of sonic-bgp-device-global container */
}
/* end of module sonic-bgp-device-global */
No special handling is required
WCMP basic configuration test:
- Verify FRR route-map is generated properly after WCMP config update
TBD
TBD