Skip to content

Commit

Permalink
Edit Fpmsyncd part
Browse files Browse the repository at this point in the history
  • Loading branch information
y.qin committed Feb 16, 2024
1 parent 00eab8d commit 591fe8c
Show file tree
Hide file tree
Showing 2 changed files with 109 additions and 78 deletions.
187 changes: 109 additions & 78 deletions doc/bgp_loading_optimization/bgp-loading-optimization-hld.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,8 @@
|:---:|:-----------:|:------------------:|-----------------------------------|
| 0.1 | Aug 16 2023 | FengSheng Yang | Initial Draft |
| 0.2 | Aug 29 2023 | Yijiao Qin | Second Draft |
| 0.3 | Sept 5 2023 | Nikhil Kelapure | Supplement of Async SAI Part |
| 0.3 | Sep 5 2023 | Nikhil Kelapure | Supplement of Async SAI Part |
| 1.0 | Feb 1 2024 | Yijiao Qin | Update test strategy |

<!-- omit in toc -->
## Table of Contents
Expand Down Expand Up @@ -200,19 +201,41 @@ Once `orchagent` `doTask` writes data to ASIC_DB, it waits for response from `sy

## High-Level Proposal

### Modification in orchagent/syncd to enable multi-threading
Figure 6 below illustrates the high level architecture modification for `orchagent` and `syncd`, it compares the original architecture and the new pipeline architecture proposed by this HLD. The pipeline design changes the workflow of both `orchagent` and `syncd`, thus enabling them to employ multiple threads to do sub-tasks concurrently.
### 1. Reduce Redis I/O traffic between Fpmsyncd and Orchagent
#### 1.1 Remove _PUBLISH_ in the lua script
> Fpmsyncd uses _ProducerStateTable_ to send data out and Orchagent uses _ConsumerStateTable_ to read data in.
While the producer has APIs associated with lua scripts for Redis operations, each script ends with a _PUBLISH_ command to notify the downstream consumers.

Since we have employed Redis pipeline to queue commands up and flush them in a batch, it's unnecessary to _PUBLISH_ for each command. We can attach a _PUBLISH_ at the end of the command queue when the pipeline flushes, then the whole batch could share this single _PUBLISH_ and we reduce traffic for O(n) _PUBLISH_ to O(1).
<figure align=center>
<img src="images/BatchPub.png" height=auto>
</figure>

#### 1.2 Reduce pipeline flush frequency

> Redis pipeline flushes itself when it's full, otherwise it temporarily holds the redis commands in its buffer.
The commands would not get stuck in the pipeline since Fpmsyncd would also flush the pipeline, and this behavior is event-triggered with _select_ method.

Firstly, we could increase pipeline buffer size from the default 125 to 10k, which would decrease the frequency of the pipeline flushing itself.
Secondly, we could skip Fpmsyncd flushes when it's not that long since the last flush and set a _flush timeout_ to determine the threshold.
To avoid commands lingering in the pipeline due to skip, we change the _select timeout_ of Fpmsyncd from _infinity_ to _flush timeout_ after a successful skip to make sure that these commands are eventually flushed. And after flushing the lingered commands, the _select timeout_ of Fpmsyncd would change back to _infinity_ again.
To make sure that consumers are able to get all the modified data when the number of _PUBLISH_ is equal to the number of flushes, while still keep the consumer pop size as 125, consumer needs to do multiple pops for a single _PUBLISH_.
If there is a batch of fewer than 10 routes coming to the pipeline, they would be directly flushed, in case they are important routes.

#### 1.3 Discard the state table with prefix _
> Pipeline flushes data into the state table and use the data structure _KeySet_ to mark modified keys.
When Orchagent is notified of new data coming, it recognized modified keys by _KeySet_, then transfers data from the state table to the stable table, then deletes the state table.

We propose to discard state tables, directly flush data to stable tables, while keep the data structure _KeySet_ to track modified keys.

### 2. Split the monolithic Orchagent/Syncd workflow into multiple threads
Take `orchagent` for example, a single task of `orchagent` contains three sub-tasks `pops`, `addToSync` and `doTask`, and originally `orchagent` performs the three sub-tasks in serial. A new `pops` sub-task can only begin after the previous `doTask` is finished. The proposed design utilizes a separate thread to run `pops`, which decouples the `pops` sub-task from `addToSync` and `doTask`. As the figure shows, in the new pipeline architecture, a new `pops` sub-task begins immediately when it's ready, not having to wait for the previous `addToSync` and `doTask` to finish.

<figure align=center>
<img src="images/pipeline-timeline.png">
<figcaption>Figure 7. Pipeline architecture compared with the original serial architecture<figcaption>
</figure>

#### Ring buffer for low-cost thread coordination
Since multiple threads are employed, we take a lock-free design by using a ring buffer as an asynchronous communication channel.

#### Asynchronous sairedis API usage
Asynchronous mode `sairedis` API is used and a list of context of response pending messages is maintained on `orchagent` to process the response when its received

Expand All @@ -224,24 +247,93 @@ Asynchronous mode `sairedis` API is used and a list of context of response pendi
#### New ResponseThread in OA
A new `ResponseThread` is used in `orchagent` to process the response when its received so that the other threads can continue processing new routing messages

### Streamlining Redis I/O
## Low-Level Implementation

The optimization for `orchagent` and `syncd` can theoretically double the BGP loading performance, which makes Redis I/O performance become a new bottleneck.
### Fpmsyncd

#### Lower frequency of the fpmsyncd flush & APPL_DB publish
Redis Pipeline would flush itself when it's full, to save TCP traffic, we choose _10000_ despite the default 125 as the pipeline size.
```c++
#define FPMSYNCD_PIPELINE_SIZE 10000
#define FLUSH_TIMEOUT 200
#define BLOCKING -1

RedisPipeline pipeline(&db, FPMSYNCD_PIPELINE_SIZE);

/**
* @brief fpmsyncd's flush logic
*
* Although ppl flushes itself when it's full, fpmsyncd also flushes it in some cases.
* This function decides fpmsyncd's flush logic and returns whether it's skipped.
*
* @param pipeline Pointer to the pipeline to be flushed.
* @param interval The expected interval between each flush.
* @param force if true, flush is guaranted
* @return true only if this flush is skipped when pipeline is non-empty
*/
bool flushPPLine(RedisPipeline* pipeline, int interval, bool forced=false);

int select_timeout = BLOCKING;
while (true)
{
Selectable *temps;
auto ret = s.select(&temps, select_timeout);
....
if (select_timeout == FLUSH_TIMEOUT && ret==Select::TIMEOUT) {
flushPPLine(&pipeline, FLUSH_TIMEOUT, true);
select_timeout = BLOCKING;
} else if (flushPPLine(&pipeline, FLUSH_TIMEOUT)) {
select_timeout = FLUSH_TIMEOUT;
}
}

Instead of flushing the `pipeline` on every data arrival and propose to use a flush timer to determine the flush frequency as illustrated below.
```
### Lua scripts
<!-- omit in toc -->
#### sonic-swss-common/common/producerstatetable.cpp
Take _luaSet_ for example
```c++
local added = redis.call('SADD', KEYS[2], ARGV[2])
for i = 0, #KEYS - 3 do
redis.call('HSET', KEYS[3 + i], ARGV[3 + i * 2], ARGV[4 + i * 2])
end
if added > 0 then
redis.call('PUBLISH', KEYS[1], ARGV[1])
end
```
changed to
```lua
local added = redis.call('SADD', KEYS[2], ARGV[2])
for i = 0, #KEYS - 3 do
redis.call('HSET', KEYS[3 + i], ARGV[3 + i * 2], ARGV[4 + i * 2])
end
```

<figure align=center>
<img src="images/pipeline-mode.png" height=auto>
<figcaption>Figure 9. Proposed new BGP loading workflow<figcaption>
</figure>
<!-- omit in toc -->
#### sonic-swss-common/common/consumer_state_table_pops.lua

#### Disable the temporary table mechanism in APPL_DB
```lua
redis.replicate_commands()
local ret = {}
local tablename = KEYS[2]
local keys = redis.call('SPOP', KEYS[1], ARGV[1])

We propose to disable the temporary/stable table behavior and keep just a single table, so that we don't need to delete the temporary and then write into a stable one, which spares much `HDEL` and `HSET` traffic.
if keys and #keys > 0 then
local n = #keys
for i = 1, n do
local key = keys[i]
local fieldvalues = redis.call('HGETALL', tablename..key)
table.insert(ret, {key, fieldvalues})
end
end
end

## Low-Level Implementation
return ret
```

## WarmRestart scenario
This proposal considers the compatibility with SONiC `WarmRestart` feature. For example, when a user updates the config, a warm restart may be needed for the config update to be reflected. SONiC's main thread would call `dumpPendingTasks()` function to save the current system states and restore the states after the warm restart. Since this HLD introduces a new thread and a new structure `ring buffer` which stores some data, then we have to ensure that the data in `ring buffer` all gets processed before warm restart. During warm start, the main thread would modify the variable `m_toSync`, which the new thread also have access to. Therefore we should block the new thread during warm restart to avoid conflict.

Take orchagent for example, we need to make sure ring buffer is empty and the new thread is in idle before we call ```dumpPendingTasks()```.

### Multi-threaded orchagent with a ring buffer

Expand Down Expand Up @@ -342,67 +434,6 @@ New pthread in orchagent
<figcaption>Figure 10. Async sairedis workflow<figcaption>
</figure>
### Fpmsyncd
`fpmsyncd` would flush the pipeline when it's full, `10000` to `15000` is tested to be a good range for the buffer size variable `REDIS_PIPELINE_SIZE` in our use cases.
In the new design, the flush on the route arrival is cancelled. To avoid critical routing data being stuck in the pipeline, it uses <b>a timer thread</b> to flush data at a fixed frequency defined by `FLUSH_INTERVAL`, mutex is required since both the timer thread and the master thread access `fpmsyncd`'s `pipeline`. Although we expect a lower flush frequency, it should make sure that the slight data delay in the pipeline doesn't hurt the overall performance, and 200 ms is tested to be a good value for `FLUSH_INTERVAL`.
### APPL_DB
<!-- omit in toc -->
#### sonic-swss-common/common/producerstatetable.cpp
The string variable `luaSet` contains the Lua script for Redis `SET` operation:
```c++
string luaSet =
"local added = redis.call('SADD', KEYS[2], ARGV[2])\n"
"for i = 0, #KEYS - 3 do\n"
" redis.call('HSET', KEYS[3 + i], ARGV[3 + i * 2], ARGV[4 + i * 2])\n"
"end\n"
" if added > 0 then \n"
" redis.call('PUBLISH', KEYS[1], ARGV[1])\n"
"end\n";
```
In our design, the script changes to:
```lua
local added = redis.call('SADD', KEYS[2], ARGV[2])
for i = 0, #KEYS - 3 do
redis.call('HSET', KEYS[3 + i], ARGV[3 + i * 2], ARGV[4 + i * 2])
end
```
Same modification should be add to `luaDel` for Redis `DEL` operation.

**NOTE:** The original lua script works fine for other modules, we only modify in the fpmsyncd case.

By this modification, Redis operation `SET/DEL` is decoupled from `PUBLISH`.

In this proposal, `PUBLISH` is binded with `fpmsyncd`'s flush behavior in `RedisPipeline->flush()` function, so that each time `fpmsyncd` flushes data to `APPL_DB`, the subscribers get notified.


<!-- omit in toc -->
#### sonic-swss-common/common/consumer_state_table_pops.lua
We removed the `DEL` and `HSET` operations in the original script, which optimizes `Table->pops()`:
```lua
redis.replicate_commands()
local ret = {}
local tablename = KEYS[2]
local stateprefix = ARGV[2]
local keys = redis.call('SPOP', KEYS[1], ARGV[1])
local n = table.getn(keys)
for i = 1, n do
local key = keys[i]
local fieldvalues = redis.call('HGETALL', stateprefix..tablename..key)
table.insert(ret, {key, fieldvalues})
end
return ret
```
This change doubles the performance of `Table->pops()` and hence leads to routing from `fpmsyncd` to `orchagent` via APPL_DB 10% faster than before.

**NOTE:** This script change limits to `routeorch` module.

## WarmRestart scenario
This proposal considers the compatibility with SONiC `WarmRestart` feature. For example, when a user updates the config, a warm restart may be needed for the config update to be reflected. SONiC's main thread would call `dumpPendingTasks()` function to save the current system states and restore the states after the warm restart. Since this HLD introduces a new thread and a new structure `ring buffer` which stores some data, then we have to ensure that the data in `ring buffer` all gets processed before warm restart. During warm start, the main thread would modify the variable `m_toSync`, which the new thread also have access to. Therefore we should block the new thread during warm restart to avoid conflict.

Take orchagent for example, we need to make sure ring buffer is empty and the new thread is in idle before we call ```dumpPendingTasks()```.
## Testing Requirements/Design
### System test
Expand Down
Binary file added doc/bgp_loading_optimization/images/BatchPub.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 591fe8c

Please sign in to comment.