Edit Fpmsyncd part

sonic-net · Feb 16, 2024 · 591fe8c · 591fe8c
1 parent 00eab8d
commit 591fe8c
Show file tree

Hide file tree

Showing 2 changed files with 109 additions and 78 deletions.
diff --git a/doc/bgp_loading_optimization/bgp-loading-optimization-hld.md b/doc/bgp_loading_optimization/bgp-loading-optimization-hld.md
@@ -7,7 +7,8 @@
 |:---:|:-----------:|:------------------:|-----------------------------------|
 | 0.1 | Aug 16 2023 |   FengSheng Yang   | Initial Draft                     |
 | 0.2 | Aug 29 2023 |   Yijiao Qin       | Second Draft                      |
-| 0.3 | Sept 5 2023 |   Nikhil Kelapure  | Supplement of Async SAI Part      |
+| 0.3 | Sep  5 2023 |   Nikhil Kelapure  | Supplement of Async SAI Part      |
+| 1.0 | Feb  1 2024 |   Yijiao Qin       | Update test strategy              |
 
 <!-- omit in toc -->
 ## Table of Contents
@@ -200,19 +201,41 @@ Once `orchagent` `doTask` writes data to ASIC_DB, it waits for response from `sy
 
 ## High-Level Proposal
 
-### Modification in orchagent/syncd to enable multi-threading
-Figure 6 below illustrates the high level architecture modification for `orchagent` and `syncd`, it compares the original architecture and the new pipeline architecture proposed by this HLD. The pipeline design changes the workflow of both `orchagent` and `syncd`, thus enabling them to employ multiple threads to do sub-tasks concurrently.
+### 1. Reduce Redis I/O traffic between Fpmsyncd and Orchagent
+#### 1.1 Remove _PUBLISH_  in the lua script
+> Fpmsyncd uses _ProducerStateTable_ to send data out and Orchagent uses _ConsumerStateTable_ to read data in.  
+While the producer has APIs associated with lua scripts for Redis operations, each script ends with a _PUBLISH_ command to notify the downstream consumers. 
 
+Since we have employed Redis pipeline to queue commands up and flush them in a batch, it's unnecessary to _PUBLISH_ for each command. We can attach a _PUBLISH_ at the end of the command queue when the pipeline flushes, then the whole batch could share this single _PUBLISH_ and we reduce traffic for O(n) _PUBLISH_ to O(1).
+<figure align=center>
+    <img src="images/BatchPub.png" height=auto>
+</figure>  
+
+#### 1.2 Reduce pipeline flush frequency
+
+> Redis pipeline flushes itself when it's full, otherwise it temporarily holds the redis commands in its buffer.  
+The commands would not get stuck in the pipeline since Fpmsyncd would also flush the pipeline, and this behavior is event-triggered with _select_ method. 
+
+Firstly, we could increase pipeline buffer size from the default 125 to 10k, which would decrease the frequency of the pipeline flushing itself.  
+Secondly, we could skip Fpmsyncd flushes when it's not that long since the last flush and set a _flush timeout_ to determine the threshold.  
+To avoid commands lingering in the pipeline due to skip, we change the _select timeout_ of Fpmsyncd from _infinity_ to _flush timeout_ after a successful skip to make sure that these commands are eventually flushed. And after flushing the lingered commands, the _select timeout_ of Fpmsyncd would change back to _infinity_ again.  
+To make sure that consumers are able to get all the modified data when the number of _PUBLISH_ is equal to the number of flushes, while still keep the consumer pop size as 125, consumer needs to do multiple pops for a single _PUBLISH_.  
+If there is a batch of fewer than 10 routes coming to the pipeline, they would be directly flushed, in case they are important routes.
+
+#### 1.3 Discard the state table with prefix _
+> Pipeline flushes data into the state table and use the data structure _KeySet_ to mark modified keys.  
+When Orchagent is notified of new data coming, it recognized modified keys by _KeySet_, then transfers data from the state table to the stable table, then deletes the state table.
+
+We propose to discard state tables, directly flush data to stable tables, while keep the data structure _KeySet_ to track modified keys.
+
+### 2. Split the monolithic Orchagent/Syncd workflow into multiple threads
 Take `orchagent` for example, a single task of `orchagent` contains three sub-tasks `pops`, `addToSync` and `doTask`, and originally `orchagent` performs the three sub-tasks in serial. A new `pops` sub-task can only begin after the previous `doTask` is finished. The proposed design utilizes a separate thread to run `pops`, which decouples the `pops` sub-task from `addToSync` and `doTask`. As the figure shows, in the new pipeline architecture, a new `pops` sub-task begins immediately when it's ready, not having to wait for the previous `addToSync` and `doTask` to finish.
 
 <figure align=center>
     <img src="images/pipeline-timeline.png">
     <figcaption>Figure 7. Pipeline architecture compared with the original serial architecture<figcaption>
 </figure>  
 
-#### Ring buffer for low-cost thread coordination
-Since multiple threads are employed, we take a lock-free design by using a ring buffer as an asynchronous communication channel.
-
 #### Asynchronous sairedis API usage
 Asynchronous mode `sairedis` API is used and a list of context of response pending messages is maintained on `orchagent` to process the response when its received
 
@@ -224,24 +247,93 @@ Asynchronous mode `sairedis` API is used and a list of context of response pendi
 #### New ResponseThread in OA
 A new `ResponseThread` is used in `orchagent` to process the response when its received so that the other threads can continue processing new routing messages
 
-### Streamlining Redis I/O
+## Low-Level Implementation
 
-The optimization for `orchagent` and `syncd` can theoretically double the BGP loading performance, which makes Redis I/O performance become a new bottleneck.
+### Fpmsyncd
 
-#### Lower frequency of the fpmsyncd flush & APPL_DB publish
+Redis Pipeline would flush itself when it's full, to save TCP traffic, we choose _10000_ despite the default 125 as the pipeline size.
+```c++
+#define FPMSYNCD_PIPELINE_SIZE 10000
+#define FLUSH_TIMEOUT 200
+#define BLOCKING -1
+
+RedisPipeline pipeline(&db, FPMSYNCD_PIPELINE_SIZE);
+
+/**
+ * @brief fpmsyncd's flush logic
+ * 
+ * Although ppl flushes itself when it's full, fpmsyncd also flushes it in some cases.
+ * This function decides fpmsyncd's flush logic and returns whether it's skipped.
+ * 
+ * @param pipeline Pointer to the pipeline to be flushed.
+ * @param interval The expected interval between each flush.
+ * @param force if true, flush is guaranted
+ * @return true only if this flush is skipped when pipeline is non-empty
+ */
+bool flushPPLine(RedisPipeline* pipeline, int interval, bool forced=false);
+
+int select_timeout = BLOCKING;
+while (true)
+{
+    Selectable *temps;
+    auto ret = s.select(&temps, select_timeout);
+    ....
+    if (select_timeout == FLUSH_TIMEOUT && ret==Select::TIMEOUT) {
+        flushPPLine(&pipeline, FLUSH_TIMEOUT, true);
+        select_timeout = BLOCKING;
+    } else if (flushPPLine(&pipeline, FLUSH_TIMEOUT)) {
+        select_timeout = FLUSH_TIMEOUT;
+    }
+}
 
-Instead of flushing the `pipeline` on every data arrival and propose to use a flush timer to determine the flush frequency as illustrated below.
+```
+### Lua scripts
+<!-- omit in toc -->
+#### sonic-swss-common/common/producerstatetable.cpp
+Take _luaSet_ for example
+```c++
+local added = redis.call('SADD', KEYS[2], ARGV[2])  
+for i = 0, #KEYS - 3 do  
+    redis.call('HSET', KEYS[3 + i], ARGV[3 + i * 2], ARGV[4 + i * 2])  
+end  
+if added > 0 then  
+    redis.call('PUBLISH', KEYS[1], ARGV[1])  
+end
+```
+changed to
+```lua
+local added = redis.call('SADD', KEYS[2], ARGV[2])
+for i = 0, #KEYS - 3 do
+    redis.call('HSET', KEYS[3 + i], ARGV[3 + i * 2], ARGV[4 + i * 2])
+end
+```
 
-<figure align=center>
-    <img src="images/pipeline-mode.png" height=auto>
-    <figcaption>Figure 9. Proposed new BGP loading workflow<figcaption>
-</figure>  
+<!-- omit in toc -->
+#### sonic-swss-common/common/consumer_state_table_pops.lua
 
-#### Disable the temporary table mechanism in APPL_DB
+```lua
+redis.replicate_commands()
+local ret = {}
+local tablename = KEYS[2]
+local keys = redis.call('SPOP', KEYS[1], ARGV[1])
 
-We propose to disable the temporary/stable table behavior and keep just a single table, so that we don't need to delete the temporary and then write into a stable one, which spares much `HDEL` and `HSET` traffic.
+if keys and #keys > 0 then
+    local n = #keys
+    for i = 1, n do
+        local key = keys[i]
+        local fieldvalues = redis.call('HGETALL', tablename..key)
+        table.insert(ret, {key, fieldvalues})
+        end
+    end
+end
 
-## Low-Level Implementation
+return ret
+```
+
+## WarmRestart scenario
+This proposal considers the compatibility with SONiC `WarmRestart` feature. For example, when a user updates the config, a warm restart may be needed for the config update to be reflected. SONiC's main thread would call `dumpPendingTasks()` function to save the current system states and restore the states after the warm restart. Since this HLD introduces a new thread and a new structure `ring buffer` which stores some data, then we have to ensure that the data in `ring buffer` all gets processed before warm restart. During warm start, the main thread would modify the variable `m_toSync`, which the new thread also have access to. Therefore we should block the new thread during warm restart to avoid conflict.
+
+Take orchagent for example, we need to make sure ring buffer is empty and the new thread is in idle before we call ```dumpPendingTasks()```. 
 
 ### Multi-threaded orchagent with a ring buffer
 
@@ -342,67 +434,6 @@ New pthread in orchagent
       <figcaption>Figure 10. Async sairedis workflow<figcaption>
   </figure> 
 
-### Fpmsyncd
-
-`fpmsyncd` would flush the pipeline when it's full, `10000` to `15000` is tested to be a good range for the buffer size variable `REDIS_PIPELINE_SIZE` in our use cases. 
-
-In the new design, the flush on the route arrival is cancelled. To avoid critical routing data being stuck in the pipeline, it uses <b>a timer thread</b> to flush data at a fixed frequency defined by `FLUSH_INTERVAL`, mutex is required since both the timer thread and the master thread access `fpmsyncd`'s  `pipeline`. Although we expect a lower flush frequency, it should make sure that the slight data delay in the pipeline doesn't hurt the overall performance, and 200 ms is tested to be a good value for `FLUSH_INTERVAL`.
-
-### APPL_DB
-<!-- omit in toc -->
-#### sonic-swss-common/common/producerstatetable.cpp
-The string variable `luaSet` contains the Lua script for Redis `SET` operation:
-```c++
-string luaSet =
-  "local added = redis.call('SADD', KEYS[2], ARGV[2])\n"
-  "for i = 0, #KEYS - 3 do\n"
-  "    redis.call('HSET', KEYS[3 + i], ARGV[3 + i * 2], ARGV[4 + i * 2])\n"
-  "end\n"
-  " if added > 0 then \n"
-  "    redis.call('PUBLISH', KEYS[1], ARGV[1])\n"
-  "end\n";
-```
-In our design, the script changes to:
-```lua
-local added = redis.call('SADD', KEYS[2], ARGV[2])
-for i = 0, #KEYS - 3 do
-    redis.call('HSET', KEYS[3 + i], ARGV[3 + i * 2], ARGV[4 + i * 2])
-end
-```
-Same modification should be add to `luaDel` for Redis `DEL` operation.
-
-**NOTE:** The original lua script works fine for other modules, we only modify in the fpmsyncd case. 
-
-By this modification, Redis operation `SET/DEL` is decoupled from `PUBLISH`.  
-
-In this proposal, `PUBLISH` is binded with `fpmsyncd`'s flush behavior in `RedisPipeline->flush()` function, so that each time `fpmsyncd` flushes data to `APPL_DB`, the subscribers get notified.
-
-
-<!-- omit in toc -->
-#### sonic-swss-common/common/consumer_state_table_pops.lua
-We removed the `DEL` and `HSET` operations in the original script, which optimizes `Table->pops()`:
-```lua
-redis.replicate_commands()
-local ret = {}
-local tablename = KEYS[2]
-local stateprefix = ARGV[2]
-local keys = redis.call('SPOP', KEYS[1], ARGV[1])
-local n = table.getn(keys)
-for i = 1, n do
-   local key = keys[i]
-   local fieldvalues = redis.call('HGETALL', stateprefix..tablename..key)
-   table.insert(ret, {key, fieldvalues})
-end
-return ret
-```
-This change doubles the performance of `Table->pops()` and hence leads to routing from `fpmsyncd` to `orchagent` via APPL_DB 10% faster than before.
-
-**NOTE:** This script change limits to `routeorch` module.
-
-## WarmRestart scenario
-This proposal considers the compatibility with SONiC `WarmRestart` feature. For example, when a user updates the config, a warm restart may be needed for the config update to be reflected. SONiC's main thread would call `dumpPendingTasks()` function to save the current system states and restore the states after the warm restart. Since this HLD introduces a new thread and a new structure `ring buffer` which stores some data, then we have to ensure that the data in `ring buffer` all gets processed before warm restart. During warm start, the main thread would modify the variable `m_toSync`, which the new thread also have access to. Therefore we should block the new thread during warm restart to avoid conflict.
-
-Take orchagent for example, we need to make sure ring buffer is empty and the new thread is in idle before we call ```dumpPendingTasks()```. 
 
 ## Testing Requirements/Design
 ### System test

diff --git a/doc/bgp_loading_optimization/images/BatchPub.png b/doc/bgp_loading_optimization/images/BatchPub.png