NVIDIA · chesterxgchen · Apr 9, 2024 · Apr 5, 2024 · Apr 5, 2024 · Apr 9, 2024
diff --git a/docs/resources/fed_xgb_detail.png b/docs/resources/fed_xgb_detail.png
diff --git a/docs/resources/loose_xgb.png b/docs/resources/loose_xgb.png
diff --git a/docs/resources/tight_xgb.png b/docs/resources/tight_xgb.png
diff --git a/docs/user_guide.rst b/docs/user_guide.rst
@@ -23,3 +23,4 @@ please refer to the :ref:`programming_guide`.
    user_guide/helm_chart
    user_guide/confidential_computing
    user_guide/hierarchy_unification_bridge
+   user_guide/federated_xgboost
diff --git a/docs/user_guide/federated_xgboost.rst b/docs/user_guide/federated_xgboost.rst
@@ -0,0 +1,26 @@
+##############################
+Federated XGBoost with NVFlare
+##############################
+
+XGBoost (https://github.com/dmlc/xgboost) is an open-source project that
+implements machine learning algorithms under the Gradient Boosting framework.
+It is an optimized distributed gradient boosting library designed to be highly
+efficient, flexible and portable.
+This implementation uses MPI (message passing interface) for client
+communication and synchronization.
+
+MPI requires the underlying communication network to be perfect - a single
+message drop causes the training to fail.
+
+This is usually achieved via a highly reliable special-purpose network like NCCL.
+
+The open-source XGBoost supports federated paradigm, where clients are in different
+locations and communicate with each other with gRPC over internet connections.
+
+We introduce federated XGBoost with NVFlare for a more reliable federated setup.
+
+.. toctree::
+   :maxdepth: 1
+
+   federated_xgboost/implementation
+   federated_xgboost/timeout
diff --git a/docs/user_guide/federated_xgboost/implementation.rst b/docs/user_guide/federated_xgboost/implementation.rst
@@ -0,0 +1,65 @@
+#################################
+Reliable Federated XGBoost Design
+#################################
+
+
+*************************
+Flare as XGBoost Launcher
+*************************
+
+NVFLARE serves as a launchpad to start the XGBoost system.
+Once started, the XGBoost system runs independently of FLARE,
+as illustrated in the following figure.
+
+.. figure:: ../../resources/loose_xgb.png
+    :height: 500px
+
+There are a few potential problems with this approach:
+
+ - As we know, MPI requires a perfect communication network,
+   whereas the simple gRPC over the internet could be unstable.
+
+ - For each job, the XGBoost Server must open a port for clients to connect to.
+   This adds burden to request IT for the additional port in the real-world situation.
+   Even if a fixed port is allowed to open, and we reuse that port,
+   multiple XGBoost jobs can not be run at the same time,
+   since each XGBoost job requires a different port number.
+
+
+*****************************
+Flare as XGBoost Communicator
+*****************************
+
+FLARE provides a highly flexible, scalable and reliable communication mechanism.
+We enhance the reliability of federated XGBoost by using FLARE as the communicator of XGBoost,
+as shown here:
+
+.. figure:: ../../resources/tight_xgb.png
+    :height: 500px
+
+Detailed Design
+===============
+
+The open-source Federated XGBoost (c++) uses gRPC as the communication protocol.
+To use FLARE  as the communicator, we simply route XGBoost's gRPC messages through FLARE.
+To do so, we change the server endpoint of each XGBoost client to a local gRPC server
+(LGS) within the FLARE client.
+
+.. figure:: ../../resources/fed_xgb_detail.png
+    :height: 500px
+
+As shown in this diagram, there is a local GRPC server (LGS) for each site
+that serves as the server endpoint for the XGBoost client on the site.
+Similarly, there is a local GRPC Client (LGC) on the FL Server that
+interacts with the XGBoost Server. The message path between the XGBoost Client and
+the XGBoost Server is as follows:
+
+  1. The XGBoost client generates a gRPC message and sends it to the LGS in FLARE Client
+  2. FLARE Client forwards the message to the FLARE Server. This is a reliable FLARE message.
+  3. FLARE Server uses the LGC to send the message to the XGBoost Server.
+  4. XGBoost Server sends the response back to the LGC in FLARE Server.
+  5. FLARE Server sends the response back to the FLARE Client.
+  6. FLARE Client sends the response back to the XGBoost Client via the LGS.
+
+Please note that the XGBoost Client (c++) component could be running as a separate process
+or within the same process of FLARE Client.
diff --git a/docs/user_guide/federated_xgboost/timeout.rst b/docs/user_guide/federated_xgboost/timeout.rst
@@ -0,0 +1,93 @@
+############################################
+Reliable Federated XGBoost Timeout Mechanism
+############################################
+
+NVFlare introduces a tightly-coupled integration between XGBoost and NVFlare.
+NVFlare implements the ReliableMessage mechanism to make XGBoost’s server/client
+interactions more robust over unstable internet connections.
+
+Unstable internet connection is the situation where the connections between
+the communication endpoints have random disconnects/reconnects and unstable speed.
+It is not meant to be an extended internet outage.
+
+ReliableMessage does not mean guaranteed delivery.
+It only means that it will try its best to deliver the message to the peer.
+If one attempt fails, it will keep trying until either the message is
+successfully delivered or a specified "transaction timeout" is reached.
+
+*****************
+Timeout Mechanism
+*****************
+
+In runtime, the FLARE System is configured with a few important timeout parameters.
+
+ReliableMessage Timeout
+=======================
+
+There are two timeout values to control the behavior of ReliableMessage (RM).
+
+Per-message Timeout
+-------------------
+
+Essentially RM tries to resend the message until delivered successfully.
+Each resend of the message requires a timeout value.
+This value should be defined based on the message size, overall network speed,
+and the amount of time needed to process the message in a normal situation.
+For example, if an XGBoost message takes no more than 5 seconds to be
+sent, processed, and replied.
+The per-message timeout should be set to 5 seconds.
+
+.. note::
+
+    Note that the initial XGBoost message might take more than 100 seconds
+    depends on the dataset size.
+
+Transaction Timeout
+-------------------
+
+This value defines how long you want RM to keep retrying until done, in case
+of unstable connection.
+This value should be defined based on the overall stability of the connection,
+nature of the connection, and how quickly the connection is restored.
+For occasional connection glitches, this value shouldn't have to be too big
+(e.g. 20 seconds).
+However if the outage is long (say 60 seconds or longer), then this value
+should be big enough.
+
+.. note::
+
+    Note that even if you think the connection is restored (e.g. replugged
+    the internet cable or reactivated WIFI), the underlying connection
+    layer may take much longer to actually restore connections (e.g. up to
+    a few minutes)!
+
+.. note::
+
+    Note: if the transaction timeout is <= per-message timeout, then the
+    message will be sent through simple messaging - no retry will be done
+    in case of failure.
+
+XGBoost Client Operation Timeout
+================================
+
+To prevent a XGBoost client from running forever, the XGBoost/FLARE
+integration lets you define a parameter (max_client_op_interval) on the
+server side to control the max amount of time permitted for a client to be
+silent (i.e. no messages sent to the server).
+The default value of this parameter is 900 seconds, meaning that if no XGB
+message is received from the client for over 900 seconds, then that client
+is considered dead, and the whole job is aborted.
+
+***************************
+Configure Timeouts Properly
+***************************
+
+These timeout values are related. For example, if the transaction timeout
+is greater than the server timeout, then it won't be that effective since
+the server will treat the client to be dead once the server timeout is reached
+anyway. Similarly, it does not make sense to have transaction timeout > XGBoost
+client op timeout.
+
+In general, follow this rule:
+
+Per-message Timeout < Transaction Timeout < XGBoost Client Operation Timeout
diff --git a/examples/advanced/xgboost/histogram-based/README.md b/examples/advanced/xgboost/histogram-based/README.md
@@ -28,27 +28,15 @@ Model accuracy can be visualized in tensorboard:
 tensorboard --logdir /tmp/nvflare/xgboost_v2_workspace/simulate_job/tb_events
 ```
 
-### Run federated experiments in real world
+## Timeout configuration
 
-To run in a federated setting, follow [Real-World FL](https://nvflare.readthedocs.io/en/main/real_world_fl.html) to
-start the overseer, FL servers and FL clients.
-
-You need to download the HIGGS data on each client site.
-You will also need to install the xgboost on each client site and server site.
-
-You can still generate the data splits and job configs using the scripts provided.
-
-You will need to copy the generated data split file into each client site.
-You might also need to modify the `data_path` in the `data_site-XXX.json`
-inside the `/tmp/nvflare/xgboost_higgs_dataset` folder,
-since each site might save the HIGGS dataset in different places.
-
-Then you can use admin client to submit the job via `submit_job` command.
+Please refer to [Reliable Federated XGBoost Timeout Mechanism](https://nvflare.readthedocs.io/en/2.4/user_guide/reliable_xgboost.html)
 
 ## Customization
 
-The provided XGBoost executor can be customized using Boost parameters
-provided in `xgb_params` argument.
+The provided FedXGBHistogramExecutor can be customized by passing
+[xgboost parameters](https://xgboost.readthedocs.io/en/stable/parameter.html)
+in the `xgb_params` argument.
 
 If the parameter change alone is not sufficient and code changes are required,
 a custom executor can be implemented to make calls to xgboost library directly.
@@ -59,13 +47,25 @@ overwrite the `xgb_train()` method.
 To use other dataset, can inherit the base class `XGBDataLoader` and
 implement the `load_data()` method.
 
+## Run in real world
+
+To run in a federated setting, follow [Real-World FL](https://nvflare.readthedocs.io/en/main/real_world_fl.html) to
+start the overseer, FL servers and FL clients.
+
+1. Each participating site need to install xgboost and nvflare.
+2. Each participating site need to have their own data loader
+   or use the same dataloader but with different location to load data
+   (can refer to higgs_data_loader.py to write one for their own data)
+
+Then you can use admin client to submit the job via `submit_job` command.
+
 ## GPU support
 By default, CPU based training is used.
 
 If the CUDA is installed on the site, tree construction and prediction can be
 accelerated using GPUs.
 
-To enable GPU accelerated training, in `config_fed_client` set the args of 
+To enable GPU accelerated training, in `config_fed_client` set the args of
 `FedXGBHistogramExecutor` to `"use_gpus": true` and set `"tree_method": "hist"`
 in `xgb_params`.
 

diff --git a/examples/advanced/xgboost/histogram-based/jobs/base_v2/app/config/config_fed_client.json b/examples/advanced/xgboost/histogram-based/jobs/base_v2/app/config/config_fed_client.json
@@ -1,6 +1,5 @@
 {
   "format_version": 2,
-  "num_rounds": 100,
   "executors": [
     {
       "tasks": [
@@ -20,7 +19,9 @@
             "eval_metric": "auc",
             "tree_method": "hist",
             "nthread": 16
-          }
+          },
+          "per_msg_timeout": 100,
+          "tx_timeout": 500
         }
       }
     }

diff --git a/job_templates/vertical_xgb/config_fed_client.conf b/job_templates/vertical_xgb/config_fed_client.conf
@@ -24,6 +24,8 @@ executors = [
         use_gpus = false
         metrics_writer_id = "metrics_writer"
         model_file_name = "test.model.json"
+        per_msg_timeout = 100
+        tx_timeout = 500
       }
     }
   }