Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2.4] Add reliable xgboost documentation #2471

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/resources/fed_xgb_detail.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/resources/loose_xgb.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/resources/tight_xgb.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/user_guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,4 @@ please refer to the :ref:`programming_guide`.
user_guide/helm_chart
user_guide/confidential_computing
user_guide/hierarchy_unification_bridge
user_guide/federated_xgboost
26 changes: 26 additions & 0 deletions docs/user_guide/federated_xgboost.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
##############################
Federated XGBoost with NVFlare
##############################

XGBoost (https://github.com/dmlc/xgboost) is an open-source project that
implements machine learning algorithms under the Gradient Boosting framework.
It is an optimized distributed gradient boosting library designed to be highly
efficient, flexible and portable.
This implementation uses MPI (message passing interface) for client
communication and synchronization.

MPI requires the underlying communication network to be perfect - a single
message drop causes the training to fail.

This is usually achieved via a highly reliable special-purpose network like NCCL.

The open-source XGBoost supports federated paradigm, where clients are in different
locations and communicate with each other with gRPC over internet connections.

We introduce federated XGBoost with NVFlare for a more reliable federated setup.

.. toctree::
:maxdepth: 1

federated_xgboost/implementation
federated_xgboost/timeout
65 changes: 65 additions & 0 deletions docs/user_guide/federated_xgboost/implementation.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
#################################
Reliable Federated XGBoost Design
#################################


*************************
Flare as XGBoost Launcher
*************************

NVFLARE serves as a launchpad to start the XGBoost system.
Once started, the XGBoost system runs independently of FLARE,
as illustrated in the following figure.

.. figure:: ../../resources/loose_xgb.png
:height: 500px

There are a few potential problems with this approach:

- As we know, MPI requires a perfect communication network,
whereas the simple gRPC over the internet could be unstable.

- For each job, the XGBoost Server must open a port for clients to connect to.
This adds burden to request IT for the additional port in the real-world situation.
Even if a fixed port is allowed to open, and we reuse that port,
multiple XGBoost jobs can not be run at the same time,
since each XGBoost job requires a different port number.


*****************************
Flare as XGBoost Communicator
*****************************

FLARE provides a highly flexible, scalable and reliable communication mechanism.
We enhance the reliability of federated XGBoost by using FLARE as the communicator of XGBoost,
as shown here:

.. figure:: ../../resources/tight_xgb.png
:height: 500px

Detailed Design
===============

The open-source Federated XGBoost (c++) uses gRPC as the communication protocol.
To use FLARE as the communicator, we simply route XGBoost's gRPC messages through FLARE.
To do so, we change the server endpoint of each XGBoost client to a local gRPC server
(LGS) within the FLARE client.

.. figure:: ../../resources/fed_xgb_detail.png
:height: 500px

As shown in this diagram, there is a local GRPC server (LGS) for each site
that serves as the server endpoint for the XGBoost client on the site.
Similarly, there is a local GRPC Client (LGC) on the FL Server that
interacts with the XGBoost Server. The message path between the XGBoost Client and
the XGBoost Server is as follows:

1. The XGBoost client generates a gRPC message and sends it to the LGS in FLARE Client
2. FLARE Client forwards the message to the FLARE Server. This is a reliable FLARE message.
3. FLARE Server uses the LGC to send the message to the XGBoost Server.
4. XGBoost Server sends the response back to the LGC in FLARE Server.
5. FLARE Server sends the response back to the FLARE Client.
6. FLARE Client sends the response back to the XGBoost Client via the LGS.

Please note that the XGBoost Client (c++) component could be running as a separate process
or within the same process of FLARE Client.
93 changes: 93 additions & 0 deletions docs/user_guide/federated_xgboost/timeout.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
############################################
Reliable Federated XGBoost Timeout Mechanism
############################################

NVFlare introduces a tightly-coupled integration between XGBoost and NVFlare.
NVFlare implements the ReliableMessage mechanism to make XGBoost’s server/client
interactions more robust over unstable internet connections.

Unstable internet connection is the situation where the connections between
the communication endpoints have random disconnects/reconnects and unstable speed.
It is not meant to be an extended internet outage.

ReliableMessage does not mean guaranteed delivery.
It only means that it will try its best to deliver the message to the peer.
If one attempt fails, it will keep trying until either the message is
successfully delivered or a specified "transaction timeout" is reached.

*****************
Timeout Mechanism
*****************

In runtime, the FLARE System is configured with a few important timeout parameters.

ReliableMessage Timeout
=======================

There are two timeout values to control the behavior of ReliableMessage (RM).

Per-message Timeout
-------------------

Essentially RM tries to resend the message until delivered successfully.
Each resend of the message requires a timeout value.
This value should be defined based on the message size, overall network speed,
and the amount of time needed to process the message in a normal situation.
For example, if an XGBoost message takes no more than 5 seconds to be
sent, processed, and replied.
The per-message timeout should be set to 5 seconds.

.. note::

Note that the initial XGBoost message might take more than 100 seconds
depends on the dataset size.

Transaction Timeout
-------------------

This value defines how long you want RM to keep retrying until done, in case
of unstable connection.
This value should be defined based on the overall stability of the connection,
nature of the connection, and how quickly the connection is restored.
For occasional connection glitches, this value shouldn't have to be too big
(e.g. 20 seconds).
However if the outage is long (say 60 seconds or longer), then this value
should be big enough.

.. note::

Note that even if you think the connection is restored (e.g. replugged
the internet cable or reactivated WIFI), the underlying connection
layer may take much longer to actually restore connections (e.g. up to
a few minutes)!

.. note::

Note: if the transaction timeout is <= per-message timeout, then the
message will be sent through simple messaging - no retry will be done
in case of failure.

XGBoost Client Operation Timeout
================================

To prevent a XGBoost client from running forever, the XGBoost/FLARE
integration lets you define a parameter (max_client_op_interval) on the
server side to control the max amount of time permitted for a client to be
silent (i.e. no messages sent to the server).
The default value of this parameter is 900 seconds, meaning that if no XGB
message is received from the client for over 900 seconds, then that client
is considered dead, and the whole job is aborted.

***************************
Configure Timeouts Properly
***************************

These timeout values are related. For example, if the transaction timeout
is greater than the server timeout, then it won't be that effective since
the server will treat the client to be dead once the server timeout is reached
anyway. Similarly, it does not make sense to have transaction timeout > XGBoost
client op timeout.

In general, follow this rule:

Per-message Timeout < Transaction Timeout < XGBoost Client Operation Timeout
36 changes: 18 additions & 18 deletions examples/advanced/xgboost/histogram-based/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,27 +28,15 @@ Model accuracy can be visualized in tensorboard:
tensorboard --logdir /tmp/nvflare/xgboost_v2_workspace/simulate_job/tb_events
```

### Run federated experiments in real world
## Timeout configuration

To run in a federated setting, follow [Real-World FL](https://nvflare.readthedocs.io/en/main/real_world_fl.html) to
start the overseer, FL servers and FL clients.

You need to download the HIGGS data on each client site.
You will also need to install the xgboost on each client site and server site.

You can still generate the data splits and job configs using the scripts provided.

You will need to copy the generated data split file into each client site.
You might also need to modify the `data_path` in the `data_site-XXX.json`
inside the `/tmp/nvflare/xgboost_higgs_dataset` folder,
since each site might save the HIGGS dataset in different places.

Then you can use admin client to submit the job via `submit_job` command.
Please refer to [Reliable Federated XGBoost Timeout Mechanism](https://nvflare.readthedocs.io/en/2.4/user_guide/reliable_xgboost.html)

## Customization

The provided XGBoost executor can be customized using Boost parameters
provided in `xgb_params` argument.
The provided FedXGBHistogramExecutor can be customized by passing
[xgboost parameters](https://xgboost.readthedocs.io/en/stable/parameter.html)
in the `xgb_params` argument.

If the parameter change alone is not sufficient and code changes are required,
a custom executor can be implemented to make calls to xgboost library directly.
Expand All @@ -59,13 +47,25 @@ overwrite the `xgb_train()` method.
To use other dataset, can inherit the base class `XGBDataLoader` and
implement the `load_data()` method.

## Run in real world

To run in a federated setting, follow [Real-World FL](https://nvflare.readthedocs.io/en/main/real_world_fl.html) to
start the overseer, FL servers and FL clients.

1. Each participating site need to install xgboost and nvflare.
2. Each participating site need to have their own data loader
or use the same dataloader but with different location to load data
(can refer to higgs_data_loader.py to write one for their own data)

Then you can use admin client to submit the job via `submit_job` command.

## GPU support
By default, CPU based training is used.

If the CUDA is installed on the site, tree construction and prediction can be
accelerated using GPUs.

To enable GPU accelerated training, in `config_fed_client` set the args of
To enable GPU accelerated training, in `config_fed_client` set the args of
`FedXGBHistogramExecutor` to `"use_gpus": true` and set `"tree_method": "hist"`
in `xgb_params`.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
{
"format_version": 2,
"num_rounds": 100,
"executors": [
{
"tasks": [
Expand All @@ -20,7 +19,9 @@
"eval_metric": "auc",
"tree_method": "hist",
"nthread": 16
}
},
"per_msg_timeout": 100,
"tx_timeout": 500
}
}
}
Expand Down
2 changes: 2 additions & 0 deletions job_templates/vertical_xgb/config_fed_client.conf
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@ executors = [
use_gpus = false
metrics_writer_id = "metrics_writer"
model_file_name = "test.model.json"
per_msg_timeout = 100
tx_timeout = 500
}
}
}
Expand Down
Loading