Skip to content

Commit

Permalink
chore: update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
NOBLES5E authored Aug 1, 2024
1 parent 90a33ad commit 3e480d2
Showing 1 changed file with 16 additions and 16 deletions.
32 changes: 16 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,24 +4,24 @@
<hr/>

<div align="center">
<a href="https://tutorials.baguasys.com/"><img src="https://img.shields.io/badge/tutorials-passing-green" alt="tutorials"></a> <a href="http://bagua.readthedocs.io/?badge=latest"><img src="https://readthedocs.org/projects/bagua/badge/?version=latest" alt="Documentation Status"></a> <a href="https://pypi.org/project/bagua/"><img src="https://pepy.tech/badge/bagua/month" alt="Downloads"></a> <a href="https://hub.docker.com/r/baguasys/bagua"><img src="https://img.shields.io/docker/pulls/baguasys/bagua" alt="Docker Pulls"></a> <a href="https://hub.docker.com/r/baguasys/bagua"><img src="https://img.shields.io/docker/cloud/build/baguasys/bagua" alt="Docker Cloud Build Status"></a> <a href="https://github.com/BaguaSys/bagua/blob/master/LICENSE"><img src="https://img.shields.io/github/license/BaguaSys/bagua" alt="GitHub license"></a>
<a href="https://tutorials-8ro.pages.dev/"><img src="https://img.shields.io/badge/tutorials-passing-green" alt="tutorials"></a> <a href="http://bagua.readthedocs.io/?badge=latest"><img src="https://readthedocs.org/projects/bagua/badge/?version=latest" alt="Documentation Status"></a> <a href="https://pypi.org/project/bagua/"><img src="https://pepy.tech/badge/bagua/month" alt="Downloads"></a> <a href="https://hub.docker.com/r/baguasys/bagua"><img src="https://img.shields.io/docker/pulls/baguasys/bagua" alt="Docker Pulls"></a> <a href="https://hub.docker.com/r/baguasys/bagua"><img src="https://img.shields.io/docker/cloud/build/baguasys/bagua" alt="Docker Cloud Build Status"></a> <a href="https://github.com/BaguaSys/bagua/blob/master/LICENSE"><img src="https://img.shields.io/github/license/BaguaSys/bagua" alt="GitHub license"></a>
</div>

<br/>

*WARNING: THIS PROJECT IS CURRENTLY IN MAINTENANCE MODE, DUE TO COMPANY REORGANIZATION.*
*WARNING: THIS PROJECT IS CURRENTLY NOT MAINTAINED, DUE TO COMPANY REORGANIZATION.*

Bagua is a deep learning training acceleration framework for PyTorch developed by [AI platform@Kuaishou Technology](https://www.kuaishou.com/en) and [DS3 Lab@ETH Zürich](https://ds3lab.inf.ethz.ch/). Bagua currently supports:

- **Advanced Distributed Training Algorithms**: Users can extend the training on a single GPU to multi-GPUs (may across multiple machines) by simply adding a few lines of code (optionally in [elastic mode](https://tutorials.baguasys.com/elastic-training/)). One prominent feature of Bagua is to provide a flexible system abstraction that supports state-of-the-art system relaxation techniques of distributed training. So far, Bagua has integrated communication primitives including
- Centralized Synchronous Communication (e.g. [Gradient AllReduce](https://tutorials.baguasys.com/algorithms/gradient-allreduce))
- Decentralized Synchronous Communication (e.g. [Decentralized SGD](https://tutorials.baguasys.com/algorithms/decentralized))
- Low Precision Communication (e.g. [ByteGrad](https://tutorials.baguasys.com/algorithms/bytegrad))
- Asynchronous Communication (e.g. [Async Model Average](https://tutorials.baguasys.com/algorithms/async-model-average))
- [**Cached Dataset**](https://tutorials.baguasys.com/more-optimizations/cached-dataset): When data loading is slow or data preprocessing is tedious, they could become a major bottleneck of the whole training process. Bagua provides cached dataset to speedup this process by caching data samples in memory, so that reading these samples after the first time becomes much faster.
- [**TCP Communication Acceleration (Bagua-Net)**](https://tutorials.baguasys.com/more-optimizations/bagua-net): Bagua-Net is a low level communication acceleration feature provided by Bagua. It can greatly improve the throughput of AllReduce on TCP network. You can enable Bagua-Net optimization on any distributed training job that uses NCCL to do GPU communication (this includes PyTorch-DDP, Horovod, DeepSpeed, and more).
- [**Performance Autotuning**](https://tutorials.baguasys.com/performance-autotuning/): Bagua can automatically tune system parameters to achieve the highest throughput.
- [**Generic Fused Optimizer**](https://tutorials.baguasys.com/more-optimizations/generic-fused-optimizer): Bagua provides generic fused optimizer which improve the performance of optimizers by fusing the optimizer `.step()` operation on multiple layers. It can be applied to arbitrary PyTorch optimizer, in contrast to [NVIDIA Apex](https://nvidia.github.io/apex/optimizers.html)'s approach, where only some specific optimizers are implemented.
- **Advanced Distributed Training Algorithms**: Users can extend the training on a single GPU to multi-GPUs (may across multiple machines) by simply adding a few lines of code (optionally in [elastic mode](https://tutorials-8ro.pages.dev/elastic-training/)). One prominent feature of Bagua is to provide a flexible system abstraction that supports state-of-the-art system relaxation techniques of distributed training. So far, Bagua has integrated communication primitives including
- Centralized Synchronous Communication (e.g. [Gradient AllReduce](https://tutorials-8ro.pages.dev/algorithms/gradient-allreduce))
- Decentralized Synchronous Communication (e.g. [Decentralized SGD](https://tutorials-8ro.pages.dev/algorithms/decentralized))
- Low Precision Communication (e.g. [ByteGrad](https://tutorials-8ro.pages.dev/algorithms/bytegrad))
- Asynchronous Communication (e.g. [Async Model Average](https://tutorials-8ro.pages.dev/algorithms/async-model-average))
- [**Cached Dataset**](https://tutorials-8ro.pages.dev/more-optimizations/cached-dataset): When data loading is slow or data preprocessing is tedious, they could become a major bottleneck of the whole training process. Bagua provides cached dataset to speedup this process by caching data samples in memory, so that reading these samples after the first time becomes much faster.
- [**TCP Communication Acceleration (Bagua-Net)**](https://tutorials-8ro.pages.dev/more-optimizations/bagua-net): Bagua-Net is a low level communication acceleration feature provided by Bagua. It can greatly improve the throughput of AllReduce on TCP network. You can enable Bagua-Net optimization on any distributed training job that uses NCCL to do GPU communication (this includes PyTorch-DDP, Horovod, DeepSpeed, and more).
- [**Performance Autotuning**](https://tutorials-8ro.pages.dev/performance-autotuning/): Bagua can automatically tune system parameters to achieve the highest throughput.
- [**Generic Fused Optimizer**](https://tutorials-8ro.pages.dev/more-optimizations/generic-fused-optimizer): Bagua provides generic fused optimizer which improve the performance of optimizers by fusing the optimizer `.step()` operation on multiple layers. It can be applied to arbitrary PyTorch optimizer, in contrast to [NVIDIA Apex](https://nvidia.github.io/apex/optimizers.html)'s approach, where only some specific optimizers are implemented.
- [**Load Balanced Data Loader**](https://bagua.readthedocs.io/en/latest/autoapi/bagua/torch_api/contrib/load_balancing_data_loader/index.html): When the computation complexity of samples in training data are different, for example in NLP and speech tasks, where each sample have different lengths, distributed training throughput can be greatly improved by using Bagua's load balanced data loader, which distributes samples in a way that each worker's workload are similar.
- [**Integration with PyTorch Lightning**](https://pytorch-lightning.readthedocs.io/en/latest/accelerators/gpu.html#bagua): Are you using [PyTorch Lightning](https://www.pytorchlightning.ai/) for your distributed training job? Now you can use Bagua in PyTorch Lightning by simply set `strategy=BaguaStrategy` in your Trainer. This enables you to take advantage of a range of advanced training algorithms, including decentralized methods, asynchronous methods, communication compression, and their combinations!

Expand All @@ -30,14 +30,14 @@ Its effectiveness has been evaluated in various scenarios, including VGG and Res
## Links

* [Bagua Main Git Repo](https://github.com/BaguaSys/bagua)
* [Bagua Tutorials](https://tutorials.baguasys.com/)
* [Bagua Tutorials](https://tutorials-8ro.pages.dev/)
* [Bagua Examples](https://github.com/BaguaSys/bagua/tree/master/examples)
* [Bagua API Documentation](https://bagua.readthedocs.io/)

## Performance

<p align="center">
<img src="https://tutorials.baguasys.com/benchmark/figures/e2e_vgg16_128.png" width="600"/>
<img src="https://tutorials-8ro.pages.dev/benchmark/figures/e2e_vgg16_128.png" width="600"/>
</p>
<p align="center">
The performance of different systems and algorithms on VGG16 with 128 GPUs under different network bandwidth.
Expand All @@ -47,13 +47,13 @@ Its effectiveness has been evaluated in various scenarios, including VGG and Res
<br/>

<p align="center">
<img src="https://tutorials.baguasys.com/benchmark/figures/tradeoff_network_bert-large-bandwidth.png" width="250"/><img src="https://tutorials.baguasys.com/benchmark/figures/tradeoff_network_bert-large-latency.png" width="250"/><img src="https://tutorials.baguasys.com/benchmark/figures/tradeoff_network_legend.png" width="260"/>
<img src="https://tutorials-8ro.pages.dev/benchmark/figures/tradeoff_network_bert-large-bandwidth.png" width="250"/><img src="https://tutorials-8ro.pages.dev/benchmark/figures/tradeoff_network_bert-large-latency.png" width="250"/><img src="https://tutorials-8ro.pages.dev/benchmark/figures/tradeoff_network_legend.png" width="260"/>
</p>
<p align="center">
Epoch time of BERT-Large Finetune under different network conditions for different systems.
</p>

For more comprehensive and up to date results, refer to [Bagua benchmark page](https://tutorials.baguasys.com/benchmark/index.html).
For more comprehensive and up to date results, refer to [Bagua benchmark page](https://tutorials-8ro.pages.dev/benchmark/index.html).

## Installation

Expand All @@ -67,7 +67,7 @@ Wheels (precompiled binary packages) are available for Linux (x86_64). Package n
| >= v11.5 | `pip install bagua-cuda115` |
| >= v11.6 | `pip install bagua-cuda116` |

Add `--pre` to `pip install` commands to install pre-release (development) versions. See [Bagua tutorials](https://tutorials.baguasys.com/getting-started/) for quick start guide and more installation options.
Add `--pre` to `pip install` commands to install pre-release (development) versions. See [Bagua tutorials](https://tutorials-8ro.pages.dev/getting-started/) for quick start guide and more installation options.

## Quick Start on AWS

Expand Down

0 comments on commit 3e480d2

Please sign in to comment.