Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated the examples folder readme file #7208

Merged
merged 3 commits into from
Jun 14, 2024

Conversation

sitamgithub-MSIT
Copy link
Contributor

This PR addresses the update of the readme file on the examples folder as requested here in this issue.

cc: @duncantech @JackCaoG

The following describes the Git difference for the changed files:

Changes:
diff --git a/examples/README.md b/examples/README.md
index 1ad0018c..04cc05bf 100644
--- a/examples/README.md
+++ b/examples/README.md
@@ -1,17 +1,42 @@
 ## Overview
-This repo aims to provide some basic examples of how to run an existing pytorch model with PyTorch/XLA. `train_resnet_base.py` is a minimal trainer to run ResNet50 with fake data on a single device. `train_decoder_only_base.py` is similar to `train_resnet_base.py` but with a decoder only model.
+This repo aims to provide some basic examples of how to run an existing PyTorch model with PyTorch/XLA. `train_resnet_base.py` is a minimal trainer to run ResNet50 with fake data on a single device. `train_decoder_only_base.py` is similar to `train_resnet_base.py` but with a decoder-only model.
 
-Other examples will import the `train_resnet_base` or `train_decoder_only_base` and demonstrate how to enable different features(distributed training, profiling, dynamo etc) on PyTorch/XLA.The objective of this repository is to offer fundamental examples of executing an existing PyTorch model utilizing PyTorch/XLA.
+Other examples will import the `train_resnet_base` or `train_decoder_only_base` and demonstrate how to enable different features (distributed training, profiling, dynamo, etc.) on PyTorch/XLA. The objective of this repository is to offer fundamental examples of executing an existing PyTorch model utilizing PyTorch/XLA.
 
 ## Setup
-Follow our [README](https://github.com/pytorch/xla#getting-started) to install latest release of torch_xla. Check out this [link](https://github.com/pytorch/xla#python-packages) for torch_xla at other versions. To install the nightly torchvision(required for the resnet) you can do
+Follow our [README](https://github.com/pytorch/xla#getting-started) to install the latest release of torch_xla. Check out this [link](https://github.com/pytorch/xla#python-packages) for torch_xla at other versions. To install the nightly torchvision(required for the resnet) you can do
 
 ```shell
 pip install --no-deps --pre torchvision -i https://download.pytorch.org/whl/nightly/cu118

Run the example

-You can run all models directly. Only environment you want to set is PJRT_DEVICE.
+You can run all models directly. The only environment you want to set is PJRT_DEVICE.

PJRT_DEVICE=TPU python fsdp/train_decoder_only_fsdp_v2.py

+## Examples and Description
+- train_resnet_base.py: A minimal example of training ResNet50. This is the baseline example for comparing performance with other training
strategies.
+- train_decoder_only_base.py: A minimal example of training a decoder-only model. This serves as a baseline for comparison with other train
ing strategies.
+- train_resnet_amp.py: Shows how to use Automatic Mixed Precision (AMP) with PyTorch/XLA to improve performance. This example demonstrates
the benefits of AMP for reducing memory usage and accelerating training.
+
+- data_parallel: A trainer implementation to run ResNet50 on multiple devices using data-parallel.
+

    • train_resnet_ddp.py: Shows how to use PyTorch's DDP implementation for distributed training on TPUs. This example showcases how to inte
      grate PyTorch's DDP with PyTorch/XLA for distributed training.
    • train_resnet_spmd_data_parallel.py: Leverages SPMD (Single Program Multiple Data) for distributed training. It shards the batch dimensi
      on across multiple devices and demonstrates how to achieve higher performance than DDP for specific workloads.
    • train_resnet_xla_ddp.py: Shows how to use PyTorch/XLA's built-in DDP implementation for distributed training on TPUs. It demonstrates
      the benefits of distributed training and the simplicity of using PyTorch/XLA's DDP.

+- debug: A trainer implementation to run ResNet50 with debug mode.
+

    • train_resnet_profile.py: Captures performance insights with PyTorch/XLA's profiler to identify bottlenecks. Helps diagnose and optimize
      model performance.
    • train_resnet_benchmark.py: Provides a simple way to benchmark PyTorch/XLA, measuring device execution and tracing time for overall effi
      ciency analysis.

+- flash_attention: A trainer implementation to run a decoder-only model using Flash Attention.
+

    • train_decoder_only_flash_attention.py: Incorporates flash attention, an efficient attention mechanism, utilizing custom kernels for acc
      elerated training.
    • train_resnet_flash_attention_fsdp_v2.py: Combines flash attention with FSDP, showcasing the integration of custom kernels with FSDP for
      scalable and efficient model training.

+- fsdp: A trainer implementation to run a decoder-only model using FSDP (Fully Sharded Data Parallelism).

    • train_decoder_only_fsdp_v2.py: Employs FSDP for training the decoder-only model, demonstrating parallel training of large transformer m
      odels on TPUs.
    • train_resnet_fsdp_auto_wrap.py: Demonstrates FSDP (Fully Sharded Data Parallel) for model training, automatically wrapping model parts
      based on size or type criteria.
      (END)


```shell
pip install --no-deps --pre torchvision -i https://download.pytorch.org/whl/nightly/cu118
```

## Run the example
You can run all models directly. Only environment you want to set is `PJRT_DEVICE`.
You can run all models directly. The only environment you want to set is `PJRT_DEVICE`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add "variable" after environment for specificity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Updated in the latest commit.

- `train_resnet_flash_attention_fsdp_v2.py`: Combines flash attention with FSDP, showcasing the integration of custom kernels with FSDP for scalable and efficient model training.

- fsdp: A trainer implementation to run a decoder-only model using FSDP (Fully Sharded Data Parallelism).
- `train_decoder_only_fsdp_v2.py`: Employs FSDP for training the decoder-only model, demonstrating parallel training of large transformer models on TPUs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Employs FSDP -> Employs FSDPv2(FSDP algorithm implemented with PyTorch/XLA GSPMD)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed

- flash_attention: A trainer implementation to run a decoder-only model using Flash Attention.

- `train_decoder_only_flash_attention.py`: Incorporates flash attention, an efficient attention mechanism, utilizing custom kernels for accelerated training.
- `train_resnet_flash_attention_fsdp_v2.py`: Combines flash attention with FSDP, showcasing the integration of custom kernels with FSDP for scalable and efficient model training.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with FSDP -> with FSDPv2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in latest commit

@JackCaoG JackCaoG merged commit ca72f1a into pytorch:master Jun 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants