accelerate deepspeed and gradient accumulation integrate #23236

pacman100 · 2023-05-09T15:11:27Z

What does this PR do?

Shift deepspeed integration to accelerate
Shift Gradient Accumulation to Accelerate
Merge after shift torch dynamo handling to accelerate #23168
no user facing change. Now user can use accelerate launch with trainer for DeepSpeed, e.g.:

accelerate launch --num_processes=2 --mixed_precision=bf16 --use_deepspeed --gradient_accumulation_steps=1 --gradient_clipping=1 --zero3_init_flag=True --zero3_save_16bit_model=False --zero_stage=3 --offload_optimizer_device=none --offload_param_device=none ./examples/pytorch/text-classification/run_glue.py  --model_name_or_path bert-base-cased   --task_name $TASK_NAME   --do_train   --do_eval   --max_seq_length 128   --per_device_train_batch_size 16   --learning_rate 5e-5   --num_train_epochs 3   --output_dir /tmp/$TASK_NAME/ --overwrite_output_dir --bf16

Usual run using torchrun and trainer args is unimpacted:

torchrun --nnodes 1 --nproc-per-node 2 ./examples/pytorch/text-classification/run_glue.py  --model_name_or_path bert-base-cased   --task_name $TASK_NAME   --do_train   --do_eval   --max_seq_length 128   --per_device_train_batch_size 16   --learning_rate 5e-5   --num_train_epochs 3   --output_dir /tmp/$TASK_NAME/ --overwrite_output_dir --deepspeed ~/transformers/tests/deepspeed/ds_config_zero2.json

Save and load utils are changed accordingly

HuggingFaceDocBuilderDev · 2023-05-09T15:32:02Z

The documentation is not available anymore as the PR was closed or merged.

muellerzr

Thanks for working on this! Looks great, once tests pass fully :)

muellerzr

LG2M, one clarification which indeed is true

src/transformers/trainer.py

sgugger

Thanks for working on this. Is the diff longer than expected because of other PRs to be merged before?

Might be cool to Have Stas have a look (not pinging him here too early) once this is ready to merge and tests are confirmed to all pass.

pacman100 · 2023-05-10T18:56:59Z

Thanks for working on this. Is the diff longer than expected because of other PRs to be merged before?

Due to updating from main, it is not showing the diff wrt previous branches. Weird.

Might be cool to Have Stas have a look (not pinging him here too early) once this is ready to merge and tests are confirmed to all pass.

Yes, definitely. All tests are passing already. Checked the slow tests offline.

…elerate-deepspeed-integrate

pacman100 · 2023-05-10T19:09:05Z

@sgugger, now the diff is only specific to DeepSpeed changes + gradient accumulation changes + saving/loading changes wrt previous PR.

pacman100 · 2023-05-10T19:10:14Z

Hello @stas00, please review this PR which aims to shift the accelerate handling in Trainer to Accelerate. Thank you!

stas00

@pacman100, I think the main concern is that this PR appears to be breaking BC in at least a few places - please correct me if I'm wrong. I think those cases are super minor and probably won't cause too many breakages for the users if any. I'll leave up to you to decide.

As I don't know all the nuances of Accelerate's Deepspeed integration I can't do a detailed review, but if all the SLOW tests pass it should be good.

request: as Accelerate is taking over please remove me from the Issues/PR templates as deepspeed integration maintainer while you're at it, since I won't be able to support users any longer.

Thank you!

docs/source/en/main_classes/deepspeed.mdx

src/transformers/deepspeed.py

muellerzr

Thanks for doing this! Looks great

…#23236) * mixed precision support via accelerate * fix issues * fix for the sharded ddp case * fix flax and tf failing tests * `refactor the place to create `Accelerator` object * move ddp prep to accelerate * fix 😅 * resolving comments * move fsdp handling to accelerate * fixex * fix saving * shift torch dynamo handling to accelerate * shift deepspeed integration and save & load utils to accelerate * fix accelerate launcher support * oops * fix 🐛 * save ckpt fix * Trigger CI * nasty 🐛 😅 * as deepspeed needs grad_acc fixes, transfer grad_acc to accelerate * make tests happy * quality ✨ * loss tracked needs to account for grad_acc * fixing the deepspeed tests * quality ✨ * 😅😅😅 * tests 😡 * quality ✨ * Trigger CI * resolve comments and fix the issue with the previous merge from branch * Trigger CI * accelerate took over deepspeed integration --------- Co-authored-by: Stas Bekman <stas@stason.org>

pacman100 added 17 commits May 4, 2023 13:35

mixed precision support via accelerate

b3987a8

fix issues

862d04b

fix for the sharded ddp case

f2196be

fix flax and tf failing tests

2339a48

refactor the place to create Accelerator` object

263b134

move ddp prep to accelerate

a5bf517

fix 😅

f00ce09

resolving comments

254f9a4

move fsdp handling to accelerate

88e7350

fixex

b37ad2a

fix saving

ec73bf2

shift torch dynamo handling to accelerate

ed1a520

shift deepspeed integration and save & load utils to accelerate

f70ba13

fix accelerate launcher support

d1cab6b

oops

4cd9b70

fix 🐛

0bee40f

save ckpt fix

63aa5ea

pacman100 requested review from sgugger and muellerzr and removed request for sgugger May 9, 2023 15:11

Trigger CI

b5e8129

muellerzr approved these changes May 9, 2023

View reviewed changes

Merge branch 'main' into smangrul/accelerate-deepspeed-integrate

d3a1e75

pacman100 changed the base branch from smangrul/accelerate-dynamo-integrate to main May 10, 2023 04:10

pacman100 changed the base branch from main to smangrul/accelerate-dynamo-integrate May 10, 2023 04:14

Merge branch 'main' into smangrul/accelerate-dynamo-integrate

b2d9946

pacman100 changed the base branch from smangrul/accelerate-dynamo-integrate to main May 10, 2023 04:15

pacman100 changed the base branch from main to smangrul/accelerate-dynamo-integrate May 10, 2023 04:15

muellerzr approved these changes May 10, 2023

View reviewed changes

src/transformers/trainer.py Show resolved Hide resolved

pacman100 added 4 commits May 10, 2023 21:49

fixing the deepspeed tests

4dff061

quality ✨

4d8ab41

😅😅😅

9b33a39

tests 😡

8891364

sgugger approved these changes May 10, 2023

View reviewed changes

Merge branch 'smangrul/accelerate-dynamo-integrate' into smangrul/acc…

5d8738b

…elerate-deepspeed-integrate

pacman100 changed the base branch from smangrul/accelerate-dynamo-integrate to main May 10, 2023 19:07

pacman100 changed the base branch from main to smangrul/accelerate-dynamo-integrate May 10, 2023 19:08

pacman100 requested a review from stas00 May 10, 2023 19:09

pacman100 added 2 commits May 11, 2023 00:42

quality ✨

a1fbcc5

Trigger CI

fc81728

stas00 reviewed May 11, 2023

View reviewed changes

docs/source/en/main_classes/deepspeed.mdx Outdated Show resolved Hide resolved

src/transformers/deepspeed.py Show resolved Hide resolved

pacman100 and others added 3 commits May 13, 2023 21:37

resolve comments and fix the issue with the previous merge from branch

26051ed

Trigger CI

9ee66e1

accelerate took over deepspeed integration

349fdd0

muellerzr approved these changes May 30, 2023

View reviewed changes

Base automatically changed from smangrul/accelerate-dynamo-integrate to main May 31, 2023 09:12

pacman100 added 2 commits May 31, 2023 14:52

Merge branch 'main' into smangrul/accelerate-deepspeed-integrate

a0f3ac1

Merge branch 'main' into smangrul/accelerate-deepspeed-integrate

dc689be

pacman100 changed the title ~~Smangrul/accelerate deepspeed integrate~~ accelerate deepspeed and gradient accumulation integrate May 31, 2023

pacman100 merged commit a73b1d5 into main May 31, 2023

pacman100 deleted the smangrul/accelerate-deepspeed-integrate branch May 31, 2023 09:46

pacman100 mentioned this pull request Jun 8, 2023

Deepspeed hang when tuning redpajama-3b #24090

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

accelerate deepspeed and gradient accumulation integrate #23236

accelerate deepspeed and gradient accumulation integrate #23236

pacman100 commented May 9, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented May 9, 2023 •

edited

Loading

muellerzr left a comment

muellerzr left a comment •

edited

Loading

sgugger left a comment

pacman100 commented May 10, 2023

pacman100 commented May 10, 2023

pacman100 commented May 10, 2023

stas00 left a comment •

edited

Loading

muellerzr left a comment

accelerate deepspeed and gradient accumulation integrate #23236

accelerate deepspeed and gradient accumulation integrate #23236

Conversation

pacman100 commented May 9, 2023 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented May 9, 2023 • edited Loading

muellerzr left a comment

Choose a reason for hiding this comment

muellerzr left a comment • edited Loading

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

pacman100 commented May 10, 2023

pacman100 commented May 10, 2023

pacman100 commented May 10, 2023

stas00 left a comment • edited Loading

Choose a reason for hiding this comment

muellerzr left a comment

Choose a reason for hiding this comment

pacman100 commented May 9, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented May 9, 2023 •

edited

Loading

muellerzr left a comment •

edited

Loading

stas00 left a comment •

edited

Loading