Introduce `PartialState` as the device handler in the `Trainer` #22752

muellerzr · 2023-04-13T19:52:32Z

What does this PR do?

This PR is the start of the Trainer integration of Accelerate

The integration will follow multiple stages to ensure small changes happen iteratively. This first one simply changes the device handler/setter to be the PartialState in Accelerate and nothing more. In a follow-up PR, I will start to include more utilities utilizing it.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sgugger @pacman100

HuggingFaceDocBuilderDev · 2023-04-13T20:13:14Z

The documentation is not available anymore as the PR was closed or merged.

pacman100

Hello, my main question is that why aren't we creating or using already passed Accelerator object? My vision was that post creating accelerator object here muellerzr@12dcc13#diff-bfceaff300c851b8e24fc50dc6638482abaec8f7d2a718e877c3828c166bcf79R1489

you can fill in the local rank and all other dist setup variables from the accelerator state.

muellerzr · 2023-04-14T10:12:55Z

@pacman100 I'm starting small with what is minimally needed with the API. E.g. the AcceleratorState isn't needed until we get into parts such as mixed precision. The device handling can be done separately altogether as it doesn't need to rely on an accelerator object the user passes in

This may eventually be the AcceleratorState, but for now just starting with the PartialState seemed okay to me.

sgugger

There is a big problem in the changes of the tests with local_rank. local_rank==0 means the training was launched in a distributed fashion but with only one process (could be a setup with multiple nodes and one process per node for instance). Those tests need to be switched to use PartialState.distributed_type and compare it to NO, not to compare the local_rank with 0.

sgugger · 2023-04-14T11:37:04Z

src/transformers/trainer.py

@@ -416,8 +416,7 @@ def __init__(
                raise ValueError(
                    "Using --sharded_ddp xxx together with --fsdp is not possible, deactivate one of those flags."
                )
-
-            if args.local_rank == -1:
+            if args.local_rank == 0:


Those checks should on the distributed mode in PartialState.distributed_type == DistributedType.No as this did not error before if a user was launching a distributed training with one process.

sgugger

Looking good! Before we merge this, can you run all trainer tests (on single and multi GPU) as well as the DeepSpeed tests to make sure this doesn't break anything?

src/transformers/trainer.py

muellerzr · 2023-04-14T20:22:03Z

PR is good for final review, tested on single + multi gpu + deepspeed and seemed to work fine. Ignore the examples_tensorflow failure, as that's actually a good failure (if one could exist)

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

rennn2002 · 2023-04-18T01:41:09Z

@muellerzr

I don't know if here is a right place to report but

I think this merge causing error in Seq2SeqTrainingArguments.

before this merge, I was able to run this code

from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./train_test",  # change to a repo name of your choice
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    warmup_steps=5,
    max_steps=40,
    gradient_checkpointing=True,
    fp16=True,
    group_by_length=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=10,
    eval_steps=10,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=False,
)

Error messages

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
[<ipython-input-25-7693116680cc>](https://localhost:8080/#) in <cell line: 3>()
      1 from transformers import Seq2SeqTrainingArguments
      2 
----> 3 training_args = Seq2SeqTrainingArguments(
      4     output_dir="./whisper-large-ja-7",  # change to a repo name of your choice
      5     per_device_train_batch_size=8,

usr/local/lib/python3.9/dist-packages/transformers/training_args_seq2seq.py in __init__(self, output_dir, overwrite_output_dir, do_train, do_eval, do_predict, evaluation_strategy, prediction_loss_only, per_device_train_batch_size, per_device_eval_batch_size, per_gpu_train_batch_size, per_gpu_eval_batch_size, gradient_accumulation_steps, eval_accumulation_steps, eval_delay, learning_rate, weight_decay, adam_beta1, adam_beta2, adam_epsilon, max_grad_norm, num_train_epochs, max_steps, lr_scheduler_type, warmup_ratio, warmup_steps, log_level, log_level_replica, log_on_each_node, logging_dir, logging_strategy, logging_first_step, logging_steps, logging_nan_inf_filter, save_strategy, save_steps, save_total_limit, save_safetensors, save_on_each_node, no_cuda, use_mps_device, seed, data_seed, jit_mode_eval, use_ipex, bf16, fp16, fp16_opt_level, half_precision_backend, bf16_full_eval, fp16_full_eval, tf32, local_rank, xpu_backend, tpu_num_cores, tpu_metrics_debug, debug, dataloader_drop_last, eval_steps, dataloader_num_workers, past_index, run_name, disable_tqdm, remove_unused_columns, label_names, load_best_model_at_end, metric_for_best_model, greater_is_better, ignore_data_skip, sharded_ddp, fsdp, fsdp_min_num_params, fsdp_config, fsdp_transformer_layer_cls_to_wrap, deepspeed, label_smoothing_factor, optim, optim_args, adafactor, group_by_length, length_column_name, report_to, ddp_find_unused_parameters, ddp_bucket_cap_mb, dataloader_pin_memory, skip_mem...

[/usr/local/lib/python3.9/dist-packages/transformers/training_args.py](https://localhost:8080/#) in __post_init__(self)
   1253             self.framework == "pt"
   1254             and is_torch_available()
-> 1255             and (self.device.type != "cuda")
   1256             and (get_xla_device_type(self.device) != "GPU")
   1257             and (self.fp16 or self.fp16_full_eval)

[/usr/local/lib/python3.9/dist-packages/transformers/training_args.py](https://localhost:8080/#) in device(self)
   1613         """
   1614         requires_backends(self, ["torch"])
-> 1615         return self._setup_devices
   1616 
   1617     @property

[/usr/local/lib/python3.9/dist-packages/transformers/utils/generic.py](https://localhost:8080/#) in __get__(self, obj, objtype)
     52         cached = getattr(obj, attr, None)
     53         if cached is None:
---> 54             cached = self.fget(obj)
     55             setattr(obj, attr, cached)
     56         return cached

[/usr/local/lib/python3.9/dist-packages/transformers/training_args.py](https://localhost:8080/#) in _setup_devices(self)
   1547             device = self.distributed_state.device
   1548         else:
-> 1549             self.distributed_state = PartialState(backend=self.xpu_backend)
   1550             device = self.distributed_state.device
   1551             self._n_gpu = 1

NameError: name 'PartialState' is not defined

amyeroberts · 2023-04-18T08:53:23Z

@rennn2002 Could you try updating your version of accelerate i.e. pip install --upgrade accelerate?

pacman100

Hello, Thank you @zach for adding this 🚀 . Left a couple comments

pacman100 · 2023-04-18T09:08:30Z

src/transformers/training_args.py

+        if (
+            torch.distributed.is_available()
+            and torch.distributed.is_initialized()
+            and self.distributed_state.distributed_type != DistributedType.NO


Should this be self.distributed_state.distributed_type == DistributedType.NO?

Good catch! Fixing in a follow up. Thanks! :)

pacman100 · 2023-04-18T09:09:06Z

src/transformers/training_args.py

+            and self.distributed_state.distributed_type != DistributedType.NO
+        ):
+            logger.warning(
+                "torch.distributed process group is initialized, but parallel_mode == ParallelMode.DISTRIBUTED. "


Should this be parallel_mode != ParallelMode.DISTRIBUTED?

…ingface#22752) * Use accelerate for device management * Add accelerate to setup Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

muellerzr added PyTorch Anything PyTorch Distributed Training / Models trainer labels Apr 13, 2023

muellerzr requested review from pacman100 and sgugger April 13, 2023 19:52

pacman100 reviewed Apr 14, 2023

View reviewed changes

sgugger reviewed Apr 14, 2023

View reviewed changes

sgugger approved these changes Apr 14, 2023

View reviewed changes

src/transformers/trainer.py Outdated Show resolved Hide resolved

muellerzr force-pushed the muellerzr-accelerate-device branch from aa28856 to 4e30eef Compare April 14, 2023 19:33

muellerzr requested review from pacman100 and sgugger April 14, 2023 20:21

sgugger approved these changes Apr 17, 2023

View reviewed changes

muellerzr force-pushed the muellerzr-accelerate-device branch from 9140c7a to 762b809 Compare April 17, 2023 16:35

muellerzr added 14 commits April 17, 2023 18:26

Use accelerate for device management

0357c3d

Add accelerate to setup

79a11c3

Add accelerate to testing deplist?

f86bc7c

Adjust set_device

320c3c4

Try commenting out local rank

27a96dc

Bookmark

3806d26

Working version now?

48e5c90

Fixup CPU

8af0411

Keep xpu the same

f9842c9

Should work now with Accelerate main

f819660

Use ParallelMode instead

42e949f

Use parallel mode

1c3c344

Rest of not_parallel to distributed

71dbaba

args instead of self

6e263e2

muellerzr and others added 13 commits April 17, 2023 18:26

Flip and use distributed

2b8ad48

Set device

82878d0

Fix recursion

eeecb77

Use distributed state

36fc3aa

Implicit false I believe

05d00a3

Check for distributed state

72e639f

Revert

fc2ac5c

Rm from testing

92b1074

Guard accelerate. Should fix it

0ae7233

Should be not, not is

cb1baed

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Fix more nits

a2ed7e6

To change to main

2ffd3a8

Rm from reqs

6bf4774

muellerzr force-pushed the muellerzr-accelerate-device branch from 71a1fe8 to 6bf4774 Compare April 17, 2023 18:26

muellerzr merged commit 0346287 into main Apr 17, 2023

muellerzr deleted the muellerzr-accelerate-device branch April 17, 2023 19:09

pacman100 reviewed Apr 18, 2023

View reviewed changes

regisss mentioned this pull request Apr 19, 2023

Run CI on Transformers' latest release huggingface/optimum-habana#206

Merged

3 tasks

oliverholworthy mentioned this pull request May 12, 2023

Add Transformers Torch Extras to install requirements NVIDIA-Merlin/Transformers4Rec#699

Merged

fxmarty mentioned this pull request Aug 4, 2023

Fix doc build huggingface/optimum#1251

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce `PartialState` as the device handler in the `Trainer` #22752

Introduce `PartialState` as the device handler in the `Trainer` #22752

muellerzr commented Apr 13, 2023

HuggingFaceDocBuilderDev commented Apr 13, 2023 •

edited

Loading

pacman100 left a comment

muellerzr commented Apr 14, 2023 •

edited

Loading

sgugger left a comment

sgugger Apr 14, 2023

sgugger left a comment

muellerzr commented Apr 14, 2023

rennn2002 commented Apr 18, 2023

amyeroberts commented Apr 18, 2023

pacman100 left a comment

pacman100 Apr 18, 2023

muellerzr Apr 18, 2023

pacman100 Apr 18, 2023

Introduce PartialState as the device handler in the Trainer #22752

Introduce PartialState as the device handler in the Trainer #22752

Conversation

muellerzr commented Apr 13, 2023

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Apr 13, 2023 • edited Loading

pacman100 left a comment

Choose a reason for hiding this comment

muellerzr commented Apr 14, 2023 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

sgugger Apr 14, 2023

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

muellerzr commented Apr 14, 2023

rennn2002 commented Apr 18, 2023

amyeroberts commented Apr 18, 2023

pacman100 left a comment

Choose a reason for hiding this comment

pacman100 Apr 18, 2023

Choose a reason for hiding this comment

muellerzr Apr 18, 2023

Choose a reason for hiding this comment

pacman100 Apr 18, 2023

Choose a reason for hiding this comment

Introduce `PartialState` as the device handler in the `Trainer` #22752

Introduce `PartialState` as the device handler in the `Trainer` #22752

HuggingFaceDocBuilderDev commented Apr 13, 2023 •

edited

Loading

muellerzr commented Apr 14, 2023 •

edited

Loading