Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce PartialState as the device handler in the Trainer #22752

Merged
merged 27 commits into from
Apr 17, 2023

Conversation

muellerzr
Copy link
Contributor

What does this PR do?

This PR is the start of the Trainer integration of Accelerate

The integration will follow multiple stages to ensure small changes happen iteratively. This first one simply changes the device handler/setter to be the PartialState in Accelerate and nothing more. In a follow-up PR, I will start to include more utilities utilizing it.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sgugger @pacman100

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Apr 13, 2023

The documentation is not available anymore as the PR was closed or merged.

Copy link
Contributor

@pacman100 pacman100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, my main question is that why aren't we creating or using already passed Accelerator object? My vision was that post creating accelerator object here muellerzr@12dcc13#diff-bfceaff300c851b8e24fc50dc6638482abaec8f7d2a718e877c3828c166bcf79R1489

you can fill in the local rank and all other dist setup variables from the accelerator state.

@muellerzr
Copy link
Contributor Author

muellerzr commented Apr 14, 2023

@pacman100 I'm starting small with what is minimally needed with the API. E.g. the AcceleratorState isn't needed until we get into parts such as mixed precision. The device handling can be done separately altogether as it doesn't need to rely on an accelerator object the user passes in

This may eventually be the AcceleratorState, but for now just starting with the PartialState seemed okay to me.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a big problem in the changes of the tests with local_rank. local_rank==0 means the training was launched in a distributed fashion but with only one process (could be a setup with multiple nodes and one process per node for instance). Those tests need to be switched to use PartialState.distributed_type and compare it to NO, not to compare the local_rank with 0.

@@ -416,8 +416,7 @@ def __init__(
raise ValueError(
"Using --sharded_ddp xxx together with --fsdp is not possible, deactivate one of those flags."
)

if args.local_rank == -1:
if args.local_rank == 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those checks should on the distributed mode in PartialState.distributed_type == DistributedType.No as this did not error before if a user was launching a distributed training with one process.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Before we merge this, can you run all trainer tests (on single and multi GPU) as well as the DeepSpeed tests to make sure this doesn't break anything?

src/transformers/trainer.py Outdated Show resolved Hide resolved
@muellerzr
Copy link
Contributor Author

PR is good for final review, tested on single + multi gpu + deepspeed and seemed to work fine. Ignore the examples_tensorflow failure, as that's actually a good failure (if one could exist)

@muellerzr muellerzr merged commit 0346287 into main Apr 17, 2023
@muellerzr muellerzr deleted the muellerzr-accelerate-device branch April 17, 2023 19:09
@rennn2002
Copy link

@muellerzr

I don't know if here is a right place to report but

I think this merge causing error in Seq2SeqTrainingArguments.

before this merge, I was able to run this code

from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./train_test",  # change to a repo name of your choice
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    warmup_steps=5,
    max_steps=40,
    gradient_checkpointing=True,
    fp16=True,
    group_by_length=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=10,
    eval_steps=10,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=False,
)

Error messages

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
[<ipython-input-25-7693116680cc>](https://localhost:8080/#) in <cell line: 3>()
      1 from transformers import Seq2SeqTrainingArguments
      2 
----> 3 training_args = Seq2SeqTrainingArguments(
      4     output_dir="./whisper-large-ja-7",  # change to a repo name of your choice
      5     per_device_train_batch_size=8,

usr/local/lib/python3.9/dist-packages/transformers/training_args_seq2seq.py in __init__(self, output_dir, overwrite_output_dir, do_train, do_eval, do_predict, evaluation_strategy, prediction_loss_only, per_device_train_batch_size, per_device_eval_batch_size, per_gpu_train_batch_size, per_gpu_eval_batch_size, gradient_accumulation_steps, eval_accumulation_steps, eval_delay, learning_rate, weight_decay, adam_beta1, adam_beta2, adam_epsilon, max_grad_norm, num_train_epochs, max_steps, lr_scheduler_type, warmup_ratio, warmup_steps, log_level, log_level_replica, log_on_each_node, logging_dir, logging_strategy, logging_first_step, logging_steps, logging_nan_inf_filter, save_strategy, save_steps, save_total_limit, save_safetensors, save_on_each_node, no_cuda, use_mps_device, seed, data_seed, jit_mode_eval, use_ipex, bf16, fp16, fp16_opt_level, half_precision_backend, bf16_full_eval, fp16_full_eval, tf32, local_rank, xpu_backend, tpu_num_cores, tpu_metrics_debug, debug, dataloader_drop_last, eval_steps, dataloader_num_workers, past_index, run_name, disable_tqdm, remove_unused_columns, label_names, load_best_model_at_end, metric_for_best_model, greater_is_better, ignore_data_skip, sharded_ddp, fsdp, fsdp_min_num_params, fsdp_config, fsdp_transformer_layer_cls_to_wrap, deepspeed, label_smoothing_factor, optim, optim_args, adafactor, group_by_length, length_column_name, report_to, ddp_find_unused_parameters, ddp_bucket_cap_mb, dataloader_pin_memory, skip_mem...

[/usr/local/lib/python3.9/dist-packages/transformers/training_args.py](https://localhost:8080/#) in __post_init__(self)
   1253             self.framework == "pt"
   1254             and is_torch_available()
-> 1255             and (self.device.type != "cuda")
   1256             and (get_xla_device_type(self.device) != "GPU")
   1257             and (self.fp16 or self.fp16_full_eval)

[/usr/local/lib/python3.9/dist-packages/transformers/training_args.py](https://localhost:8080/#) in device(self)
   1613         """
   1614         requires_backends(self, ["torch"])
-> 1615         return self._setup_devices
   1616 
   1617     @property

[/usr/local/lib/python3.9/dist-packages/transformers/utils/generic.py](https://localhost:8080/#) in __get__(self, obj, objtype)
     52         cached = getattr(obj, attr, None)
     53         if cached is None:
---> 54             cached = self.fget(obj)
     55             setattr(obj, attr, cached)
     56         return cached

[/usr/local/lib/python3.9/dist-packages/transformers/training_args.py](https://localhost:8080/#) in _setup_devices(self)
   1547             device = self.distributed_state.device
   1548         else:
-> 1549             self.distributed_state = PartialState(backend=self.xpu_backend)
   1550             device = self.distributed_state.device
   1551             self._n_gpu = 1

NameError: name 'PartialState' is not defined

@amyeroberts
Copy link
Collaborator

@rennn2002 Could you try updating your version of accelerate i.e. pip install --upgrade accelerate?

Copy link
Contributor

@pacman100 pacman100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, Thank you @zach for adding this 🚀 . Left a couple comments

if (
torch.distributed.is_available()
and torch.distributed.is_initialized()
and self.distributed_state.distributed_type != DistributedType.NO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be self.distributed_state.distributed_type == DistributedType.NO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Fixing in a follow up. Thanks! :)

and self.distributed_state.distributed_type != DistributedType.NO
):
logger.warning(
"torch.distributed process group is initialized, but parallel_mode == ParallelMode.DISTRIBUTED. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be parallel_mode != ParallelMode.DISTRIBUTED?

novice03 pushed a commit to novice03/transformers that referenced this pull request Jun 23, 2023
…ingface#22752)

* Use accelerate for device management

* Add accelerate to setup


Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants