Inappropriate reduce operation of "num_input_tokens_seen" is prone to get training stuck. #28791

YouliangHUANG · 2024-01-31T07:10:16Z

System Info

Trivial

Who can help?

@pacman100

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

See src/transformers/trainer.py line 1870
self.state.num_input_tokens_seen += self.accelerator.gather(inputs[main_input_name]).numel()

The length of "inputs[main_input_name]" is not guaranteed to be the same when using ddp, which may make the training process hang. Besides, in a distributed setup, it costs a lot to gather the WHOLE input tensors on different workers. It is better to call .numel() first and then .gather().

Ref: Stuck when using model.generate() and acclerator.gather() in the distributed setting

Expected behavior

Fix:
input_device = inputs[main_input_name].device
self.state.num_input_tokens_seen += torch.sum(self.accelerator.gather(torch.tensor(inputs[main_input_name].numel(), device=input_device, dtype=torch.int64))).item()

The text was updated successfully, but these errors were encountered:

thincal · 2024-02-01T17:12:59Z

  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1851, in _inner_training_loop
    self.state.num_input_tokens_seen += torch.sum(self.accelerator.gather(
  File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 2159, in gather
    return gather(tensor)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py", line 344, in wrapper
    return function(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py", line 405, in gather
    return _gpu_gather(tensor)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py", line 324, in _gpu_gather
    return recursively_apply(_gpu_gather_one, tensor, error_on_other_type=True)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py", line 129, in recursively_apply
    return func(data, *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py", line 321, in _gpu_gather_one
    torch.distributed.all_gather(output_tensors, tensor)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2806, in all_gather
    work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: No backend type associated with device type cpu

Patched transformer with above hotfix, it seems the error still happened. Could you help have a look ? thanks.

YouliangHUANG · 2024-02-02T06:49:55Z

  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1851, in _inner_training_loop
    self.state.num_input_tokens_seen += torch.sum(self.accelerator.gather(
  File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 2159, in gather
    return gather(tensor)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py", line 344, in wrapper
    return function(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py", line 405, in gather
    return _gpu_gather(tensor)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py", line 324, in _gpu_gather
    return recursively_apply(_gpu_gather_one, tensor, error_on_other_type=True)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py", line 129, in recursively_apply
    return func(data, *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py", line 321, in _gpu_gather_one
    torch.distributed.all_gather(output_tensors, tensor)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2806, in all_gather
    work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: No backend type associated with device type cpu

Patched transformer with above hotfix, it seems the error still happened. Could you help have a look ? thanks.

I also encountered the same problem in the first place, and that's why I added a statement to assign the device using input_device = inputs[main_input_name].device.
As the original code works, assigning the new tensor to the same device should also work as it was. Can you double-check the device assigned to the tensor?

pacman100 · 2024-02-08T12:49:12Z

Thank you @YouliangHUANG for the issue as well as the suggestion to fix it. It makes sense, it would be great if you want to open a PR with the suggested fix.

See huggingface#28791 for more details.

thincal · 2024-02-19T08:52:59Z

I also encountered the same problem in the first place, and that's why I added a statement to assign the device using input_device = inputs[main_input_name].device. As the original code works, assigning the new tensor to the same device should also work as it was. Can you double-check the device assigned to the tensor?

@YouliangHUANG

inputs[main_input_name].numel(): 1336
inputs[main_input_name].device: cpu

After applied the fix, same error happened (transformers==4.37.2):

  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1855, in _inner_training_loop
    self.accelerator.gather(
  File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 2161, in gather
    return gather(tensor)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py", line 376, in wrapper
    return function(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py", line 437, in gather
    return _gpu_gather(tensor)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py", line 356, in _gpu_gather
    return recursively_apply(_gpu_gather_one, tensor, error_on_other_type=True)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py", line 135, in recursively_apply
    return func(data, *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py", line 353, in _gpu_gather_one
    torch.distributed.all_gather(output_tensors, tensor)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2806, in all_gather
    work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: No backend type associated with device type cpu

thincal · 2024-02-19T09:00:53Z

it works now after force set the device as 'cuda', so it seems that original error is caused by the allgather op not supported in cpu device ?

YouliangHUANG · 2024-02-19T09:31:14Z

it works now after force set the device as 'cuda', so it seems that original error is caused by the allgather op not supported in cpu device ?

@thincal Please check your backend type, and refer to https://pytorch.org/docs/stable/distributed.html for more details.

thincal · 2024-02-19T11:18:45Z

it works now after force set the device as 'cuda', so it seems that original error is caused by the allgather op not supported in cpu device ?

@thincal Please check your backend type, and refer to https://pytorch.org/docs/stable/distributed.html for more details.

Yes, it's the nccl backend used, which doesn't support cpu device.

thincal · 2024-02-19T11:25:30Z

The length of "inputs[main_input_name]" is not guaranteed to be the same when using ddp, which may make the training process hang.

so what change is solving this problem ?

YouliangHUANG · 2024-02-19T12:15:35Z

The length of "inputs[main_input_name]" is not guaranteed to be the same when using ddp, which may make the training process hang.

so what change is solving this problem ?

torch.tensor(inputs[main_input_name].numel(), device=input_device, dtype=torch.int64)
@thincal This code will create a tensor with the size of 1, which records how many input tokens there are in the local worker. Therefore the tensor length is aligned and can be gathered through self.accelerator.gather and then sum into the total number.

thincal · 2024-02-19T23:51:57Z

The length of "inputs[main_input_name]" is not guaranteed to be the same when using ddp, which may make the training process hang.

so what change is solving this problem ?

torch.tensor(inputs[main_input_name].numel(), device=input_device, dtype=torch.int64) @thincal This code will create a tensor with the size of 1, which records how many input tokens there are in the local worker. Therefore the tensor length is aligned and can be gathered through self.accelerator.gather and then sum into the total number.

OK, that's great. But it seems that the device should be decided according to the ddp backend ?

github-actions · 2024-03-15T08:04:53Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

fix the behavior of collecting 'num_input_tokens_seen' See #28791 for more details.

…9099) fix the behavior of collecting 'num_input_tokens_seen' See huggingface#28791 for more details.

fix the behavior of collecting 'num_input_tokens_seen' See #28791 for more details.

YouliangHUANG added a commit to YouliangHUANG/transformers-fix-num_input_tokens_seen that referenced this issue Feb 19, 2024

fix the behavior of collecting 'num_input_tokens_seen'

76db6be

See huggingface#28791 for more details.

YouliangHUANG mentioned this issue Feb 19, 2024

Fix the behavior of collecting 'num_input_tokens_seen' #29099

Merged

github-actions bot closed this as completed Mar 24, 2024

ArthurZucker pushed a commit that referenced this issue Mar 25, 2024

Fix the behavior of collecting 'num_input_tokens_seen' (#29099)

afe73ae

fix the behavior of collecting 'num_input_tokens_seen' See #28791 for more details.

hovnatan pushed a commit to hovnatan/transformers that referenced this issue Mar 27, 2024

Fix the behavior of collecting 'num_input_tokens_seen' (huggingface#2…

4bd2d04

…9099) fix the behavior of collecting 'num_input_tokens_seen' See huggingface#28791 for more details.

thincal mentioned this issue Mar 27, 2024

num_input_tokens_seen included the pad tokens if sample padding strategy used #29889

Closed

4 tasks

Jintao-Huang mentioned this issue Apr 29, 2024

DDP include_num_input_tokens_seen fine-tuning hangs (nccl) modelscope/ms-swift#832

Closed

itazap pushed a commit that referenced this issue May 14, 2024

Fix the behavior of collecting 'num_input_tokens_seen' (#29099)

92c569b

fix the behavior of collecting 'num_input_tokens_seen' See #28791 for more details.

CodeCreator mentioned this issue Jul 15, 2024

Fix gather when collecting 'num_input_tokens_seen' #31974

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inappropriate reduce operation of "num_input_tokens_seen" is prone to get training stuck. #28791

Inappropriate reduce operation of "num_input_tokens_seen" is prone to get training stuck. #28791

YouliangHUANG commented Jan 31, 2024 •

edited

Loading

thincal commented Feb 1, 2024

YouliangHUANG commented Feb 2, 2024

pacman100 commented Feb 8, 2024

thincal commented Feb 19, 2024 •

edited

Loading

thincal commented Feb 19, 2024

YouliangHUANG commented Feb 19, 2024 •

edited

Loading

thincal commented Feb 19, 2024

thincal commented Feb 19, 2024

YouliangHUANG commented Feb 19, 2024

thincal commented Feb 19, 2024

github-actions bot commented Mar 15, 2024

Inappropriate reduce operation of "num_input_tokens_seen" is prone to get training stuck. #28791

Inappropriate reduce operation of "num_input_tokens_seen" is prone to get training stuck. #28791

Comments

YouliangHUANG commented Jan 31, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

thincal commented Feb 1, 2024

YouliangHUANG commented Feb 2, 2024

pacman100 commented Feb 8, 2024

thincal commented Feb 19, 2024 • edited Loading

thincal commented Feb 19, 2024

YouliangHUANG commented Feb 19, 2024 • edited Loading

thincal commented Feb 19, 2024

thincal commented Feb 19, 2024

YouliangHUANG commented Feb 19, 2024

thincal commented Feb 19, 2024

github-actions bot commented Mar 15, 2024

YouliangHUANG commented Jan 31, 2024 •

edited

Loading

thincal commented Feb 19, 2024 •

edited

Loading

YouliangHUANG commented Feb 19, 2024 •

edited

Loading