Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix gather when collecting 'num_input_tokens_seen' #31974

Merged
merged 4 commits into from
Jul 16, 2024

Conversation

CodeCreator
Copy link
Contributor

What does this PR do?

Following from #29099, the allgather for num_input_tokens_seen is still getting stuck with distributed training, since the tensors are still on CPU and have not been moved to device in _prepare_inputs yet. This still results in torch.distributed errors such as:

File ".../site-packages/torch/distributed/distributed_c10d.py", line 2948, in all_gather_into_tensor
     work = group._allgather_base(output_tensor, input_tensor, opts)
ValueError: Tensors must be CUDA and dense

This PR simply moves the token count to self.args.device and then moves them back to cpu after gathering.
This problem was also mentioned in the discussion of issue #28791, but the issue was closed when the padding was fixed.

Who can review?

@pacman100
@muellerzr

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that pacman no longer works at HF, so you can ping me in the future :)

This makes sense since we need to eventually gather. cc @SunMarc

@muellerzr
Copy link
Contributor

can you run pip install -e .[quality]; make style; make quality?

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice ! Thanks for fixing !

@CodeCreator
Copy link
Contributor Author

@muellerzr Thanks for the pointer! I ran the code formatter as suggested.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing!

@amyeroberts amyeroberts merged commit e391706 into huggingface:main Jul 16, 2024
21 checks passed
amyeroberts pushed a commit to amyeroberts/transformers that referenced this pull request Jul 19, 2024
* Move token count to device before gathering

* Run 'make style; make quality'
MHRDYN7 pushed a commit to MHRDYN7/transformers that referenced this pull request Jul 23, 2024
* Move token count to device before gathering

* Run 'make style; make quality'
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Jul 24, 2024
* Move token count to device before gathering

* Run 'make style; make quality'
itazap pushed a commit that referenced this pull request Jul 25, 2024
* Move token count to device before gathering

* Run 'make style; make quality'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants