Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

num_input_tokens_seen included the pad tokens if sample padding strategy used #29889

Closed
4 tasks
thincal opened this issue Mar 27, 2024 · 5 comments
Closed
4 tasks

Comments

@thincal
Copy link

thincal commented Mar 27, 2024

System Info

latest transformers

Who can help?

@muellerzr @pacman100

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

If training sample is batched with padding strategy, the num_input_tokens_seen will be calculated including the pad tokens, this is not expected.

Expected behavior

  1. Excluding the pad tokens during the calculation of num_input_tokens_seen, a suggested fix as bellow:
inputs_device = "cuda" if self.args.distributed_state.backend == "nccl" else "cpu"

self.state.num_input_tokens_seen += torch.sum(
    self.accelerator.gather(
        torch.tensor(torch.sum(inputs[main_input_name] != self.tokenizer.pad_token_id).item(), device=inputs_device, dtype=torch.int64)
    )
).item()
  1. And if the nccl backend used but input_ids stay in cpu it will also cause problem, this is also fixed by deciding the device type according to backend.
inputs_device = "cuda" if self.args.distributed_state.backend == "nccl" else "cpu"
@thincal
Copy link
Author

thincal commented Mar 31, 2024

@pacman100 could help review with this issue reported ? thanks!

@huggingface huggingface deleted a comment from github-actions bot Apr 26, 2024
@amyeroberts
Copy link
Collaborator

Gentle ping @muellerzr @pacman100

@huggingface huggingface deleted a comment from github-actions bot May 21, 2024
@huggingface huggingface deleted a comment from github-actions bot Jun 16, 2024
@amyeroberts
Copy link
Collaborator

cc @muellerzr @SunMarc

@SunMarc
Copy link
Member

SunMarc commented Jun 17, 2024

Hi @thincal, thanks for raising the issue and sorry for the delay ! This seems like a good idea to remove the padded tokens and change the device depending on the backend ! Would you like to open a PR ?

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants