`num_input_tokens_seen` included the `pad` tokens if sample padding strategy used #29889

thincal · 2024-03-27T01:58:04Z

System Info

latest transformers

Who can help?

@muellerzr @pacman100

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

If training sample is batched with padding strategy, the num_input_tokens_seen will be calculated including the pad tokens, this is not expected.

Expected behavior

Excluding the pad tokens during the calculation of num_input_tokens_seen, a suggested fix as bellow:

inputs_device = "cuda" if self.args.distributed_state.backend == "nccl" else "cpu"

self.state.num_input_tokens_seen += torch.sum(
    self.accelerator.gather(
        torch.tensor(torch.sum(inputs[main_input_name] != self.tokenizer.pad_token_id).item(), device=inputs_device, dtype=torch.int64)
    )
).item()

And if the nccl backend used but input_ids stay in cpu it will also cause problem, this is also fixed by deciding the device type according to backend.

inputs_device = "cuda" if self.args.distributed_state.backend == "nccl" else "cpu"

The text was updated successfully, but these errors were encountered:

thincal · 2024-03-31T14:32:18Z

@pacman100 could help review with this issue reported ? thanks!

amyeroberts · 2024-04-26T08:52:43Z

Gentle ping @muellerzr @pacman100

amyeroberts · 2024-06-16T13:43:56Z

cc @muellerzr @SunMarc

SunMarc · 2024-06-17T16:27:01Z

Hi @thincal, thanks for raising the issue and sorry for the delay ! This seems like a good idea to remove the padded tokens and change the device depending on the backend ! Would you like to open a PR ?

github-actions · 2024-07-12T08:05:43Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface deleted a comment from github-actions bot Apr 26, 2024

huggingface deleted a comment from github-actions bot May 21, 2024

huggingface deleted a comment from github-actions bot Jun 16, 2024

github-actions bot closed this as completed Jul 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`num_input_tokens_seen` included the `pad` tokens if sample padding strategy used #29889

`num_input_tokens_seen` included the `pad` tokens if sample padding strategy used #29889

thincal commented Mar 27, 2024 •

edited

Loading

thincal commented Mar 31, 2024

amyeroberts commented Apr 26, 2024

amyeroberts commented Jun 16, 2024

SunMarc commented Jun 17, 2024

github-actions bot commented Jul 12, 2024

num_input_tokens_seen included the pad tokens if sample padding strategy used #29889

num_input_tokens_seen included the pad tokens if sample padding strategy used #29889

Comments

thincal commented Mar 27, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

thincal commented Mar 31, 2024

amyeroberts commented Apr 26, 2024

amyeroberts commented Jun 16, 2024

SunMarc commented Jun 17, 2024

github-actions bot commented Jul 12, 2024

`num_input_tokens_seen` included the `pad` tokens if sample padding strategy used #29889

`num_input_tokens_seen` included the `pad` tokens if sample padding strategy used #29889

thincal commented Mar 27, 2024 •

edited

Loading