Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect training steps for distributed setting #137

Open
theyorubayesian opened this issue Jun 30, 2024 · 0 comments
Open

Incorrect training steps for distributed setting #137

theyorubayesian opened this issue Jun 30, 2024 · 0 comments

Comments

@theyorubayesian
Copy link

theyorubayesian commented Jun 30, 2024

During distributed training with Pytorch, the number of training steps increases with the number of processes.

To reproduce:

Transformers: 4.41.2
Torch: 2.3.1
Accelerate: 0.31.0

  1. Distributing to 4 GPU devices trains for 500K steps.
torchrun --nproc_per_node=4 \
    -m tevatron.driver.train \
    --output_dir "$RUN_DIR" \
    --model_name_or_path "$MODEL_PATH" \
    --dataset_name "Tevatron/msmarco-passage" \
    --per_device_train_batch_size 32 \
    --num_train_epochs $NUM_EPOCHS \
    --dataloader_drop_last True \
  1. Distributing to 2 GPU devices trains for 250K steps.
torchrun --nproc_per_node=2 \
    -m tevatron.driver.train \
    --output_dir "$RUN_DIR" \
    --model_name_or_path "$MODEL_PATH" \
    --dataset_name "Tevatron/msmarco-passage" \
    --per_device_train_batch_size 32 \
    --num_train_epochs $NUM_EPOCHS \
    --dataloader_drop_last True \

This happens because the dataloader is duplicated across the 4 GPUs instead of being sharded. HF moved sharding logic into the accelerator. The accelerator prepares the dataloader for the training configuration.

https://github.com/huggingface/transformers/blob/e65502951593a76844e872fee9c56b805598538a/src/transformers/trainer.py#L904

Here, the correct number of training steps should be 125K.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant