Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix batch-size on resume for multi-gpu #1942

Merged
merged 1 commit into from
Jan 14, 2021

Conversation

NanoCode012
Copy link
Contributor

@NanoCode012 NanoCode012 commented Jan 14, 2021

Fixes #1936

Tested on my own and with author of the mentioned Issue.

Command:

python -m torch.distributed.launch --master_port 9963 --nproc_per_node 2 train.py --data coco128.yaml --cfg yolov5s.yaml --weights yolov5s.pt --epochs 3 --batch-size 64 --device 3,4
python -m torch.distributed.launch --master_port 9963 --nproc_per_node 2 train.py --resume

🛠️ PR Summary

Made with ❤️ by Ultralytics Actions

🌟 Summary

Improvement in resume training functionality for YOLOv5.

📊 Key Changes

  • Changed the train.py script to update opt.batch_size properly when resuming training.

🎯 Purpose & Impact

  • The purpose of this change is to ensure that the batch size used when resuming training from a checkpoint is set correctly to the total batch size (opt.total_batch_size).
  • This will prevent potential inconsistencies with batch sizes when training is paused and resumed, leading to more reliable and stable training runs for users. 🛠️

This update is particularly beneficial for users who conduct long training sessions that may need to be paused and resumed due to various reasons, such as hardware limitations or scheduling constraints. 🔄

@glenn-jocher
Copy link
Member

glenn-jocher commented Jan 14, 2021

@NanoCode012 I was looking back at the v4.0 release commit 69be8e7 and it seems I might have caused this by moving one line here in train.py:
Screen Shot 2021-01-14 at 9 09 29 AM

@glenn-jocher
Copy link
Member

glenn-jocher commented Jan 14, 2021

I had a strange situation where I was running some v3.1 trainings multi-GPU, and then when I resumed one I got CUDA OOM (my original run was very close to OOM already, and perhaps resume used just a little more memory), so I modified a few of these lines to allow for a --batch command on resume. I think I forgot to reset this one back to default.

@glenn-jocher
Copy link
Member

@NanoCode012 do you think updating the PR to simply revert the earlier change would fix this, or do you think the current PR is best?

@NanoCode012
Copy link
Contributor Author

NanoCode012 commented Jan 14, 2021

I don't think reverting fixes the problem. The problem is on-resume. The batch-size should be the previous "total-batchsize" for this

yolov5/train.py

Line 491 in b75c432

opt.total_batch_size = opt.batch_size

before it would be divided here

yolov5/train.py

Line 499 in b75c432

opt.batch_size = opt.total_batch_size // opt.world_size

Edit: If you move that line up, I think you still need this PR.

@glenn-jocher
Copy link
Member

@NanoCode012 ok understood! Merging PR.

@glenn-jocher glenn-jocher merged commit 3a56cac into ultralytics:master Jan 14, 2021
@NanoCode012 NanoCode012 deleted the fix-resume-bs-multi branch February 10, 2021 08:59
KMint1819 pushed a commit to KMint1819/yolov5 that referenced this pull request May 12, 2021
taicaile pushed a commit to taicaile/yolov5 that referenced this pull request Oct 12, 2021
BjarneKuehl pushed a commit to fhkiel-mlaip/yolov5 that referenced this pull request Aug 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

wrong batch size after --resume on multiple GPUs
2 participants