Fix batch-size on resume for multi-gpu #1942

NanoCode012 · 2021-01-14T15:57:11Z

Fixes #1936

Tested on my own and with author of the mentioned Issue.

Command:

python -m torch.distributed.launch --master_port 9963 --nproc_per_node 2 train.py --data coco128.yaml --cfg yolov5s.yaml --weights yolov5s.pt --epochs 3 --batch-size 64 --device 3,4
python -m torch.distributed.launch --master_port 9963 --nproc_per_node 2 train.py --resume

🛠️ PR Summary

_{Made with ❤️ by Ultralytics Actions}

🌟 Summary

Improvement in resume training functionality for YOLOv5.

📊 Key Changes

Changed the train.py script to update opt.batch_size properly when resuming training.

🎯 Purpose & Impact

The purpose of this change is to ensure that the batch size used when resuming training from a checkpoint is set correctly to the total batch size (opt.total_batch_size).
This will prevent potential inconsistencies with batch sizes when training is paused and resumed, leading to more reliable and stable training runs for users. 🛠️

This update is particularly beneficial for users who conduct long training sessions that may need to be paused and resumed due to various reasons, such as hardware limitations or scheduling constraints. 🔄

glenn-jocher · 2021-01-14T17:12:55Z

@NanoCode012 I was looking back at the v4.0 release commit 69be8e7 and it seems I might have caused this by moving one line here in train.py:

glenn-jocher · 2021-01-14T17:15:41Z

I had a strange situation where I was running some v3.1 trainings multi-GPU, and then when I resumed one I got CUDA OOM (my original run was very close to OOM already, and perhaps resume used just a little more memory), so I modified a few of these lines to allow for a --batch command on resume. I think I forgot to reset this one back to default.

glenn-jocher · 2021-01-14T17:25:20Z

@NanoCode012 do you think updating the PR to simply revert the earlier change would fix this, or do you think the current PR is best?

NanoCode012 · 2021-01-14T17:35:14Z

I don't think reverting fixes the problem. The problem is on-resume. The batch-size should be the previous "total-batchsize" for this

yolov5/train.py

Line 491 in b75c432

opt.total_batch_size = opt.batch_size

before it would be divided here

yolov5/train.py

Line 499 in b75c432

opt.batch_size = opt.total_batch_size // opt.world_size

Edit: If you move that line up, I think you still need this PR.

glenn-jocher · 2021-01-14T17:53:07Z

@NanoCode012 ok understood! Merging PR.

Fix batch-size on resume for multi-gpu

6243f97

glenn-jocher merged commit 3a56cac into ultralytics:master Jan 14, 2021

NanoCode012 deleted the fix-resume-bs-multi branch February 10, 2021 08:59

This was referenced Apr 11, 2021

YOLOv5 v5.0 Release #2762

Merged

YOLOv5 v5.0 release compatibility update for YOLOv3 ultralytics/yolov3#1737

Merged

KMint1819 pushed a commit to KMint1819/yolov5 that referenced this pull request May 12, 2021

Fix batch-size on resume for multi-gpu (ultralytics#1942)

890378e

taicaile pushed a commit to taicaile/yolov5 that referenced this pull request Oct 12, 2021

Fix batch-size on resume for multi-gpu (ultralytics#1942)

cb9ba0d

BjarneKuehl pushed a commit to fhkiel-mlaip/yolov5 that referenced this pull request Aug 26, 2022

Fix batch-size on resume for multi-gpu (ultralytics#1942)

cce52ca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix batch-size on resume for multi-gpu #1942

Fix batch-size on resume for multi-gpu #1942

NanoCode012 commented Jan 14, 2021 •

edited by UltralyticsAssistant

Loading

glenn-jocher commented Jan 14, 2021 •

edited

Loading

glenn-jocher commented Jan 14, 2021 •

edited

Loading

glenn-jocher commented Jan 14, 2021

NanoCode012 commented Jan 14, 2021 •

edited

Loading

glenn-jocher commented Jan 14, 2021

Fix batch-size on resume for multi-gpu #1942

Fix batch-size on resume for multi-gpu #1942

Conversation

NanoCode012 commented Jan 14, 2021 • edited by UltralyticsAssistant Loading

🛠️ PR Summary

🌟 Summary

📊 Key Changes

🎯 Purpose & Impact

glenn-jocher commented Jan 14, 2021 • edited Loading

glenn-jocher commented Jan 14, 2021 • edited Loading

glenn-jocher commented Jan 14, 2021

NanoCode012 commented Jan 14, 2021 • edited Loading

glenn-jocher commented Jan 14, 2021

NanoCode012 commented Jan 14, 2021 •

edited by UltralyticsAssistant

Loading

glenn-jocher commented Jan 14, 2021 •

edited

Loading

glenn-jocher commented Jan 14, 2021 •

edited

Loading

NanoCode012 commented Jan 14, 2021 •

edited

Loading