DDP Multi-GPU --resume bug fix #1810

glenn-jocher · 2020-12-30T20:07:13Z

Multi-GPU --resume bug fix for issue #851. This retains the values of opt.global_rank and opt.local_rank when DDP training is resumed.

🛠️ PR Summary

_{Made with ❤️ by Ultralytics Actions}

🌟 Summary

Improved resume functionality in YOLOv5 training script.

📊 Key Changes

Introduced code to ensure distributed training settings are correctly retained when resuming training from a checkpoint.

🎯 Purpose & Impact

The change fixes a potential issue where training resumption could disregard the distributed training rank settings.
Users relying on distributed training will experience a more consistent and reliable resumption of their training process, particularly in environments using multiple GPUs or nodes. 🔄
Makes the process of resuming training smoother and less error-prone, which can save time and computational resources. 💻🕒

glenn-jocher · 2020-12-30T20:37:40Z

Running test on 2x T4 GCP VM. Test scenario:

# 1. Start DDP training
python -m torch.distributed.launch --nproc_per_node 2 train.py --batch-size 80 --data coco128.yaml --epochs 50 --weights yolov5s.pt

# 2. sudo reboot after about 25 epochs

# 3. Resume DDP training
python -m torch.distributed.launch --nproc_per_node 2 train.py --resume

Test results as expected, training resumed properly from epoch 25 after forced reboot. nvidia-smi check confirmed multi-GPUs in use. Everything seems to work. Note not tested with W&B enabled.

glenn-jocher · 2020-12-30T20:40:22Z

CI passing and resume test scenario passing. Merging.

To --resume with DDP (2 GPUs)

ONLY add --resume or --resume path/to/last.pt. --resume accepts no other arguments.

python -m torch.distributed.launch --nproc_per_node 2 train.py --resume

Multi-GPU resume bug fix

03269d7

glenn-jocher changed the title ~~Multi-GPU resume bug fix~~ DDP Multi-GPU --resume bug fix Dec 30, 2020

glenn-jocher linked an issue Dec 30, 2020 that may be closed by this pull request

Unable to resume training when using DDP #851

Closed

glenn-jocher mentioned this pull request Dec 30, 2020

Unable to resume training when using DDP #851

Closed

glenn-jocher merged commit 7180b22 into master Dec 30, 2020

glenn-jocher deleted the multi_gpu_resume branch December 30, 2020 20:40

KMint1819 pushed a commit to KMint1819/yolov5 that referenced this pull request May 12, 2021

DDP Multi-GPU --resume bug fix (ultralytics#1810)

ffb964d

taicaile pushed a commit to taicaile/yolov5 that referenced this pull request Oct 12, 2021

DDP Multi-GPU --resume bug fix (ultralytics#1810)

5de0904

BjarneKuehl pushed a commit to fhkiel-mlaip/yolov5 that referenced this pull request Aug 26, 2022

DDP Multi-GPU --resume bug fix (ultralytics#1810)

9cc9a1b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP Multi-GPU --resume bug fix #1810

DDP Multi-GPU --resume bug fix #1810

glenn-jocher commented Dec 30, 2020 •

edited by UltralyticsAssistant

Loading

glenn-jocher commented Dec 30, 2020

glenn-jocher commented Dec 30, 2020

DDP Multi-GPU --resume bug fix #1810

DDP Multi-GPU --resume bug fix #1810

Conversation

glenn-jocher commented Dec 30, 2020 • edited by UltralyticsAssistant Loading

🛠️ PR Summary

🌟 Summary

📊 Key Changes

🎯 Purpose & Impact

glenn-jocher commented Dec 30, 2020

glenn-jocher commented Dec 30, 2020

To --resume with DDP (2 GPUs)

glenn-jocher commented Dec 30, 2020 •

edited by UltralyticsAssistant

Loading