Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDP Multi-GPU --resume bug fix #1810

Merged
merged 1 commit into from
Dec 30, 2020
Merged

DDP Multi-GPU --resume bug fix #1810

merged 1 commit into from
Dec 30, 2020

Conversation

glenn-jocher
Copy link
Member

@glenn-jocher glenn-jocher commented Dec 30, 2020

Multi-GPU --resume bug fix for issue #851. This retains the values of opt.global_rank and opt.local_rank when DDP training is resumed.

πŸ› οΈ PR Summary

Made with ❀️ by Ultralytics Actions

🌟 Summary

Improved resume functionality in YOLOv5 training script.

πŸ“Š Key Changes

  • Introduced code to ensure distributed training settings are correctly retained when resuming training from a checkpoint.

🎯 Purpose & Impact

  • The change fixes a potential issue where training resumption could disregard the distributed training rank settings.
  • Users relying on distributed training will experience a more consistent and reliable resumption of their training process, particularly in environments using multiple GPUs or nodes. πŸ”„
  • Makes the process of resuming training smoother and less error-prone, which can save time and computational resources. πŸ’»πŸ•’

@glenn-jocher glenn-jocher changed the title Multi-GPU resume bug fix DDP Multi-GPU --resume bug fix Dec 30, 2020
@glenn-jocher glenn-jocher linked an issue Dec 30, 2020 that may be closed by this pull request
@glenn-jocher
Copy link
Member Author

Running test on 2x T4 GCP VM. Test scenario:

# 1. Start DDP training
python -m torch.distributed.launch --nproc_per_node 2 train.py --batch-size 80 --data coco128.yaml --epochs 50 --weights yolov5s.pt

# 2. sudo reboot after about 25 epochs

# 3. Resume DDP training
python -m torch.distributed.launch --nproc_per_node 2 train.py --resume

Test results as expected, training resumed properly from epoch 25 after forced reboot. nvidia-smi check confirmed multi-GPUs in use. Everything seems to work. Note not tested with W&B enabled.

results

@glenn-jocher
Copy link
Member Author

CI passing and resume test scenario passing. Merging.

To --resume with DDP (2 GPUs)

ONLY add --resume or --resume path/to/last.pt. --resume accepts no other arguments.

python -m torch.distributed.launch --nproc_per_node 2 train.py --resume

@glenn-jocher glenn-jocher merged commit 7180b22 into master Dec 30, 2020
@glenn-jocher glenn-jocher deleted the multi_gpu_resume branch December 30, 2020 20:40
KMint1819 pushed a commit to KMint1819/yolov5 that referenced this pull request May 12, 2021
taicaile pushed a commit to taicaile/yolov5 that referenced this pull request Oct 12, 2021
BjarneKuehl pushed a commit to fhkiel-mlaip/yolov5 that referenced this pull request Aug 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unable to resume training when using DDP
1 participant