Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass LOCAL_RANK to torch_distributed_zero_first() #5114

Conversation

qiningonline
Copy link
Contributor

@qiningonline qiningonline commented Oct 10, 2021

resolve #5111

The change proposed in this pull request:

  • passing local_rank to torch_distributed_zero_first, to avoid CUDA device index error

Details of the issue are listed in the issue description and follow-up comment.

πŸ› οΈ PR Summary

Made with ❀️ by Ultralytics Actions

🌟 Summary

Enhancement of multi-GPU training support in YOLOv5.

πŸ“Š Key Changes

  • Replaced RANK with LOCAL_RANK to better support distributed training on multiple GPUs.

🎯 Purpose & Impact

  • 🀝 Improved Compatibility: These changes enhance the compatibility with distributed training frameworks, allowing for more efficient utilization of multiple GPUs.
  • ⚑ Increased Efficiency: Better support for local GPU rank handling potentially leads to improved parallel processing performance and reduced training times.
  • πŸ‘¨β€πŸ’» Developer Experience: For developers using YOLOv5 for training on multiple GPUs, this update simplifies processes and could help avoid common issues related to device ranking in distributed environments.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ‘‹ Hello @qiningonline, thank you for submitting a πŸš€ PR! To allow your work to be integrated as seamlessly as possible, we advise you to:

  • βœ… Verify your PR is up-to-date with origin/master. If your PR is behind origin/master an automatic GitHub actions rebase may be attempted by including the /rebase command in a comment body, or by running the following code, replacing 'feature' with the name of your local branch:
git remote add upstream https://github.com/ultralytics/yolov5.git
git fetch upstream
git checkout feature  # <----- replace 'feature' with local branch name
git merge upstream/master
git push -u origin -f
  • βœ… Verify all Continuous Integration (CI) checks are passing.
  • βœ… Reduce changes to the absolute minimum required for your bug fix or feature addition. "It is not daily increase but daily decrease, hack away the unessential. The closer to the source, the less wastage there is." -Bruce Lee

@glenn-jocher glenn-jocher changed the title #5111, pass local rank to torch_distributed_zero_first. Pass LOCAL_RANK to torch_distributed_zero_first() Oct 10, 2021
@glenn-jocher glenn-jocher merged commit 4a6dfff into ultralytics:master Oct 10, 2021
@glenn-jocher
Copy link
Member

@qiningonline PR is merged. Thank you for your contributions to YOLOv5 πŸš€ and Vision AI ⭐

BjarneKuehl pushed a commit to fhkiel-mlaip/yolov5 that referenced this pull request Aug 26, 2022
Co-authored-by: qiningonline <qiningonline@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CUDA device index error in distributed training
3 participants