Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove DDP destroy_process_group() on train end #8935

Merged
merged 2 commits into from
Aug 13, 2022

Conversation

glenn-jocher
Copy link
Member

@glenn-jocher glenn-jocher commented Aug 11, 2022

May resolve #7307

πŸ› οΈ PR Summary

Made with ❀️ by Ultralytics Actions

🌟 Summary

Optimized distributed training cleanup process in YOLOv5.

πŸ“Š Key Changes

  • Removed redundant code that manually destroyed the process group after training.

🎯 Purpose & Impact

  • πŸš€ Purpose: Streamlines the distributed training process to be more efficient.
  • ✨ Impact: Can lead to smoother shutdown of training processes, possibly reducing errors on exit for users utilizing distributed training environments. This change primarily affects users working with multiple GPUs or across multiple nodes.

@glenn-jocher glenn-jocher self-assigned this Aug 11, 2022
@glenn-jocher glenn-jocher merged commit f1214f2 into master Aug 13, 2022
@glenn-jocher glenn-jocher deleted the remove/destroy_process_group branch August 13, 2022 01:57
yzbx added a commit to modelai/ymir-executor-fork that referenced this pull request Aug 16, 2022
ctjanuhowski pushed a commit to ctjanuhowski/yolov5 that referenced this pull request Sep 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Docker Multi-GPU DDP training hang on destroy_process_group()
1 participant