Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDP --sync-bn bug with torch 1.9.0 #3998

Closed
simba0703 opened this issue Jul 14, 2021 · 5 comments · Fixed by #4032 or #4615
Closed

DDP --sync-bn bug with torch 1.9.0 #3998

simba0703 opened this issue Jul 14, 2021 · 5 comments · Fixed by #4032 or #4615
Labels
bug Something isn't working

Comments

@simba0703
Copy link

When i use 'python -m torch.distributed.launch --nproc_per_node 3 train.py --batch-size 12 --data data/coco128.yaml --weights yolov5m6.pt --device 1,2,3 --adam --sync-bn',the training process will be blocked at epoch 0. And if i do not use '--sync-bn',the training process goes well.

🐛 Bug

A clear and concise description of what the bug is.

To Reproduce (REQUIRED)

Input:

import torch

a = torch.tensor([5])
c = a / 0

Output:

Traceback (most recent call last):
  File "/Users/glennjocher/opt/anaconda3/envs/env1/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3331, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-5-be04c762b799>", line 5, in <module>
    c = a / 0
RuntimeError: ZeroDivisionError

Expected behavior

A clear and concise description of what you expected to happen.

Environment

If applicable, add screenshots to help explain your problem.

  • OS: [e.g. Ubuntu]
  • GPU [e.g. 2080 Ti]

Additional context

Add any other context about the problem here.

@simba0703 simba0703 added the bug Something isn't working label Jul 14, 2021
@wudashuo
Copy link
Contributor

Got the same problem a week ago, it would be stuck if I use --sync-bn, remove it and the training will be fine.
Trying to find out why, but I failed.

@imyhxy
Copy link
Contributor

imyhxy commented Jul 16, 2021

Encountered the same problem here. 🌗

@glenn-jocher
Copy link
Member

@simba0703 @wudashuo @imyhxy thanks for the notice guys. Yes --sync is broken with torch 1.9.0, I can't figure out what the problem is though :(

If you you guys find a solution please let us know! In the meantime I'll add an assert to let users know this is a known issue.

You can still train DDP normally however, which I would recommend anyway, as all of the official models were trained without --sync.

@glenn-jocher glenn-jocher changed the title '--sync-bn' blocks the training process. DDP --sync-bn bug with torch 1.9.0 Jul 17, 2021
@glenn-jocher glenn-jocher linked a pull request Jul 17, 2021 that will close this issue
@jfpuget
Copy link

jfpuget commented Aug 8, 2021

This may be the cause: pytorch/pytorch#37930

@glenn-jocher glenn-jocher linked a pull request Aug 30, 2021 that will close this issue
@glenn-jocher
Copy link
Member

@simba0703 @wudashuo @imyhxy @jfpuget good news 😃! Your original issue may now be fixed ✅ in PR #4615. We discovered of the DPP --sync-bn issue was caused by TensorBoard add_graph() logging (used for visualizing the model interactively, example below). I don't know the exact cause and thus did not implement a fix, instead I implemented a workaround to avoid TensorBoard model visualization when --sync-bn is used.

131326663-ff9ca48a-7071-4432-b0c7-b7e3d3f32759

This means DDP training now works without issue with or without --sync-bn, but --sync-bn runs will not show a model visualization component in TensorBoard.

To receive this update:

  • Gitgit pull from within your yolov5/ directory or git clone https://github.com/ultralytics/yolov5 again
  • PyTorch Hub – Force-reload with model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
  • Notebooks – View updated notebooks Open In Colab Open In Kaggle
  • Dockersudo docker pull ultralytics/yolov5:latest to update your image Docker Pulls

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

@glenn-jocher glenn-jocher removed the TODO label Aug 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
5 participants