DDP `--sync-bn` bug with torch 1.9.0 #3998

simba0703 · 2021-07-14T05:33:27Z

When i use 'python -m torch.distributed.launch --nproc_per_node 3 train.py --batch-size 12 --data data/coco128.yaml --weights yolov5m6.pt --device 1,2,3 --adam --sync-bn',the training process will be blocked at epoch 0. And if i do not use '--sync-bn',the training process goes well.

🐛 Bug

A clear and concise description of what the bug is.

To Reproduce (REQUIRED)

Input:

import torch

a = torch.tensor([5])
c = a / 0

Output:

Traceback (most recent call last):
  File "/Users/glennjocher/opt/anaconda3/envs/env1/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3331, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-5-be04c762b799>", line 5, in <module>
    c = a / 0
RuntimeError: ZeroDivisionError

Expected behavior

A clear and concise description of what you expected to happen.

Environment

If applicable, add screenshots to help explain your problem.

OS: [e.g. Ubuntu]
GPU [e.g. 2080 Ti]

Additional context

Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

wudashuo · 2021-07-14T14:37:27Z

Got the same problem a week ago, it would be stuck if I use --sync-bn, remove it and the training will be fine.
Trying to find out why, but I failed.

imyhxy · 2021-07-16T12:28:14Z

Encountered the same problem here. 🌗

glenn-jocher · 2021-07-17T11:00:04Z

@simba0703 @wudashuo @imyhxy thanks for the notice guys. Yes --sync is broken with torch 1.9.0, I can't figure out what the problem is though :(

If you you guys find a solution please let us know! In the meantime I'll add an assert to let users know this is a known issue.

You can still train DDP normally however, which I would recommend anyway, as all of the official models were trained without --sync.

jfpuget · 2021-08-08T09:33:28Z

This may be the cause: pytorch/pytorch#37930

glenn-jocher · 2021-08-30T16:37:17Z

@simba0703 @wudashuo @imyhxy @jfpuget good news 😃! Your original issue may now be fixed ✅ in PR #4615. We discovered of the DPP --sync-bn issue was caused by TensorBoard add_graph() logging (used for visualizing the model interactively, example below). I don't know the exact cause and thus did not implement a fix, instead I implemented a workaround to avoid TensorBoard model visualization when --sync-bn is used.

This means DDP training now works without issue with or without --sync-bn, but --sync-bn runs will not show a model visualization component in TensorBoard.

To receive this update:

Git – git pull from within your yolov5/ directory or git clone https://github.com/ultralytics/yolov5 again
PyTorch Hub – Force-reload with model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
Notebooks – View updated notebooks
Docker – sudo docker pull ultralytics/yolov5:latest to update your image

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

simba0703 added the bug Something isn't working label Jul 14, 2021

glenn-jocher changed the title ~~'--sync-bn' blocks the training process.~~ DDP --sync-bn bug with torch 1.9.0 Jul 17, 2021

glenn-jocher added the TODO label Jul 17, 2021

glenn-jocher linked a pull request Jul 17, 2021 that will close this issue

Add --sync-bn known issue #4032

Merged

glenn-jocher linked a pull request Aug 30, 2021 that will close this issue

DDP torch.jit.trace() --sync-bn fix #4615

Merged

glenn-jocher closed this as completed in #4615 Aug 30, 2021

glenn-jocher removed the TODO label Aug 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP `--sync-bn` bug with torch 1.9.0 #3998

DDP `--sync-bn` bug with torch 1.9.0 #3998

simba0703 commented Jul 14, 2021

wudashuo commented Jul 14, 2021

imyhxy commented Jul 16, 2021

glenn-jocher commented Jul 17, 2021

jfpuget commented Aug 8, 2021

glenn-jocher commented Aug 30, 2021

DDP --sync-bn bug with torch 1.9.0 #3998

DDP --sync-bn bug with torch 1.9.0 #3998

Comments

simba0703 commented Jul 14, 2021

🐛 Bug

To Reproduce (REQUIRED)

Expected behavior

Environment

Additional context

wudashuo commented Jul 14, 2021

imyhxy commented Jul 16, 2021

glenn-jocher commented Jul 17, 2021

jfpuget commented Aug 8, 2021

glenn-jocher commented Aug 30, 2021

DDP `--sync-bn` bug with torch 1.9.0 #3998

DDP `--sync-bn` bug with torch 1.9.0 #3998