Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch.nn.modules.module.ModuleAttributeError in DP and DDP mode #682

Closed
NanoCode012 opened this issue Aug 9, 2020 · 2 comments · Fixed by #683
Closed

torch.nn.modules.module.ModuleAttributeError in DP and DDP mode #682

NanoCode012 opened this issue Aug 9, 2020 · 2 comments · Fixed by #683
Labels
bug Something isn't working

Comments

@NanoCode012
Copy link
Contributor

NanoCode012 commented Aug 9, 2020

🐛 Bug

Due to latest update on 3c6e2f7 , DP and DDP mode would error because they wrap around the model, so the attribute stride cannot be accessed.

To Reproduce (REQUIRED)

Input:

python train.py --weights yolov5s.pt --epochs 3 --img 320 --device 0,1 # DP
python -m torch.distributed.launch --nproc_per_node 2 train.py --weights yolov5s.pt --epochs 3 --img 320 --device 0,1 # DDP

Output in DDP mode (DP mode output is just a bit different):

Transferred 370/370 items from yolov5s.pt
Optimizer groups: 62 .bias, 70 conv.weight, 59 other
Transferred 370/370 items from yolov5s.pt
Optimizer groups: 62 .bias, 70 conv.weight, 59 other
Traceback (most recent call last):
  File "train.py", line 439, in <module>
Traceback (most recent call last):
  File "train.py", line 439, in <module>
    train(hyp, opt, device, tb_writer)
  File "train.py", line 144, in train
    gs = int(max(model.stride))  # grid size (max stride)
  File ".conda/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 772, in __getattr__
    train(hyp, opt, device, tb_writer)
  File "train.py", line 144, in train
    gs = int(max(model.stride))  # grid size (max stride)
  File ".conda/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 772, in __getattr__
    type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'DistributedDataParallel' object has no attribute 'stride'
    type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'DistributedDataParallel' object has no attribute 'stride'
Traceback (most recent call last):
  File ".conda/envs/py37/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File ".conda/envs/py37/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File ".conda/envs/py37/lib/python3.7/site-packages/torch/distributed/launch.py", line 261, in <module>
    main()
  File ".conda/envs/py37/lib/python3.7/site-packages/torch/distributed/launch.py", line 257, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['.conda/envs/py37/bin/python', '-u', 'train.py', '--local_rank=1', '--weights', 'yolov5s.pt', '--epochs', '3', '--img', '320', '--device', '0,1']' returned non-zero exit status 1.

Expected behavior

Run like Single GPU mode

Environment

  • OS: Ubuntu
  • GPU: V100s

Additional context

Solution is to move the line below, above DP/ DDP wrappers, particularly Line 144.

yolov5/train.py

Lines 143 to 145 in a0ac5ad

# Image sizes
gs = int(max(model.stride)) # grid size (max stride)
imgsz, imgsz_test = [check_img_size(x, gs) for x in opt.img_size] # verify imgsz are gs-multiples

to line 126

yolov5/train.py

Lines 126 to 129 in a0ac5ad

# DP mode
if cuda and rank == -1 and torch.cuda.device_count() > 1:
model = torch.nn.DataParallel(model)

I added PR for your convenience. This is minimal needed to change. Tested working on my unit test. If you want to keep image sizes near the dataloaders, it "may be" possible to move them above the DP/DDP wrappers, but I'm not sure.

I think having a CI or unit test in DDP/DP mode should be important as it's easy to miss bugs like these. Of course, I understand that resources are expensive.

On a side note, this would be how to do pretrain right? Just pass the weights yolov5s.pt without the yolov5s.yaml file?

My Unit test (includes DDP/DP mode)
set -e 
rm -rf yolov5 && git clone https://github.com/ultralytics/yolov5.git && cd yolov5

#pip install -r requirements.txt onnx
#python3 -c "from utils.google_utils import *; gdrive_download('1n_oKgR81BJtqk75b00eAjdv03qVCQn2f', 'coco128.zip')" && mv -n ./coco128 ../
export PYTHONPATH="$PWD" # to run *.py. files in subdirectories
for x in yolov5s #yolov5m yolov5l yolov5x # models
do
  python -m torch.distributed.launch --master_port 9990 --nproc_per_node 2 train.py --weights $x.pt --epochs 3 --img 320 --device 0,1 # DDP train
  for di in 0,1 0 cpu # inference devices
  do
    python train.py --weights $x.pt --epochs 3 --img 320 --device $di  # train
    python detect.py --weights $x.pt --device $di  # detect official
    python detect.py --weights runs/exp0/weights/last.pt --device $di  # detect custom
    python test.py --weights $x.pt --device $di # test official
    python test.py --weights runs/exp0/weights/last.pt --device $di # test custom
  done
  python models/yolo.py --cfg $x.yaml # inspect
  python models/export.py --weights $x.pt --img 640 --batch 1 # export
done
@glenn-jocher
Copy link
Member

@NanoCode012 thanks, just saw your PR and merged!

Yes, single and multi-GPU CI would be awesome. It's a very rare use-case though, so I think there is only one company offering support for it, which charges hourly. Alternatively I think Github actions can use self hosted runners that you can point to a cloud instance. This article just appeared a few days ago:
https://github.blog/2020-08-04-github-actions-self-hosted-runners-on-google-cloud/

If this could spin up a 2x K80 GPU VM (the cheapest and slowest GPUs on GCP), then we could run additional CI tests on linux at least on single and double GPU, and then immediately shut it down afterwards, the costs should be manageable.

But the blog post also notes:

⚠️ Note that these use cases are considered experimental and not officially supported by GitHub at this time. Additionally, it’s recommended not to use self-hosted runners on public repositories for a number of security reasons.

@glenn-jocher
Copy link
Member

@NanoCode012 oh about your other question, yes, now we can 'finetune', or start training from pretrained weights just by supplying the --weights, the --cfg is no longer required.

If you pass both a --cfg and --weights, the --cfg is used to create a model, and then any matching layers are transferred from the --weights. The anchors are on an exclude list of layers not to transfer, but I need to review this for the --resume use case.

Also the hyps are now in their own file in data/hyp.yaml. If pretrained weights are supplied then the finetuning hyps are used. If no pretrained weights are supplied then the from-scratch hyps are used. If you supply your own --hyp those are used instead. They two hyp files are identical for now, but may change in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants