torch.nn.modules.module.ModuleAttributeError in DP and DDP mode #682

NanoCode012 · 2020-08-09T13:33:19Z

🐛 Bug

Due to latest update on 3c6e2f7 , DP and DDP mode would error because they wrap around the model, so the attribute stride cannot be accessed.

To Reproduce (REQUIRED)

Input:

python train.py --weights yolov5s.pt --epochs 3 --img 320 --device 0,1 # DP
python -m torch.distributed.launch --nproc_per_node 2 train.py --weights yolov5s.pt --epochs 3 --img 320 --device 0,1 # DDP

Output in DDP mode (DP mode output is just a bit different):

Transferred 370/370 items from yolov5s.pt
Optimizer groups: 62 .bias, 70 conv.weight, 59 other
Transferred 370/370 items from yolov5s.pt
Optimizer groups: 62 .bias, 70 conv.weight, 59 other
Traceback (most recent call last):
  File "train.py", line 439, in <module>
Traceback (most recent call last):
  File "train.py", line 439, in <module>
    train(hyp, opt, device, tb_writer)
  File "train.py", line 144, in train
    gs = int(max(model.stride))  # grid size (max stride)
  File ".conda/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 772, in __getattr__
    train(hyp, opt, device, tb_writer)
  File "train.py", line 144, in train
    gs = int(max(model.stride))  # grid size (max stride)
  File ".conda/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 772, in __getattr__
    type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'DistributedDataParallel' object has no attribute 'stride'
    type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'DistributedDataParallel' object has no attribute 'stride'
Traceback (most recent call last):
  File ".conda/envs/py37/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File ".conda/envs/py37/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File ".conda/envs/py37/lib/python3.7/site-packages/torch/distributed/launch.py", line 261, in <module>
    main()
  File ".conda/envs/py37/lib/python3.7/site-packages/torch/distributed/launch.py", line 257, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['.conda/envs/py37/bin/python', '-u', 'train.py', '--local_rank=1', '--weights', 'yolov5s.pt', '--epochs', '3', '--img', '320', '--device', '0,1']' returned non-zero exit status 1.

Expected behavior

Run like Single GPU mode

Environment

OS: Ubuntu
GPU: V100s

Additional context

Solution is to move the line below, above DP/ DDP wrappers, particularly Line 144.

yolov5/train.py

Lines 143 to 145 in a0ac5ad

    
           # Image sizes 
        
           gs = int(max(model.stride))  # grid size (max stride) 
        
           imgsz, imgsz_test = [check_img_size(x, gs) for x in opt.img_size]  # verify imgsz are gs-multiples

to line 126

yolov5/train.py

Lines 126 to 129 in a0ac5ad

    
           # DP mode 
        
           if cuda and rank == -1 and torch.cuda.device_count() > 1: 
        
               model = torch.nn.DataParallel(model)

I added PR for your convenience. This is minimal needed to change. Tested working on my unit test. If you want to keep image sizes near the dataloaders, it "may be" possible to move them above the DP/DDP wrappers, but I'm not sure.

I think having a CI or unit test in DDP/DP mode should be important as it's easy to miss bugs like these. Of course, I understand that resources are expensive.

On a side note, this would be how to do pretrain right? Just pass the weights yolov5s.pt without the yolov5s.yaml file?

My Unit test (includes DDP/DP mode)

set -e 
rm -rf yolov5 && git clone https://github.com/ultralytics/yolov5.git && cd yolov5

#pip install -r requirements.txt onnx
#python3 -c "from utils.google_utils import *; gdrive_download('1n_oKgR81BJtqk75b00eAjdv03qVCQn2f', 'coco128.zip')" && mv -n ./coco128 ../
export PYTHONPATH="$PWD" # to run *.py. files in subdirectories
for x in yolov5s #yolov5m yolov5l yolov5x # models
do
  python -m torch.distributed.launch --master_port 9990 --nproc_per_node 2 train.py --weights $x.pt --epochs 3 --img 320 --device 0,1 # DDP train
  for di in 0,1 0 cpu # inference devices
  do
    python train.py --weights $x.pt --epochs 3 --img 320 --device $di  # train
    python detect.py --weights $x.pt --device $di  # detect official
    python detect.py --weights runs/exp0/weights/last.pt --device $di  # detect custom
    python test.py --weights $x.pt --device $di # test official
    python test.py --weights runs/exp0/weights/last.pt --device $di # test custom
  done
  python models/yolo.py --cfg $x.yaml # inspect
  python models/export.py --weights $x.pt --img 640 --batch 1 # export
done

The text was updated successfully, but these errors were encountered:

glenn-jocher · 2020-08-09T18:10:26Z

@NanoCode012 thanks, just saw your PR and merged!

Yes, single and multi-GPU CI would be awesome. It's a very rare use-case though, so I think there is only one company offering support for it, which charges hourly. Alternatively I think Github actions can use self hosted runners that you can point to a cloud instance. This article just appeared a few days ago:
https://github.blog/2020-08-04-github-actions-self-hosted-runners-on-google-cloud/

If this could spin up a 2x K80 GPU VM (the cheapest and slowest GPUs on GCP), then we could run additional CI tests on linux at least on single and double GPU, and then immediately shut it down afterwards, the costs should be manageable.

But the blog post also notes:

⚠️ Note that these use cases are considered experimental and not officially supported by GitHub at this time. Additionally, it’s recommended not to use self-hosted runners on public repositories for a number of security reasons.

glenn-jocher · 2020-08-09T18:16:22Z

@NanoCode012 oh about your other question, yes, now we can 'finetune', or start training from pretrained weights just by supplying the --weights, the --cfg is no longer required.

If you pass both a --cfg and --weights, the --cfg is used to create a model, and then any matching layers are transferred from the --weights. The anchors are on an exclude list of layers not to transfer, but I need to review this for the --resume use case.

Also the hyps are now in their own file in data/hyp.yaml. If pretrained weights are supplied then the finetuning hyps are used. If no pretrained weights are supplied then the from-scratch hyps are used. If you supply your own --hyp those are used instead. They two hyp files are identical for now, but may change in the future.

NanoCode012 added the bug Something isn't working label Aug 9, 2020

NanoCode012 mentioned this issue Aug 9, 2020

Fix ModuleAttributeError in DP and DDP mode #683

Merged

glenn-jocher closed this as completed in #683 Aug 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.nn.modules.module.ModuleAttributeError in DP and DDP mode #682

torch.nn.modules.module.ModuleAttributeError in DP and DDP mode #682

NanoCode012 commented Aug 9, 2020 •

edited

Loading

glenn-jocher commented Aug 9, 2020

glenn-jocher commented Aug 9, 2020

torch.nn.modules.module.ModuleAttributeError in DP and DDP mode #682

torch.nn.modules.module.ModuleAttributeError in DP and DDP mode #682

Comments

NanoCode012 commented Aug 9, 2020 • edited Loading

🐛 Bug

To Reproduce (REQUIRED)

Expected behavior

Environment

Additional context

glenn-jocher commented Aug 9, 2020

glenn-jocher commented Aug 9, 2020

NanoCode012 commented Aug 9, 2020 •

edited

Loading