wrong batch size after --resume on multiple GPUs #1936

AmirSa7 · 2021-01-14T08:23:48Z

🐛 Bug

After running a training session on multiple GPUs, batch_size is read wrongly from opt.yaml causing an error.

To Reproduce (REQUIRED)

Run this line:

python -m torch.distributed.launch --nproc_per_node 2 train.py --batch-size 30 --data coco128.yaml --weights yolov5s.pt --device 1,2

Notice that we use batch-size=30 --> 15 for each GPU.

Press Ctrl+C to stop the session.
Now, try to resume:

python -m torch.distributed.launch --nproc_per_node 2 train.py --batch-size 30 --data coco128.yaml --weights yolov5s.pt --device 1,2 --resume

Output:

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
github: Traceback (most recent call last):
  File "train.py", line 492, in <module>
    device = select_device(opt.device, batch_size=opt.batch_size)
  File "/user/detection/tests/yolov5/utils/torch_utils.py", line 68, in select_device
    assert batch_size % n == 0, f'batch-size {batch_size} not multiple of GPU count {n}'
AssertionError: batch-size 15 not multiple of GPU count 2
up to date with https://github.com/ultralytics/yolov5 ✅
Resuming training from ./runs/train/exp/weights/last.pt
Traceback (most recent call last):
  File "train.py", line 492, in <module>
    device = select_device(opt.device, batch_size=opt.batch_size)
  File "/user/detection/tests/yolov5/utils/torch_utils.py", line 68, in select_device
    assert batch_size % n == 0, f'batch-size {batch_size} not multiple of GPU count {n}'
AssertionError: batch-size 15 not multiple of GPU count 2
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/user/venv/yolov5/lib/python3.8/site-packages/torch/distributed/launch.py", line 260, in <module>
    main()
  File "/user/venv/yolov5/lib/python3.8/site-packages/torch/distributed/launch.py", line 255, in main
    raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/user/venv/yolov5/bin/python', '-u', 'train.py', '--local_rank=1', '--batch-size', '30', '--data', 'coco128.yaml', '--weights', 'yolov5s.pt', '--device', '1,2', '--resume']' returned non-zero exit status 1.

Expected behavior

The training should resume correctly, with the right batch size.

Environment

OS: [Ubuntu 16.04]
GPU [GTX 1080 Ti (x2)]

Additional context

--

The text was updated successfully, but these errors were encountered:

NanoCode012 · 2021-01-14T12:26:21Z

I'm thinking that we need to modify the below a bit .

yolov5/train.py

Line 480 in b75c432

    
           opt.cfg, opt.weights, opt.resume, opt.global_rank, opt.local_rank = '', ckpt, True, *apriori  # reinstate

to

opt.cfg, opt.weights, opt.resume, opt.batch_size, opt.global_rank, opt.local_rank  = '', ckpt, True, opt.total_batch_size, *apriori

@AmirSa7 , could you try to modify it and tell me how it goes?

Edit: I tested it quickly and it works.

Commands:

python -m torch.distributed.launch --master_port 9963 --nproc_per_node 2 train.py --data coco128.yaml --cfg yolov5s.yaml --weights yolov5s.pt --epochs 3 --batch-size 64 --device 3,4
python -m torch.distributed.launch --master_port 9963 --nproc_per_node 2 train.py --resume

AmirSa7 · 2021-01-14T15:49:03Z

I'm thinking that we need to modify the below a bit .

yolov5/train.py

Line 480 in b75c432

opt.cfg, opt.weights, opt.resume, opt.global_rank, opt.local_rank = '', ckpt, True, *apriori # reinstate

to
opt.cfg, opt.weights, opt.resume, opt.batch_size, opt.global_rank, opt.local_rank  = '', ckpt, True, opt.total_batch_size, *apriori
@AmirSa7 , could you try to modify it and tell me how it goes?

Edit: I tested it quickly and it works.

Commands:
python -m torch.distributed.launch --master_port 9963 --nproc_per_node 2 train.py --data coco128.yaml --cfg yolov5s.yaml --weights yolov5s.pt --epochs 3 --batch-size 64 --device 3,4
python -m torch.distributed.launch --master_port 9963 --nproc_per_node 2 train.py --resume

Yup, it works 👍 Thanks.

glenn-jocher · 2021-01-14T17:54:28Z

@NanoCode012 thanks for the PR! Your fix has been merged now.

@AmirSa7 the fix proposed by @NanoCode012 has been merged into master now. Please git pull to receive this update and let us know if you encounter any other issues!

AmirSa7 added the bug Something isn't working label Jan 14, 2021

NanoCode012 mentioned this issue Jan 14, 2021

Fix batch-size on resume for multi-gpu #1942

Merged

glenn-jocher closed this as completed in #1942 Jan 14, 2021

This was referenced Apr 11, 2021

YOLOv5 v5.0 Release #2762

Merged

YOLOv5 v5.0 release compatibility update for YOLOv3 ultralytics/yolov3#1737

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wrong batch size after --resume on multiple GPUs #1936

wrong batch size after --resume on multiple GPUs #1936

AmirSa7 commented Jan 14, 2021

NanoCode012 commented Jan 14, 2021 •

edited

Loading

AmirSa7 commented Jan 14, 2021

glenn-jocher commented Jan 14, 2021 •

edited

Loading

wrong batch size after --resume on multiple GPUs #1936

wrong batch size after --resume on multiple GPUs #1936

Comments

AmirSa7 commented Jan 14, 2021

🐛 Bug

To Reproduce (REQUIRED)

Expected behavior

Environment

Additional context

NanoCode012 commented Jan 14, 2021 • edited Loading

AmirSa7 commented Jan 14, 2021

glenn-jocher commented Jan 14, 2021 • edited Loading

NanoCode012 commented Jan 14, 2021 •

edited

Loading

glenn-jocher commented Jan 14, 2021 •

edited

Loading