"raise subprocess.CalledProcessError" shows when training with multiple GPUs in DDP mode #2294

Jelly123456 · 2021-02-25T04:15:47Z

❔Question

I run the below command to train my own custom datasets.
python -m torch.distributed.launch --nproc_per_node 2 train.py --batch-size 64 --data data/Allcls_one.yaml --weights weights/yolov5l.pt --cfg models/yolov5l_1cls.yaml --epochs 1 --device 0,1

Then the below error shows:
wandb: Synced 5 W&B file(s), 47 media file(s), 0 artifact file(s) and 0 other file(s)
wandb:
wandb: Synced exp17: https://wandb.ai/**/YOLOv5/runs/3bhvm9a3
Traceback (most recent call last):
File "/anaconda3/envs/YoLo_V5_n74/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/anaconda3/envs/YoLo_V5_n74/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/anaconda3/envs/YoLo_V5_n74/lib/python3.9/site-packages/torch/distributed/launch.py", line 260, in
main()
File "/anaconda3/envs/YoLo_V5_n74/lib/python3.9/site-packages/torch/distributed/launch.py", line 255, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['**/anaconda3/envs/YoLo_V5_n74/bin/python', '-u', 'train.py', '--local_rank=1', '--batch-size', '64', '--data', 'data/Allcls_one.yaml', '--weights', 'weights/yolov5l.pt', '--cfg', 'models/yolov5l_1cls.yaml', '--epochs', '1', '--device', '0,1']' died with <Signals.SIGSEGV: 11>

Training environment

OS: Linux
Two GPUs: Tesla V100S
Python version: 3.9
Other dependency versions: Already follow "requirements.txt"

Additional context

Previously I thought it could be my virtual environment problem, then I create a new virtual environment to retrain and it can train without error. But after training several times, this error still shows even I remove the old environment and create another new one.

Reference files

I also shared the configuration files I used for my training.

Support needed

Could any experienced person help solve this problem?

glenn-jocher · 2021-02-25T05:06:04Z

@Jelly123456 we recently patched a DDP bug fix in #2295 that was introduced in #2292, but this was all in the last few hours, so if you are using older code, or if you git pull now DDP should work correctly.

The only other thing you might consider is falling back to a python 3.8 environment, as 3.9 is pretty new and possibly not as mature as 3.8. Docker is a great choice for DDP also, it's basically a guaranteed working environment, we run most of our remote trainings via the Docker container.

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab and Kaggle notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Jelly123456 · 2021-02-25T05:34:29Z

@glenn-jocher Thank you very much for your quick reply. I will try with your recommendations.

I will tentatively leave this issue "open" now. After I try, if I found no problem, I will come back to close it.

glenn-jocher · 2021-02-25T05:44:36Z

@Jelly123456 sounds good

NanoCode012 · 2021-03-06T08:29:44Z

I was training the Transformer PR using DDP and got the same error when training ended (phew!)

Env: py3.9 , torch 1.7.1
Commit: fd96810

I think it's the commit behind: fab5085 , but I'm not sure if it's related.

Traceback (most recent call last):
  File "/.conda/envs/py39/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/.conda/envs/py39/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/.conda/envs/py39/lib/python3.9/site-packages/torch/distributed/launch.py", line 260, in <module>
    main()
  File "/.conda/envs/py39/lib/python3.9/site-packages/torch/distributed/launch.py", line 255, in main
    raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/.conda/envs/py39/bin/python', '-u', 'train.py', '--local_rank=3', '--data', 'coco.yaml', '--cfg', 'models/yolotrl.yaml', '--weights', '', '--batch-size', '128', '--device', '3,4,5,6', '--name', '4_5trlv3']' died with <Signals.SIGSEGV: 11>.

glenn-jocher · 2021-03-06T17:39:07Z

@NanoCode012 @Jelly123456 for DDP (actually for all trainings) I always use the docker image, and I haven't seen an error like this, or any other in the last few months.

Segmentation fault may also might be caused by an overloaded system rather than any GPU problems, i.e. lack of CPU threads or RAM. Other than that not sure what to say, these are usually not very repeatable unfortunately.

Jelly123456 · 2021-03-10T09:02:05Z

I tested with the latest commit and python 3.8. There is no error after training with 50 epochs.

glenn-jocher · 2021-03-10T17:46:14Z

@Jelly123456 great! Same results here, no DDP errors with 2x or 4x GPUs after 50 epochs.

ghost · 2021-08-11T15:24:42Z

I tested with the latest commit and python 3.8. There is no error after training with 50 epochs.

Question

I run the below command to train my own custom datasets.
python -m torch.distributed.launch --nproc_per_node 2 train.py --batch-size 64 --data data/Allcls_one.yaml --weights weights/yolov5l.pt --cfg models/yolov5l_1cls.yaml --epochs 1 --device 0,1

Then the below error shows:
wandb: Synced 5 W&B file(s), 47 media file(s), 0 artifact file(s) and 0 other file(s)
wandb:
wandb: Synced exp17: https://wandb.ai/**/YOLOv5/runs/3bhvm9a3
Traceback (most recent call last):
File "_/anaconda3/envs/YoLo_V5_n74/lib/python3.9/runpy.py", line 197, in run_module_as_main return run_code(code, main_globals, None, File "/anaconda3/envs/YoLo_V5_n74/lib/python3.9/runpy.py", line 87, in run_code
exec(code, run_globals)
File "/anaconda3/envs/YoLo_V5_n74/lib/python3.9/site-packages/torch/distributed/launch.py", line 260, in main() File "/anaconda3/envs/YoLo_V5_n74/lib/python3.9/site-packages/torch/distributed/launch.py", line 255, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['**/anaconda3/envs/YoLo_V5_n74/bin/python', '-u', 'train.py', '--local_rank=1', '--batch-size', '64', '--data', 'data/Allcls_one.yaml', '--weights', 'weights/yolov5l.pt', '--cfg', 'models/yolov5l_1cls.yaml', '--epochs', '1', '--device', '0,1']' died with <Signals.SIGSEGV: 11>

Training environment

OS: Linux
Two GPUs: Tesla V100S
Python version: 3.9
Other dependency versions: Already follow "requirements.txt"

Additional context

Previously I thought it could be my virtual environment problem, then I create a new virtual environment to retrain and it can train without error. But after training several times, this error still shows even I remove the old environment and create another new one.

Reference files

I also shared the configuration files I used for my training.

Support needed

Could any experienced person help solve this problem?

I ran into the same problem as you.

Jelly123456 added the question Further information is requested label Feb 25, 2021

Jelly123456 closed this as completed Mar 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"raise subprocess.CalledProcessError" shows when training with multiple GPUs in DDP mode #2294

"raise subprocess.CalledProcessError" shows when training with multiple GPUs in DDP mode #2294

Jelly123456 commented Feb 25, 2021 •

edited

Loading

glenn-jocher commented Feb 25, 2021 •

edited

Loading

Jelly123456 commented Feb 25, 2021

glenn-jocher commented Feb 25, 2021

NanoCode012 commented Mar 6, 2021

glenn-jocher commented Mar 6, 2021

Jelly123456 commented Mar 10, 2021

glenn-jocher commented Mar 10, 2021

ghost commented Aug 11, 2021

Question

Training environment

Additional context

Reference files

Support needed

"raise subprocess.CalledProcessError" shows when training with multiple GPUs in DDP mode #2294

"raise subprocess.CalledProcessError" shows when training with multiple GPUs in DDP mode #2294

Comments

Jelly123456 commented Feb 25, 2021 • edited Loading

❔Question

Training environment

Additional context

Reference files

Support needed

glenn-jocher commented Feb 25, 2021 • edited Loading

Environments

Jelly123456 commented Feb 25, 2021

glenn-jocher commented Feb 25, 2021

NanoCode012 commented Mar 6, 2021

glenn-jocher commented Mar 6, 2021

Jelly123456 commented Mar 10, 2021

glenn-jocher commented Mar 10, 2021

ghost commented Aug 11, 2021

Question

Training environment

Additional context

Reference files

Support needed

Jelly123456 commented Feb 25, 2021 •

edited

Loading

glenn-jocher commented Feb 25, 2021 •

edited

Loading