Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"raise subprocess.CalledProcessError" shows when training with multiple GPUs in DDP mode #2294

Closed
Jelly123456 opened this issue Feb 25, 2021 · 8 comments
Labels
question Further information is requested

Comments

@Jelly123456
Copy link

Jelly123456 commented Feb 25, 2021

❔Question

I run the below command to train my own custom datasets.
python -m torch.distributed.launch --nproc_per_node 2 train.py --batch-size 64 --data data/Allcls_one.yaml --weights weights/yolov5l.pt --cfg models/yolov5l_1cls.yaml --epochs 1 --device 0,1

Then the below error shows:
wandb: Synced 5 W&B file(s), 47 media file(s), 0 artifact file(s) and 0 other file(s)
wandb:
wandb: Synced exp17: https://wandb.ai/**/YOLOv5/runs/3bhvm9a3
Traceback (most recent call last):
File "/anaconda3/envs/YoLo_V5_n74/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "
/anaconda3/envs/YoLo_V5_n74/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/anaconda3/envs/YoLo_V5_n74/lib/python3.9/site-packages/torch/distributed/launch.py", line 260, in
main()
File "
/anaconda3/envs/YoLo_V5_n74/lib/python3.9/site-packages/torch/distributed/launch.py", line 255, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['**/anaconda3/envs/YoLo_V5_n74/bin/python', '-u', 'train.py', '--local_rank=1', '--batch-size', '64', '--data', 'data/Allcls_one.yaml', '--weights', 'weights/yolov5l.pt', '--cfg', 'models/yolov5l_1cls.yaml', '--epochs', '1', '--device', '0,1']' died with <Signals.SIGSEGV: 11>

Training environment

OS: Linux
Two GPUs: Tesla V100S
Python version: 3.9
Other dependency versions: Already follow "requirements.txt"

Additional context

Previously I thought it could be my virtual environment problem, then I create a new virtual environment to retrain and it can train without error. But after training several times, this error still shows even I remove the old environment and create another new one.

Reference files

I also shared the configuration files I used for my training.

Support needed

Could any experienced person help solve this problem?

image
image

@Jelly123456 Jelly123456 added the question Further information is requested label Feb 25, 2021
@glenn-jocher
Copy link
Member

glenn-jocher commented Feb 25, 2021

@Jelly123456 we recently patched a DDP bug fix in #2295 that was introduced in #2292, but this was all in the last few hours, so if you are using older code, or if you git pull now DDP should work correctly.

The only other thing you might consider is falling back to a python 3.8 environment, as 3.9 is pretty new and possibly not as mature as 3.8. Docker is a great choice for DDP also, it's basically a guaranteed working environment, we run most of our remote trainings via the Docker container.

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

@Jelly123456
Copy link
Author

@glenn-jocher Thank you very much for your quick reply. I will try with your recommendations.

I will tentatively leave this issue "open" now. After I try, if I found no problem, I will come back to close it.

@glenn-jocher
Copy link
Member

@Jelly123456 sounds good

@NanoCode012
Copy link
Contributor

I was training the Transformer PR using DDP and got the same error when training ended (phew!)

Env: py3.9 , torch 1.7.1
Commit: fd96810

I think it's the commit behind: fab5085 , but I'm not sure if it's related.

Traceback (most recent call last):
  File "/.conda/envs/py39/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/.conda/envs/py39/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/.conda/envs/py39/lib/python3.9/site-packages/torch/distributed/launch.py", line 260, in <module>
    main()
  File "/.conda/envs/py39/lib/python3.9/site-packages/torch/distributed/launch.py", line 255, in main
    raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/.conda/envs/py39/bin/python', '-u', 'train.py', '--local_rank=3', '--data', 'coco.yaml', '--cfg', 'models/yolotrl.yaml', '--weights', '', '--batch-size', '128', '--device', '3,4,5,6', '--name', '4_5trlv3']' died with <Signals.SIGSEGV: 11>.

@glenn-jocher
Copy link
Member

@NanoCode012 @Jelly123456 for DDP (actually for all trainings) I always use the docker image, and I haven't seen an error like this, or any other in the last few months.

Segmentation fault may also might be caused by an overloaded system rather than any GPU problems, i.e. lack of CPU threads or RAM. Other than that not sure what to say, these are usually not very repeatable unfortunately.

@Jelly123456
Copy link
Author

I tested with the latest commit and python 3.8. There is no error after training with 50 epochs.

@glenn-jocher
Copy link
Member

@Jelly123456 great! Same results here, no DDP errors with 2x or 4x GPUs after 50 epochs.

@ghost
Copy link

ghost commented Aug 11, 2021

I tested with the latest commit and python 3.8. There is no error after training with 50 epochs.

Question

I run the below command to train my own custom datasets.
python -m torch.distributed.launch --nproc_per_node 2 train.py --batch-size 64 --data data/Allcls_one.yaml --weights weights/yolov5l.pt --cfg models/yolov5l_1cls.yaml --epochs 1 --device 0,1

Then the below error shows:
wandb: Synced 5 W&B file(s), 47 media file(s), 0 artifact file(s) and 0 other file(s)
wandb:
wandb: Synced exp17: https://wandb.ai/**/YOLOv5/runs/3bhvm9a3
Traceback (most recent call last):
File "_/anaconda3/envs/YoLo_V5_n74/lib/python3.9/runpy.py", line 197, in run_module_as_main return run_code(code, main_globals, None, File "/anaconda3/envs/YoLo_V5_n74/lib/python3.9/runpy.py", line 87, in run_code
exec(code, run_globals)
File "
/anaconda3/envs/YoLo_V5_n74/lib/python3.9/site-packages/torch/distributed/launch.py", line 260, in main() File "
/anaconda3/envs/YoLo_V5_n74/lib/python3.9/site-packages/torch/distributed/launch.py", line 255, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['**/anaconda3/envs/YoLo_V5_n74/bin/python', '-u', 'train.py', '--local_rank=1', '--batch-size', '64', '--data', 'data/Allcls_one.yaml', '--weights', 'weights/yolov5l.pt', '--cfg', 'models/yolov5l_1cls.yaml', '--epochs', '1', '--device', '0,1']' died with <Signals.SIGSEGV: 11>

Training environment

OS: Linux
Two GPUs: Tesla V100S
Python version: 3.9
Other dependency versions: Already follow "requirements.txt"

Additional context

Previously I thought it could be my virtual environment problem, then I create a new virtual environment to retrain and it can train without error. But after training several times, this error still shows even I remove the old environment and create another new one.

Reference files

I also shared the configuration files I used for my training.

Support needed

Could any experienced person help solve this problem?

image
image

I ran into the same problem as you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants