The program crashes in the last epoch's validation in multi-gpu (DDP) case #5557

bardia-az · 2021-11-08T07:21:18Z

bardia-az
Nov 8, 2021

I am training yolov5x6 model on a Tesla V100-SXM2-32GB GPU. The used command is:

python -m torch.distributed.launch --nproc_per_node 4 train.py --epochs 2 --freeze 5 --weights yolov5x6.pt --data data/coco.yaml --name test --batch-size 64 --img 1024 --exist-ok --hyp data/hyps/hyp.scratch-p6.yaml --device 0,1,2,3 --adam

In the last epoch's validation, the program crashes and gives this error:

terminate called after throwing an instance of 'std::system_error'
what(): Connection reset by peer

Do you have any idea what's the reason for that? Does it have something to do with the fact that GPU0 is still running but the other GPUs have finished their job?

glenn-jocher · 2021-11-08T10:37:21Z

glenn-jocher
Nov 8, 2021
Maintainer

@bardia-az 👋 hi, thanks for letting us know about this possible problem with YOLOv5 🚀. We've created a few short guidelines below to help users provide what we need in order to get started investigating a possible problem. In general though you should run all DDP commands in our Docker image for best results, and you should also use the newer torch.distributed.run command.

How to create a Minimal, Reproducible Example

When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:

✅ Minimal – Use as little code as possible to produce the problem
✅ Complete – Provide all parts someone else needs to reproduce the problem
✅ Reproducible – Test the code you're about to provide to make sure it reproduces the problem

For Ultralytics to provide assistance your code should also be:

✅ Current – Verify that your code is up-to-date with GitHub master, and if necessary git pull or git clone a new copy to ensure your problem has not already been solved in master.
✅ Unmodified – Your problem must be reproducible using official YOLOv5 code without changes. Ultralytics does not provide support for custom code ⚠️.

If you believe your problem meets all the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template with a minimum reproducible example to help us better understand and diagnose your problem.

Thank you! 😃

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The program crashes in the last epoch's validation in multi-gpu (DDP) case #5557

{{title}}

Replies: 1 comment

{{title}}

Select a reply

The program crashes in the last epoch's validation in multi-gpu (DDP) case #5557

bardia-az Nov 8, 2021

Replies: 1 comment

glenn-jocher Nov 8, 2021 Maintainer

How to create a Minimal, Reproducible Example

bardia-az
Nov 8, 2021

glenn-jocher
Nov 8, 2021
Maintainer