Replies: 1 comment
-
@bardia-az 👋 hi, thanks for letting us know about this possible problem with YOLOv5 🚀. We've created a few short guidelines below to help users provide what we need in order to get started investigating a possible problem. In general though you should run all DDP commands in our Docker image for best results, and you should also use the newer How to create a Minimal, Reproducible ExampleWhen asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:
For Ultralytics to provide assistance your code should also be:
If you believe your problem meets all the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template with a minimum reproducible example to help us better understand and diagnose your problem. Thank you! 😃 |
Beta Was this translation helpful? Give feedback.
-
I am training yolov5x6 model on a Tesla V100-SXM2-32GB GPU. The used command is:
python -m torch.distributed.launch --nproc_per_node 4 train.py --epochs 2 --freeze 5 --weights yolov5x6.pt --data data/coco.yaml --name test --batch-size 64 --img 1024 --exist-ok --hyp data/hyps/hyp.scratch-p6.yaml --device 0,1,2,3 --adam
In the last epoch's validation, the program crashes and gives this error:
terminate called after throwing an instance of 'std::system_error'
what(): Connection reset by peer
Do you have any idea what's the reason for that? Does it have something to do with the fact that GPU0 is still running but the other GPUs have finished their job?
Beta Was this translation helpful? Give feedback.
All reactions