Multi GPU RuntimeError: Expected device cuda:0 but got device cuda:7 #15

zidanexu · 2020-06-04T11:22:08Z

hi @glenn-jocher
I try to reproduce training result.

using command above , 8 GPU Tela P40.
when finish 1 epoch training., The test process broken.

github-actions · 2020-06-04T11:22:47Z

Hello @zidanexu, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Google Colab Notebook, Docker Image, and GCP Quickstart Guide for example environments.

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom model or data training question, please note that Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:

Cloud-based AI surveillance systems operating on hundreds of HD video streams in realtime.
Edge AI integrated into custom iOS and Android apps for realtime 30 FPS video inference.
Custom data training, hyperparameter evolution, and model exportation to any destination.

For more information please visit https://www.ultralytics.com.

glenn-jocher · 2020-06-04T18:19:29Z

@zidanexu thank you for your bug report. We can successfully reproduce this issue. It appears to be caused by self.grid, a Detect() layer list, which is sent to a device during training. It is not transferred like normal parameters/buffers because it is not in the layer buffer list as it is a list rather than a tensor. We will look into this.

glenn-jocher · 2020-06-04T22:24:54Z

Fix complete. git pull to receive fix.

lucasjinreal · 2020-06-11T07:15:02Z

this error still exists on multi GPU training. on pytorch 1.5

RuntimeError: Model replicas must have an equal number of parameters.

it seems not fix yet, any ideas?

Also changed constants to hyperparameters

glenn-jocher changed the title ~~multi gpu train issue~~ Multi GPU RuntimeError: Expected device cuda:0 but got device cuda:7 Jun 4, 2020

glenn-jocher added the bug Something isn't working label Jun 4, 2020

glenn-jocher closed this as completed in dbdee3a Jun 4, 2020

glenn-jocher mentioned this issue Jun 5, 2020

Multi GPU RuntimeError: Model replicas must have an equal number of parameters. #11

Closed

glenn-jocher added a commit that referenced this issue Jun 6, 2020

multi-gpu test bug fix #15

5c470d2

matinhosseiny mentioned this issue Jun 23, 2020

RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM #185

Closed

DLLXW mentioned this issue Jul 3, 2020

RuntimeError: CUDA error: no kernel image is available for execution on the device (nms_cuda at /tmp/pip-req-build-9d9zypi6/torchvision/csrc/cuda/nms_cuda.cu:127) #281

Closed

hawkinglai mentioned this issue Oct 30, 2020

I am trying to custom a model with ghostnet base on yolov5 #1249

Closed

wuzuiyuzui mentioned this issue Nov 27, 2020

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR #1546

Closed

jerryWTMH mentioned this issue Dec 22, 2020

RuntimeError: CUDA error: unspecified launch failure #1752

Closed

alicera mentioned this issue Jun 23, 2021

different gpus to train #3736

Closed

coallar mentioned this issue Sep 18, 2021

CUDA error: the launch timed out and was terminated #4851

Closed

liang-jingyi mentioned this issue Nov 22, 2021

The advantage of yolov5s #5730

Closed

1 task

HeChengHui mentioned this issue Mar 24, 2022

Joint dataset training question #6904

Closed

1 task

xxyClass mentioned this issue Jul 19, 2022

PermissionError: [WinError 5] 拒绝访问。 #8626

Closed

1 task

mohammadRezapor mentioned this issue Dec 3, 2022

I can not train yolov5 in jupyter #10384

Closed

2 tasks

manole-alexandru added a commit to manole-alexandru/yolov5-uolo that referenced this issue Apr 12, 2023

No warm up + Moved dropout back ultralytics#15

0ad7de0

Also changed constants to hyperparameters

manole-alexandru added a commit to manole-alexandru/yolov5-uolo that referenced this issue Apr 13, 2023

Reduced Dropout Rate ultralytics#15

3a23002

cool112624 mentioned this issue May 16, 2023

DDP training with multiple gpu using wsl #11519

Closed

1 task

jcluo1994 mentioned this issue Oct 10, 2023

Using multi-GPU training reports errors #12213

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi GPU RuntimeError: Expected device cuda:0 but got device cuda:7 #15

Multi GPU RuntimeError: Expected device cuda:0 but got device cuda:7 #15

zidanexu commented Jun 4, 2020

github-actions bot commented Jun 4, 2020 •

edited by glenn-jocher

Loading

glenn-jocher commented Jun 4, 2020

glenn-jocher commented Jun 4, 2020

lucasjinreal commented Jun 11, 2020

Multi GPU RuntimeError: Expected device cuda:0 but got device cuda:7 #15

Multi GPU RuntimeError: Expected device cuda:0 but got device cuda:7 #15

Comments

zidanexu commented Jun 4, 2020

github-actions bot commented Jun 4, 2020 • edited by glenn-jocher Loading

glenn-jocher commented Jun 4, 2020

glenn-jocher commented Jun 4, 2020

lucasjinreal commented Jun 11, 2020

github-actions bot commented Jun 4, 2020 •

edited by glenn-jocher

Loading