Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi GPU RuntimeError: Expected device cuda:0 but got device cuda:7 #15

Closed
zidanexu opened this issue Jun 4, 2020 · 4 comments
Closed
Labels
bug Something isn't working

Comments

@zidanexu
Copy link

zidanexu commented Jun 4, 2020

hi @glenn-jocher
I try to reproduce training result.
image
using command above , 8 GPU Tela P40.
when finish 1 epoch training., The test process broken.
image

@github-actions
Copy link
Contributor

github-actions bot commented Jun 4, 2020

Hello @zidanexu, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Google Colab Notebook, Docker Image, and GCP Quickstart Guide for example environments.

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom model or data training question, please note that Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:

  • Cloud-based AI surveillance systems operating on hundreds of HD video streams in realtime.
  • Edge AI integrated into custom iOS and Android apps for realtime 30 FPS video inference.
  • Custom data training, hyperparameter evolution, and model exportation to any destination.

For more information please visit https://www.ultralytics.com.

@glenn-jocher glenn-jocher changed the title multi gpu train issue Multi GPU RuntimeError: Expected device cuda:0 but got device cuda:7 Jun 4, 2020
@glenn-jocher glenn-jocher added the bug Something isn't working label Jun 4, 2020
@glenn-jocher
Copy link
Member

@zidanexu thank you for your bug report. We can successfully reproduce this issue. It appears to be caused by self.grid, a Detect() layer list, which is sent to a device during training. It is not transferred like normal parameters/buffers because it is not in the layer buffer list as it is a list rather than a tensor. We will look into this.

@glenn-jocher
Copy link
Member

Fix complete. git pull to receive fix.
Screen Shot 2020-06-04 at 3 19 41 PM

@lucasjinreal
Copy link

this error still exists on multi GPU training. on pytorch 1.5

RuntimeError: Model replicas must have an equal number of parameters.

it seems not fix yet, any ideas?

manole-alexandru added a commit to manole-alexandru/yolov5-uolo that referenced this issue Apr 12, 2023
Also changed constants to hyperparameters
manole-alexandru added a commit to manole-alexandru/yolov5-uolo that referenced this issue Apr 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants