YOLOv5 issues with `torch==1.12` on Multi-GPU systems #8395

glenn-jocher · 2022-06-29T16:28:35Z

Search before asking

I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Training, Multi-GPU

Bug

All GPUs are utilized by torch 1.12 with current YOLOv5 master when a single-GPU command is run, i.e.:

Error does not occur with torch==1.11

@AyushExel FYI

Environment

Docker image

Minimal Reproducible Example

python train.py --epochs 10 --device 7

Additional

Temp workaround is to use torch 1.11

pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113

Are you willing to submit a PR?

Yes I'd like to help by submitting a PR!

The text was updated successfully, but these errors were encountered:

github-actions · 2022-06-29T16:29:31Z

👋 Hello @glenn-jocher, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://ultralytics.com or email support@ultralytics.com.

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab and Kaggle notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on macOS, Windows, and Ubuntu every 24 hours and on every commit.

AyushExel · 2022-06-29T17:16:22Z

I wonder how a bug this big made into the release

UnglvKitDe · 2022-06-29T17:35:37Z

@glenn-jocher I cannot confirm this. Tested on gpu servers with 2,4,8 GPUS (RTX 2080TI/A6000)
EDIT: Maby a workaround is to use CUDA_VISIBLE_DEVICES?

glenn-jocher · 2022-06-30T14:30:38Z

@AyushExel it's probably not a torch bug but instead related to our specific implementation for selecting devices in select_device() below which relies on defining CUDA_VISIBLE_DEVICES in the workspace before torch reads it.
mode.

yolov5/utils/torch_utils.py

Lines 52 to 86 in 8983324

    
           def select_device(device='', batch_size=0, newline=True): 
        
               # device = None or 'cpu' or 0 or '0' or '0,1,2,3' 
        
               s = f'YOLOv5 🚀 {git_describe() or file_date()} Python-{platform.python_version()} torch-{torch.__version__} ' 
        
               device = str(device).strip().lower().replace('cuda:', '').replace('none', '')  # to string, 'cuda:0' to '0' 
        
               cpu = device == 'cpu' 
        
               mps = device == 'mps'  # Apple Metal Performance Shaders (MPS) 
        
               if cpu or mps: 
        
                   os.environ['CUDA_VISIBLE_DEVICES'] = '-1'  # force torch.cuda.is_available() = False 
        
               elif device:  # non-cpu device requested 
        
                   os.environ['CUDA_VISIBLE_DEVICES'] = device  # set environment variable - must be before assert is_available() 
        
                   assert torch.cuda.is_available() and torch.cuda.device_count() >= len(device.replace(',', '')), \ 
        
                       f"Invalid CUDA '--device {device}' requested, use '--device cpu' or pass valid CUDA device(s)" 
        
               if not cpu and torch.cuda.is_available():  # prefer GPU if available 
        
                   devices = device.split(',') if device else '0'  # range(torch.cuda.device_count())  # i.e. 0,1,6,7 
        
                   n = len(devices)  # device count 
        
                   if n > 1 and batch_size > 0:  # check batch_size is divisible by device_count 
        
                       assert batch_size % n == 0, f'batch-size {batch_size} not multiple of GPU count {n}' 
        
                   space = ' ' * (len(s) + 1) 
        
                   for i, d in enumerate(devices): 
        
                       p = torch.cuda.get_device_properties(i) 
        
                       s += f"{'' if i == 0 else space}CUDA:{d} ({p.name}, {p.total_memory / (1 << 20):.0f}MiB)\n"  # bytes to MB 
        
                   arg = 'cuda:0' 
        
               elif not cpu and getattr(torch, 'has_mps', False) and torch.backends.mps.is_available():  # prefer MPS if available 
        
                   s += 'MPS\n' 
        
                   arg = 'mps' 
        
               else:  # revert to CPU 
        
                   s += 'CPU\n' 
        
                   arg = 'cpu' 
        
               if not newline: 
        
                   s = s.rstrip() 
        
               LOGGER.info(s.encode().decode('ascii', 'ignore') if platform.system() == 'Windows' else s)  # emoji-safe 
        
               return torch.device(arg)

If I run the reproduce command above in Docker image, python train.py --epochs 10 --device 7, I see this on our server:

glenn-jocher · 2022-06-30T14:31:46Z

@UnglvKitDe you tested a single-GPU training command above python train.py --epochs 10 --device 7 with torch 1.12 and did not see multi-GPU usage on nvidia-smi?

UnglvKitDe · 2022-06-30T14:48:38Z

@glenn-jocher

@UnglvKitDe you tested a single-GPU training command above python train.py --epochs 10 --device 7 with torch 1.12 and did not see multi-GPU usage on nvidia-smi?

@glenn-jocher not exactly, but in the end the same. i tested it on coco128 ( torch 1.12 and cuda 11.6). so python train.py --data coco128.yaml --device 0 --epochs 10.

glenn-jocher · 2022-06-30T14:54:12Z

@UnglvKitDe oh, strange. This is basically the same command I used, but I used device 7. If I try --device 0 I also get the bug though.

We'll ok I'll experiment some more. In my experiments 1.11 works correctly but 1.12 does not (master branch in Docker) with CUDA 11.3

mjun0812 · 2022-07-05T11:59:40Z

It appears that this problem occurred because of a change in the timing of reading environment variables in PyTorch 1.12.

My GPU environment is here.

CUDA:0 NVIDIA RTX A6000, 48685.3125MB
CUDA:1 NVIDIA GeForce RTX 3090, 24268.3125MB

The following code worked correctly in version 1.11.

import os
import torch

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

# print using GPU Info
print(f"Using GPU is CUDA:{os.environ['CUDA_VISIBLE_DEVICES']}")

for i in range(torch.cuda.device_count()):
    info = torch.cuda.get_device_properties(i)
    print(f"CUDA:{i} {info.name}, {info.total_memory / 1024 ** 2}MB")

In version 1.11, the output of this code is below.

Using GPU is CUDA:1
CUDA:0 NVIDIA GeForce RTX 3090, 24268.3125MB

But version 1.12's output is below.

Using GPU is CUDA:1
CUDA:0 NVIDIA RTX A6000, 48685.3125MB
CUDA:1 NVIDIA GeForce RTX 3090, 24268.3125MB

As additional information, the environment variable change worked correctly when done before import torch.

import os

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

import torch


# print using GPU Info
print(f"Using GPU is CUDA:{os.environ['CUDA_VISIBLE_DEVICES']}")

for i in range(torch.cuda.device_count()):
    info = torch.cuda.get_device_properties(i)
    print(f"CUDA:{i} {info.name}, {info.total_memory / 1024 ** 2}MB")

Output is below.

Using GPU is CUDA:1
CUDA:0 NVIDIA GeForce RTX 3090, 24268.3125MB

glenn-jocher · 2022-07-05T13:12:35Z

@mjun0812 interesting, thanks for the info! Unfortunately we can't specify the environment variables before loading torch. Does re-loading torch after defining the environment variables have any effect? i.e.:

import os
import torch

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

import torch


# print using GPU Info
print(f"Using GPU is CUDA:{os.environ['CUDA_VISIBLE_DEVICES']}")

for i in range(torch.cuda.device_count()):
    info = torch.cuda.get_device_properties(i)
    print(f"CUDA:{i} {info.name}, {info.total_memory / 1024 ** 2}MB")

mjun0812 · 2022-07-05T14:24:22Z

@glenn-jocher Thank you for your reply!
I tried your code and the following code.

Your suggestion code:

import os
import torch

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

import torch

# print using GPU Info
print(f"Using GPU is CUDA:{os.environ['CUDA_VISIBLE_DEVICES']}")

for i in range(torch.cuda.device_count()):
    info = torch.cuda.get_device_properties(i)
    print(f"CUDA:{i} {info.name}, {info.total_memory / 1024 ** 2}MB")

and del torch

import os
import torch

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

del torch
import torch

# print using GPU Info
print(f"Using GPU is CUDA:{os.environ['CUDA_VISIBLE_DEVICES']}")

for i in range(torch.cuda.device_count()):
    info = torch.cuda.get_device_properties(i)
    print(f"CUDA:{i} {info.name}, {info.total_memory / 1024 ** 2}MB")

and importlib.reload()

import os
import torch
import importlib

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

importlib.reload(torch)

# print using GPU Info
print(f"Using GPU is CUDA:{os.environ['CUDA_VISIBLE_DEVICES']}")

for i in range(torch.cuda.device_count()):
    info = torch.cuda.get_device_properties(i)
    print(f"CUDA:{i} {info.name}, {info.total_memory / 1024 ** 2}MB")

The above outputs were the same.

Using GPU is CUDA:1
CUDA:0 NVIDIA RTX A6000, 48685.3125MB
CUDA:1 NVIDIA GeForce RTX 3090, 24268.3125MB

This issue could not be resolved...
I will continue to investigate the issue.

glenn-jocher · 2022-07-05T16:17:01Z

@mjun0812 too bad. Yes let me know if you find a solution!

mjun0812 · 2022-07-05T23:06:38Z

@glenn-jocher I raised this issue in the PyTorch repository, and it has been fixed in the latest master branch.
The cause is that CUDA initialization was done during import torch.
The binary version that can be installed with pip will likely be fixed in the torch next version 1.12.1.

Therefore, it may be necessary to modify requirements.txt as follows

yolov5/requirements.txt

Lines 12 to 13 in fdc9d91

    
           torch>=1.7.0 
        
           torchvision>=0.8.1

+ torch>=1.7.0,!=1.12.0  # https://github.com/ultralytics/yolov5/issues/8395
+ torchvision>=0.8.1,!=0.13.0  # https://github.com/ultralytics/yolov5/issues/8395

I am ready to make a pull request for the above fixes.

glenn-jocher · 2022-07-06T17:31:18Z

@mjun0812 got it, thanks for the update! I see your PR, will take a look there.

Exclude torch==1.12.0, torchvision==0.13.0

glenn-jocher · 2022-07-07T11:04:42Z

@AyushExel torch 1.12 issue resolved in upcoming torch 1.12.1, so our fix is to simply excluded 1.12 in requirements.txt in #8497. Problem solved :)

…tralytics#8497) Exclude torch==1.12.0, torchvision==0.13.0

DoSquared · 2022-07-15T07:46:43Z

How did you guys fix the pytorch error? should I install 1.12.1?

glenn-jocher · 2022-07-15T12:40:33Z

@DaliaMahdy unfortunately 1.12.1 is not out currently, latest stable is 1.12.0, but you can install nightly for example to resolve the issue, or simply use 1.11.0

…tralytics#8497) Exclude torch==1.12.0, torchvision==0.13.0

glenn-jocher added the bug Something isn't working label Jun 29, 2022

glenn-jocher mentioned this issue Jun 29, 2022

Training reproducibility improvements #8213

Merged

glenn-jocher changed the title ~~YOLOv5 issues with torch 1.12 on Multi-GPU systems~~ YOLOv5 issues with torch==1.12 on Multi-GPU systems Jun 30, 2022

mjun0812 mentioned this issue Jul 5, 2022

[1.12] os.environ["CUDA_VISIBLE_DEVICES"] has no effect pytorch/pytorch#80876

Closed

mjun0812 mentioned this issue Jul 6, 2022

Exclude torch==1.12.0, torchvision==0.13.0 (Fix #8395) #8497

Merged

glenn-jocher closed this as completed in #8497 Jul 6, 2022

glenn-jocher pushed a commit that referenced this issue Jul 6, 2022

Exclude torch==1.12.0, torchvision==0.13.0 (Fix #8395) (#8497)

1ab23fc

Exclude torch==1.12.0, torchvision==0.13.0

Shivvrat pushed a commit to Shivvrat/epic-yolov5 that referenced this issue Jul 12, 2022

Exclude torch==1.12.0, torchvision==0.13.0 (Fix ultralytics#8395) (ul…

ce250f9

…tralytics#8497) Exclude torch==1.12.0, torchvision==0.13.0

martingaudio94 mentioned this issue Jul 13, 2022

BrokenPipeError: [Errno 32] Broken pipe #758

Closed

zhiqwang mentioned this issue Jul 15, 2022

About pyotrch version #8581

Closed

1 task

equal-l2 mentioned this issue Jul 18, 2022

Torch v1.12 fixes bug to train a model, but YoloV5 rejects the version. #8609

Closed

2 tasks

ctjanuhowski pushed a commit to ctjanuhowski/yolov5 that referenced this issue Sep 8, 2022

Exclude torch==1.12.0, torchvision==0.13.0 (Fix ultralytics#8395) (ul…

24a6c36

…tralytics#8497) Exclude torch==1.12.0, torchvision==0.13.0

Hojland mentioned this issue Oct 17, 2022

feat/bump Go-Autonomous/yolov5#15

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

YOLOv5 issues with `torch==1.12` on Multi-GPU systems #8395

YOLOv5 issues with `torch==1.12` on Multi-GPU systems #8395

glenn-jocher commented Jun 29, 2022

github-actions bot commented Jun 29, 2022 •

edited by glenn-jocher

Loading

AyushExel commented Jun 29, 2022

UnglvKitDe commented Jun 29, 2022 •

edited

Loading

glenn-jocher commented Jun 30, 2022

glenn-jocher commented Jun 30, 2022

UnglvKitDe commented Jun 30, 2022 •

edited

Loading

glenn-jocher commented Jun 30, 2022 •

edited

Loading

mjun0812 commented Jul 5, 2022 •

edited

Loading

glenn-jocher commented Jul 5, 2022

mjun0812 commented Jul 5, 2022

glenn-jocher commented Jul 5, 2022

mjun0812 commented Jul 5, 2022 •

edited

Loading

glenn-jocher commented Jul 6, 2022

glenn-jocher commented Jul 7, 2022

DoSquared commented Jul 15, 2022

glenn-jocher commented Jul 15, 2022

YOLOv5 issues with torch==1.12 on Multi-GPU systems #8395

YOLOv5 issues with torch==1.12 on Multi-GPU systems #8395

Comments

glenn-jocher commented Jun 29, 2022

Search before asking

YOLOv5 Component

Bug

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

github-actions bot commented Jun 29, 2022 • edited by glenn-jocher Loading

Requirements

Environments

Status

AyushExel commented Jun 29, 2022

UnglvKitDe commented Jun 29, 2022 • edited Loading

glenn-jocher commented Jun 30, 2022

glenn-jocher commented Jun 30, 2022

UnglvKitDe commented Jun 30, 2022 • edited Loading

glenn-jocher commented Jun 30, 2022 • edited Loading

mjun0812 commented Jul 5, 2022 • edited Loading

glenn-jocher commented Jul 5, 2022

mjun0812 commented Jul 5, 2022

glenn-jocher commented Jul 5, 2022

mjun0812 commented Jul 5, 2022 • edited Loading

glenn-jocher commented Jul 6, 2022

glenn-jocher commented Jul 7, 2022

DoSquared commented Jul 15, 2022

glenn-jocher commented Jul 15, 2022

YOLOv5 issues with `torch==1.12` on Multi-GPU systems #8395

YOLOv5 issues with `torch==1.12` on Multi-GPU systems #8395

github-actions bot commented Jun 29, 2022 •

edited by glenn-jocher

Loading

UnglvKitDe commented Jun 29, 2022 •

edited

Loading

UnglvKitDe commented Jun 30, 2022 •

edited

Loading

glenn-jocher commented Jun 30, 2022 •

edited

Loading

mjun0812 commented Jul 5, 2022 •

edited

Loading

mjun0812 commented Jul 5, 2022 •

edited

Loading