Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YOLOv5 issues with torch==1.12 on Multi-GPU systems #8395

Closed
1 of 2 tasks
glenn-jocher opened this issue Jun 29, 2022 · 16 comments · Fixed by #8497 or Go-Autonomous/yolov5#15
Closed
1 of 2 tasks

YOLOv5 issues with torch==1.12 on Multi-GPU systems #8395

glenn-jocher opened this issue Jun 29, 2022 · 16 comments · Fixed by #8497 or Go-Autonomous/yolov5#15
Labels
bug Something isn't working

Comments

@glenn-jocher
Copy link
Member

Search before asking

  • I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Training, Multi-GPU

Bug

All GPUs are utilized by torch 1.12 with current YOLOv5 master when a single-GPU command is run, i.e.:

Error does not occur with torch==1.11

@AyushExel FYI

Environment

Docker image

Minimal Reproducible Example

python train.py --epochs 10 --device 7

Additional

Temp workaround is to use torch 1.11

pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!
@glenn-jocher glenn-jocher added the bug Something isn't working label Jun 29, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Jun 29, 2022

👋 Hello @glenn-jocher, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://ultralytics.com or email support@ultralytics.com.

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on macOS, Windows, and Ubuntu every 24 hours and on every commit.

@AyushExel
Copy link
Contributor

I wonder how a bug this big made into the release

@UnglvKitDe
Copy link
Contributor

UnglvKitDe commented Jun 29, 2022

@glenn-jocher I cannot confirm this. Tested on gpu servers with 2,4,8 GPUS (RTX 2080TI/A6000)
EDIT: Maby a workaround is to use CUDA_VISIBLE_DEVICES?

@glenn-jocher
Copy link
Member Author

@AyushExel it's probably not a torch bug but instead related to our specific implementation for selecting devices in select_device() below which relies on defining CUDA_VISIBLE_DEVICES in the workspace before torch reads it.
mode.

def select_device(device='', batch_size=0, newline=True):
# device = None or 'cpu' or 0 or '0' or '0,1,2,3'
s = f'YOLOv5 🚀 {git_describe() or file_date()} Python-{platform.python_version()} torch-{torch.__version__} '
device = str(device).strip().lower().replace('cuda:', '').replace('none', '') # to string, 'cuda:0' to '0'
cpu = device == 'cpu'
mps = device == 'mps' # Apple Metal Performance Shaders (MPS)
if cpu or mps:
os.environ['CUDA_VISIBLE_DEVICES'] = '-1' # force torch.cuda.is_available() = False
elif device: # non-cpu device requested
os.environ['CUDA_VISIBLE_DEVICES'] = device # set environment variable - must be before assert is_available()
assert torch.cuda.is_available() and torch.cuda.device_count() >= len(device.replace(',', '')), \
f"Invalid CUDA '--device {device}' requested, use '--device cpu' or pass valid CUDA device(s)"
if not cpu and torch.cuda.is_available(): # prefer GPU if available
devices = device.split(',') if device else '0' # range(torch.cuda.device_count()) # i.e. 0,1,6,7
n = len(devices) # device count
if n > 1 and batch_size > 0: # check batch_size is divisible by device_count
assert batch_size % n == 0, f'batch-size {batch_size} not multiple of GPU count {n}'
space = ' ' * (len(s) + 1)
for i, d in enumerate(devices):
p = torch.cuda.get_device_properties(i)
s += f"{'' if i == 0 else space}CUDA:{d} ({p.name}, {p.total_memory / (1 << 20):.0f}MiB)\n" # bytes to MB
arg = 'cuda:0'
elif not cpu and getattr(torch, 'has_mps', False) and torch.backends.mps.is_available(): # prefer MPS if available
s += 'MPS\n'
arg = 'mps'
else: # revert to CPU
s += 'CPU\n'
arg = 'cpu'
if not newline:
s = s.rstrip()
LOGGER.info(s.encode().decode('ascii', 'ignore') if platform.system() == 'Windows' else s) # emoji-safe
return torch.device(arg)

If I run the reproduce command above in Docker image, python train.py --epochs 10 --device 7, I see this on our server:

Screen Shot 2022-06-30 at 4 28 13 PM

@glenn-jocher
Copy link
Member Author

@UnglvKitDe you tested a single-GPU training command above python train.py --epochs 10 --device 7 with torch 1.12 and did not see multi-GPU usage on nvidia-smi?

@UnglvKitDe
Copy link
Contributor

UnglvKitDe commented Jun 30, 2022

@glenn-jocher

@UnglvKitDe you tested a single-GPU training command above python train.py --epochs 10 --device 7 with torch 1.12 and did not see multi-GPU usage on nvidia-smi?

@glenn-jocher not exactly, but in the end the same. i tested it on coco128 ( torch 1.12 and cuda 11.6). so python train.py --data coco128.yaml --device 0 --epochs 10.

@glenn-jocher
Copy link
Member Author

glenn-jocher commented Jun 30, 2022

@UnglvKitDe oh, strange. This is basically the same command I used, but I used device 7. If I try --device 0 I also get the bug though.

We'll ok I'll experiment some more. In my experiments 1.11 works correctly but 1.12 does not (master branch in Docker) with CUDA 11.3

@glenn-jocher glenn-jocher changed the title YOLOv5 issues with torch 1.12 on Multi-GPU systems YOLOv5 issues with torch==1.12 on Multi-GPU systems Jun 30, 2022
@mjun0812
Copy link
Contributor

mjun0812 commented Jul 5, 2022

It appears that this problem occurred because of a change in the timing of reading environment variables in PyTorch 1.12.

My GPU environment is here.

CUDA:0 NVIDIA RTX A6000, 48685.3125MB
CUDA:1 NVIDIA GeForce RTX 3090, 24268.3125MB

The following code worked correctly in version 1.11.

import os
import torch

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

# print using GPU Info
print(f"Using GPU is CUDA:{os.environ['CUDA_VISIBLE_DEVICES']}")

for i in range(torch.cuda.device_count()):
    info = torch.cuda.get_device_properties(i)
    print(f"CUDA:{i} {info.name}, {info.total_memory / 1024 ** 2}MB")

In version 1.11, the output of this code is below.

Using GPU is CUDA:1
CUDA:0 NVIDIA GeForce RTX 3090, 24268.3125MB

But version 1.12's output is below.

Using GPU is CUDA:1
CUDA:0 NVIDIA RTX A6000, 48685.3125MB
CUDA:1 NVIDIA GeForce RTX 3090, 24268.3125MB

As additional information, the environment variable change worked correctly when done before import torch.

import os

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

import torch


# print using GPU Info
print(f"Using GPU is CUDA:{os.environ['CUDA_VISIBLE_DEVICES']}")

for i in range(torch.cuda.device_count()):
    info = torch.cuda.get_device_properties(i)
    print(f"CUDA:{i} {info.name}, {info.total_memory / 1024 ** 2}MB")

Output is below.

Using GPU is CUDA:1
CUDA:0 NVIDIA GeForce RTX 3090, 24268.3125MB

@glenn-jocher
Copy link
Member Author

@mjun0812 interesting, thanks for the info! Unfortunately we can't specify the environment variables before loading torch. Does re-loading torch after defining the environment variables have any effect? i.e.:

import os
import torch

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

import torch


# print using GPU Info
print(f"Using GPU is CUDA:{os.environ['CUDA_VISIBLE_DEVICES']}")

for i in range(torch.cuda.device_count()):
    info = torch.cuda.get_device_properties(i)
    print(f"CUDA:{i} {info.name}, {info.total_memory / 1024 ** 2}MB")

@mjun0812
Copy link
Contributor

mjun0812 commented Jul 5, 2022

@glenn-jocher Thank you for your reply!
I tried your code and the following code.

Your suggestion code:

import os
import torch

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

import torch

# print using GPU Info
print(f"Using GPU is CUDA:{os.environ['CUDA_VISIBLE_DEVICES']}")

for i in range(torch.cuda.device_count()):
    info = torch.cuda.get_device_properties(i)
    print(f"CUDA:{i} {info.name}, {info.total_memory / 1024 ** 2}MB")

and del torch

import os
import torch

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

del torch
import torch

# print using GPU Info
print(f"Using GPU is CUDA:{os.environ['CUDA_VISIBLE_DEVICES']}")

for i in range(torch.cuda.device_count()):
    info = torch.cuda.get_device_properties(i)
    print(f"CUDA:{i} {info.name}, {info.total_memory / 1024 ** 2}MB")

and importlib.reload()

import os
import torch
import importlib

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

importlib.reload(torch)

# print using GPU Info
print(f"Using GPU is CUDA:{os.environ['CUDA_VISIBLE_DEVICES']}")

for i in range(torch.cuda.device_count()):
    info = torch.cuda.get_device_properties(i)
    print(f"CUDA:{i} {info.name}, {info.total_memory / 1024 ** 2}MB")

The above outputs were the same.

Using GPU is CUDA:1
CUDA:0 NVIDIA RTX A6000, 48685.3125MB
CUDA:1 NVIDIA GeForce RTX 3090, 24268.3125MB

This issue could not be resolved...
I will continue to investigate the issue.

@glenn-jocher
Copy link
Member Author

@mjun0812 too bad. Yes let me know if you find a solution!

@mjun0812
Copy link
Contributor

mjun0812 commented Jul 5, 2022

@glenn-jocher I raised this issue in the PyTorch repository, and it has been fixed in the latest master branch.
The cause is that CUDA initialization was done during import torch.
The binary version that can be installed with pip will likely be fixed in the torch next version 1.12.1.

Therefore, it may be necessary to modify requirements.txt as follows

yolov5/requirements.txt

Lines 12 to 13 in fdc9d91

torch>=1.7.0
torchvision>=0.8.1

+ torch>=1.7.0,!=1.12.0  # https://github.com/ultralytics/yolov5/issues/8395
+ torchvision>=0.8.1,!=0.13.0  # https://github.com/ultralytics/yolov5/issues/8395

I am ready to make a pull request for the above fixes.

@glenn-jocher
Copy link
Member Author

@mjun0812 got it, thanks for the update! I see your PR, will take a look there.

glenn-jocher pushed a commit that referenced this issue Jul 6, 2022
Exclude torch==1.12.0, torchvision==0.13.0
@glenn-jocher
Copy link
Member Author

@AyushExel torch 1.12 issue resolved in upcoming torch 1.12.1, so our fix is to simply excluded 1.12 in requirements.txt in #8497. Problem solved :)

Shivvrat pushed a commit to Shivvrat/epic-yolov5 that referenced this issue Jul 12, 2022
@zhiqwang zhiqwang mentioned this issue Jul 15, 2022
1 task
@DoSquared
Copy link

How did you guys fix the pytorch error? should I install 1.12.1?

@glenn-jocher
Copy link
Member Author

@DaliaMahdy unfortunately 1.12.1 is not out currently, latest stable is 1.12.0, but you can install nightly for example to resolve the issue, or simply use 1.11.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
5 participants