I am getting nan and no predictions at all. #5815

LightCannon · 2021-11-28T19:55:03Z

Search before asking

I have searched the YOLOv5 issues and discussions and found no similar questions.

Question

Hello Everyone, I am new to yoloV5 and I I have problem cannot figure its cause.
I am training with custom dataset (I am trying using low epochs first), but what I am getting is that box and obj are nan. Also, the no detections appear on validation images.

I have used this command to train:
python train.py --img 412 --batch 2 --epochs 2 --data people.yaml --cfg models\yolov5s.yaml --name pm1 --workers 6

There is an issue here also discussing same problem. However, the comments are towards the environment problems which I cannot still figure what is the problem. Here is my environment:

Windows 10 16 GB ram
NVIDIA GeForce GTX 1660 Ti, 6144MiB
Cuda 11.3
Python 3.8
torch==1.10.0
torchaudio==0.10.0
torchvision==0.11.1

and I am working on this dataset: https://github.com/ucuapps/top-view-multi-person-tracking

I appreciate any help regarding fixing this problem and getting it work well. Thanks

Additional

No response

github-actions · 2021-11-28T19:55:40Z

👋 Hello @LightCannon, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.

Requirements

Python>=3.6.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

$ git clone https://github.com/ultralytics/yolov5
$ cd yolov5
$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab and Kaggle notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

glenn-jocher · 2021-11-28T21:28:48Z

@LightCannon this might be a windows/conda/CUDA11 bug that PyTorch has as mentioned in some other issues, in which case downgrading to CUDA 10 would solve this.

Or you may have some problems with your dataset labels. Check your mosaic jpgs to ensure your labels are correct and follow the instructions here:
https://docs.ultralytics.com/yolov5/tutorials/train_custom_data

LightCannon · 2021-11-28T21:39:51Z

I have downgraded to CUDA 10.2 and you are right, this is a bug from CUDA 11.3 and everything works now with CUDA 10.2. Thanks for your help.

Zengyf-CVer · 2021-11-29T02:48:19Z

@LightCannon
Because I did not see the screenshot of your virtual environment, I guess you installed PyTorch through pip install torchvison. If you want to install Cuda 11.x, you can try to enter pip3 install torch==1.10.0 in the official website +cu113 torchvision==0.11.1+cu113 torchaudio===0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html to install, as shown in the figure:

In addition, for Cuda11.x, it has something to do with the graphics card model you are using. Some graphics cards are very friendly to Cuda11.x.

github-actions · 2021-12-30T00:12:53Z

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Wiki – https://github.com/ultralytics/yolov5/wiki
Tutorials – https://docs.ultralytics.com/yolov5
Docs – https://docs.ultralytics.com

Access additional Ultralytics ⚡ resources:

Ultralytics HUB – https://ultralytics.com/hub
Vision API – https://ultralytics.com/yolov5
About Us – https://ultralytics.com/about
Join Our Team – https://ultralytics.com/work
Contact Us – https://ultralytics.com/contact

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

ozett · 2022-01-17T16:24:01Z

glad that i found that here, because i trained 100 epochs and was wondering about no predictions...

on ubuntu 20.04 with gforce 1660 and nvidia-run-driver (which gave me CUDA).

looks like a bug in 11.4 and i downgraded to 11.3...
still looking for a way to go further down...

ozett · 2022-01-17T16:32:28Z

Hopefully 11.1 has no nan-bug,
downgrade to 10.2 on ubuntu 20.04 seems difficult..

not to forget adjusting torch-install afterwards, i guess...
👻

ozett · 2022-01-17T16:45:44Z

cuda 11.3 seems to nan-nan

python train.py --img 412 --batch 2 --epochs 2

ozett · 2022-01-17T18:17:11Z

related, unsolved: https://docs.ultralytics.com/yolov5/tutorials/hyperparameter_evolution1
related, downgrade to 10.2 solved it: #4084
related, downgrade to 10.2 solved it: #4839

https://www.codestudyblog.com/cs2112pyc/1230044131.html
says, that CUDNN hat problems with Gforce 16xx...
will try to mess with CUDNN to see if that fixes this...

https://docs.nvidia.com/deeplearning/cudnn/release-notes/rel_8.html#rel-822

ozett · 2022-01-17T18:40:41Z

https://stackoverflow.com/questions/31326015/how-to-verify-cudnn-installation

looks like CUDNN is missing on my system.
maybe thats the whole problem?

ozett · 2022-01-17T19:13:19Z

wow, now some hours of driver re-install,
but pytorch 1.6 is the solution?

#1749

--- long way to go....

ozett · 2022-01-17T19:42:10Z

solved.
install specific pytorch version, with min-requirements for yolov5 and for cuda10.x
(install pytoch cuda10.2 even if v11.x is on your system. that looks like the solution)

https://pytorch.org/get-started/previous-versions/

pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

my system is now filled up with unwanted packages from all nvidia-experiments.
i will have to set it up from scratch again.

maybe it only works of mismatch of cuda-packages 10 & 11 ?

i have to cross-check this with a fresh install of ubuntu ,
minimal nvidia driver and cuda 11,
and than a pytorch-version explicitly for cuda10

glenn-jocher · 2022-01-18T00:17:20Z

@ozett thanks for the feedback! Good to know CUDA 11.6 with driver 510.39.01 and torch==1.7.1+cu101 work well with consumer cards.

ozett · 2022-01-18T09:08:58Z

thanks for the encouragement.
i am also thankful that you have an eye on almost all processes and issues here. really great. even when i rehash old stuff here. great. but you should also sleep once in a while ... :-)

this case must be special with Geforce 16xx cards.

i have to cross-check the next days on a fresh ubuntu system 20.x
if newest nvidida-driver and newest cuda 11.6
are sufficient for older versions of pytorch/cuda combinations
and thus fix this "nan-nan" error on the geForce 16xx-card.

edit: also i want the trained model to run on another install,
this has some other combination of pytorch installed without GPU
YOLOv5 🚀 v5.0-455-g59aae85 torch 1.9.1+cu102 CPU
that causes runtime-errors. i have to sort out wich versions are compatible
to transfer the trained model to another system.
maybe some combinations will work...

that will take some time ... and i will report here the results briefly..

ozett · 2022-01-18T19:27:02Z

TESTED with fresh install:
despite of the really installed cuda-version on ubuntu-os,
you must download and install the cuda10.2 version for pytorch.

that worked and fixed the nan-nan error.

detailed testrun:

#FIRST: install ubuntu 20.04.3 server.iso
#SECOND: Disabled noveau-driver, otherwise install stops:

sudo apt-get install dkms build-essential linux-headers-generic
sudo bash -c "echo blacklist nouveau > /etc/modprobe.d/blacklist-nvidia-nouveau.conf"
$ sudo bash -c "echo options nouveau modeset=0 >> /etc/modprobe.d/blacklist-nvidia-nouveau.conf"
cat /etc/modprobe.d/blacklist-nvidia-nouveau.conf
sudo update-initramfs -u

# nvidia driver from driver-website is older,
#install driver 470 from nvida.download
# CUDA 11.6 incudes newer driver


#Install CUDA from NVIDIA (with newer driver)
#installs cuda 11.6 with driver 550
# https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=18.04&target_type=runfile_local

#
# rather not ## install with deb(network) (way too much stuff)

# install with runfile...
wget https://developer.download.nvidia.com/compute/cuda/11.6.0/local_installers/cuda_11.6.0_510.39.01_linux.run
sudo sh cuda_11.6.0_510.39.01_linux.run

#check without reboot with nvidia-smi

#Install PiP
curl -sSL https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python get-pip.py

# Install torch 1.9 for cuda 11.6
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

# install yolo for training
git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

#TEST torch for CUDA 11.6:
python train.py --img 412 --batch 2 --epochs 2

# -> ERROR nan-nan

pip uninstall torch
pip uninstall torch # run this command twice

# Install torch for Cuda 10.2
pip install torch==1.9.0+cu102 torchvision==0.10.0+cu102 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

#test
python train.py --img 412 --batch 2 --epochs 2

weeix · 2023-01-17T14:49:04Z

@ozett thank you. I'm using YOLOv8 and had the same problem. Your comments saved me from excessive head scratching.

Environment: GeForce GTX 1650, Windows 11 64-bit, driver 528.02, python 3.9

Version that works for me: torch==1.9.0+cu102

Some other versions that I tried:

torch==1.9.1+cu102 -> dependency conflict
torch==1.10.2+cu102 -> 0% GPU utilization + Could not find module '...\torchvision\image.pyd' (or one of its dependencies)
torch==1.13.1+cu116 -> NaN + [WinError 1455] The paging file is too small for this operation to complete
torch==1.13.1+cu117 -> NaN

Also, because of how Python works in Windows, I had to reduce the number of workers to 1 in order to maximize GPU utilization.

Computer vision is tough.

LightCannon added the question Further information is requested label Nov 28, 2021

github-actions bot added the Stale label Dec 30, 2021

github-actions bot closed this as completed Jan 4, 2022

ozett mentioned this issue Jan 23, 2022

train_batch.jpg labels missing #1623

Closed

This was referenced Apr 2, 2022

nan in my custom dataset training hukaixuan19970627/yolov5_obb#239

Closed

nan in training on Windows hukaixuan19970627/yolov5_obb#260

Closed

monsieurpooh mentioned this issue May 11, 2022

"nan" losses issue for some small subset of users nerdyrodent/VQGAN-CLIP#151

Closed

This was referenced May 20, 2022

NaN tensor values problem for GTX16xx users (no problem on other devices) pytorch/pytorch#77955

Open

NaN tensor values problem for GTX16xx users (no problem on other devices) #7908

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I am getting nan and no predictions at all. #5815

I am getting nan and no predictions at all. #5815

LightCannon commented Nov 28, 2021

github-actions bot commented Nov 28, 2021 •

edited by glenn-jocher

Loading

glenn-jocher commented Nov 28, 2021 •

edited

Loading

LightCannon commented Nov 28, 2021

Zengyf-CVer commented Nov 29, 2021

github-actions bot commented Dec 30, 2021 •

edited by glenn-jocher

Loading

ozett commented Jan 17, 2022

ozett commented Jan 17, 2022

ozett commented Jan 17, 2022

ozett commented Jan 17, 2022 •

edited by glenn-jocher

Loading

ozett commented Jan 17, 2022

ozett commented Jan 17, 2022 •

edited

Loading

ozett commented Jan 17, 2022 •

edited

Loading

glenn-jocher commented Jan 18, 2022

ozett commented Jan 18, 2022 •

edited

Loading

ozett commented Jan 18, 2022

weeix commented Jan 17, 2023

I am getting nan and no predictions at all. #5815

I am getting nan and no predictions at all. #5815

Comments

LightCannon commented Nov 28, 2021

Search before asking

Question

Additional

github-actions bot commented Nov 28, 2021 • edited by glenn-jocher Loading

Requirements

Environments

Status

glenn-jocher commented Nov 28, 2021 • edited Loading

LightCannon commented Nov 28, 2021

Zengyf-CVer commented Nov 29, 2021

github-actions bot commented Dec 30, 2021 • edited by glenn-jocher Loading

ozett commented Jan 17, 2022

ozett commented Jan 17, 2022

ozett commented Jan 17, 2022

ozett commented Jan 17, 2022 • edited by glenn-jocher Loading

ozett commented Jan 17, 2022

ozett commented Jan 17, 2022 • edited Loading

ozett commented Jan 17, 2022 • edited Loading

glenn-jocher commented Jan 18, 2022

ozett commented Jan 18, 2022 • edited Loading

ozett commented Jan 18, 2022

weeix commented Jan 17, 2023

github-actions bot commented Nov 28, 2021 •

edited by glenn-jocher

Loading

glenn-jocher commented Nov 28, 2021 •

edited

Loading

github-actions bot commented Dec 30, 2021 •

edited by glenn-jocher

Loading

ozett commented Jan 17, 2022 •

edited by glenn-jocher

Loading

ozett commented Jan 17, 2022 •

edited

Loading

ozett commented Jan 17, 2022 •

edited

Loading

ozett commented Jan 18, 2022 •

edited

Loading