Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I am getting nan and no predictions at all. #5815

Closed
1 task done
LightCannon opened this issue Nov 28, 2021 · 16 comments
Closed
1 task done

I am getting nan and no predictions at all. #5815

LightCannon opened this issue Nov 28, 2021 · 16 comments
Labels
question Further information is requested Stale

Comments

@LightCannon
Copy link

Search before asking

Question

Hello Everyone, I am new to yoloV5 and I I have problem cannot figure its cause.
I am training with custom dataset (I am trying using low epochs first), but what I am getting is that box and obj are nan. Also, the no detections appear on validation images.

image

I have used this command to train:
python train.py --img 412 --batch 2 --epochs 2 --data people.yaml --cfg models\yolov5s.yaml --name pm1 --workers 6

There is an issue here also discussing same problem. However, the comments are towards the environment problems which I cannot still figure what is the problem. Here is my environment:

  • Windows 10 16 GB ram
  • NVIDIA GeForce GTX 1660 Ti, 6144MiB
  • Cuda 11.3
  • Python 3.8
  • torch==1.10.0
  • torchaudio==0.10.0
  • torchvision==0.11.1

and I am working on this dataset: https://github.com/ucuapps/top-view-multi-person-tracking

I appreciate any help regarding fixing this problem and getting it work well. Thanks

Additional

No response

@LightCannon LightCannon added the question Further information is requested label Nov 28, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Nov 28, 2021

👋 Hello @LightCannon, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.

Requirements

Python>=3.6.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

$ git clone https://github.com/ultralytics/yolov5
$ cd yolov5
$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

@glenn-jocher
Copy link
Member

glenn-jocher commented Nov 28, 2021

@LightCannon this might be a windows/conda/CUDA11 bug that PyTorch has as mentioned in some other issues, in which case downgrading to CUDA 10 would solve this.

Or you may have some problems with your dataset labels. Check your mosaic jpgs to ensure your labels are correct and follow the instructions here:
https://docs.ultralytics.com/yolov5/tutorials/train_custom_data

@LightCannon
Copy link
Author

I have downgraded to CUDA 10.2 and you are right, this is a bug from CUDA 11.3 and everything works now with CUDA 10.2. Thanks for your help.

@Zengyf-CVer
Copy link
Contributor

@LightCannon
Because I did not see the screenshot of your virtual environment, I guess you installed PyTorch through pip install torchvison. If you want to install Cuda 11.x, you can try to enter pip3 install torch==1.10.0 in the official website +cu113 torchvision==0.11.1+cu113 torchaudio===0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html to install, as shown in the figure:
000
In addition, for Cuda11.x, it has something to do with the graphics card model you are using. Some graphics cards are very friendly to Cuda11.x.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 30, 2021

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Access additional Ultralytics ⚡ resources:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

@ozett
Copy link

ozett commented Jan 17, 2022

glad that i found that here, because i trained 100 epochs and was wondering about no predictions...

on ubuntu 20.04 with gforce 1660 and nvidia-run-driver (which gave me CUDA).

looks like a bug in 11.4 and i downgraded to 11.3...
still looking for a way to go further down...

image

@ozett
Copy link

ozett commented Jan 17, 2022

Hopefully 11.1 has no nan-bug,
downgrade to 10.2 on ubuntu 20.04 seems difficult..

not to forget adjusting torch-install afterwards, i guess...
👻

image

@ozett
Copy link

ozett commented Jan 17, 2022

cuda 11.3 seems to nan-nan

python train.py --img 412 --batch 2 --epochs 2

image

@ozett
Copy link

ozett commented Jan 17, 2022

related, unsolved: https://docs.ultralytics.com/yolov5/tutorials/hyperparameter_evolution1
related, downgrade to 10.2 solved it: #4084
related, downgrade to 10.2 solved it: #4839

https://www.codestudyblog.com/cs2112pyc/1230044131.html
says, that CUDNN hat problems with Gforce 16xx...
will try to mess with CUDNN to see if that fixes this...

image
https://docs.nvidia.com/deeplearning/cudnn/release-notes/rel_8.html#rel-822

@ozett
Copy link

ozett commented Jan 17, 2022

https://stackoverflow.com/questions/31326015/how-to-verify-cudnn-installation

looks like CUDNN is missing on my system.
maybe thats the whole problem?

@ozett
Copy link

ozett commented Jan 17, 2022

wow, now some hours of driver re-install,
but pytorch 1.6 is the solution?

image
#1749

--- long way to go....

image

@ozett
Copy link

ozett commented Jan 17, 2022

solved.
install specific pytorch version, with min-requirements for yolov5 and for cuda10.x
(install pytoch cuda10.2 even if v11.x is on your system. that looks like the solution)

https://pytorch.org/get-started/previous-versions/

pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

image

my system is now filled up with unwanted packages from all nvidia-experiments.
i will have to set it up from scratch again.

maybe it only works of mismatch of cuda-packages 10 & 11 ?

i have to cross-check this with a fresh install of ubuntu ,
minimal nvidia driver and cuda 11,
and than a pytorch-version explicitly for cuda10

image

@glenn-jocher
Copy link
Member

@ozett thanks for the feedback! Good to know CUDA 11.6 with driver 510.39.01 and torch==1.7.1+cu101 work well with consumer cards.

@ozett
Copy link

ozett commented Jan 18, 2022

thanks for the encouragement.
i am also thankful that you have an eye on almost all processes and issues here. really great. even when i rehash old stuff here. great. but you should also sleep once in a while ... :-)

this case must be special with Geforce 16xx cards.

i have to cross-check the next days on a fresh ubuntu system 20.x
if newest nvidida-driver and newest cuda 11.6
are sufficient for older versions of pytorch/cuda combinations
and thus fix this "nan-nan" error on the geForce 16xx-card.

edit: also i want the trained model to run on another install,
this has some other combination of pytorch installed without GPU
YOLOv5 🚀 v5.0-455-g59aae85 torch 1.9.1+cu102 CPU
that causes runtime-errors. i have to sort out wich versions are compatible
to transfer the trained model to another system.
maybe some combinations will work...

that will take some time ... and i will report here the results briefly..

@ozett
Copy link

ozett commented Jan 18, 2022

TESTED with fresh install:
despite of the really installed cuda-version on ubuntu-os,
you must download and install the cuda10.2 version for pytorch.

that worked and fixed the nan-nan error.

detailed testrun:

#FIRST: install ubuntu 20.04.3 server.iso
#SECOND: Disabled noveau-driver, otherwise install stops:

sudo apt-get install dkms build-essential linux-headers-generic
sudo bash -c "echo blacklist nouveau > /etc/modprobe.d/blacklist-nvidia-nouveau.conf"
$ sudo bash -c "echo options nouveau modeset=0 >> /etc/modprobe.d/blacklist-nvidia-nouveau.conf"
cat /etc/modprobe.d/blacklist-nvidia-nouveau.conf
sudo update-initramfs -u

# nvidia driver from driver-website is older,
#install driver 470 from nvida.download
# CUDA 11.6 incudes newer driver


#Install CUDA from NVIDIA (with newer driver)
#installs cuda 11.6 with driver 550
# https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=18.04&target_type=runfile_local

#
# rather not ## install with deb(network) (way too much stuff)

# install with runfile...
wget https://developer.download.nvidia.com/compute/cuda/11.6.0/local_installers/cuda_11.6.0_510.39.01_linux.run
sudo sh cuda_11.6.0_510.39.01_linux.run

#check without reboot with nvidia-smi

#Install PiP
curl -sSL https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python get-pip.py

# Install torch 1.9 for cuda 11.6
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

# install yolo for training
git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

#TEST torch for CUDA 11.6:
python train.py --img 412 --batch 2 --epochs 2

# -> ERROR nan-nan

pip uninstall torch
pip uninstall torch # run this command twice

# Install torch for Cuda 10.2
pip install torch==1.9.0+cu102 torchvision==0.10.0+cu102 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

#test
python train.py --img 412 --batch 2 --epochs 2

image

@weeix
Copy link

weeix commented Jan 17, 2023

@ozett thank you. I'm using YOLOv8 and had the same problem. Your comments saved me from excessive head scratching.

Environment: GeForce GTX 1650, Windows 11 64-bit, driver 528.02, python 3.9

Version that works for me: torch==1.9.0+cu102

Some other versions that I tried:

  • torch==1.9.1+cu102 -> dependency conflict
  • torch==1.10.2+cu102 -> 0% GPU utilization + Could not find module '...\torchvision\image.pyd' (or one of its dependencies)
  • torch==1.13.1+cu116 -> NaN + [WinError 1455] The paging file is too small for this operation to complete
  • torch==1.13.1+cu117 -> NaN

Also, because of how Python works in Windows, I had to reduce the number of workers to 1 in order to maximize GPU utilization.

Computer vision is tough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested Stale
Projects
None yet
Development

No branches or pull requests

5 participants