CPU training works OK. GPU training always loss=nan and mAP50=0. GPU detection OK. #9709

jbeale1 · 2022-10-05T17:53:56Z

Search before asking

I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Training

Bug

I followed the instructions at https://blog.paperspace.com/train-yolov5-custom-data/ and got CPU training to work, although it was slow (over 12 hours to complete). Then I tried GPU training, starting with:
conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge
and then again following the YOLOV5 instructions above, starting with
git clone https://github.com/ultralytics/yolov5.git
I did have to add one line to "hyp.scratch.yaml" to fix a "copy-paste" error, as described here:
#4827 (comment)
Now the training line runs without any obvious error, but the progress stats always show 'nan' for the training loss, 0 for mAP50 and after finishing, the detection finds no objects. However the GPU mode does work for detection using my earlier CPU-trained weights, and it works about 10x faster than CPU so it's not like the GPU is completely absent.

python train.py --img 640 --cfg yolov5s.yaml --hyp hyp.scratch.yaml --batch 4 --epochs 100 --data road_sign_data.yaml --weights yolov5s.pt --workers 24 --name yolo_road_det

train: weights=yolov5s.pt, cfg=yolov5s.yaml, data=road_sign_data.yaml, hyp=hyp.scratch.yaml, epochs=100, batch_size=4, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=24, project=runs\train, name=yolo_road_det, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: up to date with https://github.com/ultralytics/yolov5
YOLOv5 v6.2-185-ge4398cf Python-3.9.13 torch-1.12.1 CUDA:0 (Quadro T1000, 4096MiB)

hyperparameters: lr0=0.01, lrf=0.2, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Weights & Biases: run 'pip install wandb' to automatically track and visualize YOLOv5 runs in Weights & Biases
ClearML: run 'pip install clearml' to automatically track, visualize and remotely train YOLOv5 in ClearML
Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5 runs in Comet
TensorBoard: Start with 'tensorboard --logdir runs\train', view at http://localhost:6006/
Overriding model.yaml nc=80 with nc=4
{ ... }

Starting training for 100 epochs...

  Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
   0/99     0.923G        nan        nan        nan          1        640: 100%|██████████| 176/176 [01:44<00:00,  1.69it/s]

C:\Users\bealej\Miniconda3\lib\site-packages\torch\optim\lr_scheduler.py:131: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Detected call of lr_scheduler.step() before optimizer.step(). "
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 11/11 [00:03<00:00, 3.58it/s]
all 88 132 0 0 0 0

  Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
   1/99      1.13G        nan        nan        nan          6        640: 100%|██████████| 176/176 [01:41<00:00,  1.74it/s]
             Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 11/11 [00:03<00:00,  3.60it/s]
               all         88        132          0          0          0          0

{... and so forth and so on}

C:\WINDOWS\system32>nvidia-smi
Wed Oct  5 10:34:32 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 512.72       Driver Version: 512.72       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro T1000       WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   68C    P0    23W /  N/A |   2050MiB /  4096MiB |     91%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      7128      C   ...lej\Miniconda3\python.exe    N/A      |
+-----------------------------------------------------------------------------+

Environment

YOLOv5 v6.2-185-ge4398cf Python-3.9.13 torch-1.12.1 CUDA:0 (Quadro T1000, 4096MiB)
Windows 10 Version 21H2 (OS Build 19044.2006)
Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:51:29) [MSC v.1929 64 bit (AMD64)] on win32

Minimal Reproducible Example

I am following the steps exactly as described at https://blog.paperspace.com/train-yolov5-custom-data/ with the one additional line/bugfix for KeyError: 'copy_paste' as described above, and I reduced batch from 32 to 4 due to memory limitations.
python train.py --img 640 --cfg yolov5s.yaml --hyp hyp.scratch.yaml --batch 4 --epochs 100 --data road_sign_data.yaml --weights yolov5s.pt --workers 24 --name yolo_road_det

Additional

No response

Are you willing to submit a PR?

Yes I'd like to help by submitting a PR!

The text was updated successfully, but these errors were encountered:

glenn-jocher · 2022-10-05T19:15:28Z

@jbeale1 it appears you may have environment problems. I'd recommend training in our Docker image if you are having issues locally.

Please ensure you meet all dependency requirements if you are attempting to run YOLOv5 locally. If in doubt, create a new virtual Python 3.9 environment, clone the latest repo (code changes daily), and pip install requirements.txt again from scratch.

💡 ProTip! Try one of our verified environments below if you are having trouble with your local environment.

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Models and datasets download automatically from the latest YOLOv5 release when first requested.

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

GugaLincon · 2022-10-05T20:36:22Z

I am also having the exactly same issue. All the dependencies from requirements.txt are satisfied.

Windows 10
Python 3.9.2
torch 1.12.1+cu116
torchvision 0.13.1+cu116

glenn-jocher · 2022-10-05T23:29:36Z

Windows training on CPU is part of our CI here:
https://github.com/ultralytics/yolov5/actions/runs/3186423817

We do not test on GPU with Windows however. If you find a solution please let us know, we don't have any Windows GPU machines available to debug ourselves.

github-actions · 2022-11-05T00:26:26Z

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Wiki – https://github.com/ultralytics/yolov5/wiki
Tutorials – https://docs.ultralytics.com/yolov5
Docs – https://docs.ultralytics.com

Access additional Ultralytics ⚡ resources:

Ultralytics HUB – https://ultralytics.com/hub
Vision API – https://ultralytics.com/yolov5
About Us – https://ultralytics.com/about
Join Our Team – https://ultralytics.com/work
Contact Us – https://ultralytics.com/contact

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

cswingley · 2022-11-15T23:41:22Z

I am also seeing nan when training on a Windows computer using the GPU and yolov5m.pt weights. I first tried this by running it in an Anaconda Python 3.9 virtual environment as per the instructions. Today I got the docker instance (ultralytics/yolov5:latest) running on the same system as suggested by @glenn-jocher above. In this setup I also am getting nan for box_loss, obj_loss, and cls_loss. I have not run the training to completion, since those values are immediately populated with numbers in a non-GPU run.

When I switch to yolov5s.pt weights (in the docker container), I am getting real values for the various training outputs.

Windows 11
Docker Desktop 4.14.0, Engine 20.10.21
uname -a: Linux 67706da8d61d 5.10.16.3-microsoft-standard-WSL2 #1 SMP Fri Apr 2 22:23:49 UTC 2021 x86_64
Python 3.8.13
torch 1.13.0a0+d0d6b1f (not sure why this doesn't have cuXXX at the end...)

When I start up the container it does complain that it's using driver version 516.94 which supports CUDA 11.7 but the container was built with CUDA 11.8 so it's being run in "Minor Version Compatibility mode"

glenn-jocher · 2022-11-16T18:41:45Z

@cswingley minor compatibility mode should be fine, you might just be running into windows cuda issues, which are not uncommon. If your dataset trains normally in a linux environment like Colab then there might be something with your environment. If it doesn't train correctly in Docker or Colab then there's likely a dataset issue or you've modified default hyps that are causing instabilities.

github-actions · 2022-12-17T00:19:45Z

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Wiki – https://github.com/ultralytics/yolov5/wiki
Tutorials – https://docs.ultralytics.com/yolov5
Docs – https://docs.ultralytics.com

Access additional Ultralytics ⚡ resources:

Ultralytics HUB – https://ultralytics.com/hub
Vision API – https://ultralytics.com/yolov5
About Us – https://ultralytics.com/about
Join Our Team – https://ultralytics.com/work
Contact Us – https://ultralytics.com/contact

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

jbeale1 added the bug Something isn't working label Oct 5, 2022

github-actions bot added the Stale label Nov 5, 2022

github-actions bot removed the Stale label Nov 16, 2022

github-actions bot added the Stale label Dec 17, 2022

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU training works OK. GPU training always loss=nan and mAP50=0. GPU detection OK. #9709

CPU training works OK. GPU training always loss=nan and mAP50=0. GPU detection OK. #9709

jbeale1 commented Oct 5, 2022 •

edited

Loading

glenn-jocher commented Oct 5, 2022 •

edited

Loading

GugaLincon commented Oct 5, 2022 •

edited

Loading

glenn-jocher commented Oct 5, 2022

github-actions bot commented Nov 5, 2022 •

edited by glenn-jocher

Loading

cswingley commented Nov 15, 2022 •

edited

Loading

glenn-jocher commented Nov 16, 2022

github-actions bot commented Dec 17, 2022 •

edited by glenn-jocher

Loading

CPU training works OK. GPU training always loss=nan and mAP50=0. GPU detection OK. #9709

CPU training works OK. GPU training always loss=nan and mAP50=0. GPU detection OK. #9709

Comments

jbeale1 commented Oct 5, 2022 • edited Loading

Search before asking

YOLOv5 Component

Bug

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

glenn-jocher commented Oct 5, 2022 • edited Loading

Requirements

Environments

Status

GugaLincon commented Oct 5, 2022 • edited Loading

glenn-jocher commented Oct 5, 2022

github-actions bot commented Nov 5, 2022 • edited by glenn-jocher Loading

cswingley commented Nov 15, 2022 • edited Loading

glenn-jocher commented Nov 16, 2022

github-actions bot commented Dec 17, 2022 • edited by glenn-jocher Loading

jbeale1 commented Oct 5, 2022 •

edited

Loading

glenn-jocher commented Oct 5, 2022 •

edited

Loading

GugaLincon commented Oct 5, 2022 •

edited

Loading

github-actions bot commented Nov 5, 2022 •

edited by glenn-jocher

Loading

cswingley commented Nov 15, 2022 •

edited

Loading

github-actions bot commented Dec 17, 2022 •

edited by glenn-jocher

Loading