Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU training works OK. GPU training always loss=nan and mAP50=0. GPU detection OK. #9709

Closed
1 of 2 tasks
jbeale1 opened this issue Oct 5, 2022 · 7 comments
Closed
1 of 2 tasks
Labels
bug Something isn't working Stale

Comments

@jbeale1
Copy link

jbeale1 commented Oct 5, 2022

Search before asking

  • I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Training

Bug

I followed the instructions at https://blog.paperspace.com/train-yolov5-custom-data/ and got CPU training to work, although it was slow (over 12 hours to complete). Then I tried GPU training, starting with:
conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge
and then again following the YOLOV5 instructions above, starting with
git clone https://github.com/ultralytics/yolov5.git
I did have to add one line to "hyp.scratch.yaml" to fix a "copy-paste" error, as described here:
#4827 (comment)
Now the training line runs without any obvious error, but the progress stats always show 'nan' for the training loss, 0 for mAP50 and after finishing, the detection finds no objects. However the GPU mode does work for detection using my earlier CPU-trained weights, and it works about 10x faster than CPU so it's not like the GPU is completely absent.

python train.py --img 640 --cfg yolov5s.yaml --hyp hyp.scratch.yaml --batch 4 --epochs 100 --data road_sign_data.yaml --weights yolov5s.pt --workers 24 --name yolo_road_det

train: weights=yolov5s.pt, cfg=yolov5s.yaml, data=road_sign_data.yaml, hyp=hyp.scratch.yaml, epochs=100, batch_size=4, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=24, project=runs\train, name=yolo_road_det, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: up to date with https://github.com/ultralytics/yolov5
YOLOv5 v6.2-185-ge4398cf Python-3.9.13 torch-1.12.1 CUDA:0 (Quadro T1000, 4096MiB)

hyperparameters: lr0=0.01, lrf=0.2, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Weights & Biases: run 'pip install wandb' to automatically track and visualize YOLOv5 runs in Weights & Biases
ClearML: run 'pip install clearml' to automatically track, visualize and remotely train YOLOv5 in ClearML
Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5 runs in Comet
TensorBoard: Start with 'tensorboard --logdir runs\train', view at http://localhost:6006/
Overriding model.yaml nc=80 with nc=4
{ ... }

Starting training for 100 epochs...

  Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
   0/99     0.923G        nan        nan        nan          1        640: 100%|██████████| 176/176 [01:44<00:00,  1.69it/s]

C:\Users\bealej\Miniconda3\lib\site-packages\torch\optim\lr_scheduler.py:131: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Detected call of lr_scheduler.step() before optimizer.step(). "
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 11/11 [00:03<00:00, 3.58it/s]
all 88 132 0 0 0 0

  Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
   1/99      1.13G        nan        nan        nan          6        640: 100%|██████████| 176/176 [01:41<00:00,  1.74it/s]
             Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 11/11 [00:03<00:00,  3.60it/s]
               all         88        132          0          0          0          0

{... and so forth and so on}

C:\WINDOWS\system32>nvidia-smi
Wed Oct  5 10:34:32 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 512.72       Driver Version: 512.72       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro T1000       WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   68C    P0    23W /  N/A |   2050MiB /  4096MiB |     91%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      7128      C   ...lej\Miniconda3\python.exe    N/A      |
+-----------------------------------------------------------------------------+

Environment

YOLOv5 v6.2-185-ge4398cf Python-3.9.13 torch-1.12.1 CUDA:0 (Quadro T1000, 4096MiB)
Windows 10 Version 21H2 (OS Build 19044.2006)
Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:51:29) [MSC v.1929 64 bit (AMD64)] on win32

Minimal Reproducible Example

I am following the steps exactly as described at https://blog.paperspace.com/train-yolov5-custom-data/ with the one additional line/bugfix for KeyError: 'copy_paste' as described above, and I reduced batch from 32 to 4 due to memory limitations.
python train.py --img 640 --cfg yolov5s.yaml --hyp hyp.scratch.yaml --batch 4 --epochs 100 --data road_sign_data.yaml --weights yolov5s.pt --workers 24 --name yolo_road_det

Additional

No response

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!
@jbeale1 jbeale1 added the bug Something isn't working label Oct 5, 2022
@glenn-jocher
Copy link
Member

glenn-jocher commented Oct 5, 2022

@jbeale1 it appears you may have environment problems. I'd recommend training in our Docker image if you are having issues locally.

Please ensure you meet all dependency requirements if you are attempting to run YOLOv5 locally. If in doubt, create a new virtual Python 3.9 environment, clone the latest repo (code changes daily), and pip install requirements.txt again from scratch.

💡 ProTip! Try one of our verified environments below if you are having trouble with your local environment.

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Models and datasets download automatically from the latest YOLOv5 release when first requested.

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

YOLOv5 CI

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

@GugaLincon
Copy link

GugaLincon commented Oct 5, 2022

I am also having the exactly same issue. All the dependencies from requirements.txt are satisfied.

Windows 10
Python 3.9.2
torch 1.12.1+cu116
torchvision 0.13.1+cu116

@glenn-jocher
Copy link
Member

Windows training on CPU is part of our CI here:
https://github.com/ultralytics/yolov5/actions/runs/3186423817

We do not test on GPU with Windows however. If you find a solution please let us know, we don't have any Windows GPU machines available to debug ourselves.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 5, 2022

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Access additional Ultralytics ⚡ resources:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

@github-actions github-actions bot added the Stale label Nov 5, 2022
@cswingley
Copy link

cswingley commented Nov 15, 2022

I am also seeing nan when training on a Windows computer using the GPU and yolov5m.pt weights. I first tried this by running it in an Anaconda Python 3.9 virtual environment as per the instructions. Today I got the docker instance (ultralytics/yolov5:latest) running on the same system as suggested by @glenn-jocher above. In this setup I also am getting nan for box_loss, obj_loss, and cls_loss. I have not run the training to completion, since those values are immediately populated with numbers in a non-GPU run.

When I switch to yolov5s.pt weights (in the docker container), I am getting real values for the various training outputs.

Windows 11
Docker Desktop 4.14.0, Engine 20.10.21
uname -a: Linux 67706da8d61d 5.10.16.3-microsoft-standard-WSL2 #1 SMP Fri Apr 2 22:23:49 UTC 2021 x86_64
Python 3.8.13
torch 1.13.0a0+d0d6b1f (not sure why this doesn't have cuXXX at the end...)

When I start up the container it does complain that it's using driver version 516.94 which supports CUDA 11.7 but the container was built with CUDA 11.8 so it's being run in "Minor Version Compatibility mode"

@github-actions github-actions bot removed the Stale label Nov 16, 2022
@glenn-jocher
Copy link
Member

@cswingley minor compatibility mode should be fine, you might just be running into windows cuda issues, which are not uncommon. If your dataset trains normally in a linux environment like Colab then there might be something with your environment. If it doesn't train correctly in Docker or Colab then there's likely a dataset issue or you've modified default hyps that are causing instabilities.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 17, 2022

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Access additional Ultralytics ⚡ resources:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

@github-actions github-actions bot added the Stale label Dec 17, 2022
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale
Projects
None yet
Development

No branches or pull requests

4 participants