-
-
Notifications
You must be signed in to change notification settings - Fork 15.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU training works OK. GPU training always loss=nan and mAP50=0. GPU detection OK. #9709
Comments
@jbeale1 it appears you may have environment problems. I'd recommend training in our Docker image if you are having issues locally. Please ensure you meet all dependency requirements if you are attempting to run YOLOv5 locally. If in doubt, create a new virtual Python 3.9 environment, clone the latest repo (code changes daily), and 💡 ProTip! Try one of our verified environments below if you are having trouble with your local environment. RequirementsPython>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started: git clone https://github.com/ultralytics/yolov5 # clone
cd yolov5
pip install -r requirements.txt # install Models and datasets download automatically from the latest YOLOv5 release when first requested. EnvironmentsYOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
StatusIf this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on MacOS, Windows, and Ubuntu every 24 hours and on every commit. |
I am also having the exactly same issue. All the dependencies from requirements.txt are satisfied. Windows 10 |
Windows training on CPU is part of our CI here: We do not test on GPU with Windows however. If you find a solution please let us know, we don't have any Windows GPU machines available to debug ourselves. |
👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs. Access additional YOLOv5 🚀 resources:
Access additional Ultralytics ⚡ resources:
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed! Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐! |
I am also seeing nan when training on a Windows computer using the GPU and yolov5m.pt weights. I first tried this by running it in an Anaconda Python 3.9 virtual environment as per the instructions. Today I got the docker instance (ultralytics/yolov5:latest) running on the same system as suggested by @glenn-jocher above. In this setup I also am getting nan for box_loss, obj_loss, and cls_loss. I have not run the training to completion, since those values are immediately populated with numbers in a non-GPU run. When I switch to yolov5s.pt weights (in the docker container), I am getting real values for the various training outputs. Windows 11 When I start up the container it does complain that it's using driver version 516.94 which supports CUDA 11.7 but the container was built with CUDA 11.8 so it's being run in "Minor Version Compatibility mode" |
@cswingley minor compatibility mode should be fine, you might just be running into windows cuda issues, which are not uncommon. If your dataset trains normally in a linux environment like Colab then there might be something with your environment. If it doesn't train correctly in Docker or Colab then there's likely a dataset issue or you've modified default hyps that are causing instabilities. |
👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs. Access additional YOLOv5 🚀 resources:
Access additional Ultralytics ⚡ resources:
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed! Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐! |
Search before asking
YOLOv5 Component
Training
Bug
I followed the instructions at https://blog.paperspace.com/train-yolov5-custom-data/ and got CPU training to work, although it was slow (over 12 hours to complete). Then I tried GPU training, starting with:
conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge
and then again following the YOLOV5 instructions above, starting with
git clone https://github.com/ultralytics/yolov5.git
I did have to add one line to "hyp.scratch.yaml" to fix a "copy-paste" error, as described here:
#4827 (comment)
Now the training line runs without any obvious error, but the progress stats always show 'nan' for the training loss, 0 for mAP50 and after finishing, the detection finds no objects. However the GPU mode does work for detection using my earlier CPU-trained weights, and it works about 10x faster than CPU so it's not like the GPU is completely absent.
python train.py --img 640 --cfg yolov5s.yaml --hyp hyp.scratch.yaml --batch 4 --epochs 100 --data road_sign_data.yaml --weights yolov5s.pt --workers 24 --name yolo_road_det
train: weights=yolov5s.pt, cfg=yolov5s.yaml, data=road_sign_data.yaml, hyp=hyp.scratch.yaml, epochs=100, batch_size=4, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=24, project=runs\train, name=yolo_road_det, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: up to date with https://github.com/ultralytics/yolov5
YOLOv5 v6.2-185-ge4398cf Python-3.9.13 torch-1.12.1 CUDA:0 (Quadro T1000, 4096MiB)
hyperparameters: lr0=0.01, lrf=0.2, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Weights & Biases: run 'pip install wandb' to automatically track and visualize YOLOv5 runs in Weights & Biases
ClearML: run 'pip install clearml' to automatically track, visualize and remotely train YOLOv5 in ClearML
Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5 runs in Comet
TensorBoard: Start with 'tensorboard --logdir runs\train', view at http://localhost:6006/
Overriding model.yaml nc=80 with nc=4
{ ... }
Starting training for 100 epochs...
C:\Users\bealej\Miniconda3\lib\site-packages\torch\optim\lr_scheduler.py:131: UserWarning: Detected call of
lr_scheduler.step()
beforeoptimizer.step()
. In PyTorch 1.1.0 and later, you should call them in the opposite order:optimizer.step()
beforelr_scheduler.step()
. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-ratewarnings.warn("Detected call of
lr_scheduler.step()
beforeoptimizer.step()
. "Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 11/11 [00:03<00:00, 3.58it/s]
all 88 132 0 0 0 0
{... and so forth and so on}
Environment
YOLOv5 v6.2-185-ge4398cf Python-3.9.13 torch-1.12.1 CUDA:0 (Quadro T1000, 4096MiB)
Windows 10 Version 21H2 (OS Build 19044.2006)
Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:51:29) [MSC v.1929 64 bit (AMD64)] on win32
Minimal Reproducible Example
I am following the steps exactly as described at https://blog.paperspace.com/train-yolov5-custom-data/ with the one additional line/bugfix for KeyError: 'copy_paste' as described above, and I reduced batch from 32 to 4 due to memory limitations.
python train.py --img 640 --cfg yolov5s.yaml --hyp hyp.scratch.yaml --batch 4 --epochs 100 --data road_sign_data.yaml --weights yolov5s.pt --workers 24 --name yolo_road_det
Additional
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: