Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN tensor values problem for GTX16xx users (no problem on other devices) #7908

Closed
1 of 2 tasks
YipKo opened this issue May 20, 2022 · 33 comments · Fixed by #7917 or #8804
Closed
1 of 2 tasks

NaN tensor values problem for GTX16xx users (no problem on other devices) #7908

YipKo opened this issue May 20, 2022 · 33 comments · Fixed by #7917 or #8804
Assignees
Labels
bug Something isn't working

Comments

@YipKo
Copy link

YipKo commented May 20, 2022

Search before asking

  • I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Training, Validation

Bug

I used yolov5 to test with the demo dataset (coco128) and found that box and obj are nan. Also, there are no detections appear on validation images. This only happens on GTX1660ti devices (GPU mode), when I use CPU or use Google colab(Tesla K80) / RTX2070 for training, everything works fine.
image

Environment

  • Windows 10 10.0.19044.1706
  • YOLOv5-6.1 (version 6.1)
  • Nvidia GTX 1660 TI, 6 GB
  • Python3.9
  • cudatoolkit-11.3.1
  • pytorch-1.11.0-py3.9_cuda11.3_cudnn8_0
  • (also tried pytorch-1.11.0-py3.9_cuda11.5_cudnn8_0)
  • (with dependencies installed correctly)

Minimal Reproducible Example

The command used for training is
python train.py

Additional

There are issues here also discussing the same problem.

However, I have tried pytorch with cuda version 11.5 (whose cudnn version is 8.3.0>8.2.2) and I also tried downloading cuDNN from nvidia and copy/paste the dll files into the relevant folder in torch/lib , the problem still can not be solved.

Another workaround is to downgrade to pytorch with cuda version 10.2(tested and it works), but this is currently not feasible as CUDA-10.2 PyTorch builds are no longer available for Windows.

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!
@YipKo YipKo added the bug Something isn't working label May 20, 2022
@github-actions
Copy link
Contributor

github-actions bot commented May 20, 2022

👋 Hello @MarkDeia, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://ultralytics.com or email support@ultralytics.com.

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on macOS, Windows, and Ubuntu every 24 hours and on every commit.

@glenn-jocher
Copy link
Member

@MarkDeia you may be able to work around this by disabling AMP in train.py. Anywhere that says enabled=cuda set to enabled=False

@YipKo
Copy link
Author

YipKo commented May 20, 2022

@MarkDeia you may be able to work around this by disabling AMP in train.py. Anywhere that says enabled=cuda set to enabled=False

@glenn-jocher Thanks for your reply, by turning off the Automatic mixed precision function, the box obj cls values are back to normal, but the P R mAP value during the validation are still 0.
image

At first, I think the problem should be the cuda/cudnn dependency that comes with pytorch, But NVIDIA claims that this problem has been solved on the 8.2.2 version of cudnn.
By using the code from this issue(via MobileNetV2). I tested it in pytorch_with_cuda113 (covered with the 8.2.2 version cudnn dll files) environment, the outputs are normal.
image

I am very confused, the amp and fp16 values seem to be fine.It looks like that the problem of returning nan with half precision has been fixed, but the problem still exists in the training and validation of yolov5.

Also,the detection works well. python detect.py
image

@glenn-jocher
Copy link
Member

@MarkDeia 0 labels means you have zero labels. Without labels there won't be any metrics obviously.

@YipKo
Copy link
Author

YipKo commented May 20, 2022

@MarkDeia 0 labels means you have zero labels. Without labels there won't be any metrics obviously.
@glenn-jocher In fact I think I set the labels of the validation dataset correctly, since the validation dataset is the same as the training dataset in the coco128 dataset.
I mentioned that when I used the previous version of pytorch, (pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=10.2) everything was running correctly, when I switched to pytorch 1.11( cuda11.3) using conda , the problem arises. The code and dataset I use does not change at all in both operations, and I ran it with this commandpython train.py .

image
image

@glenn-jocher
Copy link
Member

@MarkDeia they're two separate issues. The Labels 0 is indicating that there are simply no labels in your validation set, which has nothing to do with CUDA or your environment or hardware. There is no fundamental problem with detecting labels as your training has box and cls losses.

@YipKo
Copy link
Author

YipKo commented May 20, 2022

@MarkDeia they're two separate issues. The Labels 0 is indicating that there are simply no labels in your validation set, which has nothing to do with CUDA or your environment or hardware. There is no fundamental problem with detecting labels as your training has box and cls losses.

@glenn-jocher I am not understanding since I am a new to it,so what is causing no labels in my validation set?
image

@glenn-jocher
Copy link
Member

glenn-jocher commented May 21, 2022

Your dataset is structured incorrectly. To train correctly your data must be in YOLOv5 format. Please see our Train Custom Data tutorial for full documentation on dataset setup and all steps required to start training your first model. A few excerpts from the tutorial:

1.1 Create dataset.yaml

COCO128 is an example small tutorial dataset composed of the first 128 images in COCO train2017. These same 128 images are used for both training and validation to verify our training pipeline is capable of overfitting. data/coco128.yaml, shown below, is the dataset config file that defines 1) the dataset root directory path and relative paths to train / val / test image directories (or *.txt files with image paths), 2) the number of classes nc and 3) a list of class names:

# Train/val/test sets as 1) dir: path/to/imgs, 2) file: path/to/imgs.txt, or 3) list: [path/to/imgs1, path/to/imgs2, ..]
path: ../datasets/coco128  # dataset root dir
train: images/train2017  # train images (relative to 'path') 128 images
val: images/train2017  # val images (relative to 'path') 128 images
test:  # test images (optional)

# Classes
nc: 80  # number of classes
names: [ 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light',
         'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
         'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee',
         'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard',
         'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple',
         'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch',
         'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
         'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear',
         'hair drier', 'toothbrush' ]  # class names

1.2 Create Labels

After using a tool like Roboflow Annotate to label your images, export your labels to YOLO format, with one *.txt file per image (if no objects in image, no *.txt file is required). The *.txt file specifications are:

  • One row per object
  • Each row is class x_center y_center width height format.
  • Box coordinates must be in normalized xywh format (from 0 - 1). If your boxes are in pixels, divide x_center and width by image width, and y_center and height by image height.
  • Class numbers are zero-indexed (start from 0).

Image Labels

The label file corresponding to the above image contains 2 persons (class 0) and a tie (class 27):

1.3 Organize Directories

Organize your train and val images and labels according to the example below. YOLOv5 assumes /coco128 is inside a /datasets directory next to the /yolov5 directory. YOLOv5 locates labels automatically for each image by replacing the last instance of /images/ in each image path with /labels/. For example:

../datasets/coco128/images/im0.jpg  # image
../datasets/coco128/labels/im0.txt  # label

Good luck 🍀 and let us know if you have any other questions!

@YipKo
Copy link
Author

YipKo commented May 21, 2022

@glenn-jocher I think you may not have understood my expression, I ran the same code and the same dataset in both pytorch_cuda11.3 and pytorch_cuda10.2 environments, however the problem only occurred in the pytorch_cuda11.3 environment, furthermore, I was using the yolov5 demo dataset ( coco128 ), so I think there is no problem with the structure of my dataset.(I confirm that my data (coco128)is in YOLOv5 format)
I am puzzled by this result, but by running it in a different environment, I am more inclined to think that there is no problem with my dataset and the integrity of the code.

In any case, it is certain that what cause part of the problem comes from the autocast function in torch\cuda\amp\autocast_mode.py.

@glenn-jocher
Copy link
Member

glenn-jocher commented May 21, 2022

@MarkDeia well I can't really say what might be the issue. If you can help us recreate the problem with a minimum reproducible example we could get started debugging it, but given your hardware I don't think there's any reproducibility on other environments.

In any case I'd always recommend running in our Docker image if you are having issues with a local environment. See https://docs.ultralytics.com/yolov5/environments/docker_image_quickstart_tutorial/

@YipKo
Copy link
Author

YipKo commented May 21, 2022

@MarkDeia well I can't really say what might be the issue. If you can help us recreate the problem with a minimum reproducible example we could get started debugging it, but given your hardware I don't think there's any reproducibility on other environments.

In any case I'd always recommend running in our Docker image if you are having issues with a local environment. See https://docs.ultralytics.com/yolov5/environments/docker_image_quickstart_tutorial/

@glenn-jocher Thank you for your patience in your busy schedule. 👍
In view of the speciality of this system(NVIDIA GTX16xx only), please chime in all those who encounter the same problem. btw, maybe you should get more sleep :) After all, it's the weekend.

@glenn-jocher
Copy link
Member

glenn-jocher commented May 21, 2022

@MarkDeia well what we can do, which won't solve your problem, but will probably help a lot of people is to run a check before training to make sure that everything works correctly, and if not refer them to this issue or a tutorial about options.

There's definitely been multiple users that have run into issues, usually with a combination of CUDA11, Windows, Conda and consumer cards.

I'm not sure what the minimum test might be, after all we don't want to have to run a short COCO128 train before everyone's actual trainings as that would probably do more bad than good. Ok I've got it. We can run inference with and without AMP and the check will be a torch.allclose() on the outputs. If you run this on your system what do you see? On Colab we have the same detections, with boxes accurate to <1 pixel.

# PyTorch Hub
import torch

# Model
model = torch.hub.load('ultralytics/yolov5', 'yolov5s')

# Images
dir = 'https://ultralytics.com/images/'
imgs = [dir + f for f in ('zidane.jpg', 'bus.jpg')]  # batch of images

# Inference
results = model(imgs)
model.amp = True
results_amp = model(imgs)
print(results.xyxy[0] - results_amp.xyxy[0])

tensor([[-0.44983, -0.21283,  0.20471, -0.35834, -0.00050,  0.00000],
        [ 0.05951,  0.02808, -0.19067,  0.33899, -0.00065,  0.00000],
        [-0.05856, -0.06934, -0.00732,  0.04700,  0.00124,  0.00000],
        [-0.10693,  0.35675,  0.36877,  0.09174, -0.00141,  0.00000]], device='cuda:0')

@YipKo
Copy link
Author

YipKo commented May 21, 2022

@glenn-jocher
print(results_amp.xyxy[0])
print(results.xyxy[0])
the output are as follow

tensor([], device='cuda:0', size=(0, 6))
tensor([[7.42550e+02, 4.80370e+01, 1.14120e+03, 7.16642e+02, 8.81825e-01, 0.00000e+00],
        [4.42060e+02, 4.37528e+02, 4.96809e+02, 7.09839e+02, 6.87341e-01, 2.70000e+01],
        [1.25191e+02, 1.93681e+02, 7.11992e+02, 7.13047e+02, 6.39421e-01, 0.00000e+00],
        [9.82893e+02, 3.08357e+02, 1.02737e+03, 4.20092e+02, 2.62013e-01, 2.70000e+01]], device='cuda:0')

Since they have different dimensions, they cannot be subtracted ,and from the result we might know that apparently there was an error when running the amp func.I will continue to try to find the root of the problem, but it may take a few weeks as I can only debug in my spare time.

@glenn-jocher
Copy link
Member

@MarkDeia perfect! That's all I need. I'll work on a PR.

@glenn-jocher glenn-jocher self-assigned this May 21, 2022
@glenn-jocher glenn-jocher linked a pull request May 21, 2022 that will close this issue
@glenn-jocher
Copy link
Member

glenn-jocher commented May 21, 2022

@MarkDeia can you run this code and verify that you get an AMP failure notice before training starts? This tests PR #7917 which automatically disables AMP if the two image results don't match just as I proposed earlier. This won't solve all the problems but hopefully it will help many users.

Screen Shot 2022-05-21 at 8 09 36 PM

git clone https://github.com/ultralytics/yolov5 -b amp_check  # clone
cd yolov5
python train.py --epochs 3

@YipKo
Copy link
Author

YipKo commented May 22, 2022

@MarkDeia can you run this code and verify that you get an AMP failure notice before training starts? This tests PR #7917 which automatically disables AMP if the two image results don't match just as I proposed earlier. This won't solve all the problems but hopefully it will help many users.

Screen Shot 2022-05-21 at 8 09 36 PM
git clone https://github.com/ultralytics/yolov5 -b amp_check  # clone
cd yolov5
python train.py --epochs 3

@glenn-jocher Glad you added amp verification, even if I still have problems with the verification process after turning off amp, but as you say, This won't solve all the problems but hopefully it will help many users.
There is a slight error in the check_amp func in this PR, which I have commented on under #7917

@tahvane1
Copy link

tahvane1 commented Jun 5, 2022

I have this same issue in 1080TI. Even after the fix you issued, sometimes labels are zeroed after training for a while. I tried also with --device cpu flag and I got zero labels at some point as well. Sometimes training succeeds with GPU...

@abadia24
Copy link

abadia24 commented Jun 20, 2022

Same issue here, i followed the tutorial and these are the results from training 1.5 hours xdd. I also have a 1660ti for laptop
results.csv

@glenn-jocher
Copy link
Member

glenn-jocher commented Jun 21, 2022

@abadia24 NaN's are unrecoverable so if you ever see an epoch with them then you can immediately terminate as the rest of training will contain them.

In the meantime you might try training in Docker which is a self-contained linux environment with everything verified working correctly.

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

@glenn-jocher glenn-jocher removed the TODO label Jun 21, 2022
@evo11x
Copy link

evo11x commented Jun 21, 2022

It seems that this problem is much borader. I have the same problem with NaN, running ray with pytorch and RTX3060 CUDA 11.x (Windows 11 and Ubuntu 20).
I tried many cuda version combinations with older pytorch (except cuda 10.x) with no success.
The problem appears when using multithreading with higher GPU throughput. The faster you run it the sooner it crashes.

@mhw-Parker
Copy link

mhw-Parker commented Jul 12, 2022

Hi, I also meet the same problem
system: win 10
Gpu: nvidia GTX 1660ti
Cuda: cuda&cudatoolkit 11.3
pytorch: 1.11

I have already used torch.cuda.is_available() to check that my GPU environment has been built successfully.
I can also run detect.py by gpu.
But when I want to train my own custom data, error occured
if --device 0 (use gpu)
屏幕截图 2022-07-12 150400

if --device cpu
图片

I can't train by my GPU 1660ti ?

@glenn-jocher
Copy link
Member

glenn-jocher commented Jul 12, 2022

@mhw-Parker are you using the latest version of YOLOv5? What does your AMP check say before training starts?

@Raziel619
Copy link

Hi I also have the same issue. I'm running the following:

YOLOv5 version: Latest from master (07/30/2022)
PyTorch version: 1.12.0
CUDA version: 11.6
GPU: GTX 1660

All AMP checks passed. When I run the same script with the same dataset on the CPU, I get valid results.

Note, I had to replace the torch requirements from the repo with the following for torch.cuda.is_available() to be set to true:
torch==1.12.0+cu116
torchaudio==0.12.0+cu116
torchvision==0.13.0+cu116

@glenn-jocher
Copy link
Member

glenn-jocher commented Jul 30, 2022

@Raziel619 that's strange that the AMP checks passed yet you're still seeing problems. You might try disabling AMP completely by setting amp=False here, i.e. simulating an AMP check failure.

yolov5/train.py

Line 128 in 1e89807

amp = check_amp(model) # check AMP

@glenn-jocher
Copy link
Member

glenn-jocher commented Jul 30, 2022

@Raziel619 you might also try training inside the Docker image for the best stability.

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

@Raziel619
Copy link

I disabled the AMP completely and it did improve results somewhat, as I'm no longer getting NaNs for "train/box_loss", "train/obj_loss", "train/cls_loss", but I'm getting all zeros or NaNs for almost everything else. See attached for results.
results.csv

@glenn-jocher
Copy link
Member

glenn-jocher commented Jul 31, 2022

@Raziel619 hmm. Validation is done at half precision, maybe if you add half=False here to val.run()?

yolov5/train.py

Lines 367 to 377 in 1e89807

results, maps, _ = val.run(data_dict,
batch_size=batch_size // WORLD_SIZE * 2,
imgsz=imgsz,
model=ema.ema,
single_cls=single_cls,
dataloader=val_loader,
save_dir=save_dir,
plots=False,
callbacks=callbacks,
compute_loss=compute_loss)

@Raziel619
Copy link

That fixes it! Yay! Thank you so much for these swift responses, super excited to get start some training.

@glenn-jocher
Copy link
Member

@Raziel619 good news 😃! Your original issue may now be partially resolved in ✅ in PR #8804. This PR doesn't resolve the original issue, but it does disable FP16 validation if AMP checks fail or simply if you manually set amp=False.

To receive this update:

  • Gitgit pull from within your yolov5/ directory or git clone https://github.com/ultralytics/yolov5 again
  • PyTorch Hub – Force-reload model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
  • Notebooks – View updated notebooks Open In Colab Open In Kaggle
  • Dockersudo docker pull ultralytics/yolov5:latest to update your image Docker Pulls

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

@Yang-Jianzhang
Copy link

Hi I think this issue not only happened on consumer cards, becasue I also have the same issue training on 8XTesla V100.

When turning off AMP, the trainging time double.

I think the best way to handle this problem is reducing the version of cuda to 10.X.

@Caterina1996
Copy link

Caterina1996 commented Dec 13, 2022

Hello! I was having the same problem here on a NVIDIA GeForce GTX 1650 in Ubuntu 20 and a conda enviroment and cuda 11. I was finding nans when training on coco128. The easiest way to solve it for me was setting cuda 10 in my enviroment with

conda install pytorch torchvision cuda100 -c pytorch

@Tommyisr
Copy link

I disabled cudnn in PyTorch and it solved the issue with nan values, but I'm not sure whether it'll affect perfomance of training process.

torch.backends.cudnn.enabled = False

Windows 10
YOLOv8
Nvidia GTX 1660 Super
Conda env
Nvidia GTX 1660
Python3.9
cudatoolkit-11.3.1
pytorch-1.12

@glenn-jocher
Copy link
Member

@Tommyisr thank you for sharing your experience with the community! Disabling cudnn can indeed resolve the NaN issue for some users, but it may come with a performance tradeoff. We recommend monitoring the training process to evaluate whether there are noticeable impacts on performance.

For anyone encountering similar issues, please feel free to try the solutions mentioned here and share your results. Your feedback helps the community improve the overall YOLOv5 experience.

For more information and troubleshooting tips, please refer to the Ultralytics YOLOv5 Documentation. If you have any further questions or issues, don't hesitate to reach out. Happy training!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
10 participants