some bugs when training #1547

wuzuiyuzui · 2020-11-28T08:33:01Z

Hello, I ran into a difficult problem when using yolov5. Reinstalling the system did not help this problem. I was very confused about what happened. This problem has been bothering me for several days. I have closed the previous one. Questions and gave me detailed bugs, can you give me some help？？I can test and detect but I can not train.

🐛 Bug

have some problems when training

To Reproduce (REQUIRED)

Input:

adam=False, batch_size=16, bucket='', cache_images=False, cfg='', data='data/coco128.yaml', device='', epochs=300, evolve=False, exist_ok=False, global_rank=-1, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], local_rank=-1, log_imgs=16, multi_scale=False, name='exp', noautoanchor=False, nosave=False, notest=False, project='runs/train', rect=False, resume=False, save_dir='runs/train/exp10', single_cls=False, sync_bn=False, total_batch_size=16, weights='yolov5s.pt', workers=8, world_size=1)

Output:
Using torch 1.7.0+cu101 CUDA:0 (GeForce RTX 2080 Ti, 10997MB)

Namespace(adam=False, batch_size=16, bucket='', cache_images=False, cfg='', data='data/coco128.yaml', device='', epochs=300, evolve=False, exist_ok=False, global_rank=-1, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], local_rank=-1, log_imgs=16, multi_scale=False, name='exp', noautoanchor=False, nosave=False, notest=False, project='runs/train', rect=False, resume=False, save_dir='runs/train/exp10', single_cls=False, sync_bn=False, total_batch_size=16, weights='yolov5s.pt', workers=8, world_size=1)
Start Tensorboard with "tensorboard --logdir runs/train", view at http://localhost:6006/
Hyperparameters {'lr0': 0.01, 'lrf': 0.2, 'momentum': 0.937, 'weight_decay': 0.0005, 'warmup_epochs': 3.0, 'warmup_momentum': 0.8, 'warmup_bias_lr': 0.1, 'box': 0.05, 'cls': 0.5, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.015, 'hsv_s': 0.7, 'hsv_v': 0.4, 'degrees': 0.0, 'translate': 0.1, 'scale': 0.5, 'shear': 0.0, 'perspective': 0.0, 'flipud': 0.0, 'fliplr': 0.5, 'mosaic': 1.0, 'mixup': 0.0}
Overriding model.yaml nc=80 with nc=7

             from  n    params  module                                  arguments

0 -1 1 1 -1 1 2 -1 1 3 -1 1 4 5 6 7 8 9 10 11 -1 1 12 [-1, 6] 1 13 14 15 -1 1 16 [-1, 4] 1 17 18 19 [-1, 14] 1 20 21 22 [-1, 10] 1 23 24 [17, 20, 23] 1 Model Summary: 3520 models.common.Focus [3, 32, 3]
18560 models.common.Conv [32, 64, 3, 2]
19904 models.common.BottleneckCSP [64, 64, 1]
73984 models.common.Conv [64, 128, 3, 2]
-1 1 161152 models.common.BottleneckCSP [128, 128, 3]
-1 1 295424 models.common.Conv [128, 256, 3, 2]
-1 1 641792 models.common.BottleneckCSP [256, 256, 3]
-1 1 1180672 models.common.Conv [256, 512, 3, 2]
-1 1 656896 models.common.SPP [512, 512, [5, 9, 13]]
-1 1 1248768 models.common.BottleneckCSP [512, 512, 1, False]
-1 1 131584 models.common.Conv [512, 256, 1, 1]
0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
0 models.common.Concat [1]
-1 1 378624 models.common.BottleneckCSP [512, 256, 1, False]
-1 1 33024 models.common.Conv [256, 128, 1, 1]
0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
0 models.common.Concat [1]
-1 1 95104 models.common.BottleneckCSP [256, 128, 1, False]
-1 1 147712 models.common.Conv [128, 128, 3, 2]
0 models.common.Concat [1]
-1 1 313088 models.common.BottleneckCSP [256, 256, 1, False]
-1 1 590336 models.common.Conv [256, 256, 3, 2]
0 models.common.Concat [1]
-1 1 1248768 models.common.BottleneckCSP [512, 512, 1, False]
32364 models.yolo.Detect [7, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
283 layers, 7271276 parameters, 7271276 gradients

Transferred 364/370 items from yolov5s.pt
Optimizer groups: 62 .bias, 70 conv.weight, 59 other
Scanning 'coco128/labels/train2017.cache' for images and labels... 3219 found, 0 missing, 20 empty, 0 corrupted: 100%|██████████| 3219/3219 [00:00<?, ?it/s]
Scanning 'coco128/labels/val.cache' for images and labels... 246 found, 2 missing, 0 empty, 0 corrupted: 100%|██████████| 248/248 [00:00<?, ?it/s]

Analyzing anchors... anchors/target = 4.42, Best Possible Recall (BPR) = 0.9894
Image sizes 640 train, 640 test
Using 8 dataloader workers
Logging results to runs/train/exp10
Starting training for 300 epochs...

 Epoch   gpu_mem       box       obj       cls     total   targets  img_size

0%| | 0/202 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/home/ljy/yolov5-master/train.py", line 492, in
train(hyp, opt, device, tb_writer, wandb)
File "/home/ljy/yolov5-master/train.py", line 293, in train
scaler.scale(loss).backward()
File "/home/ljy/anaconda3/envs/yolo/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/ljy/anaconda3/envs/yolo/lib/python3.8/site-packages/torch/autograd/init.py", line 130, in backward
Variable._execution_engine.run_backward(
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([16, 256, 20, 20], dtype=torch.half, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(256, 256, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().half()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams
data_type = CUDNN_DATA_HALF
padding = [1, 1, 0]
stride = [1, 1, 0]
dilation = [1, 1, 0]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0x7f8ba4002b60
type = CUDNN_DATA_HALF
nbDims = 4
dimA = 16, 256, 20, 20,
strideA = 102400, 400, 20, 1,
output: TensorDescriptor 0x7f8ba40033a0
type = CUDNN_DATA_HALF
nbDims = 4
dimA = 16, 256, 20, 20,
strideA = 102400, 400, 20, 1,
weight: FilterDescriptor 0x7f8ba403e080
type = CUDNN_DATA_HALF
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 4
dimA = 256, 256, 3, 3,
Pointer addresses:
input: 0x7f8a73b60000
output: 0x7f8c792e0000
weight: 0x7f8d5b660000

Process finished with exit code 1

Environment

If applicable, add screenshots to help explain your problem.
-cudnn 7.6.4
-nvidia-driver 440.95

OS: [e.g. Ubuntu20.04]
GPU [e.g. 2080 Ti]
-torch 1.7.0+cu101
-cuda 10.1

Additional context

Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

glenn-jocher · 2020-11-28T11:24:26Z

@wuzuiyuzui it appears you may have environment problems. Please ensure you meet all dependency requirements if you are attempting to run YOLOv5 locally. If in doubt, create a new virtual Python 3.8 environment, clone the latest repo (code changes daily), and pip install -r requirements.txt again. We also highly recommend using one of our verified environments below.

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.6. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab Notebook with free GPU:
Kaggle Notebook with free GPU: https://www.kaggle.com/ultralytics/yolov5
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Docker Image https://hub.docker.com/r/ultralytics/yolov5. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are passing. These tests evaluate proper operation of basic YOLOv5 functionality, including training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu.

louis-she · 2020-11-29T09:09:15Z

utils/datasets.py
find the line: pin_memory=True
change it to: pin_memory=False

glenn-jocher · 2020-11-29T10:08:54Z

@louis-she oh interesting. Does this have performance drawbacks, i.e. slower dataloading?

louis-she · 2020-11-30T03:56:44Z

@glenn-jocher A possible reason of the bug could be the none fixed size of the label yield from the dataloader as the collect_fn does.

I havn't test the performance drawbacks but i think it should not be much difference.

Anyway I have openned a PR, just leave the option to default(which is False).

glenn-jocher · 2020-12-02T14:09:07Z

@wuzuiyuzui please try PR #1555 and confirm that this fixes the problem for you.

github-actions · 2021-01-02T00:56:06Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

wuzuiyuzui added the bug Something isn't working label Nov 28, 2020

louis-she mentioned this issue Nov 30, 2020

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR fix #1555

Closed

glenn-jocher linked a pull request Dec 2, 2020 that will close this issue

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR fix #1555

Closed

github-actions bot added the Stale label Jan 2, 2021

github-actions bot closed this as completed Jan 7, 2021

This was referenced Apr 11, 2021

YOLOv5 v5.0 Release #2762

Merged

YOLOv5 v5.0 release compatibility update for YOLOv3 ultralytics/yolov3#1737

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some bugs when training #1547

some bugs when training #1547

wuzuiyuzui commented Nov 28, 2020

glenn-jocher commented Nov 28, 2020 •

edited

Loading

louis-she commented Nov 29, 2020

glenn-jocher commented Nov 29, 2020

louis-she commented Nov 30, 2020 •

edited

Loading

glenn-jocher commented Dec 2, 2020

github-actions bot commented Jan 2, 2021

some bugs when training #1547

some bugs when training #1547

Comments

wuzuiyuzui commented Nov 28, 2020

🐛 Bug

To Reproduce (REQUIRED)

Environment

Additional context

glenn-jocher commented Nov 28, 2020 • edited Loading

Requirements

Environments

Status

louis-she commented Nov 29, 2020

glenn-jocher commented Nov 29, 2020

louis-she commented Nov 30, 2020 • edited Loading

glenn-jocher commented Dec 2, 2020

github-actions bot commented Jan 2, 2021

glenn-jocher commented Nov 28, 2020 •

edited

Loading

louis-she commented Nov 30, 2020 •

edited

Loading