Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some bugs when training #1547

Closed
wuzuiyuzui opened this issue Nov 28, 2020 · 6 comments
Closed

some bugs when training #1547

wuzuiyuzui opened this issue Nov 28, 2020 · 6 comments
Labels
bug Something isn't working Stale

Comments

@wuzuiyuzui
Copy link

Hello, I ran into a difficult problem when using yolov5. Reinstalling the system did not help this problem. I was very confused about what happened. This problem has been bothering me for several days. I have closed the previous one. Questions and gave me detailed bugs, can you give me some help??I can test and detect but I can not train.

🐛 Bug

have some problems when training

To Reproduce (REQUIRED)

Input:

adam=False, batch_size=16, bucket='', cache_images=False, cfg='', data='data/coco128.yaml', device='', epochs=300, evolve=False, exist_ok=False, global_rank=-1, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], local_rank=-1, log_imgs=16, multi_scale=False, name='exp', noautoanchor=False, nosave=False, notest=False, project='runs/train', rect=False, resume=False, save_dir='runs/train/exp10', single_cls=False, sync_bn=False, total_batch_size=16, weights='yolov5s.pt', workers=8, world_size=1)

Output:
Using torch 1.7.0+cu101 CUDA:0 (GeForce RTX 2080 Ti, 10997MB)

Namespace(adam=False, batch_size=16, bucket='', cache_images=False, cfg='', data='data/coco128.yaml', device='', epochs=300, evolve=False, exist_ok=False, global_rank=-1, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], local_rank=-1, log_imgs=16, multi_scale=False, name='exp', noautoanchor=False, nosave=False, notest=False, project='runs/train', rect=False, resume=False, save_dir='runs/train/exp10', single_cls=False, sync_bn=False, total_batch_size=16, weights='yolov5s.pt', workers=8, world_size=1)
Start Tensorboard with "tensorboard --logdir runs/train", view at http://localhost:6006/
Hyperparameters {'lr0': 0.01, 'lrf': 0.2, 'momentum': 0.937, 'weight_decay': 0.0005, 'warmup_epochs': 3.0, 'warmup_momentum': 0.8, 'warmup_bias_lr': 0.1, 'box': 0.05, 'cls': 0.5, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.015, 'hsv_s': 0.7, 'hsv_v': 0.4, 'degrees': 0.0, 'translate': 0.1, 'scale': 0.5, 'shear': 0.0, 'perspective': 0.0, 'flipud': 0.0, 'fliplr': 0.5, 'mosaic': 1.0, 'mixup': 0.0}
Overriding model.yaml nc=80 with nc=7

             from  n    params  module                                  arguments                     

0 -1 1 3520 models.common.Focus [3, 32, 3]
1 -1 1 18560 models.common.Conv [32, 64, 3, 2]
2 -1 1 19904 models.common.BottleneckCSP [64, 64, 1]
3 -1 1 73984 models.common.Conv [64, 128, 3, 2]
4 -1 1 161152 models.common.BottleneckCSP [128, 128, 3]
5 -1 1 295424 models.common.Conv [128, 256, 3, 2]
6 -1 1 641792 models.common.BottleneckCSP [256, 256, 3]
7 -1 1 1180672 models.common.Conv [256, 512, 3, 2]
8 -1 1 656896 models.common.SPP [512, 512, [5, 9, 13]]
9 -1 1 1248768 models.common.BottleneckCSP [512, 512, 1, False]
10 -1 1 131584 models.common.Conv [512, 256, 1, 1]
11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
12 [-1, 6] 1 0 models.common.Concat [1]
13 -1 1 378624 models.common.BottleneckCSP [512, 256, 1, False]
14 -1 1 33024 models.common.Conv [256, 128, 1, 1]
15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
16 [-1, 4] 1 0 models.common.Concat [1]
17 -1 1 95104 models.common.BottleneckCSP [256, 128, 1, False]
18 -1 1 147712 models.common.Conv [128, 128, 3, 2]
19 [-1, 14] 1 0 models.common.Concat [1]
20 -1 1 313088 models.common.BottleneckCSP [256, 256, 1, False]
21 -1 1 590336 models.common.Conv [256, 256, 3, 2]
22 [-1, 10] 1 0 models.common.Concat [1]
23 -1 1 1248768 models.common.BottleneckCSP [512, 512, 1, False]
24 [17, 20, 23] 1 32364 models.yolo.Detect [7, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model Summary: 283 layers, 7271276 parameters, 7271276 gradients

Transferred 364/370 items from yolov5s.pt
Optimizer groups: 62 .bias, 70 conv.weight, 59 other
Scanning 'coco128/labels/train2017.cache' for images and labels... 3219 found, 0 missing, 20 empty, 0 corrupted: 100%|██████████| 3219/3219 [00:00<?, ?it/s]
Scanning 'coco128/labels/val.cache' for images and labels... 246 found, 2 missing, 0 empty, 0 corrupted: 100%|██████████| 248/248 [00:00<?, ?it/s]

Analyzing anchors... anchors/target = 4.42, Best Possible Recall (BPR) = 0.9894
Image sizes 640 train, 640 test
Using 8 dataloader workers
Logging results to runs/train/exp10
Starting training for 300 epochs...

 Epoch   gpu_mem       box       obj       cls     total   targets  img_size

0%| | 0/202 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/home/ljy/yolov5-master/train.py", line 492, in
train(hyp, opt, device, tb_writer, wandb)
File "/home/ljy/yolov5-master/train.py", line 293, in train
scaler.scale(loss).backward()
File "/home/ljy/anaconda3/envs/yolo/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/ljy/anaconda3/envs/yolo/lib/python3.8/site-packages/torch/autograd/init.py", line 130, in backward
Variable._execution_engine.run_backward(
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([16, 256, 20, 20], dtype=torch.half, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(256, 256, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().half()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams
data_type = CUDNN_DATA_HALF
padding = [1, 1, 0]
stride = [1, 1, 0]
dilation = [1, 1, 0]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0x7f8ba4002b60
type = CUDNN_DATA_HALF
nbDims = 4
dimA = 16, 256, 20, 20,
strideA = 102400, 400, 20, 1,
output: TensorDescriptor 0x7f8ba40033a0
type = CUDNN_DATA_HALF
nbDims = 4
dimA = 16, 256, 20, 20,
strideA = 102400, 400, 20, 1,
weight: FilterDescriptor 0x7f8ba403e080
type = CUDNN_DATA_HALF
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 4
dimA = 256, 256, 3, 3,
Pointer addresses:
input: 0x7f8a73b60000
output: 0x7f8c792e0000
weight: 0x7f8d5b660000

Process finished with exit code 1

Environment

If applicable, add screenshots to help explain your problem.
-cudnn 7.6.4
-nvidia-driver 440.95

  • OS: [e.g. Ubuntu20.04]
  • GPU [e.g. 2080 Ti]
    -torch 1.7.0+cu101
    -cuda 10.1
    2020-11-28 16-06-58 的屏幕截图
    2020-11-28 16-06-58 的屏幕截图
    2020-11-28 16-07-04 的屏幕截图
    2020-11-28 16-08-32 的屏幕截图
    2020-11-28 16-08-37 的屏幕截图
    2020-11-28 16-08-59 的屏幕截图

Additional context

Add any other context about the problem here.

@wuzuiyuzui wuzuiyuzui added the bug Something isn't working label Nov 28, 2020
@glenn-jocher
Copy link
Member

glenn-jocher commented Nov 28, 2020

@wuzuiyuzui it appears you may have environment problems. Please ensure you meet all dependency requirements if you are attempting to run YOLOv5 locally. If in doubt, create a new virtual Python 3.8 environment, clone the latest repo (code changes daily), and pip install -r requirements.txt again. We also highly recommend using one of our verified environments below.

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.6. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are passing. These tests evaluate proper operation of basic YOLOv5 functionality, including training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu.

@louis-she
Copy link

utils/datasets.py
find the line: pin_memory=True
change it to: pin_memory=False

@glenn-jocher
Copy link
Member

@louis-she oh interesting. Does this have performance drawbacks, i.e. slower dataloading?

@louis-she
Copy link

louis-she commented Nov 30, 2020

@glenn-jocher A possible reason of the bug could be the none fixed size of the label yield from the dataloader as the collect_fn does.

I havn't test the performance drawbacks but i think it should not be much difference.

Anyway I have openned a PR, just leave the option to default(which is False).

@glenn-jocher
Copy link
Member

@wuzuiyuzui please try PR #1555 and confirm that this fixes the problem for you.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 2, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants