Possible Bug Training on Empty Batch? #609

ZeKunZhang1998 · 2020-08-03T02:09:06Z

❔Question

Traceback (most recent call last):
File "train.py", line 463, in
train(hyp, tb_writer, opt, device)
File "train.py", line 286, in train
loss, loss_items = compute_loss(pred, targets.to(device), model) # scaled by batch_size
File "/content/drive/My Drive/yolov5/utils/utils.py", line 443, in compute_loss
tcls, tbox, indices, anchors = build_targets(p, targets, model) # targets
File "/content/drive/My Drive/yolov5/utils/utils.py", line 542, in build_targets
b, c = t[:, :2].long().T # image, class
ValueError: too many values to unpack (expected 2)

Additional context

glenn-jocher · 2020-08-03T18:34:22Z

Hello, thank you for your interest in our work! This issue seems to lack the minimum requirements for a proper response, or is insufficiently detailed for us to help you. Please note that most technical problems are due to:

Your changes to the default repository. If your issue is not reproducible in a new git clone version of this repository we can not debug it. Before going further run this code and ensure your issue persists:

sudo rm -rf yolov5  # remove existing
git clone https://github.com/ultralytics/yolov5 && cd yolov5 # clone latest
python detect.py  # verify detection
# CODE TO REPRODUCE YOUR ISSUE HERE

Your custom data. If your issue is not reproducible with COCO or COCO128 data we can not debug it. Visit our Custom Training Tutorial for guidelines on training your custom data. Examine train_batch0.jpg and test_batch0.jpg for a sanity check of training and testing data.
Your environment. If your issue is not reproducible in one of the verified environments below we can not debug it. If you are running YOLOv5 locally, ensure your environment meets all of the requirements.txt dependencies specified below.

If none of these apply to you, we suggest you close this issue and raise a new one using the Bug Report template, providing screenshots and minimum viable code to reproduce your issue. Thank you!

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.6. To install run:

$ pip install -U -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab Notebook with free GPU:
Kaggle Notebook with free GPU: https://www.kaggle.com/ultralytics/yolov5
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Docker Image https://hub.docker.com/r/ultralytics/yolov5. See Docker Quickstart Guide

Current Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) test are passing. These tests evaluate proper operation of basic YOLOv5 functionality, including training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu.

acai66 · 2020-08-03T21:23:06Z

Changing batchsize will solve this issue, it will occur when a batch of images contain no object.

glenn-jocher · 2020-08-03T22:06:20Z

@acai66 ah I see. I remember a similar issue about testing with no targets, but I believed this was resolved. Does this occur when training or testing? Can you supply code to reproduce?

MiiaBestLamia · 2020-08-03T22:12:49Z

While changing the batchsize helped prolong the learning process, this issue still occurs for me. By printing the paths of the images in each batch i can check to see if they have an object, and there's definitely at least one object in each occasion (my dataset doesn't have images with empty label files) of the crash.

glenn-jocher · 2020-08-03T22:31:06Z

@MiiaBestLamia can you supply exact steps and code to reproduce this issue following the steps outlined before (current repo, valid environment, common dataset?)

MiiaBestLamia · 2020-08-03T22:39:02Z

@glenn-jocher i'm using the most recent repository, all of the requirements are satisfied except the Python version, which might be the reason for the issue, though it seems strange that the learning process works for a bit, then crashes (I'm using 3.6.9). I have altered only one line of code in the repository, the one that prints paths of the files used in the batch. Using a dataset that I may not be allowed to share, so I cannot provide you with the files.
I'm using the .txt option of providing the train and val sets, i have only one class, using image size 800 with batch size 4, providing yolov5l.pt weights (that I downloaded from the google drive), would be nice to see what @ZeKunZhang1998 is working with, maybe that can narrow down the problem.

glenn-jocher · 2020-08-03T22:49:48Z

@MiiaBestLamia I would verify your issue is reproducible in one of the environments above. That's what they're there for.

ZeKunZhang1998 · 2020-08-03T23:58:45Z

Changing batchsize will solve this issue, it will occur when a batch of images contain no object.

when I change to batch size = 1,another picture is something wrong.

ZeKunZhang1998 · 2020-08-04T00:05:07Z

@glenn-jocher i'm using the most recent repository, all of the requirements are satisfied except the Python version, which might be the reason for the issue, though it seems strange that the learning process works for a bit, then crashes (I'm using 3.6.9). I have altered only one line of code in the repository, the one that prints paths of the files used in the batch. Using a dataset that I may not be allowed to share, so I cannot provide you with the files.
I'm using the .txt option of providing the train and val sets, i have only one class, using image size 800 with batch size 4, providing yolov5l.pt weights (that I downloaded from the google drive), would be nice to see what @ZeKunZhang1998 is working with, maybe that can narrow down the problem.

I use a private dataset.When I jump the wrong batch , the problem will be solved.The wrong batch has object, it is not an empty picture.

acai66 · 2020-08-04T03:30:49Z

Maybe i am wrong about this issue, but this issue disappeared when i changed to another batchsize.
I've added print(targets.shape) in build_targets, i got this Tensor.Size([0, 6]) when ValueError: too many values to unpack (expected 2)

glenn-jocher · 2020-08-04T04:14:26Z

@acai66 the only thing we can act on is a reproducible example in one of the verified environments.

acai66 · 2020-08-04T04:40:57Z

@acai66 the only thing we can act on is a reproducible example in one of the verified environments.

I will reinstall pytorch from source code and fetch latest yolov5, i will upload my datasets if this issue occurs again.

ZJU-lishuang · 2020-08-04T07:33:03Z

I meet the same problem too.
I think the reason is that after box_candidates function there are no targets.

MiiaBestLamia · 2020-08-04T08:32:15Z

After changing some of the hyperparameters in train.py (lr0:0.001, scale:0.2), moving the project to a computer with a beefier GPU (went from 2060S to 2080Ti, using Python 3.6.9) and increasing the batch size to 8, training is functioning properly for 4 epochs now. I can still reproduce the issue with my data on the 2080Ti by launching train.py with a batch size of 4, so I suppose this issue is caused by peculiarities in data, not problems with the network/code.

acai66 · 2020-08-04T09:52:45Z

@acai66 the only thing we can act on is a reproducible example in one of the verified environments.

here is my datasets,
https://drive.google.com/file/d/12epqSyYELm7c4mXXIIJos33KDcj1l67f/view?usp=sharing

data yaml:
2020.yaml.txt

models yaml:
yolov5x_2020.yaml.txt

train command:
python train.py --cfg models/yolov5x_2020.yaml --data data/2020.yaml --epochs 300 --batch-size 8 --img-size 512 512 --cache-images --weights '' --name "yolov5x_2020_default" --single-cls

This issue disappeared when changed batchsize to 12.

ZeKunZhang1998 · 2020-08-04T13:37:33Z

When I use batch size = 4, it works.

ZeKunZhang1998 · 2020-08-04T13:38:37Z

@acai66 the only thing we can act on is a reproducible example in one of the verified environments.

Hi,I think some data augmentation make the boxes disappear, right? If you use bigger batch size , this issue will disappear.

mrk230 · 2020-08-04T18:59:38Z

On a private dataset I have also had this issue with batch size = 16. Haven't fully tested further, but the dataset does include a fair amount of images without objects. For what that's worth.

buimanhlinh96 · 2020-08-05T03:08:02Z

It means that whole dataset must have object, if an image isnt labelled then wrong, isnt it?

glenn-jocher · 2020-08-05T04:20:52Z

@buimanhlinh96 that is not correct. COCO has over a thousand images without labels.

buimanhlinh96 · 2020-08-05T04:24:17Z

@glenn-jocher so what happended with this issue? It might in a batch, we must have at least one image is labelled?

glenn-jocher · 2020-08-05T04:56:49Z

@acai66 the only thing we can act on is a reproducible example in one of the verified environments.

here is my datasets,
https://drive.google.com/file/d/12epqSyYELm7c4mXXIIJos33KDcj1l67f/view?usp=sharing

data yaml:
2020.yaml.txt

models yaml:
yolov5x_2020.yaml.txt

train command:
python train.py --cfg models/yolov5x_2020.yaml --data data/2020.yaml --epochs 300 --batch-size 8 --img-size 512 512 --cache-images --weights '' --name "yolov5x_2020_default" --single-cls

This issue disappeared when changed batchsize to 12.

@acai66 thank you! I think I can work with this. I can only debug official models though, so I will use yolov5x.yaml in place of yours. Do you yourself see the error when running on the default models?

glenn-jocher · 2020-08-05T05:00:06Z

@buimanhlinh96 I don't know, I have not tried to reproduce yet. I know test.py operates correctly on datasets without labels, I don't know about train.py. Can you provide minimum viable code to reproduce your specific issue?

glenn-jocher · 2020-08-05T05:01:48Z

@acai66 also are you able to reproduce in one of the verified environments?

buimanhlinh96 · 2020-08-05T06:32:21Z

@glenn-jocher I try some experiments and come up with the batch-size should be more than or equal 8

glenn-jocher · 2020-08-05T06:34:42Z

@buimanhlinh96 there is no constraint on batch size, so you should be able to use batch size 1 to batch size x, whatever your hardware can handle. If this is not the case then there is a bug.

acai66 · 2020-08-05T09:47:50Z

@acai66 also are you able to reproduce in one of the verified environments?

you can try default yolov5x.yaml. I just change nc to 1 in yolov5x_2020_default.yaml actually.

buimanhlinh96 · 2020-08-05T10:24:49Z

@glenn-jocher Yes. Hopefully we can fix it ASAP. Love yolov5

glenn-jocher · 2020-08-05T17:07:08Z

@acai66 ah I see, of course. We actually updated train.py a few weeks back to inherit nc from the data.yaml in case of a mismatch with the model yaml nc, so you should be able to use your command with the default 80 class yolov5x.yaml as well, and it will still operate correctly.

Ok, I will try to reproduce this in a colab notebook today if I have time.

Jacobsolawetz · 2020-08-05T17:53:45Z

@glenn-jocher I'm in the same boat.

Colab notebook environment with torch 1.6.0
yolov5x yaml changing NC
private dataset

For me the bug hits right after the first epoch (which successfully completes), when moving to the second epoch.

It seems fixed by moving the batch size from 4 to 12 as suggested above (Colab runs out of memory on this dataset at 16).

glenn-jocher · 2020-08-05T18:02:38Z

@Jacobsolawetz hmm ok. Do you have a pretty sparse dataset, do you think it's possible a whole batch of 4 images might have no labels? Does the bug happen during training or testing?

glenn-jocher · 2020-08-05T20:11:05Z

@acai66 I'm able to reproduce this in a colab notebook:
https://colab.research.google.com/drive/1bCFd_1fyFG8pkXkQ8MubvRSgFsb9ZPhu#scrollTo=-AVqcyhjO89V

I see this midway through the first epoch:

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
     0/299     4.78G   0.07214   0.01437         0   0.08651         3       512:  70% 105/150 [00:58<00:22,  2.02it/s]Traceback (most recent call last):
  File "train.py", line 477, in <module>
    train(hyp, opt, device, tb_writer)
  File "train.py", line 300, in train
    loss, loss_items = compute_loss(pred, targets.to(device), model)  # scaled by batch_size
  File "/content/yolov5/utils/general.py", line 446, in compute_loss
    tcls, tbox, indices, anchors = build_targets(p, targets, model)  # targets
  File "/content/yolov5/utils/general.py", line 545, in build_targets
    b, c = t[:, :2].long().T  # image, class
ValueError: too many values to unpack (expected 2)
     0/299     4.78G   0.07214   0.01437         0   0.08651         3       512:  70% 105/150 [00:58<00:25,  1.79it/s]

glenn-jocher · 2020-08-05T20:38:36Z

@ZeKunZhang1998 @mrk230 @Jacobsolawetz @acai66 @buimanhlinh96 this issue should be resolved now in 7eaf225. Please git pull to receive the latest updates and try again.

Let us know if you run into anymore problems, and good luck!

glenn-jocher · 2020-08-05T20:40:55Z

@acai66 for your dataset I would recommend several changes:

You have very small objects, you need to train at the highest viable resolution, even if it means using a smaller model.
Start from pretrained weights for best results, but also try training from scratch to compare.
Use the largest batch size that will fit into RAM.
Your dataset is different enough from COCO that it may benefit from substantially different hyperparameters. See hyperparameter evolution tutorial: https://docs.ultralytics.com/yolov5

acai66 · 2020-08-05T21:20:36Z

@acai66 for your dataset I would recommend several changes:

You have very small objects, you need to train at the highest viable resolution, even if it means using a smaller model.

Start from pretrained weights for best results, but also try training from scratch to compare.

Use the largest batch size that will fit into RAM.

Your dataset is different enough from COCO that it may benefit from substantially different hyperparameters. See hyperparameter evolution tutorial: https://docs.ultralytics.com/yolov5

Thank you very much for your recommendation, and i will try to do that,. This issue was solved after git pull latest commits.

Jacobsolawetz · 2020-08-05T22:41:19Z

@glenn-jocher yes... after introspection, there are maybe 6 or so images in the dataset of 500 that do not have annotations. A random grouping of those may have caused the cough.

Thanks for fixing this bug so quickly!

buimanhlinh96 · 2020-08-05T23:48:30Z

@glenn-jocher Thank you very much!!!!!!!

anhnktp · 2020-08-06T02:37:56Z

[@glenn-jocher] I am also facing the same issue (cloned latest code). Maybe the bug still remain, it quite strange because it can train to final epoch before the error happens. I trained with yolov5-s.yaml, batch-size=100 (maybe it is too large ?) on 2 GPU RTX 2080Ti. Every image contain at least one object

glenn-jocher · 2020-08-06T04:35:16Z

@anhnktp no, you are incorrect, you are not using the latest code. L545 no longer contains the same code, so the error message you see is not possible to produce in origin/master.

anhnktp · 2020-08-06T04:49:24Z

@glenn-jocher oh, I see. It is yolov5 version 2 days ago. You added some code. I'll recheck again. Thank you

Kachasukintim · 2021-08-30T15:43:32Z

Hello developer yolov5 I would like you to update the same with yolov4 pytorch in google colab, I tried it and yolov4 had the same problem. please help me. thank you in advance for your help

glenn-jocher · 2021-08-30T16:03:12Z

@Kachasukintim 👋 hi, thanks for letting us know about this problem with YOLOv5 🚀. We've created a few short guidelines below to help users provide what we need in order to get started investigating a possible problem.

How to create a Minimal, Reproducible Example

When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:

✅ Minimal – Use as little code as possible that still produces the same problem
✅ Complete – Provide all parts someone else needs to reproduce your problem in the question itself
✅ Reproducible – Test the code you're about to provide to make sure it reproduces the problem

In addition to the above requirements, for Ultralytics to provide assistance your code should be:

✅ Current – Verify that your code is up-to-date with current GitHub master, and if necessary git pull or git clone a new copy to ensure your problem has not already been resolved by previous commits.
✅ Unmodified – Your problem must be reproducible without any modifications to the codebase in this repository. Ultralytics does not provide support for custom code ⚠️.

If you believe your problem meets all of the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template and providing a minimum reproducible example to help us better understand and diagnose your problem.

Thank you! 😃

ZeKunZhang1998 added the question Further information is requested label Aug 3, 2020

glenn-jocher added the TODO label Aug 5, 2020

glenn-jocher changed the title ~~Hello!May be someting wrong with my data?when I change the batch size this issue occurs .~~ Possible Bug Training on Empty Batch? Aug 5, 2020

glenn-jocher added the bug Something isn't working label Aug 5, 2020

glenn-jocher assigned glenn-jocher and acai66 Aug 5, 2020

glenn-jocher added a commit that referenced this issue Aug 5, 2020

zero-target training bug fix (#609)

7eaf225

glenn-jocher removed the TODO label Aug 5, 2020

glenn-jocher closed this as completed Aug 5, 2020

burglarhobbit pushed a commit to burglarhobbit/yolov5 that referenced this issue Jan 1, 2021

zero-target training bug fix (ultralytics#609)

500b512

KMint1819 pushed a commit to KMint1819/yolov5 that referenced this issue May 12, 2021

zero-target training bug fix (ultralytics#609)

384a703

akbarali2019 mentioned this issue Mar 29, 2022

@buimanhlinh96 I don't know, I have not tried to reproduce yet. I know test.py operates correctly on datasets without labels, I don't know about train.py. Can you provide minimum viable code to reproduce your specific issue? #7189

Closed

BjarneKuehl pushed a commit to fhkiel-mlaip/yolov5 that referenced this issue Aug 26, 2022

zero-target training bug fix (ultralytics#609)

074b7c9

Possible Bug Training on Empty Batch? #609

Possible Bug Training on Empty Batch? #609

Comments

ZeKunZhang1998 commented Aug 3, 2020

❔Question

Additional context

glenn-jocher commented Aug 3, 2020 • edited Loading

Requirements

Environments

Current Status

acai66 commented Aug 3, 2020

glenn-jocher commented Aug 3, 2020

MiiaBestLamia commented Aug 3, 2020

glenn-jocher commented Aug 3, 2020

MiiaBestLamia commented Aug 3, 2020 • edited Loading

glenn-jocher commented Aug 3, 2020

ZeKunZhang1998 commented Aug 3, 2020

ZeKunZhang1998 commented Aug 4, 2020

acai66 commented Aug 4, 2020

glenn-jocher commented Aug 4, 2020

acai66 commented Aug 4, 2020

ZJU-lishuang commented Aug 4, 2020

MiiaBestLamia commented Aug 4, 2020

acai66 commented Aug 4, 2020

ZeKunZhang1998 commented Aug 4, 2020

ZeKunZhang1998 commented Aug 4, 2020

mrk230 commented Aug 4, 2020 • edited Loading

buimanhlinh96 commented Aug 5, 2020

glenn-jocher commented Aug 5, 2020

buimanhlinh96 commented Aug 5, 2020

glenn-jocher commented Aug 5, 2020

glenn-jocher commented Aug 5, 2020

glenn-jocher commented Aug 5, 2020

buimanhlinh96 commented Aug 5, 2020

glenn-jocher commented Aug 5, 2020

acai66 commented Aug 5, 2020

buimanhlinh96 commented Aug 5, 2020

glenn-jocher commented Aug 5, 2020

Jacobsolawetz commented Aug 5, 2020

glenn-jocher commented Aug 5, 2020

glenn-jocher commented Aug 5, 2020

glenn-jocher commented Aug 5, 2020

glenn-jocher commented Aug 5, 2020 • edited Loading

acai66 commented Aug 5, 2020 • edited by glenn-jocher Loading

Jacobsolawetz commented Aug 5, 2020

buimanhlinh96 commented Aug 5, 2020

anhnktp commented Aug 6, 2020 • edited Loading

glenn-jocher commented Aug 6, 2020

anhnktp commented Aug 6, 2020 • edited Loading

Kachasukintim commented Aug 30, 2021

glenn-jocher commented Aug 30, 2021 • edited Loading

How to create a Minimal, Reproducible Example

glenn-jocher commented Aug 3, 2020 •

edited

Loading

MiiaBestLamia commented Aug 3, 2020 •

edited

Loading

mrk230 commented Aug 4, 2020 •

edited

Loading

glenn-jocher commented Aug 5, 2020 •

edited

Loading

acai66 commented Aug 5, 2020 •

edited by glenn-jocher

Loading

anhnktp commented Aug 6, 2020 •

edited

Loading

anhnktp commented Aug 6, 2020 •

edited

Loading

glenn-jocher commented Aug 30, 2021 •

edited

Loading