Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible Bug Training on Empty Batch? #609

Closed
ZeKunZhang1998 opened this issue Aug 3, 2020 · 42 comments
Closed

Possible Bug Training on Empty Batch? #609

ZeKunZhang1998 opened this issue Aug 3, 2020 · 42 comments
Assignees
Labels
bug Something isn't working question Further information is requested

Comments

@ZeKunZhang1998
Copy link

❔Question

Traceback (most recent call last):
File "train.py", line 463, in
train(hyp, tb_writer, opt, device)
File "train.py", line 286, in train
loss, loss_items = compute_loss(pred, targets.to(device), model) # scaled by batch_size
File "/content/drive/My Drive/yolov5/utils/utils.py", line 443, in compute_loss
tcls, tbox, indices, anchors = build_targets(p, targets, model) # targets
File "/content/drive/My Drive/yolov5/utils/utils.py", line 542, in build_targets
b, c = t[:, :2].long().T # image, class
ValueError: too many values to unpack (expected 2)

Additional context

@ZeKunZhang1998 ZeKunZhang1998 added the question Further information is requested label Aug 3, 2020
@glenn-jocher
Copy link
Member

glenn-jocher commented Aug 3, 2020

Hello, thank you for your interest in our work! This issue seems to lack the minimum requirements for a proper response, or is insufficiently detailed for us to help you. Please note that most technical problems are due to:

  • Your changes to the default repository. If your issue is not reproducible in a new git clone version of this repository we can not debug it. Before going further run this code and ensure your issue persists:
sudo rm -rf yolov5  # remove existing
git clone https://github.com/ultralytics/yolov5 && cd yolov5 # clone latest
python detect.py  # verify detection
# CODE TO REPRODUCE YOUR ISSUE HERE
  • Your custom data. If your issue is not reproducible with COCO or COCO128 data we can not debug it. Visit our Custom Training Tutorial for guidelines on training your custom data. Examine train_batch0.jpg and test_batch0.jpg for a sanity check of training and testing data.

  • Your environment. If your issue is not reproducible in one of the verified environments below we can not debug it. If you are running YOLOv5 locally, ensure your environment meets all of the requirements.txt dependencies specified below.

If none of these apply to you, we suggest you close this issue and raise a new one using the Bug Report template, providing screenshots and minimum viable code to reproduce your issue. Thank you!

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.6. To install run:

$ pip install -U -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Current Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) test are passing. These tests evaluate proper operation of basic YOLOv5 functionality, including training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu.

@acai66
Copy link

acai66 commented Aug 3, 2020

Changing batchsize will solve this issue, it will occur when a batch of images contain no object.

@glenn-jocher
Copy link
Member

@acai66 ah I see. I remember a similar issue about testing with no targets, but I believed this was resolved. Does this occur when training or testing? Can you supply code to reproduce?

@MiiaBestLamia
Copy link

While changing the batchsize helped prolong the learning process, this issue still occurs for me. By printing the paths of the images in each batch i can check to see if they have an object, and there's definitely at least one object in each occasion (my dataset doesn't have images with empty label files) of the crash.

@glenn-jocher
Copy link
Member

@MiiaBestLamia can you supply exact steps and code to reproduce this issue following the steps outlined before (current repo, valid environment, common dataset?)

@MiiaBestLamia
Copy link

MiiaBestLamia commented Aug 3, 2020

@glenn-jocher i'm using the most recent repository, all of the requirements are satisfied except the Python version, which might be the reason for the issue, though it seems strange that the learning process works for a bit, then crashes (I'm using 3.6.9). I have altered only one line of code in the repository, the one that prints paths of the files used in the batch. Using a dataset that I may not be allowed to share, so I cannot provide you with the files.
I'm using the .txt option of providing the train and val sets, i have only one class, using image size 800 with batch size 4, providing yolov5l.pt weights (that I downloaded from the google drive), would be nice to see what @ZeKunZhang1998 is working with, maybe that can narrow down the problem.

@glenn-jocher
Copy link
Member

@MiiaBestLamia I would verify your issue is reproducible in one of the environments above. That's what they're there for.

@ZeKunZhang1998
Copy link
Author

Changing batchsize will solve this issue, it will occur when a batch of images contain no object.

when I change to batch size = 1,another picture is something wrong.

@ZeKunZhang1998
Copy link
Author

@glenn-jocher i'm using the most recent repository, all of the requirements are satisfied except the Python version, which might be the reason for the issue, though it seems strange that the learning process works for a bit, then crashes (I'm using 3.6.9). I have altered only one line of code in the repository, the one that prints paths of the files used in the batch. Using a dataset that I may not be allowed to share, so I cannot provide you with the files.
I'm using the .txt option of providing the train and val sets, i have only one class, using image size 800 with batch size 4, providing yolov5l.pt weights (that I downloaded from the google drive), would be nice to see what @ZeKunZhang1998 is working with, maybe that can narrow down the problem.

I use a private dataset.When I jump the wrong batch , the problem will be solved.The wrong batch has object, it is not an empty picture.

@acai66
Copy link

acai66 commented Aug 4, 2020

Maybe i am wrong about this issue, but this issue disappeared when i changed to another batchsize.
I've added print(targets.shape) in build_targets, i got this Tensor.Size([0, 6]) when ValueError: too many values to unpack (expected 2)

@glenn-jocher
Copy link
Member

@acai66 the only thing we can act on is a reproducible example in one of the verified environments.

@acai66
Copy link

acai66 commented Aug 4, 2020

@acai66 the only thing we can act on is a reproducible example in one of the verified environments.

I will reinstall pytorch from source code and fetch latest yolov5, i will upload my datasets if this issue occurs again.

@ZJU-lishuang
Copy link

I meet the same problem too.
I think the reason is that after box_candidates function there are no targets.

@MiiaBestLamia
Copy link

After changing some of the hyperparameters in train.py (lr0:0.001, scale:0.2), moving the project to a computer with a beefier GPU (went from 2060S to 2080Ti, using Python 3.6.9) and increasing the batch size to 8, training is functioning properly for 4 epochs now. I can still reproduce the issue with my data on the 2080Ti by launching train.py with a batch size of 4, so I suppose this issue is caused by peculiarities in data, not problems with the network/code.

@acai66
Copy link

acai66 commented Aug 4, 2020

@acai66 the only thing we can act on is a reproducible example in one of the verified environments.

here is my datasets,
https://drive.google.com/file/d/12epqSyYELm7c4mXXIIJos33KDcj1l67f/view?usp=sharing

data yaml:
2020.yaml.txt

models yaml:
yolov5x_2020.yaml.txt

train command:
python train.py --cfg models/yolov5x_2020.yaml --data data/2020.yaml --epochs 300 --batch-size 8 --img-size 512 512 --cache-images --weights '' --name "yolov5x_2020_default" --single-cls

This issue disappeared when changed batchsize to 12.

@ZeKunZhang1998
Copy link
Author

When I use batch size = 4, it works.

@ZeKunZhang1998
Copy link
Author

@acai66 the only thing we can act on is a reproducible example in one of the verified environments.

Hi,I think some data augmentation make the boxes disappear, right? If you use bigger batch size , this issue will disappear.

@mrk230
Copy link

mrk230 commented Aug 4, 2020

On a private dataset I have also had this issue with batch size = 16. Haven't fully tested further, but the dataset does include a fair amount of images without objects. For what that's worth.

@buimanhlinh96
Copy link

It means that whole dataset must have object, if an image isnt labelled then wrong, isnt it?

@glenn-jocher
Copy link
Member

@buimanhlinh96 that is not correct. COCO has over a thousand images without labels.

@buimanhlinh96
Copy link

@glenn-jocher so what happended with this issue? It might in a batch, we must have at least one image is labelled?

@glenn-jocher
Copy link
Member

@acai66 the only thing we can act on is a reproducible example in one of the verified environments.

here is my datasets,
https://drive.google.com/file/d/12epqSyYELm7c4mXXIIJos33KDcj1l67f/view?usp=sharing

data yaml:
2020.yaml.txt

models yaml:
yolov5x_2020.yaml.txt

train command:
python train.py --cfg models/yolov5x_2020.yaml --data data/2020.yaml --epochs 300 --batch-size 8 --img-size 512 512 --cache-images --weights '' --name "yolov5x_2020_default" --single-cls

This issue disappeared when changed batchsize to 12.

@acai66 thank you! I think I can work with this. I can only debug official models though, so I will use yolov5x.yaml in place of yours. Do you yourself see the error when running on the default models?

@glenn-jocher
Copy link
Member

@buimanhlinh96 I don't know, I have not tried to reproduce yet. I know test.py operates correctly on datasets without labels, I don't know about train.py. Can you provide minimum viable code to reproduce your specific issue?

@glenn-jocher
Copy link
Member

@acai66 also are you able to reproduce in one of the verified environments?

@buimanhlinh96
Copy link

@glenn-jocher I try some experiments and come up with the batch-size should be more than or equal 8

@glenn-jocher
Copy link
Member

@buimanhlinh96 there is no constraint on batch size, so you should be able to use batch size 1 to batch size x, whatever your hardware can handle. If this is not the case then there is a bug.

@acai66
Copy link

acai66 commented Aug 5, 2020

@acai66 also are you able to reproduce in one of the verified environments?

you can try default yolov5x.yaml. I just change nc to 1 in yolov5x_2020_default.yaml actually.

@buimanhlinh96
Copy link

@glenn-jocher Yes. Hopefully we can fix it ASAP. Love yolov5

@glenn-jocher
Copy link
Member

@acai66 ah I see, of course. We actually updated train.py a few weeks back to inherit nc from the data.yaml in case of a mismatch with the model yaml nc, so you should be able to use your command with the default 80 class yolov5x.yaml as well, and it will still operate correctly.

Ok, I will try to reproduce this in a colab notebook today if I have time.

@glenn-jocher glenn-jocher changed the title Hello!May be someting wrong with my data?when I change the batch size this issue occurs . Possible Bug Training on Empty Batch? Aug 5, 2020
@glenn-jocher glenn-jocher added the bug Something isn't working label Aug 5, 2020
@Jacobsolawetz
Copy link
Contributor

@glenn-jocher I'm in the same boat.

  • Colab notebook environment with torch 1.6.0
  • yolov5x yaml changing NC
  • private dataset

For me the bug hits right after the first epoch (which successfully completes), when moving to the second epoch.

It seems fixed by moving the batch size from 4 to 12 as suggested above (Colab runs out of memory on this dataset at 16).

@glenn-jocher
Copy link
Member

@Jacobsolawetz hmm ok. Do you have a pretty sparse dataset, do you think it's possible a whole batch of 4 images might have no labels? Does the bug happen during training or testing?

@glenn-jocher
Copy link
Member

@acai66 I'm able to reproduce this in a colab notebook:
https://colab.research.google.com/drive/1bCFd_1fyFG8pkXkQ8MubvRSgFsb9ZPhu#scrollTo=-AVqcyhjO89V

I see this midway through the first epoch:

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
     0/299     4.78G   0.07214   0.01437         0   0.08651         3       512:  70% 105/150 [00:58<00:22,  2.02it/s]Traceback (most recent call last):
  File "train.py", line 477, in <module>
    train(hyp, opt, device, tb_writer)
  File "train.py", line 300, in train
    loss, loss_items = compute_loss(pred, targets.to(device), model)  # scaled by batch_size
  File "/content/yolov5/utils/general.py", line 446, in compute_loss
    tcls, tbox, indices, anchors = build_targets(p, targets, model)  # targets
  File "/content/yolov5/utils/general.py", line 545, in build_targets
    b, c = t[:, :2].long().T  # image, class
ValueError: too many values to unpack (expected 2)
     0/299     4.78G   0.07214   0.01437         0   0.08651         3       512:  70% 105/150 [00:58<00:25,  1.79it/s]

@glenn-jocher
Copy link
Member

@ZeKunZhang1998 @mrk230 @Jacobsolawetz @acai66 @buimanhlinh96 this issue should be resolved now in 7eaf225. Please git pull to receive the latest updates and try again.

Let us know if you run into anymore problems, and good luck!

@glenn-jocher
Copy link
Member

glenn-jocher commented Aug 5, 2020

@acai66 for your dataset I would recommend several changes:

  1. You have very small objects, you need to train at the highest viable resolution, even if it means using a smaller model.
  2. Start from pretrained weights for best results, but also try training from scratch to compare.
  3. Use the largest batch size that will fit into RAM.
  4. Your dataset is different enough from COCO that it may benefit from substantially different hyperparameters. See hyperparameter evolution tutorial: https://docs.ultralytics.com/yolov5

@acai66
Copy link

acai66 commented Aug 5, 2020

@acai66 for your dataset I would recommend several changes:

  1. You have very small objects, you need to train at the highest viable resolution, even if it means using a smaller model.
  2. Start from pretrained weights for best results, but also try training from scratch to compare.
  3. Use the largest batch size that will fit into RAM.
  4. Your dataset is different enough from COCO that it may benefit from substantially different hyperparameters. See hyperparameter evolution tutorial: https://docs.ultralytics.com/yolov5

Thank you very much for your recommendation, and i will try to do that,. This issue was solved after git pull latest commits.

@Jacobsolawetz
Copy link
Contributor

@glenn-jocher yes... after introspection, there are maybe 6 or so images in the dataset of 500 that do not have annotations. A random grouping of those may have caused the cough.

Thanks for fixing this bug so quickly!

@buimanhlinh96
Copy link

@glenn-jocher Thank you very much!!!!!!!

@anhnktp
Copy link

anhnktp commented Aug 6, 2020

[@glenn-jocher] I am also facing the same issue (cloned latest code). Maybe the bug still remain, it quite strange because it can train to final epoch before the error happens. I trained with yolov5-s.yaml, batch-size=100 (maybe it is too large ?) on 2 GPU RTX 2080Ti. Every image contain at least one object
Screen Shot 2020-08-06 at 09 37 37

@glenn-jocher
Copy link
Member

@anhnktp no, you are incorrect, you are not using the latest code. L545 no longer contains the same code, so the error message you see is not possible to produce in origin/master.

@anhnktp
Copy link

anhnktp commented Aug 6, 2020

@glenn-jocher oh, I see. It is yolov5 version 2 days ago. You added some code. I'll recheck again. Thank you

burglarhobbit pushed a commit to burglarhobbit/yolov5 that referenced this issue Jan 1, 2021
KMint1819 pushed a commit to KMint1819/yolov5 that referenced this issue May 12, 2021
@Kachasukintim
Copy link

Hello developer yolov5 I would like you to update the same with yolov4 pytorch in google colab, I tried it and yolov4 had the same problem. please help me. thank you in advance for your help

@glenn-jocher
Copy link
Member

glenn-jocher commented Aug 30, 2021

@Kachasukintim 👋 hi, thanks for letting us know about this problem with YOLOv5 🚀. We've created a few short guidelines below to help users provide what we need in order to get started investigating a possible problem.

How to create a Minimal, Reproducible Example

When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:

  • Minimal – Use as little code as possible that still produces the same problem
  • Complete – Provide all parts someone else needs to reproduce your problem in the question itself
  • Reproducible – Test the code you're about to provide to make sure it reproduces the problem

In addition to the above requirements, for Ultralytics to provide assistance your code should be:

  • Current – Verify that your code is up-to-date with current GitHub master, and if necessary git pull or git clone a new copy to ensure your problem has not already been resolved by previous commits.
  • Unmodified – Your problem must be reproducible without any modifications to the codebase in this repository. Ultralytics does not provide support for custom code ⚠️.

If you believe your problem meets all of the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template and providing a minimum reproducible example to help us better understand and diagnose your problem.

Thank you! 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

10 participants