-
-
Notifications
You must be signed in to change notification settings - Fork 16k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible Bug Training on Empty Batch? #609
Comments
Hello, thank you for your interest in our work! This issue seems to lack the minimum requirements for a proper response, or is insufficiently detailed for us to help you. Please note that most technical problems are due to:
sudo rm -rf yolov5 # remove existing
git clone https://github.com/ultralytics/yolov5 && cd yolov5 # clone latest
python detect.py # verify detection
# CODE TO REPRODUCE YOUR ISSUE HERE
If none of these apply to you, we suggest you close this issue and raise a new one using the Bug Report template, providing screenshots and minimum viable code to reproduce your issue. Thank you! RequirementsPython 3.8 or later with all requirements.txt dependencies installed, including $ pip install -U -r requirements.txt EnvironmentsYOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
Current StatusIf this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) test are passing. These tests evaluate proper operation of basic YOLOv5 functionality, including training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu. |
Changing batchsize will solve this issue, it will occur when a batch of images contain no object. |
@acai66 ah I see. I remember a similar issue about testing with no targets, but I believed this was resolved. Does this occur when training or testing? Can you supply code to reproduce? |
While changing the batchsize helped prolong the learning process, this issue still occurs for me. By printing the paths of the images in each batch i can check to see if they have an object, and there's definitely at least one object in each occasion (my dataset doesn't have images with empty label files) of the crash. |
@MiiaBestLamia can you supply exact steps and code to reproduce this issue following the steps outlined before (current repo, valid environment, common dataset?) |
@glenn-jocher i'm using the most recent repository, all of the requirements are satisfied except the Python version, which might be the reason for the issue, though it seems strange that the learning process works for a bit, then crashes (I'm using 3.6.9). I have altered only one line of code in the repository, the one that prints paths of the files used in the batch. Using a dataset that I may not be allowed to share, so I cannot provide you with the files. |
@MiiaBestLamia I would verify your issue is reproducible in one of the environments above. That's what they're there for. |
when I change to batch size = 1,another picture is something wrong. |
I use a private dataset.When I jump the wrong batch , the problem will be solved.The wrong batch has object, it is not an empty picture. |
Maybe i am wrong about this issue, but this issue disappeared when i changed to another batchsize. |
@acai66 the only thing we can act on is a reproducible example in one of the verified environments. |
I will reinstall pytorch from source code and fetch latest yolov5, i will upload my datasets if this issue occurs again. |
I meet the same problem too. |
After changing some of the hyperparameters in train.py (lr0:0.001, scale:0.2), moving the project to a computer with a beefier GPU (went from 2060S to 2080Ti, using Python 3.6.9) and increasing the batch size to 8, training is functioning properly for 4 epochs now. I can still reproduce the issue with my data on the 2080Ti by launching train.py with a batch size of 4, so I suppose this issue is caused by peculiarities in data, not problems with the network/code. |
here is my datasets, data yaml: models yaml: train command: This issue disappeared when changed batchsize to 12. |
When I use batch size = 4, it works. |
Hi,I think some data augmentation make the boxes disappear, right? If you use bigger batch size , this issue will disappear. |
On a private dataset I have also had this issue with batch size = 16. Haven't fully tested further, but the dataset does include a fair amount of images without objects. For what that's worth. |
It means that whole dataset must have object, if an image isnt labelled then wrong, isnt it? |
@buimanhlinh96 that is not correct. COCO has over a thousand images without labels. |
@glenn-jocher so what happended with this issue? It might in a batch, we must have at least one image is labelled? |
@acai66 thank you! I think I can work with this. I can only debug official models though, so I will use yolov5x.yaml in place of yours. Do you yourself see the error when running on the default models? |
@buimanhlinh96 I don't know, I have not tried to reproduce yet. I know test.py operates correctly on datasets without labels, I don't know about train.py. Can you provide minimum viable code to reproduce your specific issue? |
@acai66 also are you able to reproduce in one of the verified environments? |
@glenn-jocher I try some experiments and come up with the batch-size should be more than or equal 8 |
@buimanhlinh96 there is no constraint on batch size, so you should be able to use batch size 1 to batch size x, whatever your hardware can handle. If this is not the case then there is a bug. |
you can try default yolov5x.yaml. I just change |
@glenn-jocher Yes. Hopefully we can fix it ASAP. Love yolov5 |
@acai66 ah I see, of course. We actually updated train.py a few weeks back to inherit Ok, I will try to reproduce this in a colab notebook today if I have time. |
@glenn-jocher I'm in the same boat.
For me the bug hits right after the first epoch (which successfully completes), when moving to the second epoch. It seems fixed by moving the batch size from 4 to 12 as suggested above (Colab runs out of memory on this dataset at 16). |
@Jacobsolawetz hmm ok. Do you have a pretty sparse dataset, do you think it's possible a whole batch of 4 images might have no labels? Does the bug happen during training or testing? |
@acai66 I'm able to reproduce this in a colab notebook: I see this midway through the first epoch:
|
@ZeKunZhang1998 @mrk230 @Jacobsolawetz @acai66 @buimanhlinh96 this issue should be resolved now in 7eaf225. Please Let us know if you run into anymore problems, and good luck! |
@acai66 for your dataset I would recommend several changes:
|
Thank you very much for your recommendation, and i will try to do that,. This issue was solved after git pull latest commits. |
@glenn-jocher yes... after introspection, there are maybe 6 or so images in the dataset of 500 that do not have annotations. A random grouping of those may have caused the cough. Thanks for fixing this bug so quickly! |
@glenn-jocher Thank you very much!!!!!!! |
[@glenn-jocher] I am also facing the same issue (cloned latest code). Maybe the bug still remain, it quite strange because it can train to final epoch before the error happens. I trained with yolov5-s.yaml, batch-size=100 (maybe it is too large ?) on 2 GPU RTX 2080Ti. Every image contain at least one object |
@anhnktp no, you are incorrect, you are not using the latest code. L545 no longer contains the same code, so the error message you see is not possible to produce in origin/master. |
@glenn-jocher oh, I see. It is yolov5 version 2 days ago. You added some code. I'll recheck again. Thank you |
Hello developer yolov5 I would like you to update the same with yolov4 pytorch in google colab, I tried it and yolov4 had the same problem. please help me. thank you in advance for your help |
@Kachasukintim 👋 hi, thanks for letting us know about this problem with YOLOv5 🚀. We've created a few short guidelines below to help users provide what we need in order to get started investigating a possible problem. How to create a Minimal, Reproducible ExampleWhen asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:
In addition to the above requirements, for Ultralytics to provide assistance your code should be:
If you believe your problem meets all of the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template and providing a minimum reproducible example to help us better understand and diagnose your problem. Thank you! 😃 |
❔Question
Traceback (most recent call last):
File "train.py", line 463, in
train(hyp, tb_writer, opt, device)
File "train.py", line 286, in train
loss, loss_items = compute_loss(pred, targets.to(device), model) # scaled by batch_size
File "/content/drive/My Drive/yolov5/utils/utils.py", line 443, in compute_loss
tcls, tbox, indices, anchors = build_targets(p, targets, model) # targets
File "/content/drive/My Drive/yolov5/utils/utils.py", line 542, in build_targets
b, c = t[:, :2].long().T # image, class
ValueError: too many values to unpack (expected 2)
Additional context
The text was updated successfully, but these errors were encountered: