Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible AutoAnchor reversal in v2.0 #447

Closed
123456789mojtaba opened this issue Jul 19, 2020 · 25 comments
Closed

Possible AutoAnchor reversal in v2.0 #447

123456789mojtaba opened this issue Jul 19, 2020 · 25 comments
Assignees
Labels
bug Something isn't working Stale

Comments

@123456789mojtaba
Copy link

hey guys.
I have trained yolov5 on visdrone for car and pedestrian. But it detects some cars and pedestrians with 2 boundig box instead of one?
does anyone know the problem?
photo5805191133326849245

@123456789mojtaba 123456789mojtaba added the bug Something isn't working label Jul 19, 2020
@priteshgohil
Copy link

I have a similar problem with yolov5s. Not sure why it predicts a small bounding box. Next, I will be training on default anchors instead of calculating during training. I doubt anchors might play a role here because proposed anchors for my datasets are smaller.

@TaoXieSZ
Copy link

Can setting higher iou-threshold help?

@glenn-jocher glenn-jocher removed the bug Something isn't working label Jul 19, 2020
@glenn-jocher
Copy link
Member

@123456789mojtaba do not use a bug label for training results that you don't understand.

@glenn-jocher
Copy link
Member

@123456789mojtaba @priteshgohil first, without looking at your training results.png it is impossible to say whether you have trained properly, so displaying anecdotal evidence of improper training on a custom dataset out of context allows no one to properly help you.

Second, 5s is the naturally the smallest and least accurate model. If your goal is accuracy, 5s should not be your first choice obviously. You can see a comparison in our readme table https://github.com/ultralytics/yolov5#pretrained-checkpoints

@priteshgohil
Copy link

priteshgohil commented Jul 22, 2020

@glenn-jocher There is no doubt on the dataset and training. The problem is even with YOLOv5l. As I have predicted, the fault was the calculated anchor boxes, because check_anchors() function is giving smaller anchor values for mine dataset. I get very good results with default anchors. I will update training results.png and prediction result by Saturday 25.07.2020.

@glenn-jocher
Copy link
Member

@priteshgohil hmm that's strange. check_anchors() is supposed to check your anchors to make sure they are aligned to your stride order. i.e. they should both be large to small or small to large depending on your head.

@glenn-jocher
Copy link
Member

@priteshgohil ah, nevermind, check_anchors() recomputes new anchors if needed based on your dataset BPR. You can disable it with python train.py --noautoanchor

@priteshgohil
Copy link

@glenn-jocher Thank you!!
So following are the results.png and prediction.

YOLOv5s with auto anchors

results5s
frankfurt_000001_077233_leftImg8bit

YOLOv5s without auto anchors (i.e. --noautoanchor)

results5l-noautoanchors
frankfurt_000001_077233_leftImg8bit

@glenn-jocher
Copy link
Member

glenn-jocher commented Jul 27, 2020

@priteshgohil ah interesting. Yes the second is definitely better. Can you report your anchors for both using:
print(torch.load('yolov5s.pt')['model'].model[-1].anchors)

AutoAnchor (actually any anchor evolution using our code) works under the assumption that the objects are spread around a range of sizes relative to the model output strides 8, 16 and 32. In theory if your labels are composed solely of larger or smaller objects, then some output layers may be better of being completely removed or ignored than being assigned anchors far outside their receptive field size. In practice though it is difficult determining actual receptive field dimensions.

@priteshgohil
Copy link

priteshgohil commented Jul 27, 2020

Hi @glenn-jocher, Thank you for explaining. So we have labels.png generated during training which is really cool. Can you explain (or have any link) about how to interpret this image?

I have the following values

With autoAnchors

Console output during training was

thr=0.25: 0.9990 best possible recall, 4.61 anchors past thr
n=9, img_size=416, metric_all=0.313/0.732-mean/best, past_thr=0.488-mean: 6,6,  12,11,  12,25,  23,16,  37,26,  30,61,  62,40,  94,72,  139,123
thr=0.25: 0.9995 best possible recall, 5.22 anchors past thr
n=9, img_size=416, metric_all=0.345/0.757-mean/best, past_thr=0.493-mean: 5,4,  7,7,  13,10,  8,18,  21,17,  19,43,  36,28,  63,46,  113,88

I have one question here. are these new calculated anchors? If yes then why it doesn't match with following anchors saved in the model? I think the larger anchors group is divided by 8 and smaller group by 32. Whereas it should be opposite right? correct me if I'm wrong

tensor([[[ 4.49609,  3.44922],
     [ 7.89453,  5.73438],
     [14.11719, 11.00000]],

    [[ 0.49658,  1.11914],
     [ 1.30859,  1.06055],
     [ 1.21582,  2.69922]],

    [[ 0.14978,  0.13513],
     [ 0.23328,  0.22156],
     [ 0.41089,  0.31543]]], dtype=torch.float16)

Without autoAnchors

These anchors match with the values in yolov5s.yaml file.

tensor([[[ 3.62500,  2.81250],
     [ 4.87500,  6.18750],
     [11.65625, 10.18750]],

    [[ 1.87500,  3.81250],
     [ 3.87500,  2.81250],
     [ 3.68750,  7.43750]],

    [[ 1.25000,  1.62500],
     [ 2.00000,  3.75000],
     [ 4.12500,  2.87500]]], dtype=torch.float16)

@glenn-jocher
Copy link
Member

@priteshgohil anchors displayed using this command are in stride units. You are using a pre v2.0 version of the repo so your anchors are reversed compared to v2.0 anchors, but this is not a problem.

yolov5s.yaml:

# anchors
anchors:
  - [10,13, 16,30, 33,23]  # P3/8
  - [30,61, 62,45, 59,119]  # P4/16
  - [116,90, 156,198, 373,326]  # P5/32

yolov5s anchors:

print(torch.load('yolov5s.pt')['model'].model[-1].anchors)
tensor([[[ 1.25000,  1.62500],
         [ 2.00000,  3.75000],
         [ 4.12500,  2.87500]],
        [[ 1.87500,  3.81250],
         [ 3.87500,  2.81250],
         [ 3.68750,  7.43750]],
        [[ 3.62500,  2.81250],
         [ 4.87500,  6.18750],
         [11.65625, 10.18750]]], dtype=torch.float16)

You have two anchor computations that both look similar, but they do not correspond to your autoanchor model output. Since your code is out of date, there are likely issues with it that have already been resolved. I would git clone the most recent repo and repeat your experiment, using all default settings (changing nothing except with and without autoanchor). It looks like you only need about 30 training epochs to make a comparison.

@priteshgohil
Copy link

priteshgohil commented Jul 28, 2020

Hi @glenn-jocher Yes you are right. Thank you :). The problem is solved with most recent pull. Results are good with latest git pull.

The problem in v2.0 was with the reversed anchors and k means computed anchors were divided with wrong stride value (instead of 8, 16, 32 it was divided with 32, 16, 8). However, I am also able to get the perfect result in v2.0 by changing following line with,

m.anchors[:] = new_anchors.clone().view_as(m.anchors) / m.stride.to(m.anchors.device).view(-1, 1, 1) # loss

m.anchors[:] = new_anchors.clone().view_as(m.anchors) / torch.flip(m.stride.to(m.anchors.device).view(-1, 1, 1),[0,1]) # loss

@glenn-jocher
Copy link
Member

@priteshgohil I don't understand. Are you saying that utils.py L99 in 7f8471e (current master) needs changing?

@glenn-jocher
Copy link
Member

glenn-jocher commented Jul 28, 2020

L99 is the line that divides the anchors from pixels to strides. L100 right after it is supposed to check the anchor order and reverse them if necessary. Perhaps this region of the code should be updated to make it more robust to different scenarios. For now it should work fine with the public architectures offered (I'm training several models currently that rely on autoanchor and they are training correctly).

@priteshgohil
Copy link

priteshgohil commented Jul 28, 2020

Hi @glenn-jocher. Sorry for creating misunderstanding. Current master (7f8471e) is perfectly fine and doesn't need any changes. The problem was when I was using old version and yolov5s.yaml was using the following order of anchors,

# anchors
anchors:
  - [116,90, 156,198, 373,326]  # P5/32
  - [30,61, 62,45, 59,119]  # P4/16
  - [10,13, 16,30, 33,23]  # P3/8

So, L100 in utils.py will correct the order but I guess it should be done before L99 and then divide it with correct stride value (correct me if I'm wrong).

In my old version of repo, L99 had following values for the tensor, where it is necessary to flip either dividing tensor or new anchor tensor.

>> m.stride.to(m.anchors.device).view(-1, 1, 1)
>> tensor([[[32.]],

        [[16.]],

        [[ 8.]]])
>> new_anchors.clone().view_as(m.anchors)
>> tensor([[[  4.79442,   4.32408],
         [  7.46562,   7.09048],
         [ 13.14909,  10.09316]],

        [[  7.94588,  17.91208],
         [ 20.93719,  16.97507],
         [ 19.46055,  43.18595]],

        [[ 35.97452,  27.59841],
         [ 63.15837,  45.87284],
         [112.93896,  87.99326]]])

After L99

>> tensor([[[ 0.14983,  0.13513],
         [ 0.23330,  0.22158],
         [ 0.41091,  0.31541]],

        [[ 0.49662,  1.11951],
         [ 1.30857,  1.06094],
         [ 1.21628,  2.69912]],

        [[ 4.49682,  3.44980],
         [ 7.89480,  5.73411],
         [14.11737, 10.99916]]])

After L100

tensor([[[ 4.49682,  3.44980],
         [ 7.89480,  5.73411],
         [14.11737, 10.99916]],

        [[ 0.49662,  1.11951],
         [ 1.30857,  1.06094],
         [ 1.21628,  2.69912]],

        [[ 0.14983,  0.13513],
         [ 0.23330,  0.22158],
         [ 0.41091,  0.31541]]])

So do you see the problem? Anchors were divided by the wrong value and check_anchor_order at L100 only changes its order.

@glenn-jocher glenn-jocher changed the title inference Possible AutoAnchor reversal in v2.0 Jul 28, 2020
@glenn-jocher glenn-jocher added bug Something isn't working TODO labels Jul 28, 2020
@glenn-jocher glenn-jocher self-assigned this Jul 28, 2020
@glenn-jocher glenn-jocher reopened this Jul 28, 2020
@glenn-jocher
Copy link
Member

glenn-jocher commented Jul 28, 2020

@priteshgohil yes I think believe you are correct that we should adjust the order in conjunction with the strides to keep them both synchronized. The evolved anchors are sorted from small to large before being attached to the model and then divided by stride, which in v2.0 model yamls is also always small to large.

But I just finished my training with a v2.0 autoanchor model, and while the training mAPs performed well (better than the official model actually), when I test the saved model I get about half the mAP expected. So it seems something is still not quite right.

@glenn-jocher
Copy link
Member

@priteshgohil I've taken a quick look, and am very confused about what could be wrong. The same EMA gets passed to test.py during training as is saved each epoch, so there should not be any differences. If the EMA performs at x mAP during training then test.py should produce the same results independently.

Just to be clear, were you able to train a v2.0 model using Autoanchor, and observed good training results, and also, separately once training was complete observed good test.py results using best.pt or last.pt?

@priteshgohil
Copy link

priteshgohil commented Jul 29, 2020

@glenn-jocher Yes I completed training and I observe that the results are almost similar to yolov5s trained on the previous version without autoanchor. Just little boost on the specific class category which is more frequent than other object categories in my dataset. Results.png is almost same as the one I have posted earlier in this issue except for minimum objectness for both training and validation is 0.1 instead of 0.05

@glenn-jocher
Copy link
Member

@priteshgohil ok thanks. Maybe the problem is only in my dev branch then.

@glenn-jocher
Copy link
Member

@priteshgohil trying to figure out the status of this issue. Are you still seeing any problems in the current code or would you say the original issue appears resolved now?

@priteshgohil
Copy link

@glenn-jocher I don't see any problem now. I even tried altering the order of anchors in .yaml file and it worked as expected.

@glenn-jocher
Copy link
Member

@priteshgohil ok, great, thanks!

@github-actions
Copy link
Contributor

github-actions bot commented Sep 7, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@glenn-jocher
Copy link
Member

TODO removed as original issue appears resolved.

@xiaomao19970819
Copy link

@priteshgohil Hi, I found a problem in the latest version of the code. I have the same opinion as you.
I don't understand why we need to divide by stride in line 58 of autoanchor.py when checking the area order of anchors.
I think check_anchor_order should run before line 58, rather than dividing by the stride first.
This can even cause the size of an Anchor to exceed the longest edge of the image.
I don't know if I'm right, please correct me if I'm wrong

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale
Projects
None yet
Development

No branches or pull requests

5 participants