Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bipartite loss ? #1083

Closed
pfeatherstone opened this issue Oct 4, 2020 · 9 comments
Closed

bipartite loss ? #1083

pfeatherstone opened this issue Oct 4, 2020 · 9 comments
Labels
question Further information is requested Stale

Comments

@pfeatherstone
Copy link

❔Question

Is it possible to use a bipartite loss, as used in https://github.com/facebookresearch/detr?

This essentially uses the hungarian algorithm to match predictions with targets. This means that the order of the targets is irrelevant as the hungarian algorithm gives you the same loss regardless. So you can, in theory, directly predict object bounding boxes without having to worry about ordering in your labels.

So rather than have yolo layers that give you 13x13x3 + 26x26x3 + 52x52x3 = 10647 predictions (with h, w = 416), you simply have a backbone + average pooling + linear layers, almost as if you were constructing a classifier, that predict, lets say, a maximum of 100 objects per image.

Is this possible? Or can this only be done using a transformer and some learned 'queries'? It would be neat if an object detector could have the same architecture as a classifier and use a fancy loss function to give you predictions.

I know this would no longer be yolo, but this repo is as good a place as any to ask object detection related questions.

Thoughts anyone?

@pfeatherstone pfeatherstone added the question Further information is requested label Oct 4, 2020
@glenn-jocher
Copy link
Member

@pfeatherstone well, from an architecture standpoint it's very easy to to add average pooling and fully connected / linear layer to the backbone, eliminating the head. In models/common.py we already have a Classify() module that essentially does this that you could just append to the backbone. The only differences is you would replace the nn.Conv2d(c1, c2) with a nn.Linear(c1, c2).

class Flatten(nn.Module):
    # Use after nn.AdaptiveAvgPool2d(1) to remove last 2 dimensions
    @staticmethod
    def forward(x):
        return x.view(x.size(0), -1)


class Classify(nn.Module):
    # Classification head, i.e. x(b,c1,20,20) to x(b,c2)
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1):  # ch_in, ch_out, kernel, stride, padding, groups
        super(Classify, self).__init__()
        self.aap = nn.AdaptiveAvgPool2d(1)  # to x(b,c1,1,1)
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False)  # to x(b,c2,1,1)
        self.flat = Flatten()

    def forward(self, x):
        z = torch.cat([self.aap(y) for y in (x if isinstance(x, list) else [x])], 1)  # cat if list
        return self.flat(self.conv(z))  # flatten to x(b,c2)

The real question is the loss function and forming box candidates etc. I'm not clear on how this would work. For 100 boxes your vector would be 500 long I suppose, assuming 1 anchor only, and represent linearly spaced regions in the normalized image space perhaps.

But an averagepool would kill your spatial information I think. Classifiers use this op because they don't care about spatial information, but for a detector...?

@pfeatherstone
Copy link
Author

Yeah i don't know. Maybe facebook tried this at first and it didn't work. Hence why DeTr has an additional transformer appended to the backbone. Still worth a try though. If a classifier can tell you what's in an image, i don't see why it can't tell you where it is, and therefore where multiple things are. You just need the right loss function. Hence the use of bipartite loss. Maybe i'm being naive.

@glenn-jocher
Copy link
Member

glenn-jocher commented Oct 5, 2020

You can think of a detector as a grid of classifiers. A classifier will stretch an image from x(1,3,640,640) to x(1,n) for n classes, where the detector analog does this at each grid point to form a x(1,n*5,20,20) grid for example at stride 32.

The adaptive maxpool or averagepool op that converts the 20x20 grid into a single point destroys the location information while retaining the classification information.

i.e. if you imagine two boxes, say cat and dog at -1 position and +1 position, averagepool op will return 0 position and 50% cat 50% dog. You keep the classification info well but the location info has been eliminated for the most part.

I don't completely understand what DETR is doing, but I also haven't been motivated to find out due to their reported mAPs, which are not (yet) competitive with what we have here.

@pfeatherstone
Copy link
Author

pfeatherstone commented Oct 5, 2020

Each grid cell in yolo is predicting x, y offsets and heights, widths (well log(width / anchor) and log(height/anchor)). So if you expand what each grid cell is doing to the whole image, it's not obvious to me why direct bbox prediction isn't possible with the right loss function. I can see your point that the average pooling destroys spatial information in terms of the grid cells, but you can't think of the grid cells as grid cells anymore. The receptive field after 5 downsampling layers might well extend a 2^5 wide/high cell. So spatial information extends beyond the grid cell.

@pfeatherstone
Copy link
Author

The main motivation is that training yolo requires too many hyperparameters. The big one for me is the iou threshold for the noobj loss. It's not obvious to me what that should be. Ideally that would also be a trainable parameter, but I doubt the network would be stable during training. Also, a grid cell might well have 2 or more objects that require the same anchor. In which case, all but one are ignored. That's bad. So interesting to see if a direct set prediction is possible using standard CNN layers. We know it's possible when using transformers, as shown by DeTr, so why not CNNs? My main objection to using DeTr is that it takes 5 days to train on 8 GPUs. That's stupid and impractical.

@pfeatherstone
Copy link
Author

Also, we are struggling to use yolo networks accurately with our datasets. It feels like most repos that train yolo networks are super optimised for COCO. And apart from recalculating anchors, it's not obvious how to fine tune all the other hyper parameters for custom datasets. Hence why I thought it would be interesting to discuss what other object detectors do, what's good, what's not good.

@glenn-jocher
Copy link
Member

glenn-jocher commented Oct 5, 2020

@pfeatherstone ah, well if you have a serious use case you'll definitely want to evolve hyperparameters on your custom dataset, see https://docs.ultralytics.com/yolov5/tutorials/hyperparameter_evolution. This usually produces up to +10 AP increase on custom datasets (i.e. on VOC I saw AP increase from 80 with default hyps to 90+ with evolved hyps), and evolves everything from anchor count, anchor threshold you mentioned, and augmentation policy. 300 generations will land you in a good local minima, and you can generally use yolov5m for the evolution, and the results can be applied to all 4 size models s/m/l/x.

In terms of experimenting with different heads and loss functions, that's a pretty high risk proposition, meaning that you might invest serious time and effort down that road with possibly no return for your effort.

It's possible you might find a way to reframe the problem or update the loss function to accomodate a max/average pool operation that does away with the grid, but doing so while also increasing AP would be quite a remarkable feat.

@pfeatherstone
Copy link
Author

Oooh. Thank you for the link to hyperparameter evolution. Will definitely give that a try.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 5, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested Stale
Projects
None yet
Development

No branches or pull requests

2 participants