bipartite loss ? #1083

pfeatherstone · 2020-10-04T18:05:23Z

❔Question

Is it possible to use a bipartite loss, as used in https://github.com/facebookresearch/detr?

This essentially uses the hungarian algorithm to match predictions with targets. This means that the order of the targets is irrelevant as the hungarian algorithm gives you the same loss regardless. So you can, in theory, directly predict object bounding boxes without having to worry about ordering in your labels.

So rather than have yolo layers that give you 13x13x3 + 26x26x3 + 52x52x3 = 10647 predictions (with h, w = 416), you simply have a backbone + average pooling + linear layers, almost as if you were constructing a classifier, that predict, lets say, a maximum of 100 objects per image.

Is this possible? Or can this only be done using a transformer and some learned 'queries'? It would be neat if an object detector could have the same architecture as a classifier and use a fancy loss function to give you predictions.

I know this would no longer be yolo, but this repo is as good a place as any to ask object detection related questions.

Thoughts anyone?

glenn-jocher · 2020-10-05T10:05:54Z

@pfeatherstone well, from an architecture standpoint it's very easy to to add average pooling and fully connected / linear layer to the backbone, eliminating the head. In models/common.py we already have a Classify() module that essentially does this that you could just append to the backbone. The only differences is you would replace the nn.Conv2d(c1, c2) with a nn.Linear(c1, c2).

class Flatten(nn.Module):
    # Use after nn.AdaptiveAvgPool2d(1) to remove last 2 dimensions
    @staticmethod
    def forward(x):
        return x.view(x.size(0), -1)


class Classify(nn.Module):
    # Classification head, i.e. x(b,c1,20,20) to x(b,c2)
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1):  # ch_in, ch_out, kernel, stride, padding, groups
        super(Classify, self).__init__()
        self.aap = nn.AdaptiveAvgPool2d(1)  # to x(b,c1,1,1)
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False)  # to x(b,c2,1,1)
        self.flat = Flatten()

    def forward(self, x):
        z = torch.cat([self.aap(y) for y in (x if isinstance(x, list) else [x])], 1)  # cat if list
        return self.flat(self.conv(z))  # flatten to x(b,c2)

The real question is the loss function and forming box candidates etc. I'm not clear on how this would work. For 100 boxes your vector would be 500 long I suppose, assuming 1 anchor only, and represent linearly spaced regions in the normalized image space perhaps.

But an averagepool would kill your spatial information I think. Classifiers use this op because they don't care about spatial information, but for a detector...?

pfeatherstone · 2020-10-05T11:16:29Z

Yeah i don't know. Maybe facebook tried this at first and it didn't work. Hence why DeTr has an additional transformer appended to the backbone. Still worth a try though. If a classifier can tell you what's in an image, i don't see why it can't tell you where it is, and therefore where multiple things are. You just need the right loss function. Hence the use of bipartite loss. Maybe i'm being naive.

glenn-jocher · 2020-10-05T12:22:10Z

You can think of a detector as a grid of classifiers. A classifier will stretch an image from x(1,3,640,640) to x(1,n) for n classes, where the detector analog does this at each grid point to form a x(1,n*5,20,20) grid for example at stride 32.

The adaptive maxpool or averagepool op that converts the 20x20 grid into a single point destroys the location information while retaining the classification information.

i.e. if you imagine two boxes, say cat and dog at -1 position and +1 position, averagepool op will return 0 position and 50% cat 50% dog. You keep the classification info well but the location info has been eliminated for the most part.

I don't completely understand what DETR is doing, but I also haven't been motivated to find out due to their reported mAPs, which are not (yet) competitive with what we have here.

pfeatherstone · 2020-10-05T14:50:43Z

Each grid cell in yolo is predicting x, y offsets and heights, widths (well log(width / anchor) and log(height/anchor)). So if you expand what each grid cell is doing to the whole image, it's not obvious to me why direct bbox prediction isn't possible with the right loss function. I can see your point that the average pooling destroys spatial information in terms of the grid cells, but you can't think of the grid cells as grid cells anymore. The receptive field after 5 downsampling layers might well extend a 2^5 wide/high cell. So spatial information extends beyond the grid cell.

pfeatherstone · 2020-10-05T14:55:23Z

The main motivation is that training yolo requires too many hyperparameters. The big one for me is the iou threshold for the noobj loss. It's not obvious to me what that should be. Ideally that would also be a trainable parameter, but I doubt the network would be stable during training. Also, a grid cell might well have 2 or more objects that require the same anchor. In which case, all but one are ignored. That's bad. So interesting to see if a direct set prediction is possible using standard CNN layers. We know it's possible when using transformers, as shown by DeTr, so why not CNNs? My main objection to using DeTr is that it takes 5 days to train on 8 GPUs. That's stupid and impractical.

pfeatherstone · 2020-10-05T14:58:24Z

Also, we are struggling to use yolo networks accurately with our datasets. It feels like most repos that train yolo networks are super optimised for COCO. And apart from recalculating anchors, it's not obvious how to fine tune all the other hyper parameters for custom datasets. Hence why I thought it would be interesting to discuss what other object detectors do, what's good, what's not good.

glenn-jocher · 2020-10-05T19:08:36Z

@pfeatherstone ah, well if you have a serious use case you'll definitely want to evolve hyperparameters on your custom dataset, see https://docs.ultralytics.com/yolov5/tutorials/hyperparameter_evolution. This usually produces up to +10 AP increase on custom datasets (i.e. on VOC I saw AP increase from 80 with default hyps to 90+ with evolved hyps), and evolves everything from anchor count, anchor threshold you mentioned, and augmentation policy. 300 generations will land you in a good local minima, and you can generally use yolov5m for the evolution, and the results can be applied to all 4 size models s/m/l/x.

In terms of experimenting with different heads and loss functions, that's a pretty high risk proposition, meaning that you might invest serious time and effort down that road with possibly no return for your effort.

It's possible you might find a way to reframe the problem or update the loss function to accomodate a max/average pool operation that does away with the grid, but doing so while also increasing AP would be quite a remarkable feat.

pfeatherstone · 2020-10-05T19:55:46Z

Oooh. Thank you for the link to hyperparameter evolution. Will definitely give that a try.

github-actions · 2020-11-05T00:31:52Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

pfeatherstone added the question Further information is requested label Oct 4, 2020

github-actions bot added the Stale label Nov 5, 2020

github-actions bot closed this as completed Nov 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bipartite loss ? #1083

bipartite loss ? #1083

pfeatherstone commented Oct 4, 2020

glenn-jocher commented Oct 5, 2020

pfeatherstone commented Oct 5, 2020

glenn-jocher commented Oct 5, 2020 •

edited

Loading

pfeatherstone commented Oct 5, 2020 •

edited

Loading

pfeatherstone commented Oct 5, 2020

pfeatherstone commented Oct 5, 2020

glenn-jocher commented Oct 5, 2020 •

edited

Loading

pfeatherstone commented Oct 5, 2020

github-actions bot commented Nov 5, 2020

bipartite loss ? #1083

bipartite loss ? #1083

Comments

pfeatherstone commented Oct 4, 2020

❔Question

glenn-jocher commented Oct 5, 2020

pfeatherstone commented Oct 5, 2020

glenn-jocher commented Oct 5, 2020 • edited Loading

pfeatherstone commented Oct 5, 2020 • edited Loading

pfeatherstone commented Oct 5, 2020

pfeatherstone commented Oct 5, 2020

glenn-jocher commented Oct 5, 2020 • edited Loading

pfeatherstone commented Oct 5, 2020

github-actions bot commented Nov 5, 2020

glenn-jocher commented Oct 5, 2020 •

edited

Loading

pfeatherstone commented Oct 5, 2020 •

edited

Loading

glenn-jocher commented Oct 5, 2020 •

edited

Loading