Multi-Modal Open-Vocabulary Object Detection Model Zoo

Introduction

This file documents a collection of models reported in our paper. Training in all cases is done with 4 32GB V100 GPUs.

The "Name" column contains a link to the config file. To train a model, run

python train_net_auto.py --num-gpus 4 --config-file /path/to/config/name.yaml

To evaluate a model with a trained/ pretrained model, run

python train_net_auto.py --num-gpus 4 --config-file /path/to/config/name.yaml --eval-only MODEL.WEIGHTS /path/to/weight.pth

Name	APr	mAP	Weights
lvis-base_r50_4x_clip_gpt3_descriptions	19.3	30.3	model
lvis-base_r50_4x_clip_image_exemplars_avg	14.8	28.8	model
lvis-base_r50_4x_clip_image_exemplars_agg	18.3	29.2	model
lvis-base_r50_4x_clip_multi_modal_avg	20.7	30.5	model
lvis-base_r50_4x_clip_multi_modal_agg	19.2	30.6	model
lvis-base_in-l_r50_4x_4x_clip_gpt3_descriptions	25.8	32.6	model
lvis-base_in-l_r50_4x_4x_clip_image_exemplars_avg	21.6	31.3	model
lvis-base_in-l_r50_4x_4x_clip_image_exemplars_agg	23.8	31.3	model
lvis-base_in-l_r50_4x_4x_clip_multi_modal_avg	26.5	32.8	model
lvis-base_in-l_r50_4x_4x_clip_multi_modal_agg	27.3	33.1	model

The open-vocabulary LVIS setup is LVIS without rare class annotations in training. We evaluate rare classes as novel classes in testing.
All models use CLIP embeddings as classifiers. This makes the box-supervised models have non-zero mAP on novel classes.
The models with in-l use the overlap classes between ImageNet-21K and LVIS as image-labeled data.
The models which are trained on in-l require the corresponding models without in-l (indicated by MODEL.WEIGHTS in the config files). Please train or download the model without in-l and place them under ${mm-ovod_ROOT}/output/.. before training the model using in-l (check the config file).