Skip to content

Latest commit

 

History

History
49 lines (32 loc) · 4.22 KB

MODEL_ZOO.md

File metadata and controls

49 lines (32 loc) · 4.22 KB

Multi-Modal Open-Vocabulary Object Detection Model Zoo

Introduction

This file documents a collection of models reported in our paper. Training in all cases is done with 4 32GB V100 GPUs.

How to Read the Tables

The "Name" column contains a link to the config file. To train a model, run

python train_net_auto.py --num-gpus 4 --config-file /path/to/config/name.yaml

To evaluate a model with a trained/ pretrained model, run

python train_net_auto.py --num-gpus 4 --config-file /path/to/config/name.yaml --eval-only MODEL.WEIGHTS /path/to/weight.pth

Open-vocabulary LVIS

Name APr mAP Weights
lvis-base_r50_4x_clip_gpt3_descriptions 19.3 30.3 model
lvis-base_r50_4x_clip_image_exemplars_avg 14.8 28.8 model
lvis-base_r50_4x_clip_image_exemplars_agg 18.3 29.2 model
lvis-base_r50_4x_clip_multi_modal_avg 20.7 30.5 model
lvis-base_r50_4x_clip_multi_modal_agg 19.2 30.6 model
lvis-base_in-l_r50_4x_4x_clip_gpt3_descriptions 25.8 32.6 model
lvis-base_in-l_r50_4x_4x_clip_image_exemplars_avg 21.6 31.3 model
lvis-base_in-l_r50_4x_4x_clip_image_exemplars_agg 23.8 31.3 model
lvis-base_in-l_r50_4x_4x_clip_multi_modal_avg 26.5 32.8 model
lvis-base_in-l_r50_4x_4x_clip_multi_modal_agg 27.3 33.1 model

Note

  • The open-vocabulary LVIS setup is LVIS without rare class annotations in training. We evaluate rare classes as novel classes in testing.

  • All models use CLIP embeddings as classifiers. This makes the box-supervised models have non-zero mAP on novel classes.

  • The models with in-l use the overlap classes between ImageNet-21K and LVIS as image-labeled data.

  • The models which are trained on in-l require the corresponding models without in-l (indicated by MODEL.WEIGHTS in the config files). Please train or download the model without in-l and place them under ${mm-ovod_ROOT}/output/.. before training the model using in-l (check the config file).