Skip to content

Latest commit

 

History

History

mask2former

Mask2Former

Masked-attention Mask Transformer for Universal Image Segmentation

Abstract

Image segmentation is about grouping pixels with different semantics, e.g., category or instance membership, where each choice of semantics defines a task. While only the semantics of each task differ, current research focuses on designing specialized architectures for each task. We present Masked-attention Mask Transformer (Mask2Former), a new architecture capable of addressing any image segmentation task (panoptic, instance or semantic). Its key components include masked attention, which extracts localized features by constraining cross-attention within predicted mask regions. In addition to reducing the research effort by at least three times, it outperforms the best specialized architectures by a significant margin on four popular datasets. Most notably, Mask2Former sets a new state-of-the-art for panoptic segmentation (57.8 PQ on COCO), instance segmentation (50.1 AP on COCO) and semantic segmentation (57.7 mIoU on ADE20K).

Introduction

Mask2Former requires COCO and COCO-panoptic dataset for training and evaluation. You need to download and extract it in the COCO dataset path. The directory should be like this.

mmdetection
├── mmdet
├── tools
├── configs
├── data
│   ├── coco
│   │   ├── annotations
|   |   |   ├── instances_train2017.json
|   |   |   ├── instances_val2017.json
│   │   │   ├── panoptic_train2017.json
│   │   │   ├── panoptic_train2017
│   │   │   ├── panoptic_val2017.json
│   │   │   ├── panoptic_val2017
│   │   ├── train2017
│   │   ├── val2017
│   │   ├── test2017

Results and Models

Backbone style Pretrain Lr schd Mem (GB) Inf time (fps) PQ box mAP mask mAP Config Download
R-50 pytorch ImageNet-1K 50e 13.9 - 51.9 44.8 41.9 config model | log
R-101 pytorch ImageNet-1K 50e 16.1 - 52.4 45.3 42.4 config model | log
Swin-T - ImageNet-1K 50e 15.9 - 53.4 46.3 43.4 config model | log
Swin-S - ImageNet-1K 50e 19.1 - 54.5 47.8 44.5 config model | log
Swin-B - ImageNet-1K 50e 26.0 - 55.1 48.2 44.9 config model | log
Swin-B - ImageNet-21K 50e 25.8 - 56.3 50.0 46.3 config model | log
Swin-L - ImageNet-21K 100e 21.1 - 57.6 52.2 48.5 config model | log

Citation

@article{cheng2021mask2former,
  title={Masked-attention Mask Transformer for Universal Image Segmentation},
  author={Bowen Cheng and Ishan Misra and Alexander G. Schwing and Alexander Kirillov and Rohit Girdhar},
  journal={arXiv},
  year={2021}
}