Merge remote-tracking branch 'origin/dev-1.x' into 1.x

open-mmlab · Dec 6, 2022 · 458ac4c · 458ac4c
2 parents a4ec279 + 12eca5b
commit 458ac4c
Show file tree

Hide file tree

Showing 45 changed files with 1,975 additions and 294 deletions.
diff --git a/README.md b/README.md
@@ -58,18 +58,18 @@ The `1.x` branch works with **PyTorch 1.6+**.
 
 ## What's new
 
+v1.0.0rc4 was released in 06/12/2022.
+
+- Upgrade API to get pre-defined models of MMClassification. See [#1236](https://github.com/open-mmlab/mmclassification/pull/1236) for more details.
+- Refactor BEiT backbone and support v1/v2 inference. See [#1144](https://github.com/open-mmlab/mmclassification/pull/1144).
+
 v1.0.0rc3 was released in 21/11/2022.
 
 - Add **Switch Recipe** Hook, Now we can modify training pipeline, mixup and loss settings during training, see [#1101](https://github.com/open-mmlab/mmclassification/pull/1101).
 - Add **TIMM and HuggingFace** wrappers. Now you can train/use models in TIMM/HuggingFace directly, see [#1102](https://github.com/open-mmlab/mmclassification/pull/1102).
 - Support **retrieval tasks**, see [#1055](https://github.com/open-mmlab/mmclassification/pull/1055).
 - Reproduce **mobileone** training accuracy. See [#1191](https://github.com/open-mmlab/mmclassification/pull/1191)
 
-v1.0.0rc2 was released in 12/10/2022.
-
-- Support Deit-3 backbone.
-- Fix MMEngine version requirements.
-
 This release introduced a brand new and flexible training & test engine, but it's still in progress. Welcome
 to try according to [the documentation](https://mmclassification.readthedocs.io/en/1.x/).
 
@@ -152,6 +152,7 @@ Results and models are available in the [model zoo](https://mmclassification.rea
 - [x] [MobileViT](https://github.com/open-mmlab/mmclassification/tree/1.x/configs/mobilevit)
 - [x] [DaViT](https://github.com/open-mmlab/mmclassification/tree/1.x/configs/davit)
 - [x] [RepLKNet](https://github.com/open-mmlab/mmclassification/tree/1.x/configs/replknet)
+- [x] [BEiT](https://github.com/open-mmlab/mmclassification/tree/1.x/configs/beit) / [BEiT v2](https://github.com/open-mmlab/mmclassification/tree/1.x/configs/beitv2)
 
 </details>
 

diff --git a/README_zh-CN.md b/README_zh-CN.md
@@ -57,23 +57,18 @@ MMClassification 是一款基于 PyTorch 的开源图像分类工具箱，是 [O
 
 ## 更新日志
 
+2022/12/06 发布了 v1.0.0rc4 版本
+
+- 更新了主要 API 接口，用以方便地获取 MMClassification 中预定义的模型。详见 [#1236](https://github.com/open-mmlab/mmclassification/pull/1236)。
+- 重构 BEiT 主干网络结构，并支持 v1 和 v2 模型的推理。
+
 2022/11/21 发布了 v1.0.0rc3 版本
 
 - 添加了 **Switch Recipe Hook**，现在我们可以在训练过程中修改数据增强、Mixup设置、loss设置等
 - 添加了 **TIMM 和 HuggingFace** 包装器，现在我们可以直接训练、使用 TIMM 和 HuggingFace 中的模型
 - 支持了检索任务
 - 复现了 **MobileOne** 训练精度
 
-2022/10/12 发布了 v1.0.0rc2 版本
-
-- 支持了 Deit-3 主干网络
-- 修复了 MMEngine 版本依赖问题
-
-2022/9/30 发布了 v1.0.0rc1 版本
-
-- 支持了 MViT，EdgeNeXt，Swin-Transformer V2，EfficientFormer，MobileOne 等主干网络。
-- 支持了 BEiT 风格的 transformer 层。
-
 2022/8/31 发布了 v1.0.0rc0 版本
 
 这个版本引入一个全新的，可扩展性强的训练和测试引擎，但目前仍在开发中。欢迎根据[文档](https://mmclassification.readthedocs.io/zh_CN/1.x/)进行试用。
@@ -158,6 +153,7 @@ mim install -e .
 - [x] [MobileViT](https://github.com/open-mmlab/mmclassification/tree/1.x/configs/mobilevit)
 - [x] [DaViT](https://github.com/open-mmlab/mmclassification/tree/1.x/configs/davit)
 - [x] [RepLKNet](https://github.com/open-mmlab/mmclassification/tree/1.x/configs/replknet)
+- [x] [BEiT](https://github.com/open-mmlab/mmclassification/tree/1.x/configs/beit) / [BEiT v2](https://github.com/open-mmlab/mmclassification/tree/1.x/configs/beitv2)
 
 </details>
 

diff --git a/configs/beit/README.md b/configs/beit/README.md
@@ -0,0 +1,38 @@
+# BEiT
+
+> [BEiT: BERT Pre-Training of Image Transformers](https://arxiv.org/abs/2106.08254)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder. Experimental results on image classification and semantic segmentation show that our model achieves competitive results with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%).
+
+<div align="center">
+<img src="https://user-images.githubusercontent.com/36138628/203688351-adac7146-4e71-4ab6-8958-5cfe643a2dc5.png" width="70%"/>
+</div>
+
+## Results and models
+
+### ImageNet-1k
+
+|    Model    |   Pretrain   | Params(M) | Flops(G) | Top-1 (%) | Top-5 (%) |                 Config                  |                                                Download                                                 |
+| :---------: | :----------: | :-------: | :------: | :-------: | :-------: | :-------------------------------------: | :-----------------------------------------------------------------------------------------------------: |
+| BEiT-base\* | ImageNet-21k |   86.53   |  17.58   |   85.28   |   97.59   | [config](./beit-base-p16_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/beit/beit-base_3rdparty_in1k_20221114-c0a4df23.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/microsoft/unilm/tree/master/beit). The config files of these models are only for inference.*
+
+For BEiT self-supervised learning algorithm, welcome to [MMSelfSup page](https://github.com/open-mmlab/mmselfsup/tree/dev-1.x/configs/selfsup/beit) to get more information.
+
+## Citation
+
+```bibtex
+@article{beit,
+    title={{BEiT}: {BERT} Pre-Training of Image Transformers},
+    author={Hangbo Bao and Li Dong and Furu Wei},
+    year={2021},
+    eprint={2106.08254},
+    archivePrefix={arXiv},
+    primaryClass={cs.CV}
+}
+```
diff --git a/configs/beit/beit-base-p16_8xb64_in1k.py b/configs/beit/beit-base-p16_8xb64_in1k.py
@@ -0,0 +1,44 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='BEiT',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        avg_token=True,
+        output_cls_token=False,
+        use_abs_pos_emb=False,
+        use_rel_pos_bias=True,
+        use_shared_rel_pos_bias=False,
+    ),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/beit/metafile.yml b/configs/beit/metafile.yml
@@ -0,0 +1,41 @@
+Collections:
+  - Name: BEiT
+    Metadata:
+      Architecture:
+        - Attention Dropout
+        - Convolution
+        - Dense Connections
+        - Dropout
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+        - Tanh Activation
+    Paper:
+      URL: https://arxiv.org/abs/2106.08254
+      Title: 'BEiT: BERT Pre-Training of Image Transformers'
+    README: configs/beit/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmclassification/blob/dev-1.x/mmcls/models/backbones/beit.py
+      Version: v1.0.0rc4
+
+Models:
+  - Name: beit-base_3rdparty_in1k
+    In Collection: BEiT
+    Metadata:
+      FLOPs: 17581219584
+      Parameters: 86530984
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 85.28
+        Top 5 Accuracy: 97.59
+    Weights: https://download.openmmlab.com/mmclassification/v0/beit/beit-base_3rdparty_in1k_20221114-c0a4df23.pth
+    Converted From:
+      Weights: https://conversationhub.blob.core.windows.net/beit-share-public/beit/beit_base_patch16_224_pt22k_ft22kto1k.pth
+      Code: https://github.com/microsoft/unilm/tree/master/beit
+    Config: configs/beit/beit-base-p16_8xb64_in1k.py
diff --git a/configs/beitv2/README.md b/configs/beitv2/README.md
@@ -0,0 +1,38 @@
+# BEiT V2
+
+> [BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers](https://arxiv.org/abs/2208.06366)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Masked image modeling (MIM) has demonstrated impressive results in self-supervised representation learning by recovering corrupted image patches. However, most existing studies operate on low-level image pixels, which hinders the exploitation of high-level semantics for representation models. In this work, we propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction, providing a systematic way to promote MIM from pixel-level to semantic-level. Specifically, we propose vector-quantized knowledge distillation to train the tokenizer, which discretizes a continuous semantic space to compact codes. We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches. Furthermore, we introduce a patch aggregation strategy which associates discrete image patches to enhance global semantic representation. Experiments on image classification and semantic segmentation show that BEiT v2 outperforms all compared MIM methods. On ImageNet-1K (224 size), the base-size BEiT v2 achieves 85.5% top-1 accuracy for fine-tuning and 80.1% top-1 accuracy for linear probing. The large-size BEiT v2 obtains 87.3% top-1 accuracy for ImageNet-1K (224 size) fine-tuning, and 56.7% mIoU on ADE20K for semantic segmentation.
+
+<div align="center">
+<img src="https://user-images.githubusercontent.com/36138628/203912182-5967a520-d455-49ea-bc67-dcbd500d76bf.png" width="70%"/>
+</div>
+
+## Results and models
+
+### ImageNet-1k
+
+|     Model     |          Pretrain          | Params(M) | Flops(G) | Top-1 (%) | Top-5 (%) |                  Config                   |                                       Download                                        |
+| :-----------: | :------------------------: | :-------: | :------: | :-------: | :-------: | :---------------------------------------: | :-----------------------------------------------------------------------------------: |
+| BEiTv2-base\* | ImageNet-1k & ImageNet-21k |   86.53   |  17.58   |   86.47   |   97.99   | [config](./beitv2-base-p16_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/beit/beitv2-base_3rdparty_in1k_20221114-73e11905.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/microsoft/unilm/tree/master/beit2). The config files of these models are only for inference.*
+
+For BEiTv2 self-supervised learning algorithm, welcome to [MMSelfSup page](https://github.com/open-mmlab/mmselfsup/tree/dev-1.x/configs/selfsup/beitv2) to get more information.
+
+## Citation
+
+```bibtex
+@article{beitv2,
+    title={{BEiT v2}: Masked Image Modeling with Vector-Quantized Visual Tokenizers},
+    author={Zhiliang Peng and Li Dong and Hangbo Bao and Qixiang Ye and Furu Wei},
+    year={2022},
+    eprint={2208.06366},
+    archivePrefix={arXiv},
+    primaryClass={cs.CV}
+}
+```
diff --git a/configs/beitv2/beitv2-base-p16_8xb64_in1k.py b/configs/beitv2/beitv2-base-p16_8xb64_in1k.py
@@ -0,0 +1,35 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='BEiT',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        avg_token=True,
+        output_cls_token=False,
+        use_abs_pos_emb=False,
+        use_rel_pos_bias=True,
+        use_shared_rel_pos_bias=False,
+    ),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/beitv2/metafile.yml b/configs/beitv2/metafile.yml
@@ -0,0 +1,41 @@
+Collections:
+  - Name: BEiTv2
+    Metadata:
+      Architecture:
+        - Attention Dropout
+        - Convolution
+        - Dense Connections
+        - Dropout
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+        - Tanh Activation
+    Paper:
+      URL: https://arxiv.org/abs/2208.06366
+      Title: 'BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers'
+    README: configs/beitv2/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmclassification/blob/dev-1.x/mmcls/models/backbones/beit.py
+      Version: v1.0.0rc4
+
+Models:
+  - Name: beitv2-base_3rdparty_in1k
+    In Collection: BEiTv2
+    Metadata:
+      FLOPs: 17581219584
+      Parameters: 86530984
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 86.47
+        Top 5 Accuracy: 97.99
+    Weights: https://download.openmmlab.com/mmclassification/v0/beit/beitv2-base_3rdparty_in1k_20221114-73e11905.pth
+    Converted From:
+      Weights: https://conversationhub.blob.core.windows.net/beit-share-public/beitv2/beitv2_base_patch16_224_pt1k_ft21kto1k.pth
+      Code: https://github.com/microsoft/unilm/tree/master/beit2
+    Config: configs/beitv2/beitv2-base-p16_8xb64_in1k.py
diff --git a/configs/mobilenet_v2/README.md b/configs/mobilenet_v2/README.md
@@ -4,15 +4,83 @@
 
 <!-- [ALGORITHM] -->
 
+## Introduction
+
+**MobileNet V2** is initially described in [the paper](https://arxiv.org/pdf/1801.04381.pdf), which improves the state of the art performance of mobile models on multiple tasks. MobileNetV2 is an improvement on V1. Its new ideas include Linear Bottleneck and Inverted Residuals, and is based on an inverted residual structure where the input and output of the residual block are thin bottleneck layers. The intermediate expansion layer uses lightweight depthwise convolutions to filter features as a source of non-linearity. The author of MobileNet V2 measure its performance on Imagenet classification, COCO object detection, and VOC image segmentation.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142563365-7a9ea577-8f79-4c21-a750-ebcaad9bcc2f.png" width="60%"/>
+</div>
+
 ## Abstract
 
+<details>
+
+<summary>Show the paper's abstract</summary>
+
+<br>
 In this paper we describe a new mobile architecture, MobileNetV2, that improves the state of the art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes. We also describe efficient ways of applying these mobile models to object detection in a novel framework we call SSDLite. Additionally, we demonstrate how to build mobile semantic segmentation models through a reduced form of DeepLabv3 which we call Mobile DeepLabv3.
 
-The MobileNetV2 architecture is based on an inverted residual structure where the input and output of the residual block are thin bottleneck layers opposite to traditional residual models which use expanded representations in the input an MobileNetV2 uses lightweight depthwise convolutions to filter features in the intermediate expansion layer. Additionally, we find that it is important to remove non-linearities in the narrow layers in order to maintain representational power. We demonstrate that this improves performance and provide an intuition that led to this design. Finally, our approach allows decoupling of the input/output domains from the expressiveness of the transformation, which provides a convenient framework for further analysis. We measure our performance on Imagenet classification, COCO object detection, VOC image segmentation. We evaluate the trade-offs between accuracy, and number of operations measured by multiply-adds (MAdd), as well as the number of parameters
+The MobileNetV2 architecture is based on an inverted residual structure where the input and output of the residual block are thin bottleneck layers opposite to traditional residual models which use expanded representations in the input an MobileNetV2 uses lightweight depthwise convolutions to filter features in the intermediate expansion layer. Additionally, we find that it is important to remove non-linearities in the narrow layers in order to maintain representational power. We demonstrate that this improves performance and provide an intuition that led to this design. Finally, our approach allows decoupling of the input/output domains from the expressiveness of the transformation, which provides a convenient framework for further analysis. We measure our performance on Imagenet classification, COCO object detection, VOC image segmentation. We evaluate the trade-offs between accuracy, and number of operations measured by multiply-adds (MAdd), as well as the number of parameters.
+</br>
 
-<div align=center>
-<img src="https://user-images.githubusercontent.com/26739999/142563365-7a9ea577-8f79-4c21-a750-ebcaad9bcc2f.png" width="40%"/>
-</div>
+</details>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+>>> import torch
+>>> from mmcls.apis import init_model, inference_model
+>>>
+>>> model = init_model('configs/mobilenet_v2/mobilenet-v2_8xb32_in1k.py', 'https://download.openmmlab.com/mmclassification/v0/mobilenet_v2/mobilenet_v2_batch256_imagenet_20200708-3b2dc3af.pth')
+>>> predict = inference_model(model, 'demo/demo.JPEG')
+>>> print(predict['pred_class'])
+leatherback turtle, leatherback, leathery turtle, Dermochelys coriacea
+>>> print(predict['pred_score'])
+0.6252720952033997
+```
+
+**Use the model**
+
+```python
+>>> import torch
+>>> from mmcls.apis import init_model
+>>>
+>>> model = init_model('configs/mobilenet_v2/mobilenet-v2_8xb32_in1k.py', 'https://download.openmmlab.com/mmclassification/v0/mobilenet_v2/mobilenet_v2_batch256_imagenet_20200708-3b2dc3af.pth')
+>>> inputs = torch.rand(1, 3, 224, 224).to(model.data_preprocessor.device)
+>>> # To get classification scores.
+>>> out = model(inputs)
+>>> print(out.shape)
+torch.Size([1, 1000])
+>>> # To extract features.
+>>> outs = model.extract_feat(inputs)
+>>> print(outs[0].shape)
+torch.Size([1, 1280])
+```
+
+**Train/Test Command**
+
+Place the ImageNet dataset to the `data/imagenet/` directory, or prepare datasets according to the [docs](https://mmclassification.readthedocs.io/en/1.x/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/mobilenet_v2/mobilenet-v2_8xb32_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/mobilenet_v2/mobilenet-v2_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/mobilenet_v2/mobilenet_v2_batch256_imagenet_20200708-3b2dc3af.pth
+```
+
+<!-- [TABS-END] -->
+
+For more configurable parameters, please refer to the [API](https://mmclassification.readthedocs.io/en/1.x/api/generated/mmcls.models.backbones.MobileNetV2.html#mmcls.models.backbones.MobileNetV2).
 
 ## Results and models