Skip to content

zhangxiaosong18/hivit

Repository files navigation

HiViT (ICLR2023, notable-top-25%)

This is the official implementation of the paper HiViT: A Simple and More Efficient Design of Hierarchical Vision Transformer.

Results

Model Pretraining data ImageNet-1K COCO Det ADE Seg
MAE-base ImageNet-1K 83.6 51.2 48.1
SimMIM-base ImageNet-1K 84.0 52.3 52.8
HiViT-base ImageNet-1K 84.6 53.3 52.8

Pre-training Models

mae_hivit_base_1600ep.pth

mae_hivit_base_1600ep_ft100ep.pth

Usage

1. Supervised learning on ImageNet-1K.: See supervised/get_started.md for a quick start.

2. Self-supervised learning on ImageNet-1K.: See self_supervised/get_started.md.

3. Object detection: See detection/get_started.md.

4. Semantic segmentation: See segmentation/get_started.md.

Bibtex

Please consider citing our paper in your publications if the project helps your research.

@inproceedings{zhanghivit,
  title={HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer},
  author={Zhang, Xiaosong and Tian, Yunjie and Xie, Lingxi and Huang, Wei and Dai, Qi and Ye, Qixiang and Tian, Qi},
  booktitle={International Conference on Learning Representations},
  year={2023},
}