Skip to content

Latest commit

 

History

History
69 lines (59 loc) · 14.8 KB

performance.md

File metadata and controls

69 lines (59 loc) · 14.8 KB

Main Results of VMamba Series

Classification on ImageNet-1K

name pretrain resolution acc@1 #params FLOPs TP. Train TP. configs/logs/ckpts
Swin-T ImageNet-1K 224x224 81.2 28M 4.5G 1244 987 --
Swin-S ImageNet-1K 224x224 83.2 50M 8.7G 718 642 --
Swin-B ImageNet-1K 224x224 83.5 88M 15.4G 458 496 --
Vanilla-VMamba-T ImageNet-1K 224x224 82.2 23M 4.5G 5.6G 638 195 config/log/ckpt
Vanilla-VMamba-S ImageNet-1K 224x224 83.5 44M 9.1G 11.2G 359 111 config/log/ckpt
Vanilla-VMamba-B ImageNet-1K 224x224 83.7 76M 15.2G 18.0G 268 84 config/log/ckpt
VMamba-T[s2l5] ImageNet-1K 224x224 82.5 31M 4.9G 1340 464 config/log/ckpt
VMamba-S[s2l15] ImageNet-1K 224x224 83.6 50M 8.7G 877 314 config/log/ckpt
VMamba-B[s2l15] ImageNet-1K 224x224 83.9 89M 15.4G 646 247 config/log/ckpt
VMamba-T[s1l8] ImageNet-1K 224x224 82.6 30M 4.9G 1686 571 config/log/ckpt
VMamba-S[s1l20] ImageNet-1K 224x224 83.3 49M 8.6G 1106 390 config/log/ckpt
VMamba-B[s1l20] ImageNet-1K 224x224 83.8 87M 15.2G 827 313 config/log/ckpt
  • Models in this subsection is trained from scratch with random or manual initialization. The hyper-parameters are inherited from Swin, except for drop_path_rate and EMA. All models are trained with EMA except for the Vanilla-VMamba-T.
  • TP.(Throughput) and Train TP. (Train Throughput) are assessed on an A100 GPU paired with an AMD EPYC 7542 CPU, with batch size 128. Train TP. is tested with mix-resolution, excluding the time consumption of optimizers.
  • FLOPs and parameters are now gathered with head (In previous versions, without head, so the numbers raise a little bit).
  • we calculate FLOPs with the algorithm @albertgu provides, which will be bigger than previous calculation (which is based on the selective_scan_ref function, and ignores the hardware-aware algorithm).

Object Detection on COCO

Backbone #params FLOPs Detector bboxAP bboxAP50 bboxAP75 segmAP segmAP50 segmAP75 configs/logs/ckpts
Swin-T 48M 267G MaskRCNN@1x 42.7 65.2 46.8 39.3 62.2 42.2 --
Swin-S 69M 354G MaskRCNN@1x 44.8 66.6 48.9 40.9 63.4 44.2 --
Swin-B 107M 496G MaskRCNN@1x 46.9 -- -- 42.3 -- -- --
Vanilla-VMamba-T 42M 262G 286G MaskRCNN@1x 46.5 68.5 50.7 42.1 65.5 45.3 config/log/ckpt
Vanilla-VMamba-S 64M 357G 400G MaskRCNN@1x 48.2 69.7 52.5 43.0 66.6 46.4 config/log/ckpt
Vanilla-VMamba-B 96M 482G 540G MaskRCNN@1x 48.6 70.0 53.1 43.3 67.1 46.7 config/log/ckpt
VMamba-T[s2l5] 50M 270G MaskRCNN@1x 47.4 69.5 52.0 42.7 66.3 46.0 config/log/ckpt
VMamba-S[s2l15] 70M 384G MaskRCNN@1x 48.7 70.0 53.4 43.7 67.3 47.0 config/log/ckpt
VMamba-B[s2l15] 108M 485G MaskRCNN@1x 49.2 71.4 54.0 44.1 68.3 47.7 config/log/ckpt
VMamba-B[s2l15] 108M 485G MaskRCNN@1x[bs8] 49.2 70.9 53.9 43.9 67.7 47.6 config/log/ckpt
VMamba-T[s1l8] 50M 271G MaskRCNN@1x 47.3 69.3 52.0 42.7 66.4 45.9 config/log/ckpt
:---: :---: :---: :---: :---: :---: :---: :---: :---: :---: :---:
Swin-T 48M 267G MaskRCNN@3x 46.0 68.1 50.3 41.6 65.1 44.9 --
Swin-S 69M 354G MaskRCNN@3x 48.2 69.8 52.8 43.2 67.0 46.1 --
Vanilla-VMamba-T 42M 262G 286G MaskRCNN@3x 48.5 70.0 52.7 43.2 66.9 46.4 config/log/ckpt
Vanilla-VMamba-S 64M 357G 400G MaskRCNN@3x 49.7 70.4 54.2 44.0 67.6 47.3 config/log/ckpt
VMamba-T[s2l5] 50M 270G MaskRCNN@3x 48.9 70.6 53.6 43.7 67.7 46.8 config/log/ckpt
VMamba-S[s2l15] 70M 384G MaskRCNN@3x 49.9 70.9 54.7 44.20 68.2 47.7 config/log/ckpt
VMamba-T[s1l8] 50M 271G MaskRCNN@3x 48.8 70.4 53.50 43.7 67.4 47.0 config/log/ckpt
  • Models in this subsection is initialized from the models trained in classfication.
  • we now calculate FLOPs with the algrithm @albertgu provides, which will be bigger than previous calculation (which is based on the selective_scan_ref function, and ignores the hardware-aware algrithm).

Semantic Segmentation on ADE20K

Backbone Input #params FLOPs Segmentor mIoU(SS) mIoU(MS) configs/logs/logs(ms)/ckpts
Swin-T 512x512 60M 945G UperNet@160k 44.4 45.8 --
Swin-S 512x512 81M 1039G UperNet@160k 47.6 49.5 --
Swin-B 512x512 121M 1188G UperNet@160k 48.1 49.7 --
Vanilla-VMamba-T 512x512 55M 939G 964G UperNet@160k 47.3 48.3 config/log/log(ms)/ckpt
Vanilla-VMamba-S 512x512 76M 1037G 1081G UperNet@160k 49.5 50.5 config/log/log(ms)/ckpt
Vanilla-VMamba-B 512x512 110M 1167G 1226G UperNet@160k 50.0 51.3 config/log/log(ms)/ckpt
VMamba-T[s2l5] 512x512 62M 948G UperNet@160k 48.3 48.6 config/log/log(ms)/ckpt
VMamba-S[s2l15] 512x512 82M 1028G UperNet@160k 50.6 51.2 config/log/log(ms)/ckpt
VMamba-B[s2l15] 512x512 122M 1170G UperNet@160k 51.0 51.6 config/log/log(ms)/ckpt
VMamba-T[s1l8] 512x512 62M 949G UperNet@160k 47.9 48.8 config/log/log(ms)/ckpt
  • Models in this subsection is initialized from the models trained in classfication.
  • we now calculate FLOPs with the algrithm @albertgu provides, which will be bigger than previous calculation (which is based on the selective_scan_ref function, and ignores the hardware-aware algrithm).