- The summary of mainstream multi-modal pre-trained big models.
-
Emu3: Next-Token Prediction is All You Need, Emu3 Team, BAAI [Paper]
-
[arXiv:2409.18119] Multi-View and Multi-Scale Alignment for Contrastive Language-Image Pre-training in Mammography, Yuexi Du, John Onofrey, Nicha C. Dvornek [Paper]
-
[arXiv:2409.18111] E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding, Ye Liu, Zongyang Ma, Zhongang Qi, Yang Wu, Ying Shan, Chang Wen Chen [Paper]
-
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions, Kai Chen, Yunhao Gou, Runhui Huang, Zhili Liu, Daxin Tan, Jing Xu, Chunwei Wang, Yi Zhu, Yihan Zeng, Kuo Yang, Dingdong Wang, Kun Xiang, Haoyuan Li, Haoli Bai, Jianhua Han, Xiaohui Li, Weike Jin, Nian Xie, Yu Zhang, James T. Kwok, Hengshuang Zhao, Xiaodan Liang, Dit-Yan Yeung, Xiao Chen, Zhenguo Li, Wei Zhang, Qun Liu, Lanqing Hong, Lu Hou, Hang Xu [Paper]
-
[arXiv:2409.17146] Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models, Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvonne Chou, Arnavi Chheda, Jenna Sparks, Sam Skjonsberg, Michael Schmitz, Aaron Sarnat, Byron Bischoff, Pete Walsh, Chris Newell, Piper Wolters, Tanmay Gupta, Kuo-Hao Zeng, Jon Borchardt, Dirk Groeneveld, Jen Dumas, Crystal Nam, Sophie Lebrecht, Caitlin Wittlif, Carissa Schoenick, Oscar Michel, Ranjay Krishna, Luca Weihs, Noah A. Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, Aniruddha Kembhavi [Paper] [Code]
-
[arXiv:2409.12568] InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning, Xiaotian Han, Yiren Jian, Xuefeng Hu, Haogeng Liu, Yiqi Wang, Qihang Fan, Yuang Ai, Huaibo Huang, Ran He, Zhenheng Yang, Quanzeng You [Paper]
-
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model, Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, Chunrui Han, Xiangyu Zhang [Paper]
-
[arXiv:2408.16500] CogVLM2: Visual Language Models for Image and Video Understanding, Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Junhui Ji, Zhao Xue, Lei Zhao, Zhuoyi Yang, Xiaotao Gu, Xiaohan Zhang, Guanyu Feng, Da Yin, Zihan Wang, Ji Qi, Xixuan Song, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Yuxiao Dong, Jie Tang [Paper]
-
[arXiv:2408.14471] A Practitioner's Guide to Continual Multimodal Pretraining, Karsten Roth, Vishaal Udandarao, Sebastian Dziadzio, Ameya Prabhu, Mehdi Cherti, Oriol Vinyals, Olivier Hénaff, Samuel Albanie, Matthias Bethge, Zeynep Akata [Paper] [Code]
-
[arXiv:2408.08872] xGen-MM (BLIP-3): A Family of Open Large Multimodal Models, Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, Shrikant Kendre, Jieyu Zhang, Can Qin, Shu Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles, Caiming Xiong, Ran Xu [Paper] [Project]
-
VITA: Towards Open-Source Interactive Omni Multimodal LLM, Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Meng Zhao, Yifan Zhang, Xiong Wang, Di Yin, Long Ma, Xiawu Zheng, Ran He, Rongrong Ji, Yunsheng Wu, Caifeng Shan, Xing Sun [Paper]
-
[arXiv:2408.04840] mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models, Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou https://arxiv.org/abs/2408.04840
-
[arXiv:2408.02718] MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models, Fanqing Meng, Jin Wang, Chuanhao Li, Quanfeng Lu, Hao Tian, Jiaqi Liao, Xizhou Zhu, Jifeng Dai, Yu Qiao, Ping Luo, Kaipeng Zhang, Wenqi Shao [Paper]
-
[arXiv:2408.02865] VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge, Zihan Li, Diping Song, Zefeng Yang, Deming Wang, Fei Li, Xiulan Zhang, Paul E. Kinahan, Yu Qiao [Paper]
-
[arXiv:2408.03326] LLaVA-OneVision: Easy Visual Task Transfer, Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li [Paper] [Code]
-
[arXiv:2407.14885] Falcon2-11B Technical Report, Quentin Malartic, Nilabhra Roy Chowdhury, Ruxandra Cojocaru, Mugariya Farooq, Giulia Campesan, Yasser Abdelaziz Dahou Djilali, Sanath Narayan, Ankit Singh, Maksim Velikanov, Basma El Amel Boussaha, Mohammed Al-Yafeai, Hamza Alobeidli, Leen Al Qadi, Mohamed El Amine Seddik, Kirill Fedyanin, Reda Alami, Hakim Hacid [Paper] [huggingface]
-
[CVPR 2024] Improved Baselines with Visual Instruction Tuning, Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee [Paper] [Code]
-
[arXiv:2407.14177] EVLM: An Efficient Vision-Language Model for Visual Understanding, Kaibing Chen, Dong Shen, Hanwen Zhong, Huasong Zhong, Kui Xia, Di Xu, Wei Yuan, Yifei Hu, Bin Wen, Tianke Zhang, Changyi Liu, Dewen Fan, Huihui Xiao, Jiahong Wu, Fan Yang, Size Li, Di Zhang [Paper]
-
[arXiv:2407.07726] PaliGemma: A versatile 3B VLM for transfer, Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisenschlos, Rishabh Kabra, Matthias Bauer, Matko Bošnjak, Xi Chen, Matthias Minderer, Paul Voigtlaender, Ioana Bica, Ivana Balazevic, Joan Puigcerver, Pinelopi Papalampidi, Olivier Henaff, Xi Xiong, Radu Soricut, Jeremiah Harmsen, Xiaohua Zhai [Paper]
-
[arXiv:2407.03418] HEMM: Holistic Evaluation of Multimodal Foundation Models, Paul Pu Liang, Akshay Goindani, Talha Chafekar, Leena Mathur, Haofei Yu, Ruslan Salakhutdinov, Louis-Philippe Morency [Paper] [Code]
-
[arXiv:2406.11832] Unveiling Encoder-Free Vision-Language Models, Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, Xinlong Wang [Paper]
-
Pegasus-v1 Technical Report, arXiv:2404.14687, Raehyuk Jung, Hyojun Go, Jaehyuk Yi, Jiho Jang, Daniel Kim, Jay Suh, Aiden Lee, Cooper Han, Jae Lee, Jeff Kim, Jin-Young Kim, Junwan Kim, Kyle Park, Lucas Lee, Mars Ha, Minjoon Seo, Abraham Jo, Ed Park, Hassan Kianinejad, SJ Kim, Tony Moon, Wade Jeong, Andrei Popescu, Esther Kim, EK Yoon, Genie Heo, Henry Choi, Jenna Kang, Kevin Han, Noah Seo, Sunny Nguyen, Ryan Won, Yeonhoo Park, Anthony Giuliani, Dave Chung, Hans Yoon, James Le, Jenny Ahn, June Lee, Maninder Saini, Meredith Sanders, Soyoung Lee, Sue Kim, Travis Couture [Paper]
-
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training, Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch, Alexander Toshev, Yinfei Yang [Paper]
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters, Quan Sun, Jinsheng Wang, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Xinlong Wang [Paper] [Code]
-
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Models, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang [Paper] [Code]
-
[arXiv:2310.07704] Ferret: Refer and Ground Anything Anywhere at Any Granularity, Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, Yinfei Yang [Paper] [Code]
-
[LLaVA] Visual Instruction Tuning, Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee, NeurIPS 2023 Oral [Paper] [Code]
-
PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER, [Paper]
-
Fuyu-8B: A Multimodal Architecture for AI Agents, [https://www.adept.ai/blog/fuyu-8b]
-
OtterHD: A High-Resolution Multi-modality Model, Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, Ziwei Liu [Paper] [Code]
-
Towards Medical Artificial General Intelligence via Knowledge-Enhanced Multimodal Pretraining, Bingqian Lin et al. [Paper]
-
CLIP^2: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data, Yihan Zeng, Chenhan Jiang, Jiageng Mao, Jianhua Han, Chaoqiang Ye, Qingqiu Huang, Dit-Yan Yeung, Zhen Yang, Xiaodan Liang, Hang Xu [Paper]
-
PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents, Weixiong Lin, Ziheng Zhao, Xiaoman Zhang, Chaoyi Wu, Ya Zhang, Yanfeng Wang, and Weidi Xie, [Paper]
-
HICLIP: CONTRASTIVE LANGUAGE-IMAGE PRETRAINING WITH HIERARCHY-AWARE ATTENTION, Shijie Geng, Jianbo Yuan, Yu Tian, Yuxiao Chen, Yongfeng Zhang, ICLR 2023 [Paper] [Code]
-
FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks, Xiao Han, Xiatian Zhu, Licheng Yu, Li Zhang, Yi-Zhe Song, Tao Xiang [Paper] [Code]
-
Prismer: A Vision-Language Model with An Ensemble of Experts, Shikun Liu, Linxi Fan, Edward Johns, Zhiding Yu, Chaowei Xiao, Anima Anandkumar [Paper] [Code]
-
STRUCTEXTV2: MASKED VISUAL-TEXTUAL PREDICTION FOR DOCUMENT IMAGE PRE-TRAINING, Yuechen Yu, Yulin Li, Chengquan Zhang, Xiaoqiang Zhang, Zengyuan Guo,Xiameng Qin, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang [Paper] [Code]
-
Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training, Dezhao Luo, Jiabo Huang, Shaogang Gong, Hailin Jin, and Yang Liu [Paper]
-
RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training, Zheng Yuan, Qiao Jin12, Chuanqi Tan, Zhengyun Zhao, Hongyi Yuan, Fei Huang, Songfang Huang [Paper]
-
"Language Is Not All You Need: Aligning Perception with Language Models." arXiv preprint arXiv:2302.14045 (2023). Huang, Shaohan, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv et al. [Paper] [Code]
-
Improving Medical Speech-to-Text Accuracy with Vision-Language Pre-training Model, Jaeyoung Huha, Sangjoon Parka, Jeong Eun Leeb, Jong Chul Ye [Paper]
-
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning, CVPR 2023, Yang, Antoine and Nagrani, Arsha and Seo, Paul Hongsuck and Miech, Antoine and Pont-Tuset, Jordi and Laptev, Ivan and Sivic, Josef and Schmid, Cordelia, [Paper] [Project]
-
Knowledge-enhanced Visual-Language Pre-training on Chest Radiology Images, Xiaoman Zhang, Chaoyi Wu, Ya Zhang, Yanfeng Wang, Weidi Xie, [arXiv]
-
FLAVA: A Foundational Language And Vision Alignment Model, Amanpreet Singh* Ronghang Hu* Vedanuj Goswami* Guillaume Couairon Wojciech Galuba Marcus Rohrbach Douwe Kiela, CVPR_2022 [Paper] [Project] [Code]
-
Position-guided Text Prompt for Vision-Language Pre-training, Alex Jinpeng Wang, Pan Zhou, Mike Zheng Shou, Shuicheng Yan [Paper] [Code]
-
MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training, Chaoyi Wu1,2, Xiaoman Zhang1,2, Ya Zhang1,2, Yanfeng Wang1,2, Weidi Xie [Paper] [Code]
-
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models, Wenhao Wu1,2 Xiaohan Wang3 Haipeng Luo4 Jingdong Wang1 Yi Yang3 Wanli Ouyang [Paper]
-
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training, Qinghao Ye Guohai Xu Ming Yan∗ Haiyang Xu Qi Qian Ji Zhang Fei Huang, [Paper] [Model]
-
Million-scale Object Detection with Large Vision Model, Feng Lin, Wenze Hu, Yaowei Wang, Yonghong Tian, Guangming Lu, Fanglin, Chen, Yong Xu and Xiaoyu Wang, [Paper] [Code]
-
VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning, Qiushi Zhu, Long Zhou, Ziqiang Zhang, Shujie Liu, Binxing Jiao, Jie Zhang, Lirong Dai, Daxin Jiang, Jinyu Li, Furu Wei [Paper] [Code]
-
SIMLA: Single-Stream Multi-Level Alignment for Vision-Language Pretraining, ECCV 2022 (NEC Labs), Zaid Khan, Vijay Kumar, Xiang Yu, Samuel Schulter, Manmohan Chandraker, and Yun Fu [Paper] [Code] [Project]
-
VINDLU : A Recipe for Effective Video-and-Language Pretraining, Feng Cheng, Xizi Wang, Jie Lei, David Crandall, Mohit Bansal, Gedas Bertasius [Paper] [Code]
-
CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet, Xiaoyi Dong*, Jianmin Bao, Ting Zhang, Dongdong Chen, Shuyang Gu, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, Nenghai Yu [Paper] [Code]
-
REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory, Ziniu Hu1*, Ahmet Iscen, Chen Sun, Zirui Wang, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David A. Ross, Alireza Fathi [Paper]
-
Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization, Junru Wu et al. [Paper]
-
Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese, An Yang et al. [Paper] [Code]
-
Generative Negative Text Replay for Continual Vision-Language Pretraining, [Paper]
-
GRIT-VLP: GRouped mIni-baTch sampling for Efficient Vision-Language Pre-training, ECCV 2022, [Paper] [[Code](GRIT-VLP: GRouped mIni-baTch sampling for Efficient Vision-Language Pre-training)]
-
INSTRUCTION-FOLLOWING AGENTS WITH JOINTLY PRE-TRAINED VISION-LANGUAGE MODELS, Hao Liu et al. [Paper] [Code]
-
FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning, Suvir Mirchandani, et al. [Paper]
-
ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts, Zhida Feng et al. [Paper]
-
Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision, [Paper]
-
MedCLIP: Contrastive Learning from Unpaired Medical Images and Text, Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, Jimeng Sun [Paper] [Code]
-
Contrastive Language-Image Pre-Training with Knowledge Graphs, [Paper]
-
Vision-Language Pre-training: Basics, Recent Advances, and Future Trends, Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao [Paper] [Recent Advances in Vision-and-Language Pre-training In conjunction with CVPR 2022]
-
Non-Contrastive Learning Meets Language-Image Pre-Training, Jinghao Zhou Li Dong Zhe Gan Lijuan Wang Furu Wei [Paper]
-
Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training, Wenliang Dai, Zihan Liu, Ziwei Ji, Dan Su, Pascale Fung [Paper]
-
Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning, Yuchong Sun, Hongwei Xue, Ruihua Song, Bei Liu, Huan Yang, Jianlong Fu, [Paper] [Code]
-
MAP: Modality-Agnostic Uncertainty-Aware Vision-Language Pre-training Model, Yatai Ji Junjie Wang Yuan Gong Lin Zhang Yanru Zhu Hongfa Wang Jiaxing Zhang Tetsuya Sakai Yujiu Yang, [Paper] [Code]
-
CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training, Tianyu Huang, Bowen Dong, Yunhan Yang, Xiaoshui Huang, Rynson W.H. Lau, Wanli Ouyang, Wangmeng Zuo [Paper] [Code]
-
F-VLM: OPEN-VOCABULARY OBJECT DETECTION UPON FROZEN VISION AND LANGUAGE MODELS, Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, Anelia Angelova [Paper] [Code]
-
MEDICAL IMAGE UNDERSTANDING WITH PRETRAINED VISION LANGUAGE MODELS: A COMPREHENSIVE STUDY, Ziyuan Qin, Huahui Yi, Qicheng Lao, Kang Li [Paper]
-
ERNIE-VIL 2.0: MULTI-VIEW CONTRASTIVE LEARNING FOR IMAGE-TEXT PRE-TRAINING, Bin Shan Weichong Yin Yu Sun Hao Tian Hua Wu Haifeng Wang, [Paper] [Code]
-
OmniVL: One Foundation Model for Image-Language and Video-Language Tasks, Junke Wang, Dongdong Chen, Zuxuan Wu, Chong Luo, Luowei Zhou, Yucheng Zhao, Yujia Xie, Ce Liu, Yu-Gang Jiang, Lu Yuan [Paper]
-
Multi-Modal Masked Autoencoders for Medical Vision-and-Language Pre-Training, Zhihong Chen, Yuhao Du, Jinpeng Hu, Yang Liu, Guanbin Li, Xiang Wan,and Tsung-Hui Chang, MICCAI-2022. [Paper] [Code]
-
EXPLORING VISUAL INTERPRETABILITY FOR CONTRASTIVE LANGUAGE-IMAGE PRE-TRAINING, Yi Li, Hualiang Wang, Yiqun Duan, Hang Xu, Xiaomeng Li, [Paper] [Code]
-
PaLI: A Jointly-Scaled Multilingual Language-Image Model, Xi Chen∗ Xiao Wang Soravit Changpinyo AJ Piergiovanni Piotr Padlewski, Daniel Salz Sebastian Goodman Adam Grycner Basil Mustafa Lucas Beyer, Alexander Kolesnikov Joan Puigcerver Nan Ding Keran Rong Hassan Akbari, Gaurav Mishra Linting Xue Ashish Thapliyal James Bradbury Weicheng Kuo, Mojtaba Seyedhosseini Chao Jia Burcu Karagol Ayan Carlos Riquelme, Andreas Steiner Anelia Angelova Xiaohua Zhai Neil Houlsby Radu Soricut [Paper]
-
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment, arxiv 2209.06430, Hongwei Xue, Yuchong Sun, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, Jiebo Luo [Paper] [Code]
-
RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection, Hangjie Yuan et al. [Paper]
-
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling, Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, Zicheng Liu, [Paper]
-
Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment, Mustafa Shukor, Guillaume Couairon, Matthieu Cord, [Paper] [Code]
-
Image as a Foreign Language: BEIT Pretraining for All Vision and Vision-Language Tasks, Wang, Wenhui and Bao, Hangbo and Dong, Li and Bjorck, Johan and Peng, Zhiliang and Liu, Qiang and Aggarwal, Kriti and Mohammed, Owais Khan and Singhal, Saksham and Som, Subhojit and others, arXiv:2208.10442, 2022. [Paper] [Code]
-
Pix4Point: Image Pretrained Transformers for 3D Point Cloud Understanding, Guocheng Qian, Xingdi Zhang, Abdullah Hamdi, Bernard Ghanem [Paper] [Code]
-
VLMAE: Vision-Language Masked Autoencoder, Sunan He, Taian Guo, Tao Dai, Ruizhi Qiao, Chen Wu, Xiujun Shu, Bo Ren, arXiv:2208.09374 [Paper]
-
Li, Juncheng, et al. "Fine-Grained Semantically Aligned Vision-Language Pre-Training." arXiv preprint arXiv:2208.02515 (2022). [Paper]
-
GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training, Jaeseok Byun, Taebaek Hwang, Jianlong Fu, and Taesup Moon, arXiv:2208.04060 [Paper] [Code]
-
Wang, Tengfei, et al. "Pretraining is All You Need for Image-to-Image Translation." arXiv preprint arXiv:2205.12952 (2022). [Paper] [Code]
-
Wang, Jinpeng, et al. "Object-aware Video-language Pre-training for Retrieval." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. [Paper] [Code]
-
See Finer, See More: Implicit Modality Alignment for Text-based Person Retrieval, Xiujun Shu, Wei Wen, Haoqian Wu, Keyu Chen, Yiran Song, Ruizhi Qiao, Bo Ren, Xiao Wang, The 2nd Workshop on Real-World Surveillance: Applications and Challenges, ECCVW-2022
[Paper] [Code] -
Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training, 2022 European Conference on Computer Vision (ECCV 2022), Haoxuan You*, Luowei Zhou*, Bin Xiao*, Noel Codella*, Yu Cheng, Ruochen Xu, Shih-Fu Chang, Lu Yuan. [Paper] [Code]
-
Zhao, Tiancheng, et al. "VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations." arXiv preprint arXiv:2207.00221 (2022). [Paper] [Code]
-
DemoVLP: Revitalize Region Feature for Democratizing Video-Language Pre-training, Guanyu Cai, Yixiao Ge, Alex Jinpeng Wang, Rui Yan, Xudong Lin, Ying Shan, Lianghua He, Xiaohu Qie, Jianping Wu, Mike Zheng Shou [Paper] [Code]
-
Yan, Rui, et al. "Video-Text Pre-training with Learned Regions." arXiv preprint arXiv:2112.01194 (2021). [Paper]] [Code]
-
Wang, Alex Jinpeng, et al. "All in one: Exploring unified video-language pre-training." arXiv preprint arXiv:2203.07303 (2022). [Paper] [Code]
-
Egocentric Video-Language Pretraining, Kevin Qinghong Lin and Alex Jinpeng Wang and Mattia Soldan and Michael Wray and Rui Yan and Eric Zhongcong Xu and Difei Gao and Rongcheng Tu and Wenzhe Zhao and Weijie Kong and Chengfei Cai and Hongfa Wang and Dima Damen and Bernard Ghanem and Wei Liu and Mike Zheng Shou, arXiv-2022 [Paper] [Code]
-
LocVTP: Video-Text Pre-training for Temporal Localization (ECCV 2022), Meng Cao, Tianyu Yang, Junwu Weng, Can Zhang, Jue Wang and Yuexian Zou.
[Paper] [Code] -
Gui L, Wang B, Huang Q, et al. Kat: A knowledge augmented transformer for vision-and-language[J]. arXiv preprint arXiv:2112.08614, 2021. [Paper] [Code]
NO. | Model | Publish | Modality | Architecture | Objective | Highlights | Code |
---|---|---|---|---|---|---|---|
64 | pyramidCLIP | arXiv-2022 | image-text | CNN+Trans | CS | Hierarchical image-text contrastive learning | - |
65 | VLC | arXiv-2022 | image-text | ViT | MIM, MLM ITM | Built on top of MAE that does not require trained on ImageNet | [Code] |
66 | VLCDoC | arXiv-2022 | image-text | Trans | CS | Contrastive Pre-Training for document classification | - |
67 | MVP | arXiv-2022 | image-text | ViT | MIM | Multimodality-guided visual pre-training leads to impressive gains | - |
68 | COTS | arXiv-2022 | image-text | Trans | CS, KLD, MVLM | Token- and task-level interaction are proposed to enhance cross-modal interaction | - |
69 | Flamingo | arXiv-2022 | image-text | NFNet | CS | An architecture for accepting arbitrarily interleaved visual data and text as input | [Code] |
70 | BLIP | arXiv-2022 | image-text | BERT | CS, MML, MLM | Propose the multimodal mixture of encoder-decoder, and captioning-filtering scheme | [Code] |
71 | TCL | CVPR-2022 | image-text | BERT | CMA, IMC, LMI ITM, MLM | The first work considers local structure information for multi-modality representation learning | [Code] |
72 | SCALE | CVPR-2022 | image, text, table video, audio | BERT | MRP, MLM, MEM MFP, MFP, MAM | A unified model to handle five modalities | [Code] |
73 | Clinical-BERT | AAAI-2022 | image-text | BERT | CD, MMM MLM, IMM | The first work to learn domain knowledge during pre-training for the medical domain | - |
74 | ProbES | ACL-2022 | image-text | LSTM, ViLBERT | Ranking loss | Prompt-based learning for VLN based on CLIP | [Code] |
75 | VLP-MABSA | ACL-2022 | image-text | BERT | MLM, AOE, MRM AOG, MSP | Task-specific VL-PTMs for multimodal aspect-based sentiment analysis | [Code] |
76 | R2D2 | arXiv-2022 | image-text | ViT, BERT | GCPR, FGR, MLM | A two-way distillation strategy is proposed, i.e., target- and feature-guided distillation | - |
77 | DeFILIP | arXiv-2022 | image-text | ViT, ResNet | CS | A benchmark for CLIP and its variants | [Code] |
78 | CoCa | arXiv-2022 | image-text | Trans | CS, ITG | Jointly pre-train image text model with contrastive loss and captioning loss | - |
79 | HiVLP | arXiv-2022 | image-text | Trans | LRM, HRL, VLM | Accelerate image-text retrieval via hierarchical retrieval | - |
80 | CLIP-Event | CVPR-2022 | image-text | Trans | CS | Consider event structural knowledge and prompts in the pre-training phase. | [Code] |
81 | AudioCLIP | ICASSP-2022 | image-text-audio | Trans | CS | Build a triplet modality based PTMs like CLIP | [Code] |
82 | VL-BEiT | arXiv-2022 | image-text | Trans | MLM, MIM, MVLM | Pretrain on both monomodal and multimodal data using a shared Transformer | [Code] |
83 | MV-GPT | arXiv-2022 | image-text | BERT | MLM, LG | Pre-train both a multi-modal video encoder and a sentence decoder jointly. | - |
84 | MMKD | arXiv-2022 | image-text | BERT | ITM | Iteratively execute knowledge discovery and model pre-training for continuous learning | - |
85 | GLIPv2 | arXiv-2022 | image-text | Swin, BERT | PGL, CS, MLM | Serves both the localization and understanding tasks. | [Code] |
86 | LIMoE | arXiv-2022 | image-text | Trans | CS | multi-modal pre-training with a sparse mixture of experts model | - |
87 | VLMixer | arXiv-2022 | image-text | Trans | MLM, CMCL, MTM | Implicit cross-modal alignment learning in unpaired VLP. | [Code] |
88 | ProtoCLIP | arXiv-2022 | image-text | Trans | CS | Combine the CLIP loss and prototypical supervisions for VLP. | [Code] |
89 | i-Code | arXiv-2022 | image-text-audio | Trans | MLM, MVM MSM, CS | It can handle different combinations of modalities (such as single-, dual-, and triple-modality) into a single representation space. | - |
NO. | Model | Publish | Modality | Architecture | Objective | Highlights | Code |
---|---|---|---|---|---|---|---|
25 | XGPT | NLPCC-2021 | image-text | Trans | IC, MLM, IDA, MOR | Novel IDA pre-training; Share parameters between encoder and decoder | - |
26 | ERNIE-ViL | AAAI-2021 | image-text | Trans | MOC, AttP, RelP, MLM, MOR, MML | Use the knowledge obtained from scene graph | [Code] |
27 | KVL-BERT | KBS-2021 | image-text | BERT | MOC, MLM | Integrate commonsense knowledge for visual commonsense reasoning | - |
28 | VinVL | CVPR-2021 | image-text | Trans | MTL, 3-way CS | Verifying that visual feature matters in VLP, i.e., strong object detector brings better results | [Code] |
29 | VL-T5 | ICML-2021 | image-text | Trans | MLM, VQA, MML, VG, GC | Unified framework for VL via generating texts | [Code] |
30 | ViLT | ICML-2021 | image-text | Trans | MLM, MML | Use linear embedding only for Fast VL transformer | [Code] |
31 | ALIGN | ICML-2021 | image-text | EfficientNet, BERT | CS | Milestone for image-text pre-training using noisy data | - |
32 | Kaleido-BERT | CVPR-2021 | image-text | Trans | MLM, MML, AKPM | Use saliency detector to generate multi-grained patches | [Code] |
33 | MDETR | ICCV-2021 | image-text | CNN+Trans | STP, MML | An end-to-end text-modulated detection system | [Code] |
34 | SOHO | CVPR-2021 | image-text | CNN+Trans | MLM, MOR, MML | Use a dynamic-updated visual dictionary for vision-language alignment | [Code] |
35 | E2E-VLP | ACL-2021 | image-text | Trans | OBD, ITG | The first end-to-end pre-trained model for V+L understanding and generation | - |
36 | PIM | NeurIPS-2021 | image-text | Trans | MLM, MML, MOR | Propose a inter-modality flow metric to measure and reveal vision and language fusion | - |
37 | CLIP-ViL | arXiv-2021 | image-text | Trans | MLM, VQA, MML | Take the CLIP visual encoder as its visual backbone | [Code] |
38 | ALBEF | NeurIPS-2021 | image-text | Trans | CS, GR | Design a momentum model to address noisy data | [Code] |
39 | SimVLM | arXiv-2021 | image-text | Trans | PrefixLM | Simple VL model using single PrefixLM pre-training objective only | - |
40 | MURAL | arXiv-2021 | image-text | Trans | CS | Adopt multi-task contrastive learning objective (image-text, text-text) | - |
41 | VLMo | arXiv-2021 | image-text | Trans | MLM, MML, CS | Jointly learns visual-, text-encoder and a fusion encoder | [Code] |
42 | METER | CVPR-2022 | image-text | Trans | MLM, MOR, MOC, MML | An empirical study on VLP | [Code] |
43 | CLIP | ICML-2021 | image-text | Resnet, Trans | CS | Milestone for image-text pre-training using noisy data | [Code] |
44 | Frozen | ICCV-2021 | video/image-text | Trans | MML | Flexibly trained on both images and videos with captions jointly | [Code] |
45 | RegionLearner | arXiv-2021 | video-text | Trans | MML | Implicitly learning object region without position supervision | [Code] |
46 | DALL-E | ICML-2021 | image-text | Trans | ELB | Achieve high quality image generation without using any of the training labels | [Code] |
47 | BriVL | arXiv-2021 | image-text | Trans | InfoNCE | First large-scale Chinese multi-modal pre-training model | [Code] |
48 | M6 | arXiv-2021 | image-text | Trans | LM | The largest pretrained model in Chinese | - |
49 | CogView | NeurIPS-2021 | image-text | Trans | NLL | The first open-source large text-to-image transformer | [Code] |
50 | VATT | NeurIPS-2021 | Video, Audio, Text | Trans | NCE, MIL-NCE | Modality-specific or Modality-agnostic triplet modality pre-trained model | [Code] |
51 | OPT | arXiv-2021 | image, Audio, Text | Trans | MLM, MVM, MoLM MAM, DTR, DIR | The first pre-trained model that connects the three modalities of text, vision, and audio | - |
52 | Florence | arXiv-2021 | image-text | CoSwin | UniCL | Expand the representations from coarse-to-fine, static-to-dynamic, and RGB-to-MM | - |
53 | ROSITA | MM-2021 | image-text | Trans | SKM, MLM, MRM | Incorporates both cross- and intra-modal knowledge, and proposed SKM strategy | - |
54 | GilBERT | IR-2021 | image-text | BERT | MLM, MOR | Employ image-to-text captioning and text-to-image synthesizing in VLP | - |
55 | U-VisualBERT | NAACL-2021 | image-text | Trans, BERT | GR, MML | \emph{Unpaired image-text data for pre-training | [Code] |
56 | M3P | CVPR-2021 | image-text | BERT | xMLM, MC-MLM, MC-MRM | Multitask, Multilingual, Multimodal Pre-training | [Code] |
57 | NUWA | arXiv-2021 | image-text | Trans | T2I, T2V, V2V | A 3D transformer framework can handle image, text, and video, simultaneously | [Code] |
58 | GLIP | CVPR-2022 | image-text | BERT | CS | Unifying detection and grounding by reformulating object detection as phrase grounding | [Code] |
59 | RegionCLIP | CVPR-2022 | image-text | Trans | Distillation loss, CS | Learn region-level visual representations based on CLIP | [Code] |
60 | DeCLIP | ICLR-2022 | image-text | ViT | InfoNCE, SS MVS, NNS | Learn generic visual features in a data efficient way | [Code] |
61 | SLIP | arXiv-2021 | image-text | ViT | CS, InfoNCE | Combine the self-supervised learning and CLIP pre-training in a multi-task framework | [Code] |
62 | FILIP | arXiv-2021 | image-text | ViT | CS | Achieve finer-level alignment using the cross-modal late interaction scheme | - |
63 | SemVLP | arXiv-2021 | image-text | Trans | MLM, MOP, ITM, QA | Fuse the single- and two-stream architectures | - |
NO. | Model | Publish | Modality | Architecture | Objective | Highlights | Code |
---|---|---|---|---|---|---|---|
08 | Unicoder-VL | AAAI-2020 | image-text | Trans | GR, MML, MOC | Single transformer encoder for VLP | [Code] |
09 | VLP | AAAI-2020 | image-text | Trans | BiDT, Seq2seq | Unified encoder-decoder network architecture | [Code] |
10 | UNITER | ECCV-2020 | image-text | Trans | MRA, MML | Propose an OT-based Word-Region Alignment objective | [Code] |
11 | 12-IN-1 | CVPR-2020 | image-text | Trans | CS, GR | Training jointly on 12 different datasets in a multi-task learning manner | [Code] |
12 | VisDial-BERT | ECCV-2020 | image-text | Trans | MLM, NSP, MIR | Pre-training on image-text corpus and finetuning on visual dialog | [Code] |
13 | ImageBERT | arXiv-2020 | image-text | Trans | MOC, MLM, MML, MOR | Indicating that multi-stage pre-training works better | - |
14 | PREVALENT | CVPR-2020 | image-text | Trans | MLM, AP | Pre-training for vision and language navigation | [Code] |
15 | InterBERT | arXiv-2020 | image-text | Trans | MSM, MOC, ITM-hn | Finding that all-attention works better than co-attention for modal interaction | [Code] |
16 | PixelBERT | arXiv-2020 | image-text | CNN, Trans | MLM, MML | First to align vision and language in pixel and text-level | - |
17 | OSCAR | ECCV-2020 | image-text | Trans | CS, MLM | Use object tags as anchor points to align image regions with word embeddings | [Code] |
18 | FashionBERT | RDIR-2020 | image-text | BERT | MLM, MOR, MML | Use image patches for fashion domain instead of RoIs | [Code] |
19 | VILLA | NeurIPS-2020 | image-text | Trans | MLM, MOR, MML | Pre-training with adversarial learning | [Code] |
20 | UniVL | arXiv-2020 | video-text | Trans | MLM, MFM, MML, ITG | A unified model for multimodal understanding and generation | [Code] |
21 | HERO | EMNLP-2020 | video-text | Trans | MLM, MFM, VSM, FOM | Hierarchical Transformer-based model trained with newly proposed VSM and FOM | [Code] |
22 | MMFT-BERT | EMNLP-2020 | image-text | BERT | Classification | Adopt multiModal fusion Transformer for modality fusion | [Code] |
23 | ActBERT | CVPR-2020 | image-text | Trans | CS, GR | Extract actions explicitly as one of the inputs | - |
24 | UNIMO | arXiv-2020 | image-text | Trans | CS | Adapt to single-, multi-modal understanding and generation tasks effectively | [Code] |
NO. | Model | Publish | Modality | Architecture | Objective | Highlights | Code |
---|---|---|---|---|---|---|---|
01 | VisualBERT | arXiv-2019 | image-text | Trans, BERT | GR, MML | A simple and strong baseline for VLP | [Code] |
02 | ViLBERT | NeurIPS-2019 | image-text | Trans | CS, GR | First adopt co-attention for MM pre-training | [Code] |
03 | LXMERT | EMNLP-2019 | image-text | Trans | QA, MOR, MOC, MML, MLM | Propose a cross-modality encoder for vision-language pre-training | [Code] |
04 | B2T2 | EMNLP-2019 | image-text | ResNet, BERT | MML, GR | Embed bounding box into text transformer in a early fusion manner | [Code] |
05 | VL-BERT | ICLR-2019 | image-text | BERT | GR, MOC | MM PTMs and faster rcnn are jointly trained | [Code] |
06 | VideoBERT | ICCV-2019 | video-text | BERT | MLM | A simple model for video-text feature learning | [Code] |
07 | CBT | arXiv-2019 | video-text | Trans | NCE | Self-supervised contrastive bidirectional Transformer | - |