A curated list of awesome Multimodal studies.
Title | Venue | Date | Code | Supplement |
---|---|---|---|---|
Tarsier: Recipes for Training and Evaluating Large Video Description Models (Tarsier, Dream1k, by ByteDance) | arXiv | 2024-07-30 | ||
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning | arXiv | 2024-06-18 | - | |
LOVA3: Learning to Visual Question Answering, Asking and Assessment | arXiv | 2024-05-23 | - | |
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI | arXiv | 2024-04-24 | ||
BLINK: Multimodal Large Language Models Can See but Not Perceive | arXiv | 2024-04-18 | ||
Ferret: Refer and Ground Anything Anywhere at Any Granularity (Ferret-Bench) | ICLR 2024 | 2023-10-11 | - | |
Aligning Large Multimodal Models with Factually Augmented RLHF (LLaVA-RLHF, MMHal-Bench (hallucination)) | arXiv | 2023-09-25 | ||
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension | CVPR 2024 | 2023-07-30 | - |
Title | Venue | Date | Code | Supplement |
---|---|---|---|---|
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation? (LaDiC) | NAACL 2024 | 2024-04-16 | - |