Skip to content

yunlong10/MMComposition

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 

Repository files navigation

MMComposition

✨ Benchmarking the Compositionality for Pre-trained Vision-Language Models

🌐 Homepage | 🔬 Paper👩‍💻 Code

In the quest for advancing vision-language models (VLMs), recent developments such as GPT-4V, LLaVA, mPLUG, MiniGPT-4, and BLIP have shown impressive capabilities in complex reasoning tasks. However, these models still face challenges in understanding fine-grained multimodal compositional information, limiting their reliability and performance. To address this, we introduce MMComposition, a novel benchmark specifically designed to evaluate the compositionality of VLMs comprehensively. MMComposition assesses VLMs across two main dimensions: vision-language (VL) compositional understanding and VL compositional reasoning. Unlike previous benchmarks that focus on single-choice questions or open-ended text generation, MMComposition provides a diverse set of tasks including single-choice questions, indefinite-choice questions, text generation, and text-image matching. This diversity ensures a thorough evaluation of the models' ability to understand and reason with compositional information across modalities. Our findings reveal that even state-of-the-art models like GPT-4 struggle with nuanced compositional reasoning tasks. These insights highlight the need for further research to enhance VLMs' compositional abilities. Our key contributions are: Proposing MMComposition, the first comprehensive benchmark for evaluating the compositionality of pretrained VLMs. Providing a thorough experimental evaluation of state-of-the-art VLMs' compositionality. Benchmarking a set of well-known VLMs using the proposed MMComposition benchmark. MMComposition aims to inspire advancements in VLM design and training, ultimately improving their performance in understanding and reasoning with complex multimodal information.

🏆 Leaderboard

image

📉 Statistics

image

✏️ Citation

@article{hua2024mmcomposition,
      title={MMComposition: Benchmarking the Compositionality for Pre-trained Vision-Language Models},
      author={Hua, Hang and Tang, Yunlong and Zeng, Ziyun and Cao, Liangliang and Yang, Zhengyuan and He, Hangfeng and Xu, Chenliang and Luo, Jiebo},
      journal={},
      year={2024}
}

Under construction...

About

Repo for MMComposition Benchmark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published