🌐 Homepage | 🔬 Paper | 👩💻 Code
In the quest for advancing vision-language models (VLMs), recent developments such as GPT-4V, LLaVA, mPLUG, MiniGPT-4, and BLIP have shown impressive capabilities in complex reasoning tasks. However, these models still face challenges in understanding fine-grained multimodal compositional information, limiting their reliability and performance. To address this, we introduce MMComposition, a novel benchmark specifically designed to evaluate the compositionality of VLMs comprehensively. MMComposition assesses VLMs across two main dimensions: vision-language (VL) compositional understanding and VL compositional reasoning. Unlike previous benchmarks that focus on single-choice questions or open-ended text generation, MMComposition provides a diverse set of tasks including single-choice questions, indefinite-choice questions, text generation, and text-image matching. This diversity ensures a thorough evaluation of the models' ability to understand and reason with compositional information across modalities. Our findings reveal that even state-of-the-art models like GPT-4 struggle with nuanced compositional reasoning tasks. These insights highlight the need for further research to enhance VLMs' compositional abilities. Our key contributions are: Proposing MMComposition, the first comprehensive benchmark for evaluating the compositionality of pretrained VLMs. Providing a thorough experimental evaluation of state-of-the-art VLMs' compositionality. Benchmarking a set of well-known VLMs using the proposed MMComposition benchmark. MMComposition aims to inspire advancements in VLM design and training, ultimately improving their performance in understanding and reasoning with complex multimodal information.
@article{hua2024mmcomposition,
title={MMComposition: Benchmarking the Compositionality for Pre-trained Vision-Language Models},
author={Hua, Hang and Tang, Yunlong and Zeng, Ziyun and Cao, Liangliang and Yang, Zhengyuan and He, Hangfeng and Xu, Chenliang and Luo, Jiebo},
journal={},
year={2024}
}