Skip to content

anguyen8/vision-llms-are-blind

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vision Language Models Are Blind

by Pooyan Rahmanzadehgervi1,*, Logan Bolton1,*, Mohammad Reza Taesiri2, Anh Totti Nguyen1

*Equal contribution
1Auburn University, 2University of Alberta

Website arXiv Hugging Face Dataset

This repository contains the code and data for the paper Vision Language Models Are Blind.

@article{vlms2024blind,
  title={Vision language models are blind},
  author={Rahmanzadehgervi, Pooyan and Bolton, Logan and Taesiri, Mohammad Reza and Nguyen, Anh Totti},
  journal={arXiv preprint arXiv:2407.06581},
  year={2024}
}

Abstract

While large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini 1.5 Pro, are powering various image-text applications and scoring high on many vision-understanding benchmarks, we find that they are surprisingly still struggling with low-level vision tasks that are easy to humans. Specifically, on BlindTest, our suite of 7 very simple tasks such as identifying (a) whether two circles overlap; (b) whether two lines intersect; (c) which letter is being circled in a word; and (d) counting circles in an Olympic-like logo, four state-of-the-art VLMs are only 58.12% accurate on average. Claude 3.5 Sonnet performs the best at 74.94% accuracy, but this is still far from the human expected accuracy of 100%. Across different image resolutions and line widths, VLMs consistently struggle with tasks that require precise spatial information and recognizing geometric primitives that overlap or are close together. Code and data are available at: https://vlmsareblind.github.io

How to Reproduce Results

  1. Find images in src/{task} directory. For example: this image in the gpt-4o/incorrect folder.

  2. Locate corresponding prompts in prompts.md. For example: Are the two circles touching each other? Answer with Yes/No.

  3. Input the above input image and prompt to models via default API settings or official playground, NOT using their web interface (e.g. use https://platform.openai.com/playground/chat for GPT-4o)

  4. Compare your results with our paper, noting that variations may occur due to the default temperature = 1 setting.

Important: Using the web interface (e.g., chatgpt.com) of the models may result in very different results from our paper.

Tasks in the BlindTest benchmark

  1. Task 1: Counting Line Intersection
  2. Task 2: Two Circles
  3. Task 3: Circled Letter
  4. Task 4: Counting Circles
  5. Task 5: Counting Nested Squares
  6. Task 6: Counting Rows and Columns
  7. Task 7: Following color-coded paths

Benchmark Results

Mean Accuracy - All Tasks

vision-llms-are-blind

vision-llms-are-blind

vision-llms-are-blind

vision-llms-are-blind

vision-llms-are-blind

vision-llms-are-blind

vision-llms-are-blind

vision-llms-are-blind

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •