Skip to content

Latest commit

 

History

History
18 lines (15 loc) · 2.75 KB

2312.1137.md

File metadata and controls

18 lines (15 loc) · 2.75 KB

Background

  • Background The paper discusses the application of large language models (LLMs) in solving mathematical problems. Specifically, it addresses the limitations of current multimodal large language models (MLLMs) in understanding geometric figures and relationships, which impairs their ability to solve problems that involve geometric information.
  • Existing Work Current MLLMs are usually trained with images and descriptions from the general domain which hampers their ability to accurately understand the semantics needed for geometric reasoning. Although there is potential in enhancing these models by including descriptions related to geometric information, the largest public geometric problem dataset available is quite small and lacks descriptions of geometric images. This limits models' understanding of basic geometric elements and their problem-solving abilities.

Core Contributions

  • Introduced a novel multimodal geometry dataset, Geo170K
    • Challenge 1: Geometric Figure Comprehension Existing MLLMs perform poorly on images labeled with geometric information. To tackle this challenge, the paper synthesizes a new multimodal geometry dataset, Geo170K, using existing datasets and leveraging text-only LLMs (e.g., ChatGPT). They incorporate unique geometric logic forms, representational uniqueness, and geometric scalability, among other features, creating a dataset that is 28 times larger than GeoQA+ and significantly expands the coverage of geometric problems.
    • Challenge 2: MLLMs Understanding and Problem Solving In response to existing model limitations in understanding geometric figures and problem-solving, the paper develops G-LLaVA, a robust MLLM based on the Geo170K dataset. This MLLM is capable of solving geometric problems, significantly surpassing existing SOTA models in performance. Notably, G-LLaVA-13B improves performance by 27.4 on the GPS mini-test split of MathVista, and with only 7B parameters, G-LLaVA surpasses the powerful GPT4-V.

Implementation and Deployment

G-LLaVA demonstrates exceptional problem-solving abilities in geometry on the Geo170K dataset. In particular, G-LLaVA-13B improves performance by 27.4 on the GPS mini-test split of MathVista compared to LLaVA-13B. Even with only 7B parameters, G-LLaVA can surpass the performance of the powerful GPT-4-V in geometry problem-solving tests. Furthermore, the paper promises that code and data resources will be provided in the future for broader research and development use.

Summary

This paper overcomes the limitations of multimodal large language models in solving geometric problems by constructing the Geo170K dataset and developing the G-LLaVA model based on it, achieving better performance than existing state-of-the-art models.