From 1e6aafd1b7ee897dfe505bf1f89626928ef5fa29 Mon Sep 17 00:00:00 2001 From: baichuanzhou Date: Fri, 30 Aug 2024 12:53:58 +0800 Subject: [PATCH] update index --- index.html | 249 ++++++++++++++++++++++++++++++++++++----------------- 1 file changed, 168 insertions(+), 81 deletions(-) diff --git a/index.html b/index.html index 4fb335f..f53be3d 100644 --- a/index.html +++ b/index.html @@ -189,60 +189,60 @@

--> -
-
-
- -
-
-
+ + + + + -
+

Abstract

-

+

Recent evaluations of Large Multimodal Models (LMMs) have explored their capabilities in various domains, with only few benchmarks specifically focusing on urban environments. Moreover, existing urban benchmarks have been limited to evaluating LMMs with basic region-level urban tasks under singular views, leading to incomplete evaluations of LMMs' abilities in urban environments. To address these issues, we present UrBench, a comprehensive benchmark designed for evaluating LMMs in complex multi-view urban scenarios. @@ -263,14 +263,41 @@

Abstract

- -
+ +
-

Comparisons with Existing Benchmarks

+

UrBench Overview

+
+ +
+
+ We propose UrBench, a multi-view benchmark designed + to evaluate LMMs’ performances in urban environments. + Our benchmark includes 14 urban tasks that we categorize into various dimensions. These tasks encompass + both region-level evaluations that assess LMMs’ capabilities in urban planning, as well as role-level evaluations + that examine LMMs’ responses to daily issues. +
+
+
+ +
+ +
+
+ + + + +
+
+
+
+

Comparisons with Existing Benchmarks

+
Compared to previous benchmarks, UrBench offers:
  • Region-level and role-level questions. UrBench contains diverse questions at both region and role level, @@ -281,7 +308,7 @@

    Comparisons with Existing Benchmarks

    while previous benchmarks only offer limited task types such as counting, object recognition, etc.
- +
@@ -291,53 +318,109 @@

Comparisons with Existing Benchmarks

- -
-
-
-

Detailed Statistics of UrBench

-

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+
+

Statistics & Characteristics of UrBench

+

+ UrBench introduces 14 diverse tasks in the urban environment, covering multiple different views. While humans handle most tasks easily, we find LMMs still struggle. +

+
+ +
+ +
+ +
+
+
-
+ + + + +

Qualitative Results

+
-
+
-

Evaluation

+

Evaluation Results

@@ -396,13 +480,13 @@

UrBench poses significant challenges to current SoTA LMMs. We find that the best performing closed-source model GPT-4o and open-source model VILA-1.5- 40B only achieve a 61.2% and a 53.1% accuracy, respectively. Interestingly, our findings indicate that the primary - limitation of these models lies in their ability to comprehend UrBench questions, not in their capacity to process multiple images, as the performance between multi-image and + limitation of these models lies in their ability to comprehend UrBench questions, not in their capacity to process multiple images, as the performance between multi-image and their single-image counterparts shows little difference, such as LLaVA-NeXT-8B and LLaVA-NeXT-Interleave in the table. Overall, the challenging nature of our benchmark indicates that current LMMs’ strong performance on the general benchmarks are not generalized to the multi-view urban scenarios.

-
Performances of LMMs and human experts on the UrBench test set.
+
Performances of LMMs and human experts on the UrBench test set.
@@ -418,10 +502,13 @@

-
+

Case Study

+
+ We present randomly selected samples from the 14 tasks of UrBench, with GPT-4o, VILA-1.5-40B and Claude-3.5-Sonnet's responses attached. +