Skip to content

Latest commit

 

History

History
19 lines (15 loc) · 3.06 KB

2403.10517.md

File metadata and controls

19 lines (15 loc) · 3.06 KB

Background

  • Background The article introduces the significant challenge of long-form video understanding in computer vision that requires a model capable of processing multi-modal information and reasoning over lengthy sequences effectively. While current large language models (LLMs) excel at reasoning and managing lengthy contexts, they lack the capability to process visual information. Visual language models (VLMs), in contrast, struggle with modeling lengthy visual inputs. Initial efforts to allow for VLMs' long context modeling fall short on video understanding benchmarks and are inefficient in dealing with long-form video content.

  • Existing Work Existing models struggle to simultaneously excel in processing lengthy sequences, managing multi-modal information, and reasoning effectively. Previous methods that uniformly sample frames or select frames in a single iteration do not make the most of video content, or they use the original question as the query to retrieve frames, without rewriting the query for more precise, fine-grained frame retrieval.

Core Contributions

  • Introduced the VideoAgent system based on a large language model
    • Challenge 1: Reasoning and planning over long sequences The challenge is to reason over long multi-modal sequences in long-video understanding. VideoAgent imitates the human cognitive process of comprehending long videos; it iteratively selects and compiles critical information to respond to queries. LLM, as an agent, performs assessments on the current information to decide if additional information is needed and utilizes CLIP to fetch new frames containing this info, thus updating the state.
    • Challenge 2: The capability to process visual information LLMs typically lack the capability of processing visual information. To tackle this, VideoAgent employs VLMs to translate new frames' visual content into textual descriptions, and together with the CLIP model, they serve as tools enabling the LLM to understand visuals and retrieve long-contextual information.

Implementation and Deployment

Evaluated on the EgoSchema and NExT-QA long-form video understanding benchmarks, VideoAgent demonstrates exceptional effectiveness and efficiency. It achieves 54.1% and 71.3% accuracy on these benchmarks, respectively, outperforming the concurrent state-of-the-art method LLoVi by 3.8% and 3.6%. Notably, VideoAgent only uses an average of 8.4 frames, 20x fewer compared to LLoVi. Ablation studies highlight the importance of the iterative frame selection process, which adaptively searches and aggregates relevant information based on video complexity. Case studies further demonstrate that VideoAgent can generalize to arbitrarily long videos, including those over an hour.

Summary

VideoAgent represents a substantial advancement in long-form video understanding by mimicking the human cognitive process, emphasizing the importance of reasoning over visual input over long periods. This work not only sets a new benchmark in long-form video understanding but also provides insights for future research in this area.