Skip to content

Latest commit

 

History

History
20 lines (15 loc) · 2.72 KB

2312.14385.md

File metadata and controls

20 lines (15 loc) · 2.72 KB

Background

  • Background This paper introduces the evolution of large-scale generative AI models from text (1D) to images (2D) and videos (3D), acknowledging the unique challenges this brings to performance and efficiency. There's a lack of system design research in the field of multi-modal text-to-image (TTI) and text-to-video (TTV) generative models.

  • Existing Work Existing systems have been optimized for large language models (LLMs), such as with the use of Flash Attention. However, TTI and TTV models possess unique properties that suggest those workloads may not benefit equally from existing optimizations, necessitating in-depth analysis for identifying optimization opportunities for TTI/TTV.

Core Contributions

  • Introduced a comprehensive study
    • Challenge 1: System Performance Bottlenecks The paper conducts systematic performance characterization on representative TTI/TTV models, revealing that after implementing state-of-the-art optimizations like Flash Attention, convolution computations account for up to 44% of the execution time in Diffusion-based TTI models, and linear layers consume up to 49% in Transformer-based models. This indicates that optimizations designed for LLMs are not directly transferable to TTI/TTV models, and a thorough characterization of these workloads is required to uncover new optimization potential.

    • Challenge 2: Variability in Sequence Length Unlike LLMs, TTI/TTV workloads experience a variability in sequence length during inference, with changes up to 4-fold, particularly over the Diffusion model inference process. This variability impacts the computational intensity and suggests opportunities for tailored system design.

Implementation and Deployment

The paper quantitatively characterizes state-of-the-art TTI/TTV models, uncovering new system performance bottlenecks and comparing them to widely-used language models such as LLaMA. It also analyzes the system performance bottleneck due to Temporal Attention, which is critical for coherent frame generation over time in TTV models, as opposed to Spatial Attention. Temporal Attention was found to take twice as long in execution time and consume nine times the FLOP count compared to Spatial Attention. Moreover, the analysis shows an exponential increase in the Temporal Attention bottleneck with the number of frames, suggesting the need for future system optimizations.

Summary

The paper is the first to characterize system performance for models that span across text, image, and video generation, revealing unique system properties distinct from traditional LLMs. It also highlights challenges and opportunities where traditional optimizations might need rethinking for TTI/TTV models.