Skip to content

Latest commit

 

History

History
21 lines (16 loc) · 3.48 KB

2311.1077.md

File metadata and controls

21 lines (16 loc) · 3.48 KB

Background

  • Background The paper discusses that in large language models, feedforward layers consist of the bulk of parameters, but not all neurons need to be engaged for every input during inference time. To provide a generally accessible proof, the researchers proposed UltraFastBERT, a variant of BERT that utilizes only 0.3% of its neurons during inference and performs comparably to other similar-sized BERT models on downstream tasks.

  • Existing Work Previous work accelerated the execution of the attention mechanism, but did not optimize intermediate feedforward networks substantially. Existing feedforward networks perform dense matrix multiplication (DMM), and the model's performance on downstream tasks also indicated that not all neurons are essential for every inference.

Core Contributions

  • Introduced UltraFastBERT
    • Challenge 1: Enhancing computational efficiency of feedforward networks Despite feedforward layers containing the majority of parameters in large language models, not all neurons need to engage in computations during inference time. UltraFastBERT replaced traditional feedforward networks with fast feedforward networks (FFFs), significantly enhancing computational efficiency. It activates only 12 out of 4095 neurons for each layer inference using Conditional Matrix Multiplication (CMM), where the dot product of inputs and a subset of neural weights is conditional on the outputs of the previous dot-product operations. This resolves the computation-intensive issue where all inputs had to dot-product with all weights in traditional feedforward networks.

    • Challenge 2: Lack of efficient implementations and optimizations There’s no native efficient implementation for CMM and no popular deep learning framework offering any interfaces that could implement it aside from a high-level simulation. The researchers provided a set of CPU implementations based on pointer-batched matrix multiplication routines of the BLAS library and compared CPU and GPU implementations at various levels of optimization later in the paper, noting that although there is already clear evidence of notable acceleration, there is still more potential. This also illustrates why the acceleration factor is only 78x compared to traditional DMM, not 341x.

    • ...

Implementation and Deployment

According to the paper, the UltraFastBERT model underwent standardized training processes and was fine-tuned across multiple tasks of the GLUE benchmark (such as RTE, MRPC, SST, STS-B, MNLI, QQP, QNLI, and CoLA), fine-tuning for 5 epochs with a learning rate of 4x10^-5 across all tasks. The results showed that UltraFastBERT performs on par with other BERT-like models. Besides, the paper provides a simple CPU implementation of CMM, which was found to be 78 times faster than the optimized traditional DMM implementation, and a PyTorch implementation that speeds up 40 times over the equivalent batched feedforward inference.

Summary

The paper introduces UltraFastBERT, a variant of a large-scale language model that significantly reduces the number of neurons needed during inference and increases computational efficiency through the use of fast feedforward networks. Despite lacking a native efficient implementation, the model provides a CPU code implementation that significantly accelerates the inference process and performs well on standard downstream tasks. This work demonstrates the substantial potential of conditional neural execution in the field of language modeling.