Background

Background The paper discusses improving sample efficiency in large language models (e.g., GPT and Llama) by changing the way they are trained. Typically, these models are trained using a next-token prediction loss.
Existing Work Existing next-token prediction methods focus on predicting a single token at a time during training, not taking full advantage of the ability to predict multiple tokens at once, especially when dealing with large-scale models and datasets.

Core Contributions

Introduced a training method based on multi-token prediction
- Challenge 1: Enhance sample efficiency Current language models predominantly focus on the next-token prediction that only contemplates one future token. The method proposed in the paper requires the model to predict the next n tokens at each position in the training corpus, using a shared trunk and independent output heads. This multi-token prediction has been shown to improve performance without extra training time costs in downstream tasks of both code and natural language models, particularly in terms of scaling up the model and cultivating algorithmic reasoning capabilities.
- Challenge 2: Improve inference speed Implementing greedy self-speculative decoding with heterogeneous batch sizes can increase the inference speed by 3 times for models trained with 4-token prediction. The paper implemented a memory-efficient multi-token prediction, contributing to improved inference efficiency without compromising training efficiency.

Implementation and Deployment

The experiments demonstrate that multi-token prediction becomes more effective as the model size increases, in training that encompasses at least 91 billion tokens of code. For large models, with the same computational budget, much more performance is squeezed out by using multi-token prediction in the MBPP and HumanEval benchmarks. Furthermore, experimental outcomes reveal significant improvements in byte-level models with multi-byte prediction, resolving 67% more issues on MBPP pass@1. Training models with 4 future tokens consistently outperforms other models during evaluations on HumanEval and MBPP across metrics of pass@1, 10, and 100. These findings suggest that multi-token prediction not only enhances performance in larger scale datasets but also significantly accelerates inference speed, especially with higher-level byte prediction models.

Summary

The paper proposes a novel method for training large language models by predicting multiple tokens instead of a single one, improving sample efficiency and demonstrating how to boost performance in generative tasks and speed up inference. Experiments confirm the significant advantages of this approach in enhancing the performance and inference speed of large models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2404.19737.md

2404.19737.md

Background

Core Contributions

Implementation and Deployment

Summary

Files

2404.19737.md

Latest commit

History

2404.19737.md

File metadata and controls

Background

Core Contributions

Implementation and Deployment

Summary