Skip to content

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

Notifications You must be signed in to change notification settings

kvcache-ai/Mooncake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Mooncake: A KVCache-centric Disaggregated
Architecture for LLM Serving

📃 Technical Report

Mooncake is the serving platform for icon Kimi, a leading LLM service provided by icon Moonshot AI. This repository hosts its technical report and also the open sourced traces.

More will come - perhaps not very soon, but stay tuned!

🔥 Updates

  • July 9, 2024: We open sourced the trace as a jsonl file!.
  • June 27, 2024: We present a series of Chinese blogs with more discussions on zhihu 1, 2, 3, 4.
  • June 26, 2024: Initial technical report release.

🎉 Overview

Mooncake features a KVCache-centric disaggregated architecture that separates the prefill and decoding clusters. It also leverages the underutilized CPU, DRAM, and SSD resources of the GPU cluster to implement a disaggregated cache of KVCache.

architecture

The core of Mooncake is its KVCache-centric scheduler, which balances maximizing overall effective throughput while meeting latency-related Service Level Objectives (SLOs) requirements. Unlike traditional studies that assume all requests will be processed, Mooncake faces challenges due to highly overloaded scenarios. To mitigate these, we developed a prediction-based early rejection policy. Experiments show that Mooncake excels in long-context scenarios. Compared to the baseline method, Mooncake can achieve up to a 525% increase in throughput in certain simulated scenarios while adhering to SLOs. Under real workloads, Mooncake’s innovative architecture enables Kimi to handle 75% more requests.

📦 Open Source Trace

{
    "timestamp": 27482,
    "input_length": 6955,
    "output_length": 52,
    "hash_ids": [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 2353, 2354]
}
{
    "timestamp": 30535,
    "input_length": 6472,
    "output_length": 26,
    "hash_ids": [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 2366]
}

The above presents two samples from our trace dataset. The trace includes the timing of request arrivals, the number of input tokens, the number of output tokens, and the remapped block hash. To protect our customers' privacy, we applied several mechanisms to remove user-related information while preserving the dataset's utility for simulated evaluation. More descriptions of the trace (e.g., up to 50% cache hit ratio) can be found in Section 4 of the paper's Version 3.

📑 Citation

Please kindly cite our paper if you find the paper or the trace is useful:
@article{qin2024mooncake,
  title        = {Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving},
  author       = {Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu},
  year         = {2024},
  url          = {https://arxiv.org/abs/2407.00079}
}

About

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published