Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Mask cache Performance Optimization for vllm (#939)
## Problem The current implementation allocates a mask for every token during generation, which significantly impacts performance. ## Proposed Solution To improve the performance, we can cache the mask on the device, as it depends on the allowed tokens from the FSM. Additionally, limiting the input to the hash function to the first 2k tokens results in a notable speedup. ## Discussion While using only the first 2k tokens for the hash may introduce potential cache collisions, the likelihood of such collisions is very low. ## TODO - [x] Provide measurements of the performance impact --------- Co-authored-by: pgrundmann <pgrundmann@bht-berlin.de>
- Loading branch information