-
Notifications
You must be signed in to change notification settings - Fork 369
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
0463f7f
commit ba52b60
Showing
2 changed files
with
34 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
# Guide on Hyperparameter Tuning | ||
|
||
## Achieving Peak Throughput | ||
|
||
Achieving a large batch size is the most important thing for attaining high throughput. | ||
|
||
When the server is running at full load, look for the following in the log: | ||
```[gpu_id=0] #running-req: 233, #token: 370959, token usage: 0.82, gen throughput (token/s): 4594.01, #queue-req: 417``` | ||
|
||
### Tune Your Request Submission Speed | ||
`#queue-req` indicates the number of requests in the queue. If you frequently see `#queue-req == 0`, it suggests you are bottlenecked by the request submission speed. | ||
A healthy range for `#queue-req` is `100 - 3000`. | ||
|
||
### Tune `--schedule-conservativeness` | ||
`token usage` indicates the KV cache memory utilization of the server. `token usage > 0.9` means good utilization. | ||
If you frequently see `token usage < 0.9` and `#queue-req > 0`, it means the server is too conservative about taking in new requests. You can decrease `--schedule-conservativeness` to a value like 0.5. | ||
The case of serving being too conservative can happen when users send many requests with a large `max_new_tokens` but the requests stop very early due to EOS or stop strings. | ||
|
||
On the other hand, if you see `token usage` very high and you frequently see warnings like | ||
`decode out of memory happened, #retracted_reqs: 1, #new_token_ratio: 0.9998 -> 1.0000`, you can increase `--schedule-conservativeness` to a value like 1.3. | ||
|
||
### Tune `--dp-size` and `--tp-size` | ||
Data parallelism is better for throughput. When there is enough GPU memory, always favor data parallelism for throughput. | ||
|
||
### (Minor) Tune `--schedule-heuristic` | ||
If you have many shared prefixes, use the default `--schedule-heuristic lpm`. `lpm` stands for longest prefix match. | ||
When you have no shared prefixes at all, you can try `--schedule-heuristic fcfs`. `fcfs` stands for first come first serve. | ||
|
||
### (Minor) Tune `--max-prefill-tokens`, `--mem-fraction-static`, `--max-running-requests`. | ||
If you see out of memory errors, you can decrease them. Otherwise, the default value should work well. |