You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, our implementation cannot accurately estimate GPU memory usage, leading to minor memory wastage. If you wish to maximize memory usage, you might try setting --vram-budget to a value larger than the physical memory size, yet small enough to prevent Out of Memory (OOM) errors under your workload.
For parallel inference with the same prompt, you can utilize the examples/batched feature. For parallel processing of different prompts, the --cont-batching option in examples/server might be helpful, although it is not recommended. Our tests have shown that it can lead to significantly incorrect results.
Currently, our implementation cannot accurately estimate GPU memory usage, leading to minor memory wastage. If you wish to maximize memory usage, you might try setting --vram-budget to a value larger than the physical memory size, yet small enough to prevent Out of Memory (OOM) errors under your workload.
For parallel inference with the same prompt, you can utilize the examples/batched feature. For parallel processing of different prompts, the --cont-batching option in examples/server might be helpful, although it is not recommended. Our tests have shown that it can lead to significantly incorrect results.
您好,目前有两个迫切需要解决的问题,能否帮忙解答。
1.gpu显存存在一个上限,是否支持提高显存到24GB以提高gpu的利用率?
2.如何同时输入多个问题进行并行推理?
The text was updated successfully, but these errors were encountered: