-
-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to specify which GPU the model inference on? #352
Comments
NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=2,3 python -m vllm.entrypoints.api_server --tensor-parallel-size 2 --host 127.0.0.1 |
I have one question on my servers. It seems that when cuda:0 is almost full, it still fail to do so, by passing the parameters "CUDA_VISIBLE_DEVICES"? |
Oh, I find that they are still taking the first two GPUs by ray::worker when I specify other two. |
@MasKong Can you bit elaborate this? llm = LLM(model_name, max_model_len=50, tensor_parallel_size=2)
output = llm.generate(text) You can find complete issue here |
Closing in preference to #3012 |
Hello, I have 4 GPUs. And when I set
tensor_parallel_size
as 2, when running the service, it would takes CUDA:0 and CUDA:1.My question is, if I want start two workers(i.e. two process that deploy two same models), how to specify my second process takes on CUDA:2 and CUDA:3?
Cuz now if I just start service without any config, it will OOM.
The text was updated successfully, but these errors were encountered: