New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

running speed slow on NVIDIA vGPU #45

Open

foricee opened this issue Nov 11, 2023 · 0 comments

foricee commented Nov 11, 2023

I test qwen-7b GPT-Q quantization on a vGPU that is half of the A10‘s performance.

Driver Version：470.161.03
CUDA Version: 11.4

I have noticed that the processing speed of the context and the decoding speed are particularly slow,

context(500 tokens) processing speed: 48 tokens/s
decode speed: 1.6 token/s

Then, I test other model such as https://huggingface.co/ClueAI/ChatYuan-large-v2 and the speed is within expectations. So I guess that GPT-Q does not work well on vGPU？

The code is nothing special, looks like

from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0")
...

foricee changed the title ~~running speed slow on NVIDIA vGPU A10(1/2)~~ running speed slow on NVIDIA vGPU

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment