Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

running speed slow on NVIDIA vGPU #45

Open
foricee opened this issue Nov 11, 2023 · 0 comments
Open

running speed slow on NVIDIA vGPU #45

foricee opened this issue Nov 11, 2023 · 0 comments

Comments

@foricee
Copy link

foricee commented Nov 11, 2023

I test qwen-7b GPT-Q quantization on a vGPU that is half of the A10‘s performance.

  • Driver Version:470.161.03
  • CUDA Version: 11.4

I have noticed that the processing speed of the context and the decoding speed are particularly slow,

  • context(500 tokens) processing speed: 48 tokens/s
  • decode speed: 1.6 token/s

Then, I test other model such as https://huggingface.co/ClueAI/ChatYuan-large-v2 and the speed is within expectations. So I guess that GPT-Q does not work well on vGPU?

The code is nothing special, looks like

from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0")
...
@foricee foricee changed the title running speed slow on NVIDIA vGPU A10(1/2) running speed slow on NVIDIA vGPU Nov 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant