-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for Openbmb/MiniCPM #504
Conversation
I don't have permission to submit code. Please tell me how to get permission. Openbmb is a open source community in China. Minicpm is the main open source language model of the community. Thank you very much. |
@LDLINGLINGLING thank you for contributing to AutoAWQ. I will review and merge it into the main branch once I have tested it. I have researched the MiniCPM models before and they are indeed great, some of the best work on small models coming out of China. |
Thank you again。 |
Thanks for your contribution. I have tested the model and it works. |
Hi @LDLINGLINGLING. AutoAWQ also supports multimodal models. I would love to support MiniCPM-V-2. If you have time and interest, I will certainly help you review any pull request to provide support for a quantized multimodal model. |
Sorry for replying to you so late. We have been too busy releasing minicpmv2.6 recently. I am now preparing to do the awq quantification of minicpmv2.6. Can you give me an example? Thank you very much |
@LDLINGLINGLING You can find the implementation for LLaVa next here: Documentation (quantization): https://casper-hansen.github.io/AutoAWQ/examples/#vision-language-models |
can you give me a example of llama next use awq interface to quantize? is need a image and text data? |
For llava next, we only quantize the text part of the model. I'm not sure if it's compatible with MiniCPM-V |
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = 'llava-hf/llama3-llava-next-8b-hf'
quant_path = 'llama3-llava-next-8b-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
# Load model
model = AutoAWQForCausalLM.from_pretrained(
model_path, device_map="cuda", **{"low_cpu_mem_usage": True}
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Quantize
model.quantize(tokenizer, quant_config=quant_config)
# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print(f'Model is quantized and saved at "{quant_path}"') |
hi,I have quantified minicpmv2.6, but there is currently a problem, that is, the speed drops a lot. Is there any solution? Is there a big speed improvement after fuser? |
Hello, I am a staff member of openbmb responsible for the open source community. In this pull request, support for our openbmb/MiniCPM model has been added. The following huggingface address of the awq quantitative model:
MiniCPM2_2b_awq_int4
MiniCPM2_1b_awq_int4
I also completed the perplexity test of the above model on the wikitext test set, and the results are as follows:
awq model: awq_cpm_1b_4bit
gpu usage: 1.54GB
Perplexity 8.867: 100%|███████████████████████████████████████████| 164/164 [00:28<00:00, 5.84it/s]
pretrained model: MiniCPM-1B-sft-bf16
gpu usage: 3.24GB
Perplexity 8.576: 100%|███████████████████████████████████████████| 164/164 [00:10<00:00, 15.25it/s]
gptq model: minicpm_1b_4bit
gpu usage: 1.9GB
Perplexity 9.416: 100%|███████████████████████████████████████████| 164/164 [00:09<00:00, 17.81it/s]
awq model: awq_cpm_2b_4bit
gpu usage: 2.75GB
Perplexity 8.152: 100%|███████████████████████████████████████████| 159/159 [00:33<00:00, 4.70it/s]
pretrained model: miniCPM-bf16
gpu usage: 5.93GB
Perplexity 7.981: 100%|███████████████████████████████████████████| 159/159 [00:17<00:00, 9.18it/s]
gptq model: minicpm_2b_4bit
gpu usage: 3.02GB
Perplexity 8.669: 100%|███████████████████████████████████████████| 159/159 [00:14<00:00, 10.65it/s]
If the above code meets your requirements, we look forward to merging it into the master branch.