[Build] How can I quantize the llama3 model activation to int4 ? #21334

zhangyu68 · 2024-07-12T07:33:57Z

Describe the issue

I’m trying to quantize a int4 model, but this file only provides the weight-only-quantization. If I can quantize both weight and activation to int4 ?
https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/quantization/matmul_4bits_quantizer.py

Thanks for your help!

Urgency

No response

Target platform

onnx

Build script

python -m onnxruntime.transformers.models.llama.convert_to_onnx -m /publicdata/huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/ --output llama3-8b-int4-gpu --precision int4 --execution_provider cuda --quantization_method blockwise --use_gqa

Error / output

except can quantize both weight and activation

Visual Studio Version

No response

GCC / Compiler Version

No response

zhangyu68 added the build build issues; typically submitted using template label Jul 12, 2024

github-actions bot added ep:CUDA issues related to the CUDA execution provider model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. quantization issues related to quantization labels Jul 12, 2024

sophies927 removed the ep:CUDA issues related to the CUDA execution provider label Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Build] How can I quantize the llama3 model activation to int4 ? #21334

[Build] How can I quantize the llama3 model activation to int4 ? #21334

zhangyu68 commented Jul 12, 2024

[Build] How can I quantize the llama3 model activation to int4 ? #21334

[Build] How can I quantize the llama3 model activation to int4 ? #21334

Comments

zhangyu68 commented Jul 12, 2024

Describe the issue

Urgency

Target platform

Build script

Error / output

Visual Studio Version

GCC / Compiler Version