Latest news 🔥
- [coming soon] Support for more popular models and inference optimization techniques like continuous batching and speculative decoding
- [06/2024] Added support for GPT-J-6B
- [06/2024] Added support for Qwen2-1.5B-Instruct
- [06/2024] Added support for StarCoder2-15B
- [06/2024] Added support for Phi3-Mini-4K-Instruct
- [06/2024] Added support for Codestral-22B-v0.1
- [06/2024] Added support for Vicuna-v1.5
- [05/2024] Added support for Mixtral-8x7B & Mistral-7B-Instruct-v0.1.
- [04/2024] Initial release of efficient transformers for seamless inference on pre-trained LLMs.
This library provides reimplemented blocks of LLMs which are used to make the models functional and highly performant on Qualcomm Cloud AI 100. There are several models which can be directly transformed from a pre-trained original form to a deployment ready optimized form. For other models, there is comprehensive documentation to inspire upon the changes needed and How-To(s).
- Reimplemented blocks from Transformers which enable efficient on-device retention of intermediate states.
- Graph transformations to enable execution of key operations in lower precision
- Graph transformations to replace some operations to other mathematically equivalent operations
- Handling for underflows and overflows in lower precision
- Patcher modules to map weights of original model's operations to updated model's operations
- Exporter module to export the model source into a ONNX Graph.
- Sample example applications and demo notebooks
- Unit test templates.
It is mandatory for each Pull Request to include tests such as:
- If the PR is for adding support for a model, the tests should include successful execution of the model post changes (the changes included as part of PR) on Pytorch and ONNXRT. Successful exit criteria is MSE between output of original model and updated model.
- If the PR modifies any common utilities, tests need to be included to execute tests of all models included in the library.
- GPT2
- Llama-3-8b
- Llama-3-70b
- Llama-2-70b
- Llama-2-7b-chat-hf
- Llama-2-13b-chat-hf
- CodeLlama-7b-hf
- CodeLlama-13b-hf
- CodeLlama-34b-hf
- Salesforce/codegen25-7b-mono_P
- Salesforce/xgen-7b-8k-base
- MPT-7b
- Mistral-7B-Instruct-v0.1
- Mixtral-8x7B
- Vicuna-v0
- Vicuna-v1.3
- Vicuna-v1.5
- Qwen2-1.5B-Instruct
- StarCoder2-15B
- Phi3-Mini-4K-Instruct
- Codestral-22B-v0.1
- Falcon-40b
- GPT-J-6B
System Requirements:
- Supported Linux OS - Ubuntu, RHEL and AWS Linux
- Pre-requisites installed
- Cloud AI 100 Platform and Apps SDK installed
- Multi-device support enabled for model sharding
💡 Use bash terminal
📝 If using ZSH terminal then "device_group" should be in single quotes e.g. "--device_group '[0]'"
pip install -U pip
pip install git+https://github.com/quic/efficient-transformers
QEfficient Library was designed with one goal: to make onboarding of models inference straightforward for any Transformer architecture, while leveraging the complete power of Cloud AI platform
To achieve this, we have 2 levels of APIs, with different levels of abstraction.
-
High-level APIs abstract away complex details, offering a simpler interface. They're ideal for quick development and prototyping. If you're new to a technology or want to minimize coding effort, high-level APIs are more user-friendly.
-
Low-level APIs offer more granular control, ideal for when customization is necessary. These are particularly useful for users who are trying their own models, not hosted on HF but are implemented based on Transformers.
In summary:
- Choose high-level APIs for quick development, simplicity, and ease of use.
- Opt for low-level APIs when you need fine-tuned control, optimization, or advanced customization.
High Level APIs | Sample use | Arguments |
---|---|---|
QEfficient.cloud.infer | click here |
|
QEfficient.cloud.execute | click here |
|
One argument, prompt or prompts_txt_file_path must be passed.
This is the single e2e python api in the library, which takes model_card name as input along with other compile args if necessary and does everything in one go.
- Torch Download → Optimize for Cloud AI 100 → Export to ONNX → Verify (CPU) → Compile on Cloud AI 100 → Execute
- It skips the ONNX export/compile stage if ONNX file or qpc found on path
# Check out the options using the help menu
python -m QEfficient.cloud.infer --help
python -m QEfficient.cloud.infer --model_name gpt2 --batch_size 1 --prompt_len 32 --ctx_len 128 --mxfp6 --num_cores 16 --device_group [0] --prompt "My name is" --mos 1 --aic_enable_depth_first
# If executing for batch size>1,
# Either pass input prompts in single string, seperated with pipe (|) symbol". Example below
python -m QEfficient.cloud.infer --model_name gpt2 --batch_size 3 --prompt_len 32 --ctx_len 128 --num_cores 16 --device_group [0] --prompt "My name is|The flat earth
theory is the belief that|The sun rises from" --mxfp6 --mos 1 --aic_enable_depth_first
# Or pass path of txt file with input prompts, Example below, sample txt file(prompts.txt) is present in examples folder.
python -m QEfficient.cloud.infer --model_name gpt2 --batch_size 3 --prompt_len 32 --ctx_len 128 --num_cores 16 --device_group [0] --prompts_txt_file_path examples/prompts.txt --mxfp6 --mos 1 --aic_enable_depth_first
Once we have compiled the QPC, we can now use the precompiled QPC in execute API to run for different prompts, like below:
python -m QEfficient.cloud.execute --model_name gpt2 --qpc_path qeff_models/gpt2/qpc_16cores_1BS_32PL_128CL_1devices_mxfp6/qpcs --prompt "Once upon a time in" --device_group [0]
We can also enable MQ, just based on the number of devices. Based on the "--device_group" as input it will create TS config on the fly. If "--device_group [0,1]" it will create TS config for 2 devices and use it for compilation, if "--device_group 0" then TS compilation is skipped and single soc execution is enabled.
python -m QEfficient.cloud.infer --model_name Salesforce/codegen-2B-mono --batch_size 1 --prompt_len 32 --ctx_len 128 --mxfp6 --num_cores 16 --device_group [0,1] --prompt "def fibonacci(n):" --mos 1 --aic_enable_depth_first
# Once qpc is saved, you can use the execute API to run for different prompts
python -m QEfficient.cloud.execute --model_name Salesforce/codegen-2B-mono --qpc-path qeff_models/Salesforce/codegen-2B-mono/qpc_16cores_1BS_32PL_128CL_2devices_mxfp6/qpcs --prompt "def binary_search(array: np.array, k: int):" --device_group [0,1]
# To disable MQ, just pass single soc like below:
python -m QEfficient.cloud.infer --model_name gpt2 --batch_size 1 --prompt_len 32 --ctx_len 128 --mxfp6 --num_cores 16 --device_group [0] --prompt "My name is" --mos 1 --aic_enable_depth_first
High Level APIs | Single SoC | Tensor Slicing |
---|---|---|
QEfficient.cloud.infer | python -m QEfficient.cloud.infer --model_name |
python -m QEfficient.cloud.infer --model_name |
QEfficient.cloud.execute | python -m QEfficient.cloud.execute --model_name |
python -m QEfficient.cloud.execute --model_name |
📝 Replace
Low Level APIs | Sample use | Arguments |
---|---|---|
QEfficient.transform | click here |
|
QEfficient.export | click here |
|
QEfficient.compile | click here |
|
QEfficient.cloud_ai_100_exec_kv | click here | |
**In QEfficient.cloud_ai_100_exec_kv, atleast one argument, prompt or prompts_txt_file_path must be passed.
Initialize QEfficient and transform the models, Check the list of supported architectures in the repo.
# Initiate the Orignal Transformer model
import os
from QEfficient import QEFFAutoModelForCausalLM as AutoModelForCausalLM
# Please uncomment and use appropriate Cache Directory for transformers, in case you don't want to use default ~/.cache dir.
# os.environ["TRANSFORMERS_CACHE"] = "/local/mnt/workspace/hf_cache"
# ROOT_DIR = os.path.dirname(os.path.abspath(""))
# CACHE_DIR = os.path.join(ROOT_DIR, "tmp") #, you can use a different location for just one model by passing this param as cache_dir in below API.
# Model-Card name to be onboarded (This is HF Model Card name) : https://huggingface.co/gpt2-xl
model_name = "gpt2" # Similar, we can change model name and generate corresponding models, if we have added the support in the lib.
qeff_model = AutoModelForCausalLM.from_pretrained(model_name)
print(f"{model_name} optmized for AI 100 \n", qeff_model)
use the qualcomm_efficient_converter API to export the KV transformed Model to ONNX and Verify on Torch.
# We can now export the modified models to Onnx framework
# This will generate single Onnx Model for both Prefill and Decode Variations which are optimized for
# Cloud AI 100 Platform.
# While generating the ONNX model, this will clip the overflow constants to fp16
# Verify the model on Onnxruntime vs Pytorch
# Then generate inputs and customio yaml file required for compilation.
# Compile the model for provided compilation arguments
# Please use platform SDk to Check num_cores for your card.
generated_qpc_path = qeff_model.compile(
num_cores=14,
mxfp6=True,
device_group=[0],
)
Benchmark the model on Cloud AI 100, run the infer API to print tokens and tok/sec
# post compilation, we can print the latency stats for the kv models, We provide API to print token and Latency stats on AI 100
# We need the compiled prefill and decode qpc to compute the token generated, This is based on Greedy Sampling Approach
qeff_model.generate(prompts=["My name is"])
End to End demo examples for various models are available in notebooks directory. Please check them out.
Watch this space for references to detailed steps, template examples and much more.
Note: More details are here: https://quic.github.io/cloud-ai-sdk-pages/latest/Getting-Started/Model-Architecture-Support/Large-Language-Models/llm/
Thanks to:
- Huggingface transformers for work in LLM GenAI modeling implementation
- ONNX, Pytorch, ONNXruntime community.
If you run into any problems with the code, please file Github issues directly to this repo.
This project welcomes contributions and suggestions. Please check the License. Integration with a CLA Bot is underway.