Qualcomm Transformers Library

Cloud AI 100

Qualcomm Transformers Library

Latest news 🔥

[coming soon] Support for more popular models and inference optimization techniques like continuous batching and speculative decoding

[06/2024] Added support for GPT-J-6B

[06/2024] Added support for Qwen2-1.5B-Instruct
[06/2024] Added support for StarCoder2-15B
[06/2024] Added support for Phi3-Mini-4K-Instruct
[06/2024] Added support for Codestral-22B-v0.1
[06/2024] Added support for Vicuna-v1.5
[05/2024] Added support for Mixtral-8x7B & Mistral-7B-Instruct-v0.1.
[04/2024] Initial release of efficient transformers for seamless inference on pre-trained LLMs.

Train anywhere, Infer on Qualcomm Cloud AI with a Developer-centric Toolchain

This library provides reimplemented blocks of LLMs which are used to make the models functional and highly performant on Qualcomm Cloud AI 100. There are several models which can be directly transformed from a pre-trained original form to a deployment ready optimized form. For other models, there is comprehensive documentation to inspire upon the changes needed and How-To(s).

Typically for LLMs, the library provides:

Reimplemented blocks from Transformers which enable efficient on-device retention of intermediate states.
Graph transformations to enable execution of key operations in lower precision
Graph transformations to replace some operations to other mathematically equivalent operations
Handling for underflows and overflows in lower precision
Patcher modules to map weights of original model's operations to updated model's operations
Exporter module to export the model source into a ONNX Graph.
Sample example applications and demo notebooks
Unit test templates.

It is mandatory for each Pull Request to include tests such as:

If the PR is for adding support for a model, the tests should include successful execution of the model post changes (the changes included as part of PR) on Pytorch and ONNXRT. Successful exit criteria is MSE between output of original model and updated model.
If the PR modifies any common utilities, tests need to be included to execute tests of all models included in the library.

Validated Models

Models Coming Soon

Requirements

System Requirements:

Supported Linux OS - Ubuntu, RHEL and AWS Linux
Pre-requisites installed
Cloud AI 100 Platform and Apps SDK installed
Multi-device support enabled for model sharding

💡 Use bash terminal

📝 If using ZSH terminal then "device_group" should be in single quotes e.g. "--device_group '[0]'"

Installation

pip install -U pip
pip install git+https://github.com/quic/efficient-transformers

Quick Start Guide

QEfficient Library was designed with one goal: to make onboarding of models inference straightforward for any Transformer architecture, while leveraging the complete power of Cloud AI platform

To achieve this, we have 2 levels of APIs, with different levels of abstraction.

High-level APIs abstract away complex details, offering a simpler interface. They're ideal for quick development and prototyping. If you're new to a technology or want to minimize coding effort, high-level APIs are more user-friendly.
Low-level APIs offer more granular control, ideal for when customization is necessary. These are particularly useful for users who are trying their own models, not hosted on HF but are implemented based on Transformers.

In summary:

Choose high-level APIs for quick development, simplicity, and ease of use.
Opt for low-level APIs when you need fine-tuned control, optimization, or advanced customization.

Using High Level APIs

High Level APIs	Sample use	Arguments
QEfficient.cloud.infer	click here	model_name : $\color{green} {Mandatory}$ num_cores : $\color{green} {Mandatory}$ device_group : $\color{green} {Mandatory}$ prompt : Optional prompts_txt_file_path : Optional aic_enable_depth_first : Optional mos : Optional [Default=-1] batch_size : Optional [Default=1] prompt_len : Optional [Default=32] ctx_len : Optional [Default=128] generation_len : Optional [Default=None] mxfp6 : Optional mxint8 : Optional local_model_dir : Optional [Path to custom model weights and config file] cache_dir : Optional [Path to the directory used for saving HuggingFace cache, Default is "efficient-transformers/cache_dir".] hf_token : Optional verbose : Optional
QEfficient.cloud.execute	click here	model_name : $\color{green} {Mandatory}$ qpc_path : $\color{green} {Mandatory}$ device_group : $\color{green} {Mandatory}$ local_model_dir : Optional [Path to custom model weights and config file] prompt : Optional prompts_txt_file_path : Optional generation_len : Optional [Default=None] cache_dir : Optional [Path to the directory used for saving HuggingFace cache, Default is "efficient-transformers/cache_dir".] hf_token : Optional

One argument, prompt or prompts_txt_file_path must be passed.

1. Use QEfficient.cloud.infer

This is the single e2e python api in the library, which takes model_card name as input along with other compile args if necessary and does everything in one go.

Torch Download → Optimize for Cloud AI 100 → Export to ONNX → Verify (CPU) → Compile on Cloud AI 100 → Execute
It skips the ONNX export/compile stage if ONNX file or qpc found on path

# Check out the options using the help menu
python -m QEfficient.cloud.infer --help
python -m QEfficient.cloud.infer --model_name gpt2 --batch_size 1 --prompt_len 32 --ctx_len 128 --mxfp6 --num_cores 16 --device_group [0] --prompt "My name is" --mos 1 --aic_enable_depth_first  

# If executing for batch size>1,

# Either pass input prompts in single string, seperated with pipe (|) symbol". Example below

python -m QEfficient.cloud.infer --model_name gpt2 --batch_size 3 --prompt_len 32 --ctx_len 128 --num_cores 16 --device_group [0] --prompt "My name is|The flat earth 
theory is the belief that|The sun rises from" --mxfp6 --mos 1 --aic_enable_depth_first

# Or pass path of txt file with input prompts, Example below, sample txt file(prompts.txt) is present in examples folder.

python -m QEfficient.cloud.infer --model_name gpt2 --batch_size 3 --prompt_len 32 --ctx_len 128 --num_cores 16 --device_group [0] --prompts_txt_file_path examples/prompts.txt --mxfp6 --mos 1 --aic_enable_depth_first

2. Use of QEfficient.cloud.execute

Once we have compiled the QPC, we can now use the precompiled QPC in execute API to run for different prompts, like below:

python -m QEfficient.cloud.execute --model_name gpt2 --qpc_path qeff_models/gpt2/qpc_16cores_1BS_32PL_128CL_1devices_mxfp6/qpcs --prompt "Once upon a time in" --device_group [0]

We can also enable MQ, just based on the number of devices. Based on the "--device_group" as input it will create TS config on the fly. If "--device_group [0,1]" it will create TS config for 2 devices and use it for compilation, if "--device_group 0" then TS compilation is skipped and single soc execution is enabled.

python -m QEfficient.cloud.infer --model_name Salesforce/codegen-2B-mono --batch_size 1 --prompt_len 32 --ctx_len 128 --mxfp6 --num_cores 16 --device_group [0,1] --prompt "def fibonacci(n):" --mos 1 --aic_enable_depth_first  
 
# Once qpc is saved, you can use the execute API to run for different prompts
python -m QEfficient.cloud.execute --model_name Salesforce/codegen-2B-mono --qpc-path qeff_models/Salesforce/codegen-2B-mono/qpc_16cores_1BS_32PL_128CL_2devices_mxfp6/qpcs --prompt "def binary_search(array: np.array, k: int):" --device_group [0,1] 
 
# To disable MQ, just pass single soc like below:
python -m QEfficient.cloud.infer --model_name gpt2 --batch_size 1 --prompt_len 32 --ctx_len 128 --mxfp6 --num_cores 16 --device_group [0] --prompt "My name is" --mos 1 --aic_enable_depth_first

High Level APIs	Single SoC	Tensor Slicing
QEfficient.cloud.infer	python -m QEfficient.cloud.infer --model_name $\color{green} {model}$ --batch_size 1 --prompt_len 128 --ctx_len 1024 --num_cores 16 --device_group [0] --prompt "My name is" --mxfp6 --hf_token $\color{green}{xyz}$ --mos 1 --aic_enable_depth_first	python -m QEfficient.cloud.infer --model_name $\color{green}{model}$ --batch_size 1 --prompt_len 128 --ctx_len 1024 --num_cores 16 --device_group [0,1,2,3] --prompt "My name is" --mxfp6 --hf_token $\color{green}{xyz}$ --mos 1 --aic_enable_depth_first
QEfficient.cloud.execute	python -m QEfficient.cloud.execute --model_name $\color{green}{model}$ --device_group [0] --qpc_path $\color{green}{path}$ --prompt "My name is" --hf_token $\color{green}{xyz}$	python -m QEfficient.cloud.execute --model_name $\color{green}{model}$ --device_group [0,1,2,3] --qpc_path $\color{green}{path}$ --prompt "My name is" --hf_token $\color{green}{xyz}$

📝 Replace $\color{green}{model}$ , $\color{green}{path}$ and $\color{green}{xyz}$ with preferred model card name, qpc path and hf token respectively.

Using Low Level APIs

Low Level APIs	Sample use	Arguments
QEfficient.transform	click here	model : $\color{green} {Mandatory}$ form_factor : Optional [Default="cloud"]
QEfficient.export	click here	model_name : $\color{green} {Mandatory}$ model_kv : Optional local_model_dir : Optional [Path to custom model weights and config file] tokenizer : Optional cache_dir : Optional [Path to the directory used for saving HuggingFace cache, Default is "efficient-transformers/cache_dir".] onnx_dir_path : Optional hf_token : Optional seq_length : Optional [Default=128] kv : Optional [Default=True] form_factor : Optional [Default="cloud"]
QEfficient.compile	click here	onnx_path : $\color{green} {Mandatory}$ qpc_path : $\color{green} {Mandatory}$ num_cores : $\color{green} {Mandatory}$ device_group : $\color{green} {Mandatory}$ batch_size : Optional [Default=1] prompt_len : Optional [Default=32] ctx_len : Optional [Default=128] aic_enable_depth_first : Optional [Default=False] mos : Optional [Default=-1] mxint8 : Optional [Defaul=False] mxfp6 : Optional [Default=True] custom_io_file_path : Optional [Default=None]
QEfficient.cloud_ai_100_exec_kv	click here	tokenizer : $\color{green} {Mandatory}$ qpc_path : $\color{green} {Mandatory}$ prompt : Optional prompts_txt_file_path : Optional device_id : Optional [Default=[0]] generation_len : Optional [Default=None] enable_debug_logs : Optional [Default=False] stream : Optional [Default=True] write_io_dir : Optional automation : Optional [Default=False]

**In QEfficient.cloud_ai_100_exec_kv, atleast one argument, prompt or prompts_txt_file_path must be passed.

1. Model download and Optimize for Cloud AI 100

Initialize QEfficient and transform the models, Check the list of supported architectures in the repo.

# Initiate the Orignal Transformer model
import os

from QEfficient import QEFFAutoModelForCausalLM as AutoModelForCausalLM

# Please uncomment and use appropriate Cache Directory for transformers, in case you don't want to use default ~/.cache dir.
# os.environ["TRANSFORMERS_CACHE"] = "/local/mnt/workspace/hf_cache"

# ROOT_DIR = os.path.dirname(os.path.abspath(""))
# CACHE_DIR = os.path.join(ROOT_DIR, "tmp") #, you can use a different location for just one model by passing this param as cache_dir in below API.

# Model-Card name to be onboarded (This is HF Model Card name) : https://huggingface.co/gpt2-xl
model_name = "gpt2"  # Similar, we can change model name and generate corresponding models, if we have added the support in the lib.

qeff_model = AutoModelForCausalLM.from_pretrained(model_name)
print(f"{model_name} optmized for AI 100 \n", qeff_model)

2. Export and Compile with one API

use the qualcomm_efficient_converter API to export the KV transformed Model to ONNX and Verify on Torch.

# We can now export the modified models to Onnx framework
# This will generate single Onnx Model for both Prefill and Decode Variations which are optimized for
# Cloud AI 100 Platform.

# While generating the ONNX model, this will clip the overflow constants to fp16
# Verify the model on Onnxruntime vs Pytorch

# Then generate inputs and customio yaml file required for compilation.
# Compile the model for provided compilation arguments
# Please use platform SDk to Check num_cores for your card.

generated_qpc_path = qeff_model.compile(
    num_cores=14,
    mxfp6=True,
    device_group=[0],
)

3. Run Benchmark

Benchmark the model on Cloud AI 100, run the infer API to print tokens and tok/sec

# post compilation, we can print the latency stats for the kv models, We provide API to print token and Latency stats on AI 100
# We need the compiled prefill and decode qpc to compute the token generated, This is based on Greedy Sampling Approach

qeff_model.generate(prompts=["My name is"])

End to End demo examples for various models are available in notebooks directory. Please check them out.

Adding support for a new model

Watch this space for references to detailed steps, template examples and much more.

Details on KV Cache Optimization for Cloud AI 100

Note: More details are here: https://quic.github.io/cloud-ai-sdk-pages/latest/Getting-Started/Model-Architecture-Support/Large-Language-Models/llm/

Acknowledgements

Thanks to:

Huggingface transformers for work in LLM GenAI modeling implementation
ONNX, Pytorch, ONNXruntime community.

Support

If you run into any problems with the code, please file Github issues directly to this repo.

Contributing

This project welcomes contributions and suggestions. Please check the License. Integration with a CLA Bot is underway.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.github		.github
QEfficient		QEfficient
docs		docs
examples		examples
notebooks		notebooks
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE-OF-CONDUCT.md		CODE-OF-CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Qualcomm Transformers Library

Train anywhere, Infer on Qualcomm Cloud AI with a Developer-centric Toolchain

Typically for LLMs, the library provides:

Validated Models

Models Coming Soon

Requirements

Installation

Quick Start Guide

Using High Level APIs

1. Use QEfficient.cloud.infer

2. Use of QEfficient.cloud.execute

Using Low Level APIs

1. Model download and Optimize for Cloud AI 100

2. Export and Compile with one API

3. Run Benchmark

Adding support for a new model

Details on KV Cache Optimization for Cloud AI 100

Acknowledgements

Support

Contributing

About

Releases

Packages

Contributors 13

Languages

License

quic/efficient-transformers

Folders and files

Latest commit

History

Repository files navigation

Qualcomm Transformers Library

Train anywhere, Infer on Qualcomm Cloud AI with a Developer-centric Toolchain

Typically for LLMs, the library provides:

Validated Models

Models Coming Soon

Requirements

Installation

Quick Start Guide

Using High Level APIs

1. Use QEfficient.cloud.infer

2. Use of QEfficient.cloud.execute

Using Low Level APIs

1. Model download and Optimize for Cloud AI 100

2. Export and Compile with one API

3. Run Benchmark

Adding support for a new model

Details on KV Cache Optimization for Cloud AI 100

Acknowledgements

Support

Contributing

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 13

Languages

Packages