Skip to content

This library empowers users to seamlessly port pretrained models and checkpoints on the HuggingFace (HF) hub (developed using HF transformers library) into inference-ready formats that run efficiently on Qualcomm Cloud AI 100 accelerators.

License

Notifications You must be signed in to change notification settings

quic/efficient-transformers

Repository files navigation

Cloud AI 100


Qualcomm Transformers Library


Latest news 🔥

  • [coming soon] Support for more popular models and inference optimization techniques like continuous batching and speculative decoding

Train anywhere, Infer on Qualcomm Cloud AI with a Developer-centric Toolchain

This library provides reimplemented blocks of LLMs which are used to make the models functional and highly performant on Qualcomm Cloud AI 100. There are several models which can be directly transformed from a pre-trained original form to a deployment ready optimized form. For other models, there is comprehensive documentation to inspire upon the changes needed and How-To(s).

Typically for LLMs, the library provides:

  1. Reimplemented blocks from Transformers which enable efficient on-device retention of intermediate states.
  2. Graph transformations to enable execution of key operations in lower precision
  3. Graph transformations to replace some operations to other mathematically equivalent operations
  4. Handling for underflows and overflows in lower precision
  5. Patcher modules to map weights of original model's operations to updated model's operations
  6. Exporter module to export the model source into a ONNX Graph.
  7. Sample example applications and demo notebooks
  8. Unit test templates.

It is mandatory for each Pull Request to include tests such as:

  1. If the PR is for adding support for a model, the tests should include successful execution of the model post changes (the changes included as part of PR) on Pytorch and ONNXRT. Successful exit criteria is MSE between output of original model and updated model.
  2. If the PR modifies any common utilities, tests need to be included to execute tests of all models included in the library.

Validated Models

Models Coming Soon

Requirements

System Requirements:

  1. Supported Linux OS - Ubuntu, RHEL and AWS Linux
  2. Pre-requisites installed
  3. Cloud AI 100 Platform and Apps SDK installed
  4. Multi-device support enabled for model sharding

💡 Use bash terminal

📝 If using ZSH terminal then "device_group" should be in single quotes e.g. "--device_group '[0]'"

Installation

pip install -U pip
pip install git+https://github.com/quic/efficient-transformers

Quick Start Guide

QEfficient Library was designed with one goal: to make onboarding of models inference straightforward for any Transformer architecture, while leveraging the complete power of Cloud AI platform

To achieve this, we have 2 levels of APIs, with different levels of abstraction.

  1. High-level APIs abstract away complex details, offering a simpler interface. They're ideal for quick development and prototyping. If you're new to a technology or want to minimize coding effort, high-level APIs are more user-friendly.

  2. Low-level APIs offer more granular control, ideal for when customization is necessary. These are particularly useful for users who are trying their own models, not hosted on HF but are implemented based on Transformers.

In summary:

  • Choose high-level APIs for quick development, simplicity, and ease of use.
  • Opt for low-level APIs when you need fine-tuned control, optimization, or advanced customization.

Using High Level APIs

High Level APIs Sample use Arguments
QEfficient.cloud.infer click here
  • model_name : $\color{green} {Mandatory}$
  • num_cores : $\color{green} {Mandatory}$
  • device_group : $\color{green} {Mandatory}$
  • **prompt : Optional
  • **prompts_txt_file_path : Optional
  • aic_enable_depth_first : Optional
  • mos : Optional [Default=-1]
  • batch_size : Optional [Default=1]
  • prompt_len : Optional [Default=32]
  • ctx_len : Optional [Default=128]
  • generation_len : Optional [Default=None]
  • mxfp6 : Optional
  • mxint8 : Optional
  • local_model_dir : Optional [Path to custom model weights and config file]
  • cache_dir : Optional [Path to the directory used for saving HuggingFace cache, Default is "efficient-transformers/cache_dir".]
  • hf_token : Optional
  • verbose : Optional
  • QEfficient.cloud.execute click here
  • model_name : $\color{green} {Mandatory}$
  • qpc_path : $\color{green} {Mandatory}$
  • device_group : $\color{green} {Mandatory}$
  • local_model_dir : Optional [Path to custom model weights and config file]
  • **prompt : Optional
  • **prompts_txt_file_path : Optional
  • generation_len : Optional [Default=None]
  • cache_dir : Optional [Path to the directory used for saving HuggingFace cache, Default is "efficient-transformers/cache_dir".]
  • hf_token : Optional
  • One argument, prompt or prompts_txt_file_path must be passed.

    1. Use QEfficient.cloud.infer

    This is the single e2e python api in the library, which takes model_card name as input along with other compile args if necessary and does everything in one go.

    • Torch Download → Optimize for Cloud AI 100 → Export to ONNX → Verify (CPU) → Compile on Cloud AI 100 → Execute
    • It skips the ONNX export/compile stage if ONNX file or qpc found on path
    # Check out the options using the help menu
    python -m QEfficient.cloud.infer --help
    python -m QEfficient.cloud.infer --model_name gpt2 --batch_size 1 --prompt_len 32 --ctx_len 128 --mxfp6 --num_cores 16 --device_group [0] --prompt "My name is" --mos 1 --aic_enable_depth_first  
    
    # If executing for batch size>1,
    
    # Either pass input prompts in single string, seperated with pipe (|) symbol". Example below
    
    python -m QEfficient.cloud.infer --model_name gpt2 --batch_size 3 --prompt_len 32 --ctx_len 128 --num_cores 16 --device_group [0] --prompt "My name is|The flat earth 
    theory is the belief that|The sun rises from" --mxfp6 --mos 1 --aic_enable_depth_first
    
    # Or pass path of txt file with input prompts, Example below, sample txt file(prompts.txt) is present in examples folder.
    
    python -m QEfficient.cloud.infer --model_name gpt2 --batch_size 3 --prompt_len 32 --ctx_len 128 --num_cores 16 --device_group [0] --prompts_txt_file_path examples/prompts.txt --mxfp6 --mos 1 --aic_enable_depth_first  

    2. Use of QEfficient.cloud.execute

    Once we have compiled the QPC, we can now use the precompiled QPC in execute API to run for different prompts, like below:

    python -m QEfficient.cloud.execute --model_name gpt2 --qpc_path qeff_models/gpt2/qpc_16cores_1BS_32PL_128CL_1devices_mxfp6/qpcs --prompt "Once upon a time in" --device_group [0]  

    We can also enable MQ, just based on the number of devices. Based on the "--device_group" as input it will create TS config on the fly. If "--device_group [0,1]" it will create TS config for 2 devices and use it for compilation, if "--device_group 0" then TS compilation is skipped and single soc execution is enabled.

    python -m QEfficient.cloud.infer --model_name Salesforce/codegen-2B-mono --batch_size 1 --prompt_len 32 --ctx_len 128 --mxfp6 --num_cores 16 --device_group [0,1] --prompt "def fibonacci(n):" --mos 1 --aic_enable_depth_first  
     
    # Once qpc is saved, you can use the execute API to run for different prompts
    python -m QEfficient.cloud.execute --model_name Salesforce/codegen-2B-mono --qpc-path qeff_models/Salesforce/codegen-2B-mono/qpc_16cores_1BS_32PL_128CL_2devices_mxfp6/qpcs --prompt "def binary_search(array: np.array, k: int):" --device_group [0,1] 
     
    # To disable MQ, just pass single soc like below:
    python -m QEfficient.cloud.infer --model_name gpt2 --batch_size 1 --prompt_len 32 --ctx_len 128 --mxfp6 --num_cores 16 --device_group [0] --prompt "My name is" --mos 1 --aic_enable_depth_first
    High Level APIs Single SoC Tensor Slicing
    QEfficient.cloud.infer python -m QEfficient.cloud.infer --model_name $\color{green} {model}$ --batch_size 1 --prompt_len 128 --ctx_len 1024 --num_cores 16 --device_group [0] --prompt "My name is" --mxfp6 --hf_token $\color{green}{xyz}$ --mos 1 --aic_enable_depth_first python -m QEfficient.cloud.infer --model_name $\color{green}{model}$ --batch_size 1 --prompt_len 128 --ctx_len 1024 --num_cores 16 --device_group [0,1,2,3] --prompt "My name is" --mxfp6 --hf_token $\color{green}{xyz}$ --mos 1 --aic_enable_depth_first
    QEfficient.cloud.execute python -m QEfficient.cloud.execute --model_name $\color{green}{model}$ --device_group [0] --qpc_path $\color{green}{path}$ --prompt "My name is" --hf_token $\color{green}{xyz}$ python -m QEfficient.cloud.execute --model_name $\color{green}{model}$ --device_group [0,1,2,3] --qpc_path $\color{green}{path}$ --prompt "My name is" --hf_token $\color{green}{xyz}$

    📝 Replace $\color{green}{model}$ , $\color{green}{path}$ and $\color{green}{xyz}$ with preferred model card name, qpc path and hf token respectively.

    Using Low Level APIs

    Low Level APIs Sample use Arguments
    QEfficient.transform click here
  • model : $\color{green} {Mandatory}$
  • form_factor : Optional [Default="cloud"]
  • QEfficient.export click here
  • model_name : $\color{green} {Mandatory}$
  • model_kv : Optional
  • local_model_dir : Optional [Path to custom model weights and config file]
  • tokenizer : Optional
  • cache_dir : Optional [Path to the directory used for saving HuggingFace cache, Default is "efficient-transformers/cache_dir".]
  • onnx_dir_path : Optional
  • hf_token : Optional
  • seq_length : Optional [Default=128]
  • kv : Optional [Default=True]
  • form_factor : Optional [Default="cloud"]
  • QEfficient.compile click here
  • onnx_path : $\color{green} {Mandatory}$
  • qpc_path : $\color{green} {Mandatory}$
  • num_cores : $\color{green} {Mandatory}$
  • device_group : $\color{green} {Mandatory}$
  • batch_size : Optional [Default=1]
  • prompt_len : Optional [Default=32]
  • ctx_len : Optional [Default=128]
  • aic_enable_depth_first : Optional [Default=False]
  • mos : Optional [Default=-1]
  • mxint8 : Optional [Defaul=False]
  • mxfp6 : Optional [Default=True]
  • custom_io_file_path : Optional [Default=None]
  • QEfficient.cloud_ai_100_exec_kv click here
  • tokenizer : $\color{green} {Mandatory}$
  • qpc_path : $\color{green} {Mandatory}$
  • **prompt : Optional
  • **prompts_txt_file_path : Optional
  • device_id : Optional [Default=[0]]
  • generation_len : Optional [Default=None]
  • enable_debug_logs : Optional [Default=False]
  • stream : Optional [Default=True]
  • write_io_dir : Optional
  • automation : Optional [Default=False]
  • **In QEfficient.cloud_ai_100_exec_kv, atleast one argument, prompt or prompts_txt_file_path must be passed.

    1. Model download and Optimize for Cloud AI 100

    Initialize QEfficient and transform the models, Check the list of supported architectures in the repo.

    # Initiate the Orignal Transformer model
    import os
    
    from QEfficient import QEFFAutoModelForCausalLM as AutoModelForCausalLM
    
    # Please uncomment and use appropriate Cache Directory for transformers, in case you don't want to use default ~/.cache dir.
    # os.environ["TRANSFORMERS_CACHE"] = "/local/mnt/workspace/hf_cache"
    
    # ROOT_DIR = os.path.dirname(os.path.abspath(""))
    # CACHE_DIR = os.path.join(ROOT_DIR, "tmp") #, you can use a different location for just one model by passing this param as cache_dir in below API.
    
    # Model-Card name to be onboarded (This is HF Model Card name) : https://huggingface.co/gpt2-xl
    model_name = "gpt2"  # Similar, we can change model name and generate corresponding models, if we have added the support in the lib.
    
    qeff_model = AutoModelForCausalLM.from_pretrained(model_name)
    print(f"{model_name} optmized for AI 100 \n", qeff_model)

    2. Export and Compile with one API

    use the qualcomm_efficient_converter API to export the KV transformed Model to ONNX and Verify on Torch.

    # We can now export the modified models to Onnx framework
    # This will generate single Onnx Model for both Prefill and Decode Variations which are optimized for
    # Cloud AI 100 Platform.
    
    # While generating the ONNX model, this will clip the overflow constants to fp16
    # Verify the model on Onnxruntime vs Pytorch
    
    # Then generate inputs and customio yaml file required for compilation.
    # Compile the model for provided compilation arguments
    # Please use platform SDk to Check num_cores for your card.
    
    generated_qpc_path = qeff_model.compile(
        num_cores=14,
        mxfp6=True,
        device_group=[0],
    )

    3. Run Benchmark

    Benchmark the model on Cloud AI 100, run the infer API to print tokens and tok/sec

    # post compilation, we can print the latency stats for the kv models, We provide API to print token and Latency stats on AI 100
    # We need the compiled prefill and decode qpc to compute the token generated, This is based on Greedy Sampling Approach
    
    qeff_model.generate(prompts=["My name is"])

    End to End demo examples for various models are available in notebooks directory. Please check them out.

    Adding support for a new model

    Watch this space for references to detailed steps, template examples and much more.

    Details on KV Cache Optimization for Cloud AI 100

    alt text

    Note: More details are here: https://quic.github.io/cloud-ai-sdk-pages/latest/Getting-Started/Model-Architecture-Support/Large-Language-Models/llm/

    Acknowledgements

    Thanks to:

    • Huggingface transformers for work in LLM GenAI modeling implementation
    • ONNX, Pytorch, ONNXruntime community.

    Support

    If you run into any problems with the code, please file Github issues directly to this repo.

    Contributing

    This project welcomes contributions and suggestions. Please check the License. Integration with a CLA Bot is underway.

    About

    This library empowers users to seamlessly port pretrained models and checkpoints on the HuggingFace (HF) hub (developed using HF transformers library) into inference-ready formats that run efficiently on Qualcomm Cloud AI 100 accelerators.

    Topics

    Resources

    License

    Code of conduct

    Stars

    Watchers

    Forks

    Releases

    No releases published

    Packages

    No packages published