- Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer
- XTC: Extreme Compression for Pre-trained Transformers Made Simple and Efficient
- BiT: Robustly Binarized Multi-distilled Transformer
- LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
- Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases
- OPTQ: Accurate Quantization for Generative Pre-trained Transformers
- QLoRA: Efficient Finetuning of Quantized LLMs
- SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot
- FlashTransformers
- EfficientFormer: Vision Transformers at MobileNet Speed
- COMCAT: Towards Efficient Compression and Customization of Attention-Based Vision Models
- Token Merging: Your VIT but Faster
- LoRA: Low-Rank Adaptation of Large Language Models
- QLoRA: Efficient Finetuning of Quantized LLMs
- Less is More: Task-aware Layer-wise Distillation for Language Model Compression
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
- DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale