Quantization

Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer
XTC: Extreme Compression for Pre-trained Transformers Made Simple and Efficient
BiT: Robustly Binarized Multi-distilled Transformer
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases
OPTQ: Accurate Quantization for Generative Pre-trained Transformers
QLoRA: Efficient Finetuning of Quantized LLMs

Pruning

SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot

Attention

FlashTransformers

Architecture Optimization

EfficientFormer: Vision Transformers at MobileNet Speed

Compression

COMCAT: Towards Efficient Compression and Customization of Attention-Based Vision Models
Token Merging: Your VIT but Faster

Low-Rank

LoRA: Low-Rank Adaptation of Large Language Models
QLoRA: Efficient Finetuning of Quantized LLMs

Distillation

Less is More: Task-aware Layer-wise Distillation for Language Model Compression

System

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

LLM Family

GPT
Llama
- Alpaca
- Koala
- Baize
- Llama-2
GLM
- ChatGLM
- VisualGLM
Vicuna
RedPajama
Falcon
RWKV
WizardLM
MPT