Skip to content

Latest commit

 

History

History
51 lines (41 loc) · 1.8 KB

Large-Pretraining-Models.md

File metadata and controls

51 lines (41 loc) · 1.8 KB

Quantization

  • Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer
  • XTC: Extreme Compression for Pre-trained Transformers Made Simple and Efficient
  • BiT: Robustly Binarized Multi-distilled Transformer
  • LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
  • Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases
  • OPTQ: Accurate Quantization for Generative Pre-trained Transformers
  • QLoRA: Efficient Finetuning of Quantized LLMs

Pruning

  • SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot

Attention

  • FlashTransformers

Architecture Optimization

  • EfficientFormer: Vision Transformers at MobileNet Speed

Compression

  • COMCAT: Towards Efficient Compression and Customization of Attention-Based Vision Models
  • Token Merging: Your VIT but Faster

Low-Rank

  • LoRA: Low-Rank Adaptation of Large Language Models
  • QLoRA: Efficient Finetuning of Quantized LLMs

Distillation

  • Less is More: Task-aware Layer-wise Distillation for Language Model Compression

System

  • Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
  • DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

LLM Family