Skip to content

Latest commit

 

History

History
18 lines (11 loc) · 539 Bytes

Auto-scaling Vision Transformers Without Training.md

File metadata and controls

18 lines (11 loc) · 539 Bytes

Contributions C:

A NAS-like unified framework that can automate the search of ViT backbone design and scaling efficiently.

Takeaways


Quotes

ViTs can tolerate coarse tokenization in early training stages.

Attention maps of ViTs gradually become similar in deeper layers, leading to identical feature maps and saturated performance. NTK condition number to indicate the trainability of ViTs.