A NAS-like unified framework that can automate the search of ViT backbone design and scaling efficiently.
Quotes
ViTs can tolerate coarse tokenization in early training stages.
Attention maps of ViTs gradually become similar in deeper layers, leading to identical feature maps and saturated performance.
NTK condition number
to indicate the trainability of ViTs.