SPT-LSA ViT : training ViT for small size Datasets

Here is a non official implementation, in Pytorch, of the paper Vision Transformer for Small-Size Datasets.

The configuration has been trained on CIFAR-10 and shows interesting results.

The main components of the papers are :

The ViT architecture :

The Shifted Patch Tokenizer (for increasing the locality inductive bias) :

The Locality Self-Attention :

These components can be found in the models.py

Todo

Use register_buffer for the -inf mask in the Locality Self-Attention
Use warmup
Visualize Attention layers
Track scaling coefficient in attention using TensorBoard