Initial Vision Transformer architecture with MAE decoder #37

weiji14 · 2023-11-16T03:00:17Z

What I am changing

Initializing the neural network architecture, with a Vision Transformer (ViT) B/32 backbone and Masked Autoencoder (MAE) decoder

How I did it

ViT backbone/decoder architecture is from HuggingFace transformers
- Loosely using the ViT B/32 model, but with 12 channels instead of 3.

Note:

Have consider torchvision/lightly (see fb8d06b), but changing input channels from 3 to 12 required a lot of hacking
- Torchvision's ViT B/32 model - https://pytorch.org/vision/0.15/models/vision_transformer.html
- MAE decoder and utils from Lightly (a self-supervised learning library)
  - Adapted from https://docs.lightly.ai/self-supervised-learning/examples/mae.html

TODO:

Install transformers dependency
Initialize model architecture backbone/decoder layers
Setup training_step and forward pass
Add unit tests
Document model architecture in src/README.md

How you can test it

Run python trainer.py fit --trainer.max_epochs=10 locally

Related Issues

Working towards #3

References:

He, K., Chen, X., Xie, S., Li, Y., Dollar, P., & Girshick, R. (2022). Masked Autoencoders Are Scalable Vision Learners. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 15979–15988. https://doi.org/10.1109/CVPR52688.2022.01553

Somehow using the `--with-cuda=11.8` flag in conda-lock didn't work as expected to get the CUDA-built Pytorch instead of the CPU version. Temporarily downgrading from Pytorch 2.1 to 2.0 and CUDA 11.8 to 11.2, to make it possible to install torchvision=0.15.2 from conda-forge later.

A deep learning package for self-supervised learning!

Initializing the neural network architecture layers, specifically a Vision Transformer (ViT) B/32 backbone and a Masked Autoencoder (MAE) decoder. Using Lightly for the MAE setup, with the ViT backbone from torchvision. Setup is mostly adapted from https://github.com/lightly-ai/lightly/blob/v1.4.21/examples/pytorch_lightning/mae.py

This reverts commit 1959771.

State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow!

Changing from lightly/torchvision's ViTMAE implementation to HuggingFace transformers's ViTMAE. This allows us to configure the number of input channels to a number other than 3 (e.g. 12). However, transformer's ViTMAE is an all-in-one class rather than an Encoder/Decoder split (though there's a way to access either once the class is instantiated). Allowed for configuring the masking_ratio instead of the decoder_dim size, and removed the MSE loss because it is implemented in the ViTMAE class already.

Run input images through the encoder and decoder, and compute the pixel reconstruction loss from training the Masked Autoencoder.

Ensure that running one training step on a mini-batch works. Created a random torch Dataset that generates tensors of shape (12, 256, 256) until there is real data to train on.

No need to pin to CUDA 11.2 since not using torchvision anymore. Patches 06535cd

The datacube has 13 channels, namely 10 from Sentinel-2's 10m and 20m resolution bands, 2 from Sentinel-1's VV and VH, and 1 from the Copernicus DEM.

Use a variable self.B instead of hardcoding 32 as the batch_size in the assert statements checking the tensor shape, so that the last mini-batch with a size less than 32 can be seen by the model.

srmsoumya

The implementation looks good to me & we have enough options to modify in the MAE & ViT backbone.
Let us use this model for current sprint, next week we need to add options to:

Add embeddings for time, lat/lon, channels & position
Implement different masking strategy like random masking, grouped channel/time masking.
Add support for different backbones like SWIN or FlexiVIT

Rename MAELitModule to ViTLitModule, and model.py to model_vit.py, since we might be trying out different neural network model architectures later.

weiji14 · 2023-11-21T22:44:56Z

Thanks @srm, I see you're starting the spatiotemporal embedding work on GeoViT at #47, and we can work on the masking strategy and different model backbones later too. I've renamed model.py to model_vit.py in case we want to have other model_*.py files. Will merge this into the main branch now.

weiji14 added 3 commits November 16, 2023 15:25

➕ Add lightly

1959771

A deep learning package for self-supervised learning!

weiji14 added the model-architecture Pull requests about the neural network model architecture label Nov 16, 2023

weiji14 added this to the v0 Release milestone Nov 16, 2023

weiji14 self-assigned this Nov 16, 2023

weiji14 added 6 commits November 20, 2023 09:20

🔀 Merge branch 'main' into model/init-vit

dc70447

Revert ":heavy_plus_sign: Add lightly"

fa40098

This reverts commit 1959771.

➕ Add transformers

bbd9f75

State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow!

👔 Implement forward pass and training_step

4770235

Run input images through the encoder and decoder, and compute the pixel reconstruction loss from training the Masked Autoencoder.

✅ Add unit test for MAELitModule

ed99427

Ensure that running one training step on a mini-batch works. Created a random torch Dataset that generates tensors of shape (12, 256, 256) until there is real data to train on.

weiji14 requested a review from srmsoumya November 20, 2023 03:17

📌 Pin to CUDA 11.8

4d2d7c2

No need to pin to CUDA 11.2 since not using torchvision anymore. Patches 06535cd

weiji14 mentioned this pull request Nov 21, 2023

Bump conda-lock to 2.5.1, add fiona and h5netcdf #46

Merged

weiji14 added 3 commits November 21, 2023 14:52

🔀 Merge branch 'main' into model/init-vit

c7b8e66

🗃️ Increase input channels from 12 to 13

2ce108a

The datacube has 13 channels, namely 10 from Sentinel-2's 10m and 20m resolution bands, 2 from Sentinel-1's VV and VH, and 1 from the Copernicus DEM.

🐛 Remove hardcoded batch_size in assert statements

269eacf

Use a variable self.B instead of hardcoding 32 as the batch_size in the assert statements checking the tensor shape, so that the last mini-batch with a size less than 32 can be seen by the model.

weiji14 force-pushed the model/init-vit branch from 9404802 to 2ce108a Compare November 21, 2023 04:57

weiji14 marked this pull request as ready for review November 21, 2023 05:16

srmsoumya approved these changes Nov 21, 2023

View reviewed changes

🚚 Rename to model_vit.py and ViTLitModule

b4ac16c

Rename MAELitModule to ViTLitModule, and model.py to model_vit.py, since we might be trying out different neural network model architectures later.

weiji14 merged commit 17f4698 into main Nov 21, 2023
2 checks passed

weiji14 deleted the model/init-vit branch November 21, 2023 22:46

weiji14 mentioned this pull request Nov 30, 2023

Generate embeddings via prediction loop #56

Merged

7 tasks

weiji14 mentioned this pull request Mar 17, 2024

Binder launch is broken because install requires NVIDIA GPUs #181

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial Vision Transformer architecture with MAE decoder #37

Initial Vision Transformer architecture with MAE decoder #37

weiji14 commented Nov 16, 2023 •

edited

Loading

srmsoumya left a comment

weiji14 commented Nov 21, 2023

Initial Vision Transformer architecture with MAE decoder #37

Initial Vision Transformer architecture with MAE decoder #37

Conversation

weiji14 commented Nov 16, 2023 • edited Loading

What I am changing

How I did it

How you can test it

Related Issues

srmsoumya left a comment

Choose a reason for hiding this comment

weiji14 commented Nov 21, 2023

weiji14 commented Nov 16, 2023 •

edited

Loading