Skip to content

06_Models in Gym

Daniel Buscombe edited this page Feb 24, 2023 · 2 revisions

Models

The Transfomer family of models

There is currently 1 model currently included in Gym: the Segformer, using the smallest pre-trained model, mit-b0. This is for transfer learning tasks based on a non-CNN architecture. This might be especially useful on tasks not well suited to training from scratch with a UNet or Res-UNet. An additional advantage is that it can accept any sized input image shape (within memory constraints)

The UNet family of models

There are currently 5 models currently included in Gym: a 2 UNets, 2 Residual UNets, and the Satellite UNet. There are two options with the Unet architecture in this repository: a simple version and a highly configurable version.

A UNet today generally refers to a family of models identified by the following characteristics: a) fully convolutional (no fully connected layers); b) four convolutional ‘blocks’ consisting of convolutional layers and Batch Normalization layers connected by ReLu activations, then optionally, Dropout layers; c) symmetrical U shape (hence the `U' in the name) with skip connections. Specific UNet implementations often differ in a) number of filters, b) stride length (i.e. feature extraction specifics), c) use of (and type of) dropout.

Batch Normalization involves normalizing the input or output of the activation functions in a hidden layer by the mean and standard deviation of the input batch. It can be a powerful regularization technique for natural imagery affected by large variations in natural light (image brightness and contrast), making UNets more stable by protecting against outlier weights, thereby reducing overfitting, and enabling higher learning rates, which in turn promotes faster training. The normalization, especially for small batch sizes and when combined with subsequent fully connected layers with relatively few output neurons, forces the model to generalize to data because it is slightly different each time, dictated by the distribution of values in the batch. So therefore its effects are contingent in part to the size of the batch. It is therefore possibly the most generally useful regularization technique for models of imagery, especially in high batch variance scenarios.

The use of image augmentation and Batch Normalization layers already impose model regularization, however further regularization might optionally be achieved by means of Dropout layers. Dropout is a form of regularization useful in training neural networks. Dropout regularization works by removing (actually, zeroing out temporarily) a random selection of a fixed number of the units in a network layer for a single gradient step. The more neurons are dropped, the stronger the regularization. The configuration file may be used to specify the amount of Dropout type, rate and also the locations within the model that Dropout is applied. There are two types of Dropout supported by our software; a) standard, and b) spatial. Standard Dropout involves randomly removing individual neurons in all feature maps, whereas Spatial Dropout involves dropping entire feature maps instead of individual elements. Dropout rate is the proportion of randomly selected neurons to drop. The rate may be fixed, or may increase or decrease with each successive convolutional block. The user has the option to specify Dropout on either the encoder portion of the network, the decoder portion, or both.

The UNet framework consists of two parts, the encoder and the decoder. The encoder receives the input image and applies a series of convolutional and batch normalization layers, and optionally Dropout layers, followed by pooling layers that reduce the spatial size and condense features. Four banks of convolutional filters each use filters that double in size to the previous, thereby progressively downsampling the inputs as features are extracted through pooling. The last set of features (or so-called bottleneck) is a very low-dimensional feature representation of the input imagery. The decoder upsamples the bottleneck into a label image progressively using convolutional filters, each using filters half in size to the previous, thereby progressively upsampling the inputs as features are extracted through transpose convolutions and concatenation. The sets of features from each of the four levels in the encoder-decoder structure are concatenated, which allows learning different features at different levels and leads to spatially well-resolved outputs. The final classification layer maps the output of the previous layer to a single 2D output based on a sigmoid activation function.

1. UNet model

The UNet model is a fully convolutional neural network that is used for binary segmentation i.e foreground and background pixel-wise classification. It is easily adapted to multiclass segmentation workflows by representing each class as a binary mask, creating a stack of binary masks for each potential class (so-called one-hot encoded label data). A UNet is symmetrical (hence the U in the name) and uses concatenation instead of addition to merge feature maps.

The fully convolutional model framework consists of two parts, the encoder and the decoder. The encoder receives the N x N x M (M=1, 3 or 4 in this implementation) input image and applies a series of convolutional layers and pooling layers to reduce the spatial size and condense features. Six banks of convolutional filters, each using filters that double in size to the previous, thereby progressively downsampling the inputs as features are extracted through pooling. The last set of features (or so-called bottleneck) is a very low-dimensional feature representation of the input imagery. The decoder upsamples the bottleneck into a N x N x 1 label image progressively using six banks of convolutional filters, each using filters half in size to the previous, thereby progressively upsampling the inputs as features are extracted through transpose convolutions and concatenation. A transposed convolution convolves a dilated version of the input tensor, consisting of interleaving zeroed rows and columns between each pair of adjacent rows and columns in the input tensor, in order to upscale the output. The sets of features from each of the six levels in the encoder-decoder structure are concatenated, which allows learning different features at different levels and leads to spatially well-resolved outputs. The final classification layer maps the output of the previous layer to a single 2D output based on a sigmoid activation function.

2. Residual UNet model

UNet with residual (or lateral/skip connections).

The difference between our Res Unet and the original UNet is in the use of three residual-convolutional encoding and decoding layers instead of regular six convolutional encoding and decoding layers. Residual or 'skip' connections have been shown in numerous contexts to facilitate information flow, which is why we have halved the number of convolutional layers but can still achieve good accuracy on the segmentation tasks. The skip connections essentially add the outputs of the regular convolutional block (sequence of convolutions and ReLu activations) with the inputs, so the model learns to map feature representations in context to the inputs that created those representations.

There are two options with the Res-Unet architecture in this repository: a simple version and a highly configurable version.

3. Satellite UNet model

The model was proposed in this kaggle competition but to our knowledge has not been used in the scientific literature. It implements an interesting idea that is worthy of further exploratiion, hence its inclusion in the corpus of models provided by Gym. A typical UNet architecture involves increasing the number of feature maps (channels) with each max pooling operation. However the 'satellite unet' uses a constant number of feature maps throughout the network. According to the architecture's author, Arkadiusz Nowaczynski:

This choice was motivated by two observations. Firstly, we can allow the network to lose some information after the downsampling layer because the model has access to low level features in the upsampling path. Secondly, in satellite images there is no concept of depth or high-level 3D objects to understand, so a large number of feature maps in higher layers may not be critical for good performance. We developed separate models for each class, because it was easier to fine tune them individually for better performance and to overcome imbalanced data problems.