Skip to content

07_Train an image segmentation model

Daniel Buscombe edited this page Feb 24, 2023 · 1 revision

This section describes retraining a model using the provided datasets.

Usage notes

⚠️ 🚷 Note: you require an NVIDIA GPU with >6GB memory to train models from scratch using a GPU

⚠️ 🚷 Also Note: by default, mixed precision is used for training models. We have noticed that this allows for larger BATCH_SIZE than you would otherwise be able to use for a given GPU, and also slightly speeds up model convergence times. If this presents problems, or you wish to train models with full floating point precision, comment out the following lines in do_train.py

from tensorflow.keras import mixed_precision
mixed_precision.set_global_policy('mixed_float16')

Instructions

The following instructions assume you have already run make_nd_datasets.py to create .npz format files that contain all your data for model training and validation.

  1. Make sure you have the dataset and the config file, which may be in a folder like /Users/Someone/my_segmentation_zoo_datasets in the appropriate directories.

  2. Now navigate to the directory with the code ( cd /segmentation_zoo/unet ) and train the model with:

python train_model.py

You will be prompted via a GUI to provide the config file, images, and labels. Then the program will print some example training and validation samples in a sample/ folder in the directory with the data.

Then the model will begin training. You will see output similar to ...

Creating and compiling model ...
.....................................
Training model ...

Epoch 00001: LearningRateScheduler reducing learning rate to 1e-07.
Epoch 1/200
2021-03-03 11:47:03.934177: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-03-03 11:47:04.670713: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
75/75 [==============================] - 55s 733ms/step - loss: 0.5727 - mean_iou: 0.1747 - dice_coef: 0.4273 - val_loss: 0.5493 - val_mean_iou: 3.3207e-05 - val_dice_coef: 0.4507 - lr: 1.0000e-07

Epoch 00002: LearningRateScheduler reducing learning rate to 5.95e-07.
Epoch 2/200
75/75 [==============================] - 56s 745ms/step - loss: 0.5632 - mean_iou: 0.1840 - dice_coef: 0.4368 - val_loss: 0.6005 - val_mean_iou: 9.7821e-05 - val_dice_coef: 0.3995 - lr: 5.9500e-07

Epoch 00003: LearningRateScheduler reducing learning rate to 1.0900000000000002e-06.
Epoch 3/200
75/75 [==============================] - 56s 751ms/step - loss: 0.5403 - mean_iou: 0.2212 - dice_coef: 0.4597 - val_loss: 0.6413 - val_mean_iou: 8.7021e-04 - val_dice_coef: 0.3587 - lr: 1.0900e-06

(etc)

The above is for MAX_EPOCHS=200, BATCH_SIZE=4 (i.e. the default, set/changed in the config file weights/sentinel2_coast_watermask/watermask_oblique_2class_batch_4.json), and the default learning rate parameters.

When model training is complete, the model is evaluated on the validation set. You'll see output like this printed to screen

Epoch 00126: LearningRateScheduler reducing learning rate to 1.001552739802484e-07.
Epoch 126/200
75/75 [==============================] - 56s 750ms/step - loss: 0.1206 - mean_iou: 0.8015 - dice_coef: 0.8794 - val_loss: 0.1222 - val_mean_iou: 0.7998 - val_dice_coef: 0.8778 - lr: 1.0016e-07
.....................................
Evaluating model ...
225/225 [==============================] - 26s 117ms/step - loss: 0.1229 - mean_iou: 0.7988 - dice_coef: 0.8771
loss=0.1229, Mean IOU=0.7988, Mean Dice=0.8771

When evaluating models using metrics, it is important to remember that a supervised approach involves training a network end-to-end in a discriminative way that explicitly maps the classes to the image features, and optimized to extract the features that explicitly predict the class. It is fully supervised, to the level where there is no room for ambiguity - every pixel falls into one class or another, despite the obvious continuum that exists between them, and between features that covary with them. Therefore the confidence metric we get is as much a reflection of the model feature extraction and classification process as it is (summarized collectively by its learned parameters or weights) the input data itself.