Skip to content

05_Create a model ready dataset

Daniel Buscombe edited this page Feb 24, 2023 · 1 revision

Here at Doodleverse HQ we advocate training models on the augmented data encoded in the datasets, so the original data is a hold-out or test set. This is ideal because although the validation dataset (drawn from augmented data) doesn't get used to adjust model weights, it does influence model training by triggering early stopping if validation loss is not improving. Testing on an untransformed set is also a further check/reassurance of model performance and evaluation metric.

How this works

Run python make_nd_dataset.py.

  1. You will be asked to navigate to and select your config file

  2. You will be asked to navigate to and select a directory to store the files that are the outputs of the program

  3. You will be asked to navigate to and select your directory of images

  4. and finally You will be asked to navigate to and select your label images (one label image per image)

First, your imagery and labels will be resized. The program will create new folders called resized_* , where * is the name of the image or label folder. You do not need to modify this directory

Next, your imagery is augmented according to the augmentation parameters you specified in your config file

For each of your image and label pairs, multiple augmented copies will be saved to disk. Each augmented image-label pair will be saved to as a .npz format file, along with the name of the original image that the augmented copy is based upon (useful for tracking down what images were used for training and validation)

It is this directory of npz files that you point the model to during model training. Labels are stored as one-hot encoded. This is not memory efficient, but does compress very well, and facilitates label smoothing as an optionally pre-processing step for model training (not yet implemented)