image_classification

This repository provides an end-to-end pipeline for tensorflow-based image classification using neural network.

Requirements

pandas
numpy
pillow
sklearn
tensorflow
matplotlib
tensorflow_hub
Jupyter Notebook with IPython - for displaying purposes

install via pip install pandas numpy pillow sklearn tensorflow matplotlib tensorflow_hub jupyter notebook

Fork project and set it up to work on local laptop

Fork/Clone the repository to your local machine into a folder like image_classification, go to that folder and open classify_images.ipynb. This Notebook provides an example of the whole pipeline.
To use the pipeline, you'll definitely have to change the first line path = "D:/datasets/horse-or-human" to the corresponding dataset you have on your own machine like path = "path-to-dataset".
Pipeline supports two types of folder structures to work with:

    1. already 'manually' prepared

        /<dataset-name>
        ---/train/
        ---/test/
        ---/labels.csv

where as train contains all training pictures (directly in train-folder, no subfolders), test contains all test pictures (directly in test-folder, no subfolders) and labels.csv contains the information of train-pictures labels. Contains id-column (that contains the name of the training picture file) and label (corresponding label of the file)

    2. 'Tensorflow'-alike folder structure

        /<dataset-name>
        ---/train
        ---/---/<label1>
        ---/---/<label2>
        ---/---/...
        ---/---/<labelx>
        ---/validation
        ---/---/<label1>
        ---/---/<label2>
        ---/---/...
        ---/---/<labelx>
        ---/test

where as only one of test- or validation-folder must be available --> Pipeline is able to handle all three folder structures (given test AND validation, given test and NO validation and given NO test but validation). Pipeline will than change the folder structure to be exact the same as in 1.

IMPORTANT: Folder structure doesn't need to provide a validation set. This set will automatically split the training data into training and validation set using sklearn's train_test_split-function.

How to use it / possible modifications

Pipeline provides various options to use it. It is built very modular, so you can insert and use all the scripts in script-folder independently. Going through them step-by-step

import_data.py

Manages the import of training data and labels, splitting it into training and validation sets.

init class

inits the class

required parameter

    path:           - path to the dataset (as described above) - String

optional parameter

    NUM_IMAGES:     - number of images that shall be used to train the model. Helps during testing process to avoid the model from taking too much time during training - Integer - default = -1 --> all

import_raw_data()

refactors the required folder structure and imports the data from the files

required parameter

    none

optional parameter

    describe:       - whether or not the Pipeline shall give a short description of the given labels and show the distribution of given labels - Boolean - default = False
    plot:           - whether or not showing an example of the training data. Will show the first image from the train-folder - Boolean - default = False

returns

    none

get_raw_traindata()

Loads the filenames of the training data and splits it into X_train, X_val, y_train and y_test. Loads labels and gives a list of all unique labels that exist.

required parameter

    none

optional parameter

    none

returns

    none

preprocessing.py

Preprocesses the training and validation (or testing) datasets, gives them back as data batches to be used during training

init class

inits the class

required parameter

   unique_labels:   - list of all unique labels that exist. Should be the list that is given by `import_data.unique_labels` which is created using `import_data.get_raw_traindata()` - numpy.array

optional parameter

    IMG_SIZE:       - size that the images must have to be processed by the model (is defined by the size that the desired model from `tensorflow_hub` uses) - Integer - default = 224
    BATCH_SIZE:     - size of data batches that shall be processed during training - Integer - default = 32

create_data_batches()

Creates and returns data batches from given data. Optionally describes and displays an example

required parameter

    X:              - X data for batch (like X_train, X_val or X_test) - pandas.DataFrame

optional parameter

    y:              - y data for batch (like y_train, y_val). Has to be provided if training or validation batches shall be returned. Only unused for y_test because no y_test available - pandas.DataFrame - default = None
    valid_data:     - whether or not validation batch shall be returned - Boolean - default = False
    test_data:      - whether or not test batch shall be returned - Boolean - default = False
    describe:       - whether or not the Pipeline shall give a short description of the determined data batch - Boolean - default = False
    plot:           - whether or not showing an example of the determined data batch. Will show the first 25 images from the batch in one plot - Boolean - default = False

returns

    data_batch:     - data batch of provided data - tensorflow.data.Dataset.batch

model.py

Creates, trains, loads and saves the classification model

init class

inits the class

required parameter

   unique_labels:   - list of all unique labels that exist. Should be the list that is given by `import_data.unique_labels` which is created using `import_data.get_raw_traindata()` - numpy.array
   path:            - path where logs and model shall be saved (can be same as <path-to-dataset>) - String

optional parameter

    IMG_SIZE:       - size that the images must have to be processed by the model (is defined by the size that the desired model from `tensorflow_hub` uses) - Integer - default = 224
    MODEL_URL:      - url to the desired pretrained model from <a href="https://www.tensorflow.org/hub" target="_blank">`TensorFlow-Hub`</a> - String - default = "https://tfhub.dev/google/imagenet/mobilenet_v2_130_224/classification/4"

train_model()

creates, trains, evaluates and saves the model. An own created model can be passed if desired.

required parameter

    train_data:     - data to train model on (given by preprocessing.py) - tensorflow.data.Dataset.batch
    val_data:       - data for model validation (given by preprocessing.py) - tensorflow.data.Dataset.batch

optional parameter

    NUM_EPOCHS:     - Maximum number of training epochs. Model will stop (thanks to early stopping callback) to train if no new progress is made, to avoid overfitting - Integer - default = 100
    model_name:     - name of file where model will be stored (path of model is constructed by - in init - given path and a new folder of current date + given model_name as suffix) - String - default = "" (will result in filename of %H%M%S.h5)
    model:          - own, custom made model can be used instead of new one - tensorflow.keras.Sequential - default = None (which means that Pipeline will create new)
    describe:       - whether or not to describe the how the model looks like - Boolean - default = False

returns

    model:          - the trained and evaluated model - tensorflow.keras.Sequential

predict.py

Predicts and shows validation data. Can only handle data which has labels (like validation or training data batches)

init class

inits the class

required parameter

    model:          - model that shall predict the data - tensorflow.keras.Sequential
    data:           - data batches to be predicted (most likely the validation dataset, given by preprocessing.py) - tensorflow.data.Dataset.batch 
    unique_labels:  - list of all unique labels that exist. Should be the list that is given by `import_data.unique_labels` which is created using `import_data.

optional parameter

    none

check_predictions()

Predicts labels based on data and evaluates whether right or wrong. Plots result

required parameter

    none

optional parameter

    num_rows:       - number of rows that will be plotted (multiplied with num_cols results in the number of predictions that are plotted) - Integer - default = 3
    num_cols:       - number of columns that will be plotted (multiplied with num_rows results in the number of predictions that are plotted) - Integer - default = 2
    i_multiplier:   - number of images that are skipped before starting to plot (0 means, the first x images in the data batch are used, 10 means that the images 10 to 10+x will be used) - Integer - default = 0

returns

    none

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
assets		assets
scripts		scripts
.gitignore		.gitignore
README.md		README.md
classify_images.ipynb		classify_images.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

image_classification

Requirements

Fork project and set it up to work on local laptop

How to use it / possible modifications

import_data.py

init class

import_raw_data()

get_raw_traindata()

preprocessing.py

init class

create_data_batches()

model.py

init class

train_model()

predict.py

init class

check_predictions()

What's next?

Further informations:

About

Releases

Packages

Languages

papstchaka/image_classification

Folders and files

Latest commit

History

Repository files navigation

image_classification

Requirements

Fork project and set it up to work on local laptop

How to use it / possible modifications

import_data.py

init class

import_raw_data()

get_raw_traindata()

preprocessing.py

init class

create_data_batches()

model.py

init class

train_model()

predict.py

init class

check_predictions()

What's next?

Further informations:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages