Skip to content

sookinoby/generative-models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Generative Models using Apache MXNet

In our previous notebooks, we used a deep learning technique called Convolution Neural Network (CNN) to classify text and images. A CNN is an example of a Discriminative Model, which creates a decision boundary to classify a given input signal (data).

Deep learning models in recent times have been used to create even more powerful and useful models called Generative Models. A Generative Model doesn’t just create a decision boundary but understands the underlying distribution of values. Using this insight, a generative model can also generate new data or classify a given input data. Here are some examples of Generative Models:

  1. Predicting the probability of a word or character given the previous word or character. Almost all of us have a predictive keyboard on our smartphone, which suggests upcoming words for super-fast typing. Generative models allow us to build the most advanced predictive system similar to SwiftKey.

  2. Producing a new song or combine two genres of songs to create an entirely different song, and synthesizing new images from existing images are some examples of Generative Models.

  3. Upgrading images to a higher resolution for removing fuzziness, improving image quality, restoring photos, and much more.

In general, Generative Models can be used on any form of data to learn the underlying distribution, generate new data, and augment existing data.

In this tutorial, we are going to build Generative Models, using Apache MXNet gluon API, for the first application just listed above: predicting the next alphabetical character in an incoming stream and then implement a generative adversarial network for generating a new image from existing images.

We will also talk about the following topics:

  • The difference between Generative and Discriminative models
  • The building blocks of a Recurrent Neural Network (RNN)
  • Implementation of an unrolled version of RNN to understand its relationship with Feed Forward Neural Network.
  • Simple example of Generative Adversarial Network (GAN)

You need to have a basic understanding of Recurrent Neural Network(RNN), Activation Functions, Gradient Descent, Back Propagation and NumPy to understand this tutorial.

By the end of the notebook, you will be able to:

  1. Understand Generative Models
  2. Know the limitations of a Feed Forward Neural Network
  3. Understand the idea behind RNN and LSTM
  4. Install MXNet with Gluon API
  5. Prepare datasets to train the Neural Network
  6. Implement a basic RNN using Feed Forward Neural Network
  7. Implement an RNN Model to auto-generate text using Gluon API *
  8. Implement a Generative Adversarial Network (GAN)

*Note - Although RNN model is used to generate text, it is not actually a 'Generative Model' in the strict sense. This [pdf document]((https://arxiv.org/pdf/1703.01898.pdf) clearly illustrates the difference between a generative model and discriminative model for text classification.

First, we will discuss the idea behind Generative Models and then cover the limitations of Feed Forward Neural Networks. Next, we will implement a basic RNN using Feed Forward Neural Network that can provide a good insight into how RNN works. Then we design a powerful RNN with LSTM and GRU layers using MxNet gluon API. Next, we implement a Generative Adversarial Network (GAN) that can generate new images from existing images. By the end of the tutorial, you can implement other cool generative Models using Gluon API. We will roughly be following the structure of this report

How Generative Models Go Further Than Discriminative Models

Let’s understand the power of Generative Models using a trivial example.

The following table depicts the heights of ten humans and Martians.

Martian (height in centimetre) - 250,260,270,300,220,260,280,290,300,310
Human (height in centimetre) - 160,170,180,190,175,140,180,210,140,200

The heights of human beings follow a normal distribution, showing up as a bell-shaped curve on the graph. Martians tend to be much taller than humans but also have a normal distribution. So let's input the heights of humans and Martians into both Discriminative and Generative models.

If we train a Discriminative Model, it will only plot a decision boundary. The model misclassifies just one human - the accuracy is quite good overall. Basically, the model doesn’t learn about the underlying distribution of data so it is not suitable to build powerful applications listed in the beginning of this article. Alt text

In contrast, a generative model will learn the underlying distribution (lower dimension representation) for Martian (mean =274, std= 8.71) and Human (mean=174, std=7.32). Alt text
. Suppose we have a normal distribution for Martian (mean =274, std = 8.71), we can produce new data by generating a random number between 0 and 1 (uniform distribution) and then querying the normal distribution of Martians to get a value say 275 cm.

Using the underlying distribution, we can generate new Martians and Humans, or a new interbreed species (humars). We have the infinite ways to generate data as we can manipulate the underlying distribution of data. We can also use this model for classifying Martians and Humans, just like the discriminative model. For a concrete understanding of generative vs discriminative models, please check this.

Examples of Discriminative models - Logistic regression, Support Vector Machine, etc. Examples of Generative models -Hidden Markov Model, Naive Bayes Classifier, etc.

Generative vs Discriminative Models in neural network

Let’s say you want to train two models called “m-dis” and “m-gen-partial” to find the difference between a dog and a cat.

An “m-dis” will have a softmax layer at the end (final layer), which does binary classification. All the other layers (hidden layer) try to learn a representation of the input (cat/dog) that can reduce the loss at the final layer. The hidden layer may* learn a rule like :
If the eyes are blue and have brown strips then it is a cat or it is a dog, ignoring other important features like the shape of the body, height, etc.

On the other hand, “m-gen-partial” is trained to learn a lower dimension representation (distribution) that can represent the input image of cat/dog. The final layer is not a softmax layer used for classification. The hidden layer can learn about the general features of a cat/dog (shape, colour, height, etc). Moreover, the dataset needs no labelling as we are only training to extract features to represent the input data. Then we can tweak the model “‘m-gen-partial’” to classify a cat/dog by adding a softmax classifier at the end and by training with few labelled examples of cat/dog. We can also generate new data by adding a decoder network to the ‘m-gen-partial’ model. Adding a decoder network is not trivial -- we have explained about this in the “GAN model” section.

    • In a deep neural network, the hidden layers of the discriminative model actually learns the general features except for the last layer which is used in classification.

The Need For Hidden State (memory)

Although Feed Forward Neural Networks, including Convolution Neural Networks, have shown great accuracy in classifying sentences and text, they cannot store long-term dependencies in memory (hidden state). For example, whenever an average American thinks about KFC chicken, her brain immediately thinks of it as "hot" and "crispy". This is because our brains can remember the context of a conversation from memory, and retrieve those contexts whenever it needs. A Feed-Forward Neural Network can’t interpret the context. In a CNN can learn temporal context, a local group of neighbors within the size of its convolution kernels. So it cannot model sequential data (data with definitive ordering, like the structure of a language). An abstract view of feed-forward neural network is shown below
Alt text

An RNN is more versatile, it's cells accept weighted input and produce both weighted output (WO) and weighted hidden state (WH). The hidden state acts as the memory that stores context. If an RNN represents a person talking on the phone, the weighted output is the words spoken, and the weighted hidden state is the context in which the person utters the word. Alt text

The yellow arrows are the hidden state, and the red arrows are the output.

A simple example can help us understand long term dependencies.

<html>
<head>
<title>
RNN, Here I come.
 </title>
 </head> <body>HTML is amazing, but I should not forget the end tag.</body>
 </html>

Let’s say we are building a predictive text editor, which helps users auto-complete the current word by using the words in the current document and perhaps the users' prior typing habits. The model should remember long-term dependencies like the need for the start tag and end tag . A CNN does not have provision to remember long term context like these. On the other hand, an RNN can remember the context using its internal "memory," just as a person might think “Hey, I saw an tag, then a <title> tag, so I need to close the <title> tag before closing the tag.”

The intuition behind RNNs

Suppose we have to predict the 4th character in a stream of text, given the first three characters. To do that, we can design a simple Feed Forward Neural Network as in the following figure. Alt text

This is basically a Feed Forward Network where the weights WI (green arrows) and WH (yellow arrows) are shared between some of the layers. This is an unrolled version of Vanilla RNN, generally referred to as a many-to-one RNN because multiple inputs (3 characters, in this case) are used to predict one character. The RNN can be designed using MxNet as follows:

class UnRolledRNN_Model(Block):
  # This is the initialisation of UnRolled RNN
    def __init__(self,vocab_size, num_embed, num_hidden,**kwargs):
        super(UnRolledRNN_Model, self).__init__(**kwargs)
        self.num_embed = num_embed
        self.vocab_size = vocab_size

        # Use name_scope to give child Blocks appropriate names.
        # It also allows sharing parameters between blocks recursively.
        with self.name_scope():
            self.encoder = nn.Embedding(self.vocab_size, self.num_embed)
            self.dense1 = nn.Dense(num_hidden,activation='relu',flatten=True)
            self.dense2 = nn.Dense(num_hidden,activation='relu',flatten=True)
            self.dense3 = nn.Dense(vocab_size,flatten=True)

    # This is the forward pass of neural network
    def forward(self, inputs):
        emd = self.encoder(inputs)
        #print(emd.shape)
        #since the input is shape(batch_size,input(3 characters))
        # we need to extract 0th,1st,2nd character from each batch
        chararcter1 = emd[:,0,:]
        chararcter2 = emd[:,1,:]
        chararcter3 = emd[:,2,:]
        c1_hidden = self.dense1(chararcter1) # green arrow in diagram for character 1 (WI)
        c2_hidden = self.dense1(chararcter2) # green arrow in diagram for character 2 (WI)
        c3_hidden = self.dense1(chararcter3) # green arrow in diagram for character 3 (WI)
        c1_hidden_2 = self.dense2(c1_hidden)  # yellow arrow in diagram (WH)
        addition_result = F.add(c2_hidden,c1_hidden_2) # Total c1 + c2
        addition_hidden = self.dense2(addition_result) # yellow arrow in diagram (WH)
        addition_result_2 = F.add(addition_hidden,c3_hidden) # Total c1 + c2 + c3
        final_output = self.dense3(addition_result_2)   # The red arrow in diagram (WO)
        return final_output

Basically, this neural network has 3 embedding layers (emb) for each character, followed by 3 dense layers: Dense1 (with weights WI), which the input Dense 2 (with weights WH) (an intermediate layer) Dense3 (with weights WO), which produces the output. We also do some MXNet array addition to combine inputs.

In addition to the many-to-one RNN, there are other types of RNN that process such memory-based applications, including the popular sequence-to-sequence RNN: ![Alt text](images/loss.png?raw=true"Sequence to Sequence model")

Here N inputs (3 characters) are mapped onto N outputs. This helps the model to train faster because we measure loss (the difference between the predicted value and the actual output) at each time instant. Instead of one loss at the end, we can see loss1, loss2, etc; So that we get a better feedback (backpropagation) when training our model.

We use Binary Cross Entropy Loss in our model.

This model can be folded back and succinctly represented like this:
Alt text

The above representation also makes the math behind the model easy to understand:

hidden_state_at_t = (WI x input + WH x previous_hidden_state)

There are some limitations with Vanilla RNN. For example, let’s say we have a long document has the sentences "I was born in France during the world war ….." and "So I can speak French." A Vanilla RNN cannot understand the context of being "born in France" and "I can speak French" if they can be far apart (temporally distant) in a given document.

RNN doesn’t provide the capability (at least in practice) to forget the irrelevant context in between the phrases. RNN gives more importance to the most previous hidden state because it cannot give preference to the arbitrary (t-k) hidden state, where t is the current time step and k is the number greater than 0. This is because training an RNN on a long sequence of words can cause the gradient to vanish (when the gradient is small) or to explode (when the gradient is large) during backpropagation. Basically, backpropagation multiplies the gradients along the computational graph in reverse direction. A detailed explanation of the problems with RNN is explainedhere.

Long Short-Term Memory (LSTM)

To address the problems with Vanilla RNN, the two German researchers Sepp Hochreiter and Juergen Schmidhuber proposed Long Short-Term Memory (LSTM, a complex RNN unit) as a solution to the vanishing/exploding gradient problem. A beautifully illustrated simpler version of LSTM can be found here and here. In an abstract sense, we can think LSTM unit as a small neural network that decides the amount of information it needs to preserve (memory) from the previous time step.

Implementing an LSTM

Now we can try creating our own simple character predictor.

Preparing your environment

If you're working in the AWS Cloud, you can save yourself a lot of installation work by using an Amazon Machine Image, pre-configured for deep learning. If you have done this, skip steps 1-5 below.

If you are using a Conda environment, remember to install pip inside conda by typing 'conda install pip' after you activate an environment. This will save you a lot of problems down the road.

Here's how to get set up:

  1. Install Anaconda, a package manager. It is easier to install Python libraries using Anaconda.
  2. Install scikit-learn, a general-purpose scientific computing library. We'll use this to pre-process our data. You can install it with 'conda install scikit-learn'.
  3. Grab the Jupyter Notebook, with 'conda install jupyter notebook'.
  4. Get MXNet, an open source deep learning library. The Python notebook was tested on version 0.12.0 of MxNet, and you can install using pip as follows: pip install mxnet==0.12.0
  5. After you activate the anaconda environment, type these commands in it: ‘source activate mxnet’

The consolidated list of commands are given below

conda install pip
pip install opencv-python
conda install scikit-learn
conda install jupyter notebook
pip install mxnet==0.12.0
  1. You can download the MXNet notebook for this part of the tutorial here, where we've created and run all this code, and play with it! Adjust the hyperparameters and experiment with different approaches to neural network architecture.

Preparing the Data Set

We will use a work of Friedrich Nietzsche as our dataset. You can download the data set here. You are free to use any other dataset, such as your own chat history, or you can download some datasets from this site.

The dataset nietzsche.txt consists of 600901 characters, out of which 86 are unique. We need to convert the entire text to a sequence of numbers.

chars = sorted(list(set(text)))
#maps character to unique index e.g. {a:1,b:2....}
char_indices = dict((c, i) for i, c in enumerate(chars))
#maps indices to characters (1:a,2:b ....)
indices_char = dict((i, c) for i, c in enumerate(chars))
#convert the entire text into sequence
idx = [char_indices[c] for c in text]

Preparing dataset for Unrolled RNN

Our goal is to convert the data set to a series of inputs and outputs. Each sequence of three characters from the input stream will be stored as the three input characters to our model, with the next character being the output we are trying to train our model to predict. For instance, we would translate the string "I_love_mxnet" into the following set of inputs and outputs. Alt text

The code to do the conversion follows.

#Input for neural network(our basic rnn has 3 inputs, n samples)
cs=3
c1_dat = [idx[i] for i in range(0, len(idx)-1-cs, cs)]
c2_dat = [idx[i+1] for i in range(0, len(idx)-1-cs, cs)]
c3_dat = [idx[i+2] for i in range(0, len(idx)-1-cs, cs)]
#The output of rnn network (single vector)
c4_dat = [idx[i+3] for i in range(0, len(idx)-1-cs, cs)]
#Stacking the inputs to form3 input features
x1 = np.stack(c1_dat[:-2])
x2 = np.stack(c2_dat[:-2])
x3 = np.stack(c3_dat[:-2])

# Concatenate to form the input training set
col_concat = np.array([x1,x2,x3])
t_col_concat = col_concat.T

We also batchify the training set in batches of 32, so each training instance is of shape 32 X 3. Batchifying the input helps us train the model faster.

#Set the batch size as 32, so input is of form 32 X 3
#output is 32 X 1
batch_size = 32
def get_batch(source,label_data, i,batch_size=32):
    bb_size = min(batch_size, source.shape[0] - 1 - i)
    data = source[i : i + bb_size]
    target = label_data[i: i + bb_size]
    #print(target.shape)
    return data, target.reshape((-1,))

Preparing the dataset for gluon RNN

This is very similar to preparing the dataset for unrolled RNN, except for the shape of the input. The dataset should be ordered in the shape (number of example X batch_size). For example, let us consider the sample dataset below and batch it:

Alt text

In the above image, the input sequence is converted to a batch size of 3. By transforming it this way, we lose the temporal relationship between 'O' and 'V', 'M' and 'T'; but we can train our model faster in batches. It is very easy to generate the arbitrary length input sequence. During our training, we use an input sequence length of 15. This is a hyperparameter and may require fine tuning for the best output.

Designing RNN in Gluon

Next, we define a class that allows us to create two RNN models that we have chosen for our example: GRU (Gated Recurrent Unit)](https://mxnet.incubator.apache.org/api/python/gluon.html#mxnet.gluon.rnn.GRU) and LSTM. GRU is a simpler version of LSTM and performs equally well. You can find a comparison study here. The models are created with the following Python snippet:

# Class to create model objects.
class GluonRNNModel(gluon.Block):
    """A model with an encoder, recurrent layer, and a decoder."""

    def __init__(self, mode, vocab_size, num_embed, num_hidden,
                 num_layers, dropout=0.5, **kwargs):
        super(GluonRNNModel, self).__init__(**kwargs)
        with self.name_scope():
            self.drop = nn.Dropout(dropout)
            self.encoder = nn.Embedding(vocab_size, num_embed,
                                        weight_initializer = mx.init.Uniform(0.1))

            if mode == 'lstm':
                self.rnn = rnn.LSTM(num_hidden, num_layers, dropout=dropout,
                                    input_size=num_embed)
            elif mode == 'gru':
                self.rnn = rnn.GRU(num_hidden, num_layers, dropout=dropout,
                                   input_size=num_embed)
            else:
                self.rnn = rnn.RNN(num_hidden, num_layers, activation='relu', dropout=dropout,
                                   input_size=num_embed)
            self.decoder = nn.Dense(vocab_size, in_units = num_hidden)
            self.num_hidden = num_hidden
   
 #define the forward pass of the neural network
    def forward(self, inputs, hidden):
        emb = self.drop(self.encoder(inputs))
        output, hidden = self.rnn(emb, hidden)
        output = self.drop(output)
        decoded = self.decoder(output.reshape((-1, self.num_hidden)))
        return decoded, hidden
    #Initial state of network
    def begin_state(self, *args, **kwargs):
        return self.rnn.begin_state(*args, **kwargs)

The constructor of the class creates the neural units that will be used in our forward pass. The constructor is parameterized by the type of RNN layer (LSTM, GRU or Vanilla RNN) to use. The forward pass method will be called when training the model to generate the loss associated with the training data.

The forward pass function starts by creating an embedding layer for the input character. You can look at our previous blog post for more details on embedding. The output of the embedding layer is provided as input to the RNN. The RNN returns an output as well as the hidden state. There is dropout layer to prevent overfitting so that the model doesn’t memorize the input-output mapping. The output produced by the RNN is passed to a decoder (dense unit), which predicts the next character in the neural network and also generates the loss during the training phase.

We also have a “begin state” function that initialises the initial hidden state of the model.

Training the neural network

After defining the network, now, we have to train the neural network so that it learns.

def trainGluonRNN(epochs,train_data,seq=seq_length):
    for epoch in range(epochs):
        total_L = 0.0
        hidden = model.begin_state(func = mx.nd.zeros, batch_size = batch_size, ctx = context)
        for ibatch, i in enumerate(range(0, train_data.shape[0] - 1, seq_length)):
            data, target = get_batch(train_data, i,seq)
            hidden = detach(hidden)
            with autograd.record():
                output, hidden = model(data, hidden)
                L = loss(output, target) # this is total loss associated with seq_length
                L.backward()

            grads = [i.grad(context) for i in model.collect_params().values()]
            # Here gradient is for the whole batch.
            # So we multiply max_norm by batch_size and seq_length to balance it.
            gluon.utils.clip_global_norm(grads, clip * seq_length * batch_size)

            trainer.step(batch_size)
            total_L += mx.nd.sum(L).asscalar()

Each epoch starts by initializing the hidden units to zero. While training each batch, we detach the hidden unit from computational graph so that we don’t backpropagate the gradient beyond the sequence length (15 in our case). If we don’t detach the hidden state, the gradient is passed to the beginning of hidden state (t=0). After detaching, we calculate the loss and use the backward function to back-propagate the loss in order to fine tune the weights. We also normalize the gradient by multiplying it by the sequence length and batch size.

Text generation

After training for 200 epochs, we can generate random text. The weights of the trained model are available here. You can download the model parameters and load it using model.load_params function.

To generate text, we initialize the hidden state.

 hidden = model.begin_state(func = mx.nd.zeros, batch_size = batch_size, ctx=context)

Remember, we don't have to reset the hidden state as we don’t backpropagate the loss (fine tune the weights).

Then, we reshape the input sequence vector to a shape that the RNN model accepts.

 sample_input = mx.nd.array(np.array([idx[0:seq_length]]).T
                                ,ctx=context)

Then we look at the argmax of the output produced by the network. generate output char 'c'.

output,hidden = model(sample_input,hidden)
output,hidden = model(sample_input,hidden)
index = mx.nd.argmax(output, axis=1)
index = index.asnumpy()
count = count + 1

Then append output char 'c' to input string

sample_input = mx.nd.array(np.array([idx[0:seq_length]]).T,ctx=context)
new_string = new_string + indices_char[index[-1]]
input_string = input_string[1:] + indices_char[index[-1]]

Next, slice the first character of the input string.

 new_string = new_string + indices_char[index[-1]]
        input_string = input_string[1:] + indices_char[index[-1]]
# a nietzsche like text generator
import sys
def generate_random_text(model,input_string,seq_length,batch_size,sentence_length_to_generate):
    count = 0
    new_string = ''
    cp_input_string = input_string
    hidden = model.begin_state(func = mx.nd.zeros, batch_size = batch_size, ctx=context)
    while count < sentence_length_to_generate:
        idx = [char_indices[c] for c in input_string]
        if(len(input_string) != seq_length):
            print(len(input_string))
            raise ValueError('there was a error in the input ')
        sample_input = mx.nd.array(np.array([idx[0:seq_length]]).T
                                ,ctx=context)
        output,hidden = model(sample_input,hidden)
        index = mx.nd.argmax(output, axis=1)
        index = index.asnumpy()
        count = count + 1
        new_string = new_string + indices_char[index[-1]]
        input_string = input_string[1:] + indices_char[index[-1]]
    print(cp_input_string + new_string)

If you look at the text generated, we will note the model has learnt open and close quotations(""). It has a definite structure and looks similar to 'nietzsche'.

Next, we will take a look at generative models for images and specially GAN.

Generative Adversarial Network (GAN)

Generative Adversarial Network is a neural network model based on a zero-sum game from game theory. The application typically consists of two different neural networks called Discriminator and Generator, where each network tries to outperform the other. Let us consider an example to understand GAN network.

Let’s assume that there is a bank (discriminator) that detects whether a given currency is real or fake using machine learning. A fraudster (generator) builds a machine learning model to counterfeit fake currency notes by looking at the real currency notes and deposits them in the bank. The bank tries to identify the currencies deposited as fake. Alt text

If the bank tells the fraudster why it classified these notes as fake, he can improve his model based on those reasons. After multiple iterations, the bank cannot find the difference between the “real” and “fake” currency. This is the idea behind GAN. So now let's implement a simple GAN network.

I encourage you to download the notebook. You are welcome to adjust the hyperparameters and experiment with different approaches to neural network architecture.

Preparing the DataSet

We use a library called Brine to download our dataset. Brine has many data sets, so we can choose the data set that we want to download. To install Brine and download our data set, do the following:

  1. pip install brine-io
  2. brine install jayleicn/anime-faces

For this tutorial, I am using the Anime-faces dataset, which contains over 100,000 anime images collected from the Internet.

Once the dataset is downloaded, you can load it using the following code:

# brine for loading anime-faces dataset
import brine
anime_train = brine.load_dataset('jayleicn/anime-faces')

We also need to normalize the pixel value of each image to [-1 to 1] and reshape each image from (width X height X channels) to (channels X width X height), because the latter format is what MxNet expects. The transform function does the job of reshaping the input image into the required shape expected by the MxNet model.

def transform(data, target_wd, target_ht):
    # resize to target_wd * target_ht
    data = mx.image.imresize(data, target_wd, target_ht)
    # transpose from (target_wd, target_ht, 3)
    # to (3, target_wd, target_ht)
    data = nd.transpose(data, (2,0,1))
    # normalize to [-1, 1]
    data = data.astype(np.float32)/127.5 - 1
    return data.reshape((1,) + data.shape)

The getImageList function reads the images from the training_folder and returns the images as a list, which is then transformed into a MxNet array.

# Read images, call the transform function, attach it to list
def getImageList(base_path,training_folder):
    img_list = []
    for train in training_folder:
        fname = base_path + train.image
        img_arr = mx.image.imread(fname)
        img_arr = transform(img_arr, target_wd, target_ht)
        img_list.append(img_arr)
    return img_list

base_path = 'brine_datasets/jayleicn/anime-faces/images/' img_list = getImageList('brine_datasets/jayleicn/anime-faces/images/',training_fold)



### Designing the network

We now need to design the two separate networks, the discriminator and the generator. The generator takes a random vector of shape (batchsize X N ), where N is an integer and converts it to an image of shape (batch size X channels X width X height). 

It uses [transpose convolutions](http://deeplearning.net/software/theano_versions/dev/tutorial/conv_arithmetic.html#no-zero-padding-unit-strides-transposed) to upscale the input vectors. 
This is very similar to how a decoder unit in an [autoencoder](https://en.wikipedia.org/wiki/Autoencoder) maps a lower-dimension vector into a higher-dimensional vector representation. You can choose to design your own generator network, the only the thing you need to be careful about is the input and the output shapes. The input to generator network should be of low dimension (we use 1X150 dimension, latent_z_size) and output should be the expected number of channels (3, for color images), width and height (3 x width x height). Here’s the snippet of a generator network.


```python

# Simple generator. You can use any model of your choice(VGG, AlexNet, etc.) but ensure that it upscales the latent variable(random vectors) to 64 * 64 * 3 channel image - the output image we want the generative model to produce.
With netG.name_scope():
     # input is random_z (batchsize X 150 X 1), going into a tranposed convolution
    netG.add(nn.Conv2DTranspose(ngf * 8, 4, 1, 0))
    netG.add(nn.BatchNorm())
    netG.add(nn.Activation('relu'))
    # output size. (ngf*8) x 4 x 4
    netG.add(nn.Conv2DTranspose(ngf * 4, 4, 2, 1))
    netG.add(nn.BatchNorm())
    netG.add(nn.Activation('relu'))
    # output size. (ngf*8) x 8 x 8
    netG.add(nn.Conv2DTranspose(ngf * 2, 4, 2, 1))
    netG.add(nn.BatchNorm())
    netG.add(nn.Activation('relu'))
    # output size. (ngf*8) x 16 x 16
    netG.add(nn.Conv2DTranspose(ngf, 4, 2, 1))
    netG.add(nn.BatchNorm())
    netG.add(nn.Activation('relu'))
    # output size. (ngf*8) x 32 x 32
    netG.add(nn.Conv2DTranspose(nc, 4, 2, 1))
    netG.add(nn.Activation('tanh')) # use tanh , we need an output that is between -1 to 1, not 0 to 1 
    # Remember the input image is normalised between -1 to 1, so should be the output
    # output size. (nc) x 64 x 64

Our discriminator is a binary image classification network that maps the image of shape (batch size X channels X width x height) into a lower-dimension vector of shape (batch size X 1). This is similar to an encoder that converts a higher-dimension image representation into a lower-dimension one. Again, you can use any model that does binary classification with reasonable accuracy.

Here’s the snippet of the discriminator network:

with netD.name_scope():
    # input is (nc) x 64 x 64
    netD.add(nn.Conv2D(ndf, 4, 2, 1))
    netD.add(nn.LeakyReLU(0.2))
    # output size. (ndf) x 32 x 32
    netD.add(nn.Conv2D(ndf * 2, 4, 2, 1))
    netD.add(nn.BatchNorm())
    netD.add(nn.LeakyReLU(0.2))
    # output size. (ndf) x 16 x 16
    netD.add(nn.Conv2D(ndf * 4, 4, 2, 1))
    netD.add(nn.BatchNorm())
    netD.add(nn.LeakyReLU(0.2))
    # output size. (ndf) x 8 x 8
    netD.add(nn.Conv2D(ndf * 8, 4, 2, 1))
    netD.add(nn.BatchNorm())
    netD.add(nn.LeakyReLU(0.2))
    # output size. (ndf) x 4 x 4
    netD.add(nn.Conv2D(1, 4, 1, 0))

Training the GAN network

The training of a GAN network is not straightforward, but it is simple. The following diagram illustrates the training process. Alt text

The real images are given a label of 1, and the fake images are given a label of 0.

#real label is the labels of real image
real_label = nd.ones((batch_size,), ctx=ctx)
#fake labels is label associated with fake image
fake_label = nd.zeros((batch_size,),ctx=ctx)

Training the discriminator

A real image is now passed to the discriminator, to determine if it is real or fake, and the loss associated with the prediction is calculated as errD_real.

# train with real image
output = netD(data).reshape((-1, 1))
#The loss is a real valued number
errD_real = loss(output, real_label)

In the next step, a random noise random_z is passed to the generator network to produce a random image. This image is then passed to the discriminator to classify it as real (1) or fake(0), thereby creating a loss, errD_fake. This errD_fake is high if the discriminator wrongly classifies the fake image (label 0) as a true image (label 1). This errD_fake is back propagated to train the discriminator to classify the fake image as a fake image (label 0). This helps the discriminator to improve its accuracy.

#train with fake image, see what the discriminator predicts
#creates fake image
fake = netG(random_z)
# pass it to the discriminator
output = netD(fake.detach()).reshape((-1, 1))
errD_fake = loss(output, fake_label)

The total error is back propagated to tune the weights of the discriminator.

#compute the total error for fake image and the real image
errD = errD_real + errD_fake
#improve the discriminator skill by back propagating the error
errD.backward()

Training the generator

The random noise(random_z) vector used for training the discriminator is used again to generate a fake image. We then pass the fake image to the discriminator network to obtain the classification output, and the loss is calculated. The loss is high if the fake image generated (label = 0) is not similar to the real image (label 1) i.e. The generator is not able to produce a fake image that can trick the discriminator to classify it as a real image (label =1). The loss is then used to fine-tune the generator network.

fake = netG(random_z)
output = netD(fake).reshape((-1, 1))
errG = loss(output, real_label)
errG.backward()

Generating new fake images

The model weights are available here. You can download the model parameters and load it using model.load_params function. We can use the generator network to create new fake images by providing 150 random dimensions as an input to the network.

Alt text

#Let’s generate some random images
num_image = 8
for i in range(num_image):
    # random input for generating images
    latent_z = mx.nd.random_normal(0, 1, shape=(1, latent_z_size, 1, 1), ctx=ctx)
    img = netG(random_z)
    plt.subplot(2,4,i+1)
    visualize(img[0])
plt.show()

Conclusion

Generative models open up new opportunities for deep learning. This article has explored some of the famous generative models for text and image data. We learned the basics of RNN and how RNN can be constructed using a Feed Forward Neural Network. We also used LSTM/GRU/Vanilla RNN to generate text similar to Friedrich Nietzsche. Finally, we learned about GAN models and generated images identical to the input data (Anime Characters).

About

Generative models using MXNET

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published