Setting learning_phase to 0 leads to extremely low accuracy #7177

datumbox · 2017-06-29T12:29:52Z

Loading a persisted model and setting the learning_phase=0 reduces the accuracy from 100% to 50% in binary classification problem.

The below script contains a simple example that reproduces the problem on Keras 2.0.5 and TensorFlow 1.2 (Python 2.7, Ubuntu 14.04, Nvidia Quadro K2200 GPU). I use an extremely small dataset and I intentionally overfit the model.

Snippet:

import tensorflow as tf
import numpy as np
from keras.applications.resnet50 import ResNet50
from keras.preprocessing import image
from keras.models import Model, load_model
from keras.layers import Dense, Flatten
from keras import backend as K


epochs = 5
input_shape = (224, 224, 3)
batch_size = 32
seed = 42
dataset_path = './data/cifar2tiny' # same train/test dataset to overfit it

np.random.seed(seed)
tf.set_random_seed(seed)


K.set_learning_phase(1)
base_model = ResNet50(weights='imagenet', include_top=False, input_shape=input_shape)

x = base_model.output
x = Flatten(name='flatten')(x)
predictions = Dense(2, activation='softmax', name='predictions')(x)
model = Model(inputs=base_model.input, outputs=predictions)

for layer in model.layers[0:141]:
    layer.trainable = False

model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

train_gen = image.ImageDataGenerator().flow_from_directory(dataset_path, target_size=input_shape[:2], batch_size=batch_size, class_mode='categorical', shuffle=True, seed=seed)
test_gen = image.ImageDataGenerator().flow_from_directory(dataset_path, target_size=input_shape[:2], batch_size=batch_size, class_mode='categorical', shuffle=True, seed=seed)
train_steps = train_gen.samples//batch_size
test_steps = test_gen.samples//batch_size

model.fit_generator(train_gen, train_steps, epochs=epochs, validation_data=test_gen, validation_steps=test_steps)

test_gen.reset()
print('Before Save:', model.evaluate_generator(test_gen, test_steps)) # Accuracy close to 100%
model.save('/tmp/tmpModel')

K.clear_session()
K.set_learning_phase(0)
test_gen.reset()
model = load_model('/tmp/tmpModel')
print('After Load - learning_phase=0:', model.evaluate_generator(test_gen, test_steps)) # Accuracy close to 50%

K.clear_session()
K.set_learning_phase(1)
test_gen.reset()
model = load_model('/tmp/tmpModel')
print('After Load - learning_phase=1:', model.evaluate_generator(test_gen, test_steps)) # Accuracy close to 100%

Output:

Found 100 images belonging to 2 classes.
Found 100 images belonging to 2 classes.
Epoch 1/5
3/3 [==============================] - 6s - loss: 1.7213 - acc: 0.5833 - val_loss: 0.9526 - val_acc: 0.8438
Epoch 2/5
3/3 [==============================] - 3s - loss: 1.0977 - acc: 0.7642 - val_loss: 0.3841 - val_acc: 0.9559
Epoch 3/5
3/3 [==============================] - 3s - loss: 0.5245 - acc: 0.8649 - val_loss: 0.1062 - val_acc: 0.9706
Epoch 4/5
3/3 [==============================] - 3s - loss: 0.0637 - acc: 0.9885 - val_loss: 0.1152 - val_acc: 0.9853
Epoch 5/5
3/3 [==============================] - 4s - loss: 0.1557 - acc: 0.9688 - val_loss: 0.0218 - val_acc: 0.9896
('Before Save:', [0.023946404457092285, 1.0])
('After Load - learning_phase=0:', [7.8911511103312177, 0.51041666666666663])
('After Load - learning_phase=1:', [0.028959342899421852, 0.98958333333333337])

As you can see above, I explicitly save the model, clear the session and load it again. This is important for reproducing the problem. I don't believe that there is an issue on the persistence mechanism of Keras as the weights before and after the load() seem the same.

Since ResNet50 does not contain any Dropout layer, I believe the problem is caused by the BatchNormalization layers. As far as I see on Keras source, during training we use the sample mean/variance of the mini-batch while during testing we use the rolling mean/variance.

Any thoughts from Keras contributors? I'm happy to provide more info or investigate further.

@fchollet Could you provide any hint/pointers where to look next?

Check that you are up-to-date with the master branch of Keras. You can update with:
pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps
If running on TensorFlow, check that you are up-to-date with the latest version. The installation instructions can be found here.
If running on Theano, check that you are up-to-date with the master branch of Theano. You can update with:
pip install git+git://github.com/Theano/Theano.git --upgrade --no-deps
Provide a link to a GitHub Gist of a Python script that can reproduce your issue (or just copy the script here if it is short).

The text was updated successfully, but these errors were encountered:

jorgecarleitao · 2017-06-30T08:31:12Z

I did a simple test to check whether the BatchNormalization was the culprit, and do not found evidence of it. Specifically,

from keras.models import load_model, Sequential
from keras.layers import Dense, Flatten, Conv2D, MaxPooling2D, BatchNormalization
from keras.datasets import mnist
from keras import backend as K
import keras

import os


# input image dimensions
img_rows, img_cols = 28, 28
num_classes = 10

# the data, shuffled and split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

if K.image_data_format() == 'channels_first':
    x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
    x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
    input_shape = (1, img_rows, img_cols)
else:
    x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
    x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
    input_shape = (img_rows, img_cols, 1)


x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)


def get_trained_model(file_path):
    if os.path.exists(file_path):
        return load_model(file_path)

    batch_size = 128
    epochs = 12

    model = Sequential()
    model.add(Conv2D(16, kernel_size=(3, 3),
                     activation='relu',
                     input_shape=input_shape))
    model.add(Conv2D(4, (3, 3), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    # Use batch normalization instead of Dropout to test it
    model.add(BatchNormalization())
    model.add(Flatten())
    model.add(Dense(16, activation='relu'))
    model.add(BatchNormalization())
    model.add(Dense(num_classes, activation='softmax'))

    model.compile(loss=keras.losses.categorical_crossentropy,
                  optimizer=keras.optimizers.Adadelta(),
                  metrics=['accuracy'])

    model.fit(x_train, y_train,
              batch_size=batch_size,
              epochs=epochs,
              verbose=2,
              validation_data=(x_test, y_test))

    model.save(file_path)
    return model


model = get_trained_model('test.h5')


print('Before Load:', model.evaluate(x_test, y_test, verbose=0))


K.clear_session()
K.set_learning_phase(0)
model = load_model('test.h5')
print('After Load - learning_phase=0:', model.evaluate(x_test, y_test, verbose=0))


K.clear_session()
K.set_learning_phase(1)
model = load_model('test.h5')
print('After Load - learning_phase=1:', model.evaluate(x_test, y_test, verbose=0))

I get (TensorFlow, Keras 2.0.5)

Before Load: [0.056945249618217349, 0.98070000000000002]
After Load - learning_phase=0: [0.056945249618217349, 0.98070000000000002]
After Load - learning_phase=1: [0.056797642182186248, 0.98170000000000002]

datumbox · 2017-06-30T14:39:35Z

@jorgecarleitao I think it is important to do fine-tuning on the model (freeze some of the layers) in order to reproduce the results. Note that the way I do fine-tuning here comes from the documentation. Could you add a couple more layers on your example and freeze part of the network? Alternatively you can run my snippet.

I kept digging into this and I noticed that that when you freeze part of the model, the moving mean and variance of the frozen BatchNormalization layers keep updating their values (beta and gamma are not though). Perhaps @fchollet can clarify if this is a bug or an intended behaviour. This might actually be responsible for the problem.

It is worth noting that if this is indeed a bug, it can heavily affect people who deploy their models on the live environment.

undo76 · 2017-07-25T09:11:03Z

Maybe related to #4762

As explained in my comment, I think it should be possible to freeze the BatchNormalization parameters when trainable = False.

datumbox · 2017-07-25T16:01:54Z

@undo76 Indeed I observe the same effect, akka the rolling mean and variance of BatchNormalization are being updated even when the layer is frozen. Still it is not clear to me if this is related to the low accuracy bug that I report here.

@fchollet Any thoughts?

datumbox · 2017-07-25T18:11:33Z

@jorgecarleitao I actually added more layers to your snippet and performed fine-tuning but I can't make it break. Still uncertain on whether this is caused by the BN layer.

datumbox · 2017-07-26T13:08:32Z

OK I believe I know what is the problem. It is not a bug, but a side-effect of the way we estimate the moving averages on BatchNormalization.

The mean and variance of the training data that I use are different from the ones of the dataset used to train the ResNet50 (the effect is amplified by the fact I don't subtract the average pixel & flip the channel order but you can actually get the same result even if you do). Because the momentum on the BatchNormalization has a default value of 0.99, with only 5 iterations it does not converge quickly enough to the correct values for the moving mean and variance. This is not obvious during training when the learning_phase is 1 because BN uses the mean/variance of the batch. Nevertheless when we set learning_phase to 0, the incorrect mean/variance values which are learned during training significantly affect the accuracy. The reason why @jorgecarleitao 's snippet does not reproduce the problem is because he trains the model from scratch rather than using pre-trained weights.

There are two ways to demonstrate that this is the root of the problem:

1. More iterations

On my original snippet, reduce the size of the batch from 32 to 16 (to perform more updates per epoch) and increase the number of epochs from 5 to 250. This way the moving average and variance will converge to the correct values.

Output:

('Before Save:', [1.1920930376163597e-07, 1.0])
('After Load - learning_phase=0:', [1.6205018008956054e-07, 1.0])
('After Load - learning_phase=1:', [5.8301013202329466e-07, 1.0])

2. Change the momentum of BatchNormalization

Keep the number of iterations fixed but change the momentum of the BatchNormalization layer to update more aggressively the rolling mean and variance (not recommended for production models). Note that changing the momentum field after the model has been initialised will not have any effect (the graph on Tensorflow has already been constructed using this value), so we use a hacky patch to demonstrate the case.

On my original snippet, add the following patch between reading the base_model and defining the new layers:

# ....
base_model = ResNet50(weights='imagenet', include_top=False, input_shape=input_shape)

# PATCH MOMENTUM - START
import json
conf = json.loads(base_model.to_json())
for l in conf['config']['layers']:
    if l['class_name'] == 'BatchNormalization':
        l['config']['momentum'] = 0.5


m = Model.from_config(conf['config'])
for l in base_model.layers:
    m.get_layer(l.name).set_weights(l.get_weights())

base_model = m
# PATCH MOMENTUM - END

x = base_model.output
# ....

Output:

('Before Save:', [0.025331034635504086, 1.0])
('After Load - learning_phase=0:', [0.01937261379013459, 1.0])
('After Load - learning_phase=1:', [0.14217917621135712, 0.98958333333333337])

Hope this helps others that face similar issues.

ghost · 2019-02-19T09:58:41Z

Hi @datumbox ,
For this problem what do we do ? Is necessary to set learning_phase to 1? apparently, this problem is solved in the above version of keras 2.1.0, right? if so, for fine-tuning the network is not necessary to set this learning_phase in codes?
and this ""(the effect is amplified by the fact I don't subtract the average pixel & flip the channel order but you can actually get the same result even if you do)"" is ambiguous for me. what's your meaning?
one side you told the effect is amplified by the fact I don't subtract the average pixel & flip the channel order
i get meaning from this statement : if we don't apply subtract and flip, we get different result value. right?
another side you told you get the same result if you do this,
please more details explain,
Thanks,

datumbox closed this as completed Jul 26, 2017

datumbox mentioned this issue Apr 20, 2018

Change BN layer to use moving mean/var if frozen #9965

Closed

lsdefine mentioned this issue Jun 29, 2020

why get same output with different input? lsdefine/attention-is-all-you-need-keras#29

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting learning_phase to 0 leads to extremely low accuracy #7177

Setting learning_phase to 0 leads to extremely low accuracy #7177

datumbox commented Jun 29, 2017 •

edited

Loading

jorgecarleitao commented Jun 30, 2017 •

edited

Loading

datumbox commented Jun 30, 2017

undo76 commented Jul 25, 2017

datumbox commented Jul 25, 2017

datumbox commented Jul 25, 2017

datumbox commented Jul 26, 2017

ghost commented Feb 19, 2019 •

edited by ghost

Loading

Setting learning_phase to 0 leads to extremely low accuracy #7177

Setting learning_phase to 0 leads to extremely low accuracy #7177

Comments

datumbox commented Jun 29, 2017 • edited Loading

jorgecarleitao commented Jun 30, 2017 • edited Loading

datumbox commented Jun 30, 2017

undo76 commented Jul 25, 2017

datumbox commented Jul 25, 2017

datumbox commented Jul 25, 2017

datumbox commented Jul 26, 2017

ghost commented Feb 19, 2019 • edited by ghost Loading

datumbox commented Jun 29, 2017 •

edited

Loading

jorgecarleitao commented Jun 30, 2017 •

edited

Loading

ghost commented Feb 19, 2019 •

edited by ghost

Loading