Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Got different accuracy between history and evaluate #10014

Closed
z888888861 opened this issue Apr 23, 2018 · 27 comments
Closed

Got different accuracy between history and evaluate #10014

z888888861 opened this issue Apr 23, 2018 · 27 comments

Comments

@z888888861
Copy link

z888888861 commented Apr 23, 2018

I fit a model as follow: history = model.fit(x_train, y_train, epochs=50, verbose=1, validation_data=(x_val,y_val))

Got the answer :
Epoch 48/50
49/49 [==============================] - 0s 3ms/step - loss: 0.0228 - acc: 0.9796 - val_loss: 3.3064 - val_acc: 0.6923
Epoch 49/50
49/49 [==============================] - 0s 3ms/step - loss: 0.0186 - acc: 1.0000 - val_loss: 3.3164 - val_acc: 0.6923
Epoch 50/50
49/49 [==============================] - 0s 2ms/step - loss: 0.0150 - acc: 1.0000 - val_loss: 3.3186 - val_acc: 0.6923

While, when I try to evaluate my model in train set with model.evaluate(x_train,y_train)

I got this [4.552013397216797, 0.44897958636283875]

I have no idea how this happen? Thank you.

@SpecKROELLchen
Copy link

SpecKROELLchen commented Apr 23, 2018

I have the same issue that is why i post it also here.

First i tested my model after training in a separate session.
I tested if i had read in the data and labels in a wrong order or
if the preprocessing is different, but everything seemed fine.,
Then i thought i had an issue similar to this one ##4875

But then i implemented testing directly after training and also on the training dataset,
which should give the exact same result as the training accuracy is?!

I compile using the fit_generator:
(i know the model is massively overfitting here, but this should not be an issue since it is just for tests)
Epoch 16/17
72/73 [============================>.] - ETA: 0s - loss: 0.1138 - acc: 0.9648 - weighted_acc: 0.9648Epoch 00016: val_acc did not improve
73/73 [==============================] - 20s 274ms/step - loss: 0.1139 - acc: 0.9646 - weighted_acc: 0.9646 - val_loss: 0.9137 - val_acc: 0.6249 - val_weighted_acc: 0.6249
Epoch 17/17
72/73 [============================>.] - ETA: 0s - loss: 0.1059 - acc: 0.9661 - weighted_acc: 0.9661
73/73 [==============================] - 20s 280ms/step - loss: 0.1053 - acc: 0.9664 - weighted_acc: 0.9664 - val_loss: 0.5450 - val_acc: 0.7273 - val_weighted_acc: 0.7273
879/879 [==============================] - 6s 6ms/step

If i then use model.evaluate with the same batch_size the result is:
Test_accuracy: 63.663%

My guess is that the model trains correctly but at the end of training storing/ updating the model or model weights something goes wrong.

update: Sorry i forgot to mention that i use python 3.6 and keras 1.2.4 and to save the model I use keras.callbacks.ModelCheckpoint with the following setup:
ModelCheckpoint(Create_callbackfolder, monitor='val_acc', verbose=1, save_best_only=True)

Any help appriciated

@datumbox
Copy link
Contributor

@z888888861 Do you use Batch Normalization layers? Are you fine-tuning the network (trainable=False for some of the layers)?

If not, there is a very high chance you are overfitting the network.

@SpecKROELLchen
Copy link

SpecKROELLchen commented Apr 24, 2018

@datumbox "If not, there is a very high chance you are overfitting the network."
How should overfitting be the problem when he is testing the training data? Maybe I miss something here.

btw. i upgrades tf to version 1.0.7 with cuda 9.0 and the problem remains.
Does anyone has an idea how to fix this problem?

@datumbox
Copy link
Contributor

@SpecKROELLchen Did not notice he was testing on training data. Are you using BN layers and finetuning? If yes you might be affected by what is currently discussed here #9965

@SpecKROELLchen
Copy link

SpecKROELLchen commented Apr 25, 2018

@datumbox Sorry for the late response, and thanks for your response. I don't use BN layers but fine tuning.
I also checked you link, but did not understand how my problem is related to this one.
Basically the code snipped looks like this:

READ IN AND STUFF...

pmodel = resnet50.ResNet50(include_top=False, input_tensor=custom_input, weights="imagenet",
classes=20)
pmodel = pmodel.output
pmodel = GlobalAveragePooling2D()(pmodel)
#pmodel = Dropout(drop_out_rate)(pmodel)
predictions = Dense(num_classes, activation='sigmoid')(pmodel)
pmodel = Model(inputs=custom_input, outputs=predictions)
pmodel.compile(optimizer="Adam", loss=binary_crossentropy, metrics=["accuracy"])
callback = ModelCheckpoint(Folder, monitor = "val_acc", verbose=1, save_best_only=True)

history = pmodel.fit_generator(TRAIN_DATA, validation_data = VAL_DATA, shufle= True,
callbacks=callback)

score = pmodel.evaluate(TRAIN_DATA,verbose=1, batch_size=12)
print(pmodel.metrics_names[1], score[1]*100)

And the test-accuracy ON TRAINING DATA is always something between 50-70%, while
the train accuracy is like 95%.

I still did not solve this problem. So any help is appriciated.
@z888888861 since you first posted and did not response later on, did you maybe solve your problem?

@datumbox
Copy link
Contributor

@SpecKROELLchen You are using BN layers, ResNet50 is full of them. Unfortunately you are also affected by how Keras implements this type of layer. Check the discussion on the PR for potential workarounds.

@SpecKROELLchen
Copy link

@datumbox Ah, thank you. I thought you meant the layers behind the basemodel. Thx, i will keep an eye on that.

@SpecKROELLchen
Copy link

SpecKROELLchen commented Apr 27, 2018

Sorry i have to repost here again. I think my problem might still be a little bit different or a mix of the BN problem and another one.
I tested VGG16 and the training AND validation accuracy did almost not change.
It stayed between 68-69%.
Then i tested it with Xception model (should also have BN) but then it massively overfitted
and train_acc = 98%, while val_acc=70%.
And no matter what i tried (increasing drop out, l2 kernel regularizer, reduce learning rate for Adams optimizer, i could not handle the overfitting).
And on ResNet50, i have high train_acc AND val_acc (like 95%), while in reality or test mode the results are still poor.
I tried to switch every parameter but do not get good results.
Any help is appreciated.

@SpecKROELLchen
Copy link

SpecKROELLchen commented Apr 28, 2018

Okay,
i still did not solve my problem and will describe my problem a little bit more in detail.
First step: reduced my problem to a 1 class binary classification as i have done it multiple times.

1.) Read in the data, doing padding when resizing the image
to the desired and needed target_shape.

2.) doing a train-test-split 80-20.

READ IN DATA AND LABELS...

# PREPROCESS the read in data
im_dat_gen = ImageDataGenerator(rescale=1. / 255,
                                rotation_range=40,
                                width_shift_range=0.2,
                                height_shift_range=0.2,
                                shear_range=0.2,
                                zoom_range=0.2,
                                fill_mode='nearest')

train_gen = im_dat_gen.flow(X_train, Y_train, batch_size=12)  # img_data_generator
val_gen = im_dat_gen.flow(X_val, Y_val)

Custom_input = Input(255,255,3)
base_model = vgg19.VGG19(include_top = False, input_tensor = input, weights = "imagenet", classes = num_classes)
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dropout(drop_out_rate)(x)
x = Dense(fully_connected_size, activation='relu', kernel_regularizer=regularizers.l2(kernel_regu))(x)  # ,input_dim=7 * 7 * 512
x = Dropout(drop_out_rate)(x)
# 1 CLASS BINARY PROBLEM
predictions = Dense(1, activation='sigmoid')(x) 
model = Model(inputs=Custon_input, outputs=predictions)

model.compile(optimizer="adam", loss="binary_crossentropy, metrics=["accuracy"])
# TB CALLBACKS
ckpt_callback = ModelCheckpoint(callbackfolder, monitor='val_acc', verbose=1, save_best_only=True)
tb_callback = TensorBoard("MyModel.h5", write_graph=True,
                              write_images=True)     #
cb_list = [ckpt_callback, tb_callback]
# FIT MODEL
history = model.fit_generator(train_gen,validation_data = val_gen, steps_per_epoch=steps_per_epoch,shuffle = True, epochs=num_epoch,
                                  callbacks=cb_list, validation_steps=validation_steps)
# TEST MODEL on training data (just to see if the weights etc. are saved correctly)
score = model.evaluate(X_train, Y_train, verbose=1, batch_size=batch_size)
print("%s: %.3f%%" % (model.metrics_names[1], score[1] * 100))

Comments:
I tested everything i can imagine, changing activation, not using dropout, not using freeze,
train n layers.
But the problem from my post above this one still exists.
The model does not seem to train at all.
while if i use resnet50 for the exact same code (just doing a different preprocessing),
my model trains, validation is very nice, but test on training data fails totally
(probably due to the BN issue).
I can assure that the labels and images are read in correctly
and that the training data do not contain much noise or anything.
I really need help :D.

@ucohen
Copy link

ucohen commented Oct 21, 2018

have the same issue, any updates on this issue?

@mblouin02
Copy link

Same issue here. When training using fit_generator, I get a training and validation accuracy that are both much higher than the ones I get when I evaluate the model manually, on training and testing data.

@tbagnoli
Copy link

Same problem here. I'm training with fit_generator, using separate generators for the training and validation sets:

history = classifier.fit_generator(train_generator,
    steps_per_epoch=train_batches,
    validation_data=val_generator,
    validation_steps=val_batches,
    epochs=60, verbose = 0)

The loss and accuracy that are stored in history for the training and validation sets have simply nothing to do with the values I get from e.g., using

scores = classifier.evaluate(X_val, Y_val, verbose=0)
print('validation loss:', scores[0])
print('validation accuracy:', scores[1])

@adityapatadia
Copy link

Try to downgrade to keras 2.1.6 and see if you still face such issues. I could solve it by downgrading.

@shivam2298
Copy link

Even I am facing a problem similar to this one. I am using resnet as my base model.
After 150 epochs, my train accuracy is 60% but when I use model.evaluate() on the train set I get 10%. Is this being caused due fit_generator or resnet50?

@hrosspet
Copy link

Try to downgrade to keras 2.1.6 and see if you still face such issues. I could solve it by downgrading.

Thanks @adityapatadia! This solved my problem, too.

@Issuenate
Copy link

I suggest you guys can save model/weights and load it and test, that might avoid the problem.

@midneet
Copy link

midneet commented May 13, 2019

I also got this problem, and when I save model/weights and load it to test, this problem just occurs. Also applying fine-tune with fit_generator; the accuracy of training data using fit_geterator and evaluate are greatly different.

@amintavakol
Copy link

Loading the model from the saved model and evaluating it on the test set gives me 16% while
the reported validation accuracy after the last epoch (during training the model) is 85%.
I'm using fit_generator and real-time data augmentation.
Keras version: 2.2.4, with tensorflow backend: 1.13.1
python: 2.7.16

@Issuenate
Copy link

Try the following might help:

  1. build your network again in your test phrase with the same code for building your network when you were training.
  2. load weights from the trained model/weights.

@amintavakol
Copy link

what is the difference between what you suggested and load_model method?
Do you mean to set_weights for each layer separately?

@Issuenate
Copy link

if there is your own custom layers/functions in your model.
if you didn't define well or what, you might not load the model correctly so it might come to the terrible test accuracy.
i used the "load_weights" methods and it's fine to avoid those problems.

it is just my suggestion, you can try it tho. Or check if you load your custom layers/functions correctly.

@tinamautd
Copy link

Hello, I think I got similar issue. Actually when I use keras.fit() to train the model and use the training data as validation data, I get different training accuracy and validation accuracy during the training process. Have you found the solution for the different training/test accuracy?

@RaviBansal7717
Copy link

RaviBansal7717 commented Aug 4, 2020

I have found a fix for this issue.
I encountered this issue while using ImageDataGenerators and obviously my model accuracies on both training and validation set were far more lower when using mode.evaluate() as compared to values returned from model history.
So the fix is to set shuffle=False while creating your validation generator and then your accuracy will match with the validation set.
For training set it may not match as we generally keep shuffle=True for training.
Below is an example of how to create validation DataGenerator for reproducible results (Set shuffle=False) :

validation_datagen=ImageDataGenerator(rescale=1./255)
validation_generator=validation_datagen.flow_from_directory(
    validation_directory,
    target_size=target_size,
    batch_size=validation_batch_size,
    class_mode=class_mode,
    shuffle=False
)

@malraharsh
Copy link

Thank you very much @RaviBansal7717.

@lostdatum
Copy link

@RaviBansal7717 Nothing made sense, world was turning gray, you are a lifesaver ! Why on earth would shuffle=True be the default though.

@xiluo67
Copy link

xiluo67 commented Mar 26, 2021

I still have the problem that the training accuracy in history is 100% but when I use the model. evaluate(Train_image, Train_label), it gives me 86%, plus I already turned off the shuffle, regularization, dropout, and set batch size equal to the whole dataset number during training. Really have no idea what went wrong?

@hollemantv
Copy link

I have a slightly different issue. Whether my training accuracy is 70, 80, or 98%, I routinely run model.evaluate(X_test, y_test) after training and get an accuracy score around 0.05. I've seen it happen with and without BN and dropout layers. I'm using Keras 2.4.3, TF 2.5.0 and Python 3.9.5. Suggestions very much appreciated .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests