Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Senseless predictions on the 20_newsgroups dataset #152

Open
dfesenko opened this issue Aug 7, 2018 · 0 comments
Open

Senseless predictions on the 20_newsgroups dataset #152

dfesenko opened this issue Aug 7, 2018 · 0 comments

Comments

@dfesenko
Copy link

dfesenko commented Aug 7, 2018

I have an issue trying to perform text classification using the 20_newsgroups dataset, loaded from sklearn. Only 6 different newsgroups were selected for use in this case (so, I have only 6 labels).
I got very low accuracy on the test dataset. And then I have noticed that Magpie predict the same label for all inputs. Only confidence scores differ. When I play with a number of epochs and vector dimensions, the model starts to predict 2-3 different labels. But the performance is still very low (around 15% accuracy). What can be wrong here? The model which predicts the same output for any input is senseless.

I have texts in a variable X and labels in variable y. Then I create a folder data_six where I placed every text and every label in the separate .txt and .lab files using this code:

counter = 1
for i in range(len(X)):
     if y[i] in codes_to_leave:
        name_text = "data_six/" + str(counter) + ".txt"
        name_label = "data_six/" + str(counter) + ".lab"
        with open(name_text, 'w') as f1:
             f1.write(X[i])
        with open(name_label, 'w') as f2:
             f2.write(str(y[i]))
        counter += 1

Then I train word2vec and a model:

magpie.train_word2vec('data_six', vec_dim=300)
magpie.fit_scaler('data_six')
labels = ['comp.sys.mac.hardware', 'misc.forsale', 'rec.sport.hockey', 
          'sci.med', 'soc.religion.christian', 'talk.politics.mideast']
magpie.train('data_six', labels, test_ratio=0.2, epochs=10)

These are the outputs from the training process:

Train on 4691 samples, validate on 1173 samples
Epoch 1/10
4691/4691 [==============================] - 59s 13ms/step - loss: 0.0382 - top_k_categorical_accuracy: 0.8397 - val_loss: 3.2556e-06 - val_top_k_categorical_accuracy: 0.7101
Epoch 2/10
4691/4691 [==============================] - 58s 12ms/step - loss: 8.8344e-06 - top_k_categorical_accuracy: 0.7883 - val_loss: 3.1788e-06 - val_top_k_categorical_accuracy: 0.7153
Epoch 3/10
4691/4691 [==============================] - 58s 12ms/step - loss: 8.3641e-06 - top_k_categorical_accuracy: 0.7870 - val_loss: 3.1237e-06 - val_top_k_categorical_accuracy: 0.7306
Epoch 4/10
4691/4691 [==============================] - 58s 12ms/step - loss: 8.2958e-06 - top_k_categorical_accuracy: 0.7990 - val_loss: 3.0603e-06 - val_top_k_categorical_accuracy: 0.7442
Epoch 5/10
4691/4691 [==============================] - 58s 12ms/step - loss: 8.1809e-06 - top_k_categorical_accuracy: 0.8017 - val_loss: 2.9892e-06 - val_top_k_categorical_accuracy: 0.7621
Epoch 6/10
4691/4691 [==============================] - 59s 13ms/step - loss: 7.8731e-06 - top_k_categorical_accuracy: 0.8128 - val_loss: 2.9141e-06 - val_top_k_categorical_accuracy: 0.7724
Epoch 7/10
4691/4691 [==============================] - 58s 12ms/step - loss: 7.5711e-06 - top_k_categorical_accuracy: 0.8075 - val_loss: 2.8374e-06 - val_top_k_categorical_accuracy: 0.7877
Epoch 8/10
4691/4691 [==============================] - 59s 12ms/step - loss: 7.7605e-06 - top_k_categorical_accuracy: 0.7996 - val_loss: 2.7545e-06 - val_top_k_categorical_accuracy: 0.7988
Epoch 9/10
4691/4691 [==============================] - 58s 12ms/step - loss: 7.2885e-06 - top_k_categorical_accuracy: 0.8220 - val_loss: 2.6719e-06 - val_top_k_categorical_accuracy: 0.8107
Epoch 10/10
4691/4691 [==============================] - 62s 13ms/step - loss: 6.9731e-06 - top_k_categorical_accuracy: 0.8148 - val_loss: 2.5892e-06 - val_top_k_categorical_accuracy: 0.8252

Then I use the following function for making predictions and measuring accuracy:

def predict_and_evaluate(data_folder):
    filenames = os.listdir(data_folder)
    count_true = 0
    count_true_in_3 = 0
    count_all = 0
    for filename in filenames:
        if filename[-3:] == 'txt':
            count_all += 1
            prediction_list = magpie.predict_from_file('data/' + filename)
            first_prediction = max(prediction_list, key=lambda x:x[1])
            prediction_name = first_prediction[0]
            prediction_code = label_dict[prediction_name]    
            print(prediction_code)
            top3_preds = [i[0] for i in prediction_list[:3]]
            top3_codes = [label_dict[j] for j in top3_preds]
            with open('data_six/' + filename[:-3] + 'lab', 'r') as f:
                y_true = int(f.read())
            if y_true == prediction_code:
                count_true += 1
            if y_true in top3_codes:
                count_true_in_3 += 1      
    accuracy = float(count_true) / float(count_all) 
    accuracy_top_3 = float(count_true_in_3) / float(count_all)
    return accuracy, accuracy_top_3

And in the result, I get all outputs "misc.forsale" or "rec.sport.hockey" (here I mean that these categories get the highest probabilities for any input). And when I change the number of epochs and/or vector dimensions, there might be predicted other categories like soc.religion.christian, but all the same: for any input - the same prediction.

Can somebody tell me, please, what may be the reason of such weird behavior?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant