Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inaccurate predictions, where am I going wrong? #154

Open
denissellu opened this issue Aug 14, 2018 · 5 comments
Open

Inaccurate predictions, where am I going wrong? #154

denissellu opened this issue Aug 14, 2018 · 5 comments

Comments

@denissellu
Copy link

Hello,

First off, thank you for making a great tool. I love how simple it is to use.

Quick background: The multi-label problem I'm working on is very straightforward, I have a vast corpus of documentation from software libraries(200k) tagged with keywords - I have 92k unique keyword. After training with scikit-learn, I got the f1 score to about 0.3, and out of tons of predictions, it would get some right. So I decided to try Keras to improve that result and it is how I stumbled across this library.

The probelm is after training I'm getting really inaccurate predictions, the val_loss I get is around 1.7306e-06 or lower. Sometimes, when I change the vec_dim to be higher I do get improved val_loss.

My questios are: where am I going wrong? is my code incomplete / am I missing something?

Is my vec_dim too small? I've tried increasing it to about 10000 but it took a long time to train and results were still inacurate.

Any generel tips or advice would be greatly appricated!

My code below

from magpie import Magpie

magpie = Magpie()

labels = open('data/magpie-ready-data.labels', 'r').readlines()

magpie.init_word_vectors('./data/magpie-ready-data', vec_dim=1000)

magpie.train('./data/magpie-ready-data', labels, test_ratio=0.2, epochs=10)

magpie.predict_from_file('data-sample/magpie-ready-data-sample/1000806.txt')

Thank you for your time.

@jstypka
Copy link
Collaborator

jstypka commented Aug 14, 2018

@denissellu it's hard to say, because the the accuracy depends heavily on your data and it's quality. One thing that I can say is that 1000 vector dimensions is wayyyy too much, you should be fine with 100 or even 50. The accuracy might be slightly reduced or improved, but it shouldn't matter too much.

The other issue is the massive space of potential labels. If you have 92k unique keywords - there are 2^92000 possible label combinations and only one is correct. It's super hard to learn this kind of pattern and also it's very easy to get a terrible score. Also, if you have 200k training samples - there's not that much examples for this massive problem space. The network in order to learn a specific keyword combination, needs to see several examples of documents that fit this pattern. For your case, there's 2^(92k) keyword combinations (more than the number of atoms in the universe) and only 200k samples to learn them from. So it's hard.

I'd suggest looking at the keyword count distribution and picking the top 100 keywords that get the highest coverage and training/testing on them. If that works, you can then increase the pool a little bit so you get more coverage etc until the performance drops to an unacceptable level. Make sure to filter out both the train and test set from the unused labels, so they don't bias the evaluation metrics.

Hope that helps!

@dfesenko
Copy link

I have opened the similar issue some time ago and nobody answered me there. This was an issue about inaccurate predictions for the 20 Newsgroups dataset. While using other similar libraries I was able to improve the accuracy of my models to 90+% (by preprocessing of the text and tuning hyperparameters), Magpie still shows very low accuracy. My code is very similar to the code provided by the author of this issue. And also I have selected only 6 categories to work with (6 labels), so the problem with the massive set of possible labels should be eliminated. What else can be wrong? Other libraries show good performance on this dataset. How many training examples I should have to be able to train a classifier for 6 possible labels in Magpie?

@jstypka
Copy link
Collaborator

jstypka commented Aug 22, 2018

It's difficult to say, but my rule of thumb would be that a couple of thousand of samples should be enough. You also need to make sure that there is enough text to train the word2vec vectors on, so the texts should also be reasonably long and well-formatted.

How does a typical training sample of your dataset look like and how many samples do you have @dfesenko ?

@dfesenko
Copy link

I have around 9500 training examples. The typical training sample looks like an email. Actually, this is standard well-known dataset which you can download from sklearn datasets: the 20 Newsgroups dataset. Some of the emails are very short.

@jstypka
Copy link
Collaborator

jstypka commented Aug 23, 2018

that sounds fine. You mind pasting one here as raw text (with the whitespace and interpunction of the original)? Did you check the distribution of the labels in the dataset? (are all the samples focusing on one label, while the other labels are underrepresented?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants