Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Predictions are horribly wrong #145

Open
davidniki02 opened this issue Jun 28, 2018 · 4 comments
Open

Predictions are horribly wrong #145

davidniki02 opened this issue Jun 28, 2018 · 4 comments

Comments

@davidniki02
Copy link

I have trained magpie on a news dataset. I have 9 labels for my data.

I training the model and tested the following text using magpie.predict_from_text():

Más de 690 mil casos de inmigrantes esperan ser resueltos por tribunales de Inmigración WASHINGTON— La Administración Trump ha convertido las protecciones de menores en sinónimo de “lagunas legales” que el Congreso debe eliminar pero mientras tanto, sobre el terreno, tampoco ha mejorado el atasco de más de 692,000 casos pendientes en los tribunales de Inmigración, según expertos.

While I don't have ANY Spanish documents in my training samples, magpie returns a 90% chance that this text belongs to one of my labels! It even predicts similar results for 3 other categories, all of them irrelevant. I even tried to see if there are any words that are causing this, but could not find any.

What can be wrong here? I trained the data on 400-500 documents for each category, and set epochs to 30 as well as 50 (no change in results)

@jstypka
Copy link
Collaborator

jstypka commented Jul 2, 2018

Well, if you didn't feed it any Spanish text before, the network will return random result. In order for the network to build representations for words (in any language) they need to appear in the training set at least N times (N=5 by default). Otherwise Magpie just has no idea what is being fed into it and might be triggered by random noise like "Washington" or "Trump" in your case.

The rule is - you should test/predict on the same type of data as you train.

@davidniki02
Copy link
Author

The thing that worries me is the high confidence - 95% in some cases. If it does not recognize the words, should it not at least be careful about its predictions?

@shashi-netra
Copy link

I have the same issue, and have these poor results even if I use some part of the training corpus to test.

@shashi-netra
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants