Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to deal with the problem of label imbalance?? #178

Open
JiaWenqi opened this issue Mar 14, 2019 · 1 comment
Open

How to deal with the problem of label imbalance?? #178

JiaWenqi opened this issue Mar 14, 2019 · 1 comment

Comments

@JiaWenqi
Copy link

My training set has 100,000 doc samples and 1,000 tags, but I found that tags satisfy the long tail distribution. Some tags only appear in less than 10 docs, while others are basically included in every doc. So how should I deal with these situations?

@jstypka
Copy link
Collaborator

jstypka commented Mar 14, 2019

Magpie will likely learn to almost never recommend the classes from the long tail and will frequently default to the most common class. If that's not a behaviour you desire, then you might want to repartition your dataset to have more balanced class distribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants