Skip to content
This repository has been archived by the owner on Mar 19, 2024. It is now read-only.

Classification for small datasets #5

Open
bratao opened this issue Aug 5, 2016 · 8 comments
Open

Classification for small datasets #5

bratao opened this issue Aug 5, 2016 · 8 comments

Comments

@bratao
Copy link

bratao commented Aug 5, 2016

Hello,
First of all thank you for this awesome contribution to the scientific world.

I performed some tests with binary classification on the corpus with 5.000 samples and the result was not good ( 0.65). Even with a BoW Naive Bayes classifier I could get higher scores.

I tried to play with some parameters like epoch and minCount and it improved the results very slight.

At the fastText Hacker News thread seems like a developer is aware about this issue:

Thanks for pointing this out. We design this library on large datasets and some static variables may not be well tuned for smaller ones. For example the learning rate is only updated every 10k words. We are fixing that now, could you please send us on which dataset you were testing? We would like to see if we have solved this.

There is something inherently to the algorithm that makes this performant only with big datasets ? There are any variable that we can tune for this use case ?

@gojomo
Copy link
Contributor

gojomo commented Aug 7, 2016

Oddly enough it seems the 'verbose' parameter may affect how often the learning-rate is updated; see:

if (tokenCount > args.verbose) {

So perhaps try a small value there.

@alexbeletsky
Copy link
Contributor

I have the same question. My dataset is about 5x bigger, but I still have quite poor results P@1 = 0.37. It could be also related to a quality of my dataset, though it would be interesting to know the answer.

@gojomo very interesting about verbose.. is this an issue?

@xiamx
Copy link

xiamx commented Aug 9, 2016

I think the latest source changed -verbose to a more related name -lrUpdateRate

@havardl
Copy link

havardl commented Mar 15, 2017

I have quite a small binary data set, with around 400 (<900 in total) texts for each of the two classes. I was able to increase the precision and recall from around 0.53 to 0.64 by playing around with the different parameters. The one that had the most effect was -lrUpdateRate, with a setting of 150000 - 200000. Bucket needed to be above 100000, but beyond that had little effect.

Any ideas as to why fastText is performing so low on this sample? Running a plain NB on the same sample gives between 0.86 and 0.89 in accuracy with different normalization methods.

@lukewendling
Copy link

Same problem, I'll add a use case to move the discussion along:

I want to use FT to classify questions from users in a chatbot app. Input is like "I want to sign up", "How do I get a login", "How do I get started?". The chatbot will eventually be able to classify many types of user input, but until I've collected actual questions from users, I want to seed the "signup" class of questions with a small number (<100) of inputs that I make up, so that my app knows "this is a signup request".

Problem:
With default settings for FT trainer on very few (but closely related - all with word 'signup', for example) observations, the predictions are not helpful... if I have 3 classes on 100 examples, I get probabilities of ~ 33% no matter what the input is, including gibberish input ("abc123").

Perhaps FT is inherently a bad choice for tiny datasets and early stage deployment like this. It's such a great tool for larger datasets that I was hoping to get it integrated into the app early on.

@cpuhrsch
Copy link
Contributor

Hello @bratao ,

Thank you for your post. You might find more support for this kind of issue within one of our community boards. In particular, the Facebook group has a lot of members many of whom are ML experts and are keen on discussing applications of this library.

Specifically, there is a

Facebook group
Stack Overflow tag
Google group

If you do decide to move this to one of our other community boards, please consider closing this issue.

Thanks,
Christian

@matanox
Copy link

matanox commented Mar 3, 2018

Have you used a pre-trained embedding in training your classifier? you should typically get good results for this size of supervised training data if you did.

@CharlesCCC
Copy link

Do we have any update for this issue ?
As I'm also experience this problem with small-dataset, the precision don't even get close to 60%.

I saw there are people suggest to tweak the parameters, I did most of the suggested, but didn't help too much, the value is fluctuated between 50% ~ 53%.


To improve the performance of fastText on small datasets, the learning rate should be increased (for example, use -lr 0.5) as well as the number of epochs (for example, use -epoch 20). You can also decrease the number of buckets (for example, use -bucket 100000), to reduce the model size. A good starting point is something like:
./fasttext supervised -input TRAIN.txt -output MODEL -dim 10 -lr 0.5 -epoch 20 -bucket 100000

@Celebio Celebio added the at label Jul 22, 2019
GerHobbelt pushed a commit to GerHobbelt/fastText that referenced this issue Jan 16, 2022
)

Otherwise it conflicts with the `fasttext` module.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants