Classification for small datasets #5

bratao · 2016-08-05T14:23:00Z

Hello,
First of all thank you for this awesome contribution to the scientific world.

I performed some tests with binary classification on the corpus with 5.000 samples and the result was not good ( 0.65). Even with a BoW Naive Bayes classifier I could get higher scores.

I tried to play with some parameters like epoch and minCount and it improved the results very slight.

At the fastText Hacker News thread seems like a developer is aware about this issue:

Thanks for pointing this out. We design this library on large datasets and some static variables may not be well tuned for smaller ones. For example the learning rate is only updated every 10k words. We are fixing that now, could you please send us on which dataset you were testing? We would like to see if we have solved this.

There is something inherently to the algorithm that makes this performant only with big datasets ? There are any variable that we can tune for this use case ?

gojomo · 2016-08-07T02:42:03Z

Oddly enough it seems the 'verbose' parameter may affect how often the learning-rate is updated; see:

fastText/src/fasttext.cc

Line 236 in cd5726e

if (tokenCount > args.verbose) {

So perhaps try a small value there.

alexbeletsky · 2016-08-09T16:27:37Z

I have the same question. My dataset is about 5x bigger, but I still have quite poor results P@1 = 0.37. It could be also related to a quality of my dataset, though it would be interesting to know the answer.

@gojomo very interesting about verbose.. is this an issue?

xiamx · 2016-08-09T17:25:08Z

I think the latest source changed -verbose to a more related name -lrUpdateRate

havardl · 2017-03-15T18:06:48Z

I have quite a small binary data set, with around 400 (<900 in total) texts for each of the two classes. I was able to increase the precision and recall from around 0.53 to 0.64 by playing around with the different parameters. The one that had the most effect was -lrUpdateRate, with a setting of 150000 - 200000. Bucket needed to be above 100000, but beyond that had little effect.

Any ideas as to why fastText is performing so low on this sample? Running a plain NB on the same sample gives between 0.86 and 0.89 in accuracy with different normalization methods.

lukewendling · 2017-10-14T15:05:35Z

Same problem, I'll add a use case to move the discussion along:

I want to use FT to classify questions from users in a chatbot app. Input is like "I want to sign up", "How do I get a login", "How do I get started?". The chatbot will eventually be able to classify many types of user input, but until I've collected actual questions from users, I want to seed the "signup" class of questions with a small number (<100) of inputs that I make up, so that my app knows "this is a signup request".

Problem:
With default settings for FT trainer on very few (but closely related - all with word 'signup', for example) observations, the predictions are not helpful... if I have 3 classes on 100 examples, I get probabilities of ~ 33% no matter what the input is, including gibberish input ("abc123").

Perhaps FT is inherently a bad choice for tiny datasets and early stage deployment like this. It's such a great tool for larger datasets that I was hoping to get it integrated into the app early on.

cpuhrsch · 2017-12-21T16:42:32Z

Hello @bratao ,

Thank you for your post. You might find more support for this kind of issue within one of our community boards. In particular, the Facebook group has a lot of members many of whom are ML experts and are keen on discussing applications of this library.

Specifically, there is a

Facebook group
Stack Overflow tag
Google group

If you do decide to move this to one of our other community boards, please consider closing this issue.

Thanks,
Christian

matanox · 2018-03-03T19:25:31Z

Have you used a pre-trained embedding in training your classifier? you should typically get good results for this size of supervised training data if you did.

CharlesCCC · 2019-04-25T18:24:19Z

Do we have any update for this issue ?
As I'm also experience this problem with small-dataset, the precision don't even get close to 60%.

I saw there are people suggest to tweak the parameters, I did most of the suggested, but didn't help too much, the value is fluctuated between 50% ~ 53%.

“
To improve the performance of fastText on small datasets, the learning rate should be increased (for example, use -lr 0.5) as well as the number of epochs (for example, use -epoch 20). You can also decrease the number of buckets (for example, use -bucket 100000), to reduce the model size. A good starting point is something like:
./fasttext supervised -input TRAIN.txt -output MODEL -dim 10 -lr 0.5 -epoch 20 -bucket 100000
“

) Otherwise it conflicts with the `fasttext` module.

ivsanro1 mentioned this issue Dec 18, 2018

Segmentation Fault when importing both fastText and pikepdf #701

Closed

Celebio added the Feature request label Jul 5, 2019

Celebio added the at label Jul 22, 2019

GerHobbelt pushed a commit to GerHobbelt/fastText that referenced this issue Jan 16, 2022

Keep external class bindings localized to the module (facebookresearch#5

f264a3b

) Otherwise it conflicts with the `fasttext` module.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Classification for small datasets #5

Classification for small datasets #5

bratao commented Aug 5, 2016 •

edited

Loading

gojomo commented Aug 7, 2016

alexbeletsky commented Aug 9, 2016

xiamx commented Aug 9, 2016

havardl commented Mar 15, 2017

lukewendling commented Oct 14, 2017

cpuhrsch commented Dec 21, 2017

matanox commented Mar 3, 2018

CharlesCCC commented Apr 25, 2019

Classification for small datasets #5

Classification for small datasets #5

Comments

bratao commented Aug 5, 2016 • edited Loading

gojomo commented Aug 7, 2016

alexbeletsky commented Aug 9, 2016

xiamx commented Aug 9, 2016

havardl commented Mar 15, 2017

lukewendling commented Oct 14, 2017

cpuhrsch commented Dec 21, 2017

matanox commented Mar 3, 2018

CharlesCCC commented Apr 25, 2019

bratao commented Aug 5, 2016 •

edited

Loading