-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Classification for small datasets #5
Comments
Oddly enough it seems the 'verbose' parameter may affect how often the learning-rate is updated; see: Line 236 in cd5726e
So perhaps try a small value there. |
I have the same question. My dataset is about 5x bigger, but I still have quite poor results @gojomo very interesting about |
I think the latest source changed |
I have quite a small binary data set, with around 400 (<900 in total) texts for each of the two classes. I was able to increase the precision and recall from around 0.53 to 0.64 by playing around with the different parameters. The one that had the most effect was -lrUpdateRate, with a setting of 150000 - 200000. Bucket needed to be above 100000, but beyond that had little effect. Any ideas as to why fastText is performing so low on this sample? Running a plain NB on the same sample gives between 0.86 and 0.89 in accuracy with different normalization methods. |
Same problem, I'll add a use case to move the discussion along: I want to use FT to classify questions from users in a chatbot app. Input is like "I want to sign up", "How do I get a login", "How do I get started?". The chatbot will eventually be able to classify many types of user input, but until I've collected actual questions from users, I want to seed the "signup" class of questions with a small number (<100) of inputs that I make up, so that my app knows "this is a signup request". Problem: Perhaps FT is inherently a bad choice for tiny datasets and early stage deployment like this. It's such a great tool for larger datasets that I was hoping to get it integrated into the app early on. |
Hello @bratao , Thank you for your post. You might find more support for this kind of issue within one of our community boards. In particular, the Facebook group has a lot of members many of whom are ML experts and are keen on discussing applications of this library. Specifically, there is a Facebook group If you do decide to move this to one of our other community boards, please consider closing this issue. Thanks, |
Have you used a pre-trained embedding in training your classifier? you should typically get good results for this size of supervised training data if you did. |
Do we have any update for this issue ? I saw there are people suggest to tweak the parameters, I did most of the suggested, but didn't help too much, the value is fluctuated between 50% ~ 53%.
|
Hello,
First of all thank you for this awesome contribution to the scientific world.
I performed some tests with binary classification on the corpus with 5.000 samples and the result was not good ( 0.65). Even with a BoW Naive Bayes classifier I could get higher scores.
I tried to play with some parameters like epoch and minCount and it improved the results very slight.
At the fastText Hacker News thread seems like a developer is aware about this issue:
There is something inherently to the algorithm that makes this performant only with big datasets ? There are any variable that we can tune for this use case ?
The text was updated successfully, but these errors were encountered: