-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partial learn #15
Comments
Hi @Slavenin, you can split up your training set and train on them sequentially. However, I'm pretty sure, that for more advanced training such as hyperparameter search this approach might not be applicable. Maybe @sergioburdisso could elaborate a bit on that 😇? |
Hi @Slavenin! that's weird, what type of labels are you working with? It would be nice if we could replicate this error locally so that we can fix it. Is it just a problem with the You can also use the
Of course, you can use a loop to implement the above code, I wrote it that way just to make the explanation simpler. As pointed out by @angrymeir, when working with a big dataset, it is better to perform hyperparameter optimization using a sub-sampling of the dataset. For instance by using the stratified k-fold function of sklearn and then working with just a single fold (subset) to optimize the model (Note we're using "stratified" here to make sure at least one sample of each category is included in each split, in fact, it will try to fit the same amount of samples for each category in each training subset/split/fold). Nevertheless, it is in the "TODO" list to the optimization of the current source code to be robust in relation to the size of the used dataset, especially in relation to the number of categories, for instance by using NumPy data structures (I have some work done on this regard but there's still work left to do). (Thanks, @angrymeir for your valuable help, you rock buddy! 💪). |
Ah @sergioburdisso that makes sense! @Slavenin If you can't provide the dataset it might give some insights to see the output of the Vocab. Size per Category. I could imagine that you have a sample |
File names are id categories. Simply numbers.
|
You're right. But I do not understand how to fix that. I have a category with only one record. |
As far as I understand it, there's two issues.
|
I think @sergioburdisso can answer that way more competent :) |
-Quick fix of default compatibility with foreign languages (#15).
Hi @Slavenin!
Yes, the model works independently of the language being used, however, the default preprocessing function ignores characters outside the "standard" ones (
I've just also made a tiny update to the source code of the preprocessing function to consider all valid "word" characters ( Let us know if this solved your problem, and do not hesitate to re-open this issue in case it is needed. Regarding the size of the dataset, I would like to point out two things:
PS: I'm really sorry for the delay, I'm currently on vacation 😎 in the countryside 🐔, with very limited Internet access (and more importantly, very limited electrical power xD). Take care guys! 💪 |
Hi!
I have dataset on 900k records with 800 categories. But I can not train my model because 16gb RAM not enough.
How I can train my model by part?
The text was updated successfully, but these errors were encountered: