Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TF-IDF: change to scikit-learn #1069

Open
ajdapretnar opened this issue Jul 5, 2024 · 0 comments
Open

TF-IDF: change to scikit-learn #1069

ajdapretnar opened this issue Jul 5, 2024 · 0 comments

Comments

@ajdapretnar
Copy link
Collaborator

Orange uses the following formula for IDF: math.log10(number_of_docs/number_of_docs_with_word). In this case, some words become all 0 if they appear in all documents. This has them removed by subsequent preprocessors. To avoid this, one can use Smooth IDF, which uses math.log10(1 + number_of_docs/number_of_docs_with_word).

Why is this a problem? This is not the same as in scikit.
a) IDF is math.log10(number_of_docs/(number_of_docs_with_word + 1))
b) Smooth is math.log(1 + number_of_docs+1 / number_of_docs_with_word+1)
c) Scikit uses natural log, while we use log10 (not a big issue, as all numbers are multiplied by constant, but still)
d) TF, when computing TF-IDF, is not normalized by document length, which is also a standard.

We should probably use scikit here. This would, of course, affect teaching materials.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant