TF-IDF: change to scikit-learn #1069

ajdapretnar · 2024-07-05T08:51:43Z

Orange uses the following formula for IDF: math.log10(number_of_docs/number_of_docs_with_word). In this case, some words become all 0 if they appear in all documents. This has them removed by subsequent preprocessors. To avoid this, one can use Smooth IDF, which uses math.log10(1 + number_of_docs/number_of_docs_with_word).

Why is this a problem? This is not the same as in scikit.
a) IDF is math.log10(number_of_docs/(number_of_docs_with_word + 1))
b) Smooth is math.log(1 + number_of_docs+1 / number_of_docs_with_word+1)
c) Scikit uses natural log, while we use log10 (not a big issue, as all numbers are multiplied by constant, but still)
d) TF, when computing TF-IDF, is not normalized by document length, which is also a standard.

We should probably use scikit here. This would, of course, affect teaching materials.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TF-IDF: change to scikit-learn #1069

TF-IDF: change to scikit-learn #1069

ajdapretnar commented Jul 5, 2024

TF-IDF: change to scikit-learn #1069

TF-IDF: change to scikit-learn #1069

Comments

ajdapretnar commented Jul 5, 2024