Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement filter_extremes #169

Open
wants to merge 27 commits into
base: master
Choose a base branch
from

Conversation

henrifroese
Copy link
Collaborator

@henrifroese henrifroese commented Aug 26, 2020

We add a new function hero.filter_extremes(s: TokenSeries, max_words=None, min_df=1, max_df=1.0) to remove words from all documents that are above or below a document frequency threshold; additionally only keep max_words many words. Naming from gensim's similar function here.

Excerpt from docstring to explain functionality:

Decrease the size of your documents by
filtering out words by their frequency.

It is often useful to reduce the size of your dataset
by dropping words in order to
reduce noise and improve performance.
This function removes all words/tokens from
all documents where the
document frequency (=number of documents a term appears in) is

  • below min_df
  • above max_df.

When min_df or max_df is an integer, then document frequency
is the absolute number of documents that a term
appears in. When it's a float, it is the
proportion of documents a term appears in.

Additionally, only max_words many words are kept.

Parameters

max_words : int, default to None
The maximum number of words/tokens that
are kept, according to term frequency descending.
If None, will consider all features.

min_df : int or float, default to 1
Remove words that have a document frequency
lower than min_df. If float, it represents a
proportion of documents, integer absolute counts.

max_df : int or float, default to 1
Remove words that have a document frequency
higher than max_df. If float, it represents a
proportion of documents, integer absolute counts.

Example

>>> import texthero as hero
>>> import pandas as pd
>>> s = pd.Series(
...        [
...         "Here one two one one one go there",
...         "two go one one one two two two is important",
...     ]
... )
>>> s.pipe(hero.tokenize).pipe(hero.filter_extremes, 3)
0              [one, two, one, one, one, go]
1    [two, go, one, one, one, two, two, two]
dtype: object

Note: only so many lines changed as this builds upon the DocumentTermDF (see #156)

@vercel vercel bot temporarily deployed to Preview August 26, 2020 15:37 Inactive
@vercel vercel bot temporarily deployed to Preview August 26, 2020 15:37 Inactive
Black just rolled out V20.8b1. This creates errors with our ./tests.sh -> switch back
@henrifroese
Copy link
Collaborator Author

henrifroese commented Aug 29, 2020

Note: Black (our formatter) just rolled out V20.8b1 3 days ago. This creates errors with our ./tests.sh in preprocessing because of whitespace. Will investigate this further but atm we set the black version in .travis.yml and setup.cfg to the last working version (19.10b1).

EDIT: found the issue, see the issue opened at Black here

@henrifroese henrifroese added the enhancement New feature or request label Sep 6, 2020
@jbesomi
Copy link
Owner

jbesomi commented Sep 8, 2020

Thanks. Will review once the previous PRs are merged

@jbesomi jbesomi marked this pull request as draft September 14, 2020 13:34
@jbesomi
Copy link
Owner

jbesomi commented Sep 14, 2020

Waiting for #162 to be merged + will need to conflicts change (and will simplify the code).

@mk2510
Copy link
Collaborator

mk2510 commented Sep 22, 2020

we have now implemented all changes from the master and this branch is also ready to review/to be merged 🐙 :octocat: 🥇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants