Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ngram counts are sometimes wrong #476

Closed
firemuzzy opened this issue May 8, 2018 · 3 comments
Closed

Ngram counts are sometimes wrong #476

firemuzzy opened this issue May 8, 2018 · 3 comments

Comments

@firemuzzy
Copy link

I have noticed that for some phrases the ngram counts are greater than the number of occurrences in the text.

Here is a an example https://runkit.com/firemuzzy/5af1d67875914d001263570c

If you look at the phrase free for, the provided text has 2 occurrences, but nlp-compromise returns a count of 3. Coincidentally the phrase free occurs 3 times, I wonder if something is messing up in the normalization.

At the same time the phrase just not user friendly only occurs 1 time, but nlp-compromise reports it with a count of 2. That discrepancy completely puzzles me.

Am I missing something with how ngrams works?

@spencermountain
Copy link
Owner

hey Michael, nice find.
I'll take a look at fixing this this week.
thanks

@spencermountain
Copy link
Owner

ha! oh wow, it's the contraction - It's free for.
it's creating the gram [is] free for.
this is a great bug, will fix it today.
cheers

spencermountain added a commit that referenced this issue May 11, 2018
@spencermountain spencermountain mentioned this issue May 11, 2018
Merged
@spencermountain
Copy link
Owner

fixed in 11.8.0
thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants