Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange behavior of tokenize(.., only_ci=True) #10

Open
lumpidu opened this issue Jan 13, 2021 · 1 comment
Open

Strange behavior of tokenize(.., only_ci=True) #10

lumpidu opened this issue Jan 13, 2021 · 1 comment

Comments

@lumpidu
Copy link

lumpidu commented Jan 13, 2021

The following snippet gives inconsistent results:

from reynir_correct import tokenize

texts = ["Skúta", "300 ára gömul írsk skúta fundin við Suður-Noreg" ]
for t in texts:
    g = tokenize(t, only_ci=True)
    for t in g:
        if t.txt:
            print(f"{t.txt:12} {t.error_code:8} {t.error_description}")

Output:

Skúta                 
300                   
ára                   
gömul                 
írsk                  
skúta        U001     Óþekkt orð: 'skúta'
fundin                
við                   
Suður-Noreg

The correct word skúta is marked as unknown, but not if it's written as standalone word. Using no options for the tokenize() method works as expected.

It's also not clear from the documentation, what exactly the optiononly_ci does.

@vthorsteinsson
Copy link
Member

only_ci is supposed to instruct the checker to look for context-independent errors only. It is more of an internal switch used to separately measure the performance of the spell checker on context-independent errors. We may well remove it in a future release, so I don't recommend relying on it. That said, the behavior you point out is of course incorrect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants