Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replaced_words is not correct #103

Closed
xcTorres opened this issue Nov 30, 2021 · 4 comments · Fixed by #104
Closed

replaced_words is not correct #103

xcTorres opened this issue Nov 30, 2021 · 4 comments · Fixed by #104
Labels
bug Something isn't working

Comments

@xcTorres
Copy link

address_str = "Perum GPS Griya Permata Sejahtera gang Guyub No 17 Ngumpak Dalem Dander Bojonegoro"

suggestions = sym_spell.lookup_compound(address_str, max_edit_distance=1,ignore_non_words=True, transfer_casing=False)

for sug in suggestions:
    print(sug)

# "perum GPS griya permata sejahtera gang muyub no 17 ngumpakdalem dander bojonegoro, 11, 0" 

We can see Ngumpak Dalem is changed to ngumpakdalem. But when I print the replaced_words.

for k, v in sym_spell.replaced_words.items():
    print(f"origin: {k}, modify: {v.term}, edit_distance: {v.distance}")

origin: guyub, modify: muyub, edit_distance: 1
origin: ngumpak, modify: n tumpak, edit_distance: 2

Seems "origin: ngumpak, modify: n tumpak, edit_distance: 2" is not as expected.

@mammothb mammothb added the bug Something isn't working label Nov 30, 2021
@mammothb
Copy link
Owner

I believe this is because I missed updating replaced_words when a combination of 2 terms is the best match. I have pushed a fix to this branch. Could you please test and see if that fixes the problem for you?

I have tried it on my side with following code

import pkg_resources

from symspellpy import SymSpell

sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_dictionary_en_82_765.txt"
)
bigram_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_bigramdictionary_en_243_342.txt"
)
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
sym_spell.load_bigram_dictionary(bigram_path, term_index=0, count_index=2)

input_term = (
    "whereis th elove GPS hehad dated forImuch of thepast who "
    "couqdn'tread in sixtgrade and 16 microstru cture him"
)
suggestions = sym_spell.lookup_compound(
    input_term, max_edit_distance=1, ignore_non_words=True
)
for suggestion in suggestions:
    print(suggestion)

for k, v in sym_spell.replaced_words.items():
    print(f"origin: {k}, modify: {v.term}, edit_distance: {v.distance}")

and managed to get the following output

where is the love GPS he had dated for much of the past who couldn't read in six grade and 16 microstructure him, 9, 0
<omitted>
origin: microstru, modify: microstructure, edit_distance: 1

and it seems to address the issue

@xcTorres
Copy link
Author

xcTorres commented Dec 1, 2021

Thanks. It works. Could I add one more question? Is there a way to get the start, end index of the origin word?

@mammothb
Copy link
Owner

mammothb commented Dec 1, 2021

Unfortunately there's no way to do that in symspellpy right now, you'll have to implement some custom post processing functions in your project for that

@xcTorres
Copy link
Author

xcTorres commented Dec 1, 2021

Thanks for your reply, and thanks for the package.

@mammothb mammothb closed this as completed Dec 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants