Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spell-Correct: Probability of all corrected words are the same #90

Closed
ming-o-0 opened this issue Apr 14, 2018 · 3 comments
Closed

Spell-Correct: Probability of all corrected words are the same #90

ming-o-0 opened this issue Apr 14, 2018 · 3 comments
Labels
enhancement enhance functionalities

Comments

@ming-o-0
Copy link

In pythainlp/pythainlp/spell/pn.py, you said you fork the code from http://norvig.com/spell-correct.html. As far as I understand, you use the same implementation as in the link.

In your code, you import "WORDS" from dictionary. Instead the link above use corpus (big.txt) rather than dictionary. This make the probability of the corrected words are the same because all words appear only once. The idea behind this code is to chose the most frequent word in the corpus.

Just change "WORDS" to the big corpus.

@wannaphong wannaphong added the enhancement enhance functionalities label Apr 15, 2018
@bact
Copy link
Member

bact commented Oct 25, 2018

Observation

Confirmed @MingPawat observation. Below is a result from PyThaiNLP 1.7.0.1:

>>> from pythainlp.spell import pn
>>> pn.prob("กิน")
1.9348347651110595e-05
>>> pn.prob("ข้าว")
1.9348347651110595e-05
>>> pn.prob("กัน")
1.9348347651110595e-05
>>> pn.prob("กงง")
0.0
>>> pn.prob("ภาษาไท")
0.0

All words that included in dictionary will have probability value of 1.9348347651110595e-05,
everything else will be 0.0.

I tried to use word frequencies from Thai National Corpus instead
(from pythainlp.corpus import tnc -- they're already a counted number, not the actual corpus).
by replacing

WORDS = Counter(thaiword.get_data())

with

WORDS = Counter(dict(tnc.get_word_frequency_all()))

Here's the result

>>> pn.prob("กิน")
0.0006138412282452856
>>> pn.prob("ข้าว")
0.00026716049573969757
>>> pn.prob("กัน")
0.003979265980548341
>>> pn.prob("กงง")
0.0
>>> pn.prob("ภาษาไท")
0.0

Difference in spelling check end result

Original spell checker (using thaiword.txt):

>>> pythainlp.spell("เหลีนม")
['เหลิม', 'เหลียน', 'เหลือม', 'เหลน', 'เหลียน', 'เลียม', 'เหลียว', 'เหนียม', 'เหลี่ยม', 'เหลียน', 'เลียม', 'เหลียน', 'เหลน', 'เหลิม', 'เหลี่ยม', 'เหลียว', 'เหลือม', 'เหนียม', 'เหลียน', 'เหลียน', 'เหลี่ยม', 'เหลี่ยม', 'เหลิม', 'เหลิม']
>>> pythainlp.spell("เหลียม")
['เหลียว', 'เหนียม', 'เหลี่ยม', 'เหลียน', 'เลียม']

Modified spell checker (using TNC word frequency):

>>> pythainlp.spell("เหลีนม")
['เหลียม']
>>> pythainlp.spell("เหลียม")
['เหลียม']

This is mainly because thaiword.txt does not contain the word "เหลียม", but TNC does.

Other tests with TNC:

>>> pythainlp.spell("กกฎาคม")
['กรกฎาคม']
>>> pythainlp.spell("อนุญาติ")
['อนุญาต']
>>> pythainlp.spell("กิเลย")
['กิเลน', 'กิเลส']
>>> pythainlp.spell("สัตค์")
['สัตว์', 'สัตย์', 'สัตร์', 'สัตถ์']

From a quick human (me) judgement, the suggesting order looks reasonable.

Possible problem with "real world" examples

The problem with using text from a corpus (like TNC) is that, if there is a misspelled word in the example, spell checker may suggest a misspelled word. Have to find out on this as well.

@bact
Copy link
Member

bact commented Oct 25, 2018

@MingPawat I have put a pull request #137 to fix this based on your suggestion. If you have time, please review if it works in a correct way. Thank you.

@bact
Copy link
Member

bact commented Oct 29, 2018

Fixed with #137

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement enhance functionalities
Projects
None yet
Development

No branches or pull requests

3 participants