Spell-Correct: Probability of all corrected words are the same #90

ming-o-0 · 2018-04-14T12:07:20Z

In pythainlp/pythainlp/spell/pn.py, you said you fork the code from http://norvig.com/spell-correct.html. As far as I understand, you use the same implementation as in the link.

In your code, you import "WORDS" from dictionary. Instead the link above use corpus (big.txt) rather than dictionary. This make the probability of the corrected words are the same because all words appear only once. The idea behind this code is to chose the most frequent word in the corpus.

Just change "WORDS" to the big corpus.

bact · 2018-10-25T14:11:07Z

Observation

Confirmed @MingPawat observation. Below is a result from PyThaiNLP 1.7.0.1:

>>> from pythainlp.spell import pn
>>> pn.prob("กิน")
1.9348347651110595e-05
>>> pn.prob("ข้าว")
1.9348347651110595e-05
>>> pn.prob("กัน")
1.9348347651110595e-05
>>> pn.prob("กงง")
0.0
>>> pn.prob("ภาษาไท")
0.0

All words that included in dictionary will have probability value of 1.9348347651110595e-05,
everything else will be 0.0.

I tried to use word frequencies from Thai National Corpus instead
(from pythainlp.corpus import tnc -- they're already a counted number, not the actual corpus).
by replacing

WORDS = Counter(thaiword.get_data())

with

WORDS = Counter(dict(tnc.get_word_frequency_all()))

Here's the result

>>> pn.prob("กิน")
0.0006138412282452856
>>> pn.prob("ข้าว")
0.00026716049573969757
>>> pn.prob("กัน")
0.003979265980548341
>>> pn.prob("กงง")
0.0
>>> pn.prob("ภาษาไท")
0.0

Difference in spelling check end result

Original spell checker (using thaiword.txt):

>>> pythainlp.spell("เหลีนม")
['เหลิม', 'เหลียน', 'เหลือม', 'เหลน', 'เหลียน', 'เลียม', 'เหลียว', 'เหนียม', 'เหลี่ยม', 'เหลียน', 'เลียม', 'เหลียน', 'เหลน', 'เหลิม', 'เหลี่ยม', 'เหลียว', 'เหลือม', 'เหนียม', 'เหลียน', 'เหลียน', 'เหลี่ยม', 'เหลี่ยม', 'เหลิม', 'เหลิม']
>>> pythainlp.spell("เหลียม")
['เหลียว', 'เหนียม', 'เหลี่ยม', 'เหลียน', 'เลียม']

Modified spell checker (using TNC word frequency):

>>> pythainlp.spell("เหลีนม")
['เหลียม']
>>> pythainlp.spell("เหลียม")
['เหลียม']

This is mainly because thaiword.txt does not contain the word "เหลียม", but TNC does.

Other tests with TNC:

>>> pythainlp.spell("กกฎาคม")
['กรกฎาคม']
>>> pythainlp.spell("อนุญาติ")
['อนุญาต']
>>> pythainlp.spell("กิเลย")
['กิเลน', 'กิเลส']
>>> pythainlp.spell("สัตค์")
['สัตว์', 'สัตย์', 'สัตร์', 'สัตถ์']

From a quick human (me) judgement, the suggesting order looks reasonable.

Possible problem with "real world" examples

The problem with using text from a corpus (like TNC) is that, if there is a misspelled word in the example, spell checker may suggest a misspelled word. Have to find out on this as well.

bact · 2018-10-25T20:22:30Z

@MingPawat I have put a pull request #137 to fix this based on your suggestion. If you have time, please review if it works in a correct way. Thank you.

bact · 2018-10-29T08:26:42Z

Fixed with #137

wannaphong added the enhancement enhance functionalities label Apr 15, 2018

bact mentioned this issue Oct 25, 2018

Update Peter Norvig's spell checker to suggest words based on probability #137

Merged

bact closed this as completed Oct 29, 2018

wannaphong mentioned this issue Nov 3, 2018

List PyThaiNLP 2.0 #118

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spell-Correct: Probability of all corrected words are the same #90

Spell-Correct: Probability of all corrected words are the same #90

ming-o-0 commented Apr 14, 2018

bact commented Oct 25, 2018

bact commented Oct 25, 2018

bact commented Oct 29, 2018

Spell-Correct: Probability of all corrected words are the same #90

Spell-Correct: Probability of all corrected words are the same #90

Comments

ming-o-0 commented Apr 14, 2018

bact commented Oct 25, 2018

Observation

Difference in spelling check end result

Possible problem with "real world" examples

bact commented Oct 25, 2018

bact commented Oct 29, 2018