Update Peter Norvig's spell checker to suggest words based on probability #137

bact · 2018-10-25T14:16:53Z

enhancement based on suggestion in issue Spell-Correct: Probability of all corrected words are the same #90
use word frequencies from Thai National Corpus (with some filtering)

ใช้วิธีดึง (คำ, ความถี่) มาจาก Thai National Corpus (TNC) เพื่อสร้างรายการคำสำหรับคำนวณความน่าจะเป็น

คำ "สะกดผิด" ใน TNC และวิธีแก้ไข

เนื่องจาก TNC มีขนาดใหญ่มาก จึงมีคำที่อาจไม่ได้เป็นคำที่พบทั่วไป หรือใช้ตัวสะกดตามพจนานุกรม

วิธีหนึ่ง (1) ที่เป็นไปได้เพื่อแก้ไขปัญหา คือเอารายการคำจาก TNC มาเทียบกับรายการคำที่เรามั่นใจว่าสะกดถูกแน่ๆ แล้วดึงมาเฉพาะคำที่มีอยู่ในทั้ง 2 รายการ แต่ก็จะทำให้จำนวนคำที่ได้น้อยลง (เป็นไปได้ว่า precision จะสูงขึ้น แต่ recall จะต่ำลง)
อีกวิธี (2) คือ ไปใช้ Thai Textbook Corpus (TTC) แทน ซึ่งจะเป็นคำที่เป็นทางการกว่า
วิธีที่ใช้ในปัจจุบัน (3) คือ ใช้ TNC ตามที่ดึงมาได้ แล้วกรองเอาคำจำนวนหนึ่งออกไป เพื่อให้มีเฉพาะคำที่น่าจะเป็นประโยชน์ต่อการแก้ไขคำผิด
- คำมีความถี่ต่ำกว่าหรือเท่ากับ 1 ตัดทิ้ง
- คำยาวกว่า 40 ตัวอักษร ตัดทิ้ง
- คำมีตัวเลข ตัดทิ้ง
- คำมีอักษรนอกจากภาษาไทยและจุด ตัดทิ้ง
- คำเริ่มต้นด้วยจุด ตัดทิ้ง
วิธีที่ (3) นี้ จากการดูด้วยตาคร่าวๆ พบว่าคำที่มีความถี่เท่ากับ 2 มีทั้งคำที่สะกดผิดและถูก แต่โดยรวมแล้วคุณภาพน่าจะใช้ได้ โดยคาดหวังว่าคำที่สะกดถูกที่มี edit distance เดียวกันจะมีความถี่มากกว่า ทำให้ปัญหาไม่น่าจะมีมากนัก
- อย่างไรก็ตาม ถ้าเป็นคำที่มี exact match ใน corpus ตัว spell() จะคืนค่านั้นค่าเดียวทันที โดยไม่สนใจ edit distance เช่น คำว่า "กัต" (ปรากฏใน TNC 115 ครั้ง)

ตัวอย่างคำและความถี่ที่พบใน TNC:

๒๕๔๘ 452
๑๕๐ 441
๒๕๔๗ 372
_______________________________ 48
วิทยาศาสตร์เทคโนโลยีและสิ่งแวดล้อม 10
งงงงงงงงงงงงงงงงงงงงงง 2
ม่อต้อ 2
ตลกหัวเราะ 2
เทรอซ์ 2
ยุกติธรรม 2
โหรสพ 2
ยยยยยยยยยยยยยยย 2
ปากขม 2
เจริญสมณธรรม 2
ลองเพลย์ 2
เตตียม 2
นั่งทางใน 2
ลิคเตนสไตน์ 2
มวยหมู่ 2
ข้าวนก 2
กะบ่อนกะแบ่น 2
เสิรฟ์ 2
66-70-73 2
Directors 2
.สำหรับชาวไทยมุสลิมในสี่จังหวัดภาคใต้ซึ่งยิ่งด้อยการศึกษาเป็นส่วนมากไม่ใช่เป็นเรื่องง่าย 1
.ได้ชี้แจงกับกลุ่มการเมืองว่าช่วงเวลา 1
.ต.ท.ทักษิณและให้การต้อนรับรัฐบาลของตนถึงสองครั้งสองคราทำไมการเดินทางของ 1

Update documents

Update from PyThaiNLP origin

… on probability (as suggested in issue #90 ) - use word frequencies from Thai National Corpus

Update from PyThaiNLP/pythainlp

… list for spell checker

… Climate

wannaphong · 2018-10-26T14:15:14Z

pythainlp/spell/pn.py

+_WORDS_TOTAL = sum(_WORDS.values())
+
+
+def _prob(word, n=_WORDS_TOTAL):


ฟังก์ชันนี้จะเก็บไว้ไหมครับ ?

ต้องใช้ _prob() สำหรับเป็น sort key ใน correction() ครับ ต้องเก็บไว้ครับ

wannaphong · 2018-10-28T15:04:56Z

ผมกด Merge ได้เลยไหมครับ @bact

bact · 2018-10-29T01:02:27Z

@wannaphongcom ถ้ารีวิวแล้วโอเค merge ได้เลยครับ

ผมเพิ่มคลาส NorvigSpellChecker เข้าไป เพื่อให้เพิ่มพจนานุกรมได้เองครับ

…ect, based on NorvigSpellChecker class

bact · 2018-10-29T05:11:46Z

ขอบคุณครับ :)

bact added 5 commits October 24, 2018 14:06

Merge pull request #6 from PyThaiNLP/dev

4b108ec

Update documents

Merge pull request #7 from PyThaiNLP/dev

ba6ea25

Update from PyThaiNLP origin

Update Peter Norvig's spell checker to be able to suggest words based…

16d9fcd

… on probability (as suggested in issue #90 ) - use word frequencies from Thai National Corpus

remove import future

5c957c5

minor sort of imports

6f40f7e

wannaphong approved these changes Oct 25, 2018

View reviewed changes

bact added 2 commits October 25, 2018 23:27

More docstring for Peter Norvig's spell checker

ae6e251

should return list not set

2433a69

wannaphong added this to the 1.8 milestone Oct 25, 2018

bact added 4 commits October 26, 2018 01:27

Merge pull request #8 from PyThaiNLP/dev

55278bd

Update from PyThaiNLP/pythainlp

Filter out non-Thai words and low frequency words from word frequency…

5721e75

… list for spell checker

make Thai characters list a constant outside function _edits1()

83c5187

Adjust word frequency filter

08278b1

bact mentioned this pull request Oct 25, 2018

Spell-Correct: Probability of all corrected words are the same #90

Closed

bact added 2 commits October 26, 2018 11:26

Trying to reduce cognitive complex in functions, as suggested by Code…

75ab30d

… Climate

Stick with the previous _keep() code, less cognitive complexity

a476471

wannaphong reviewed Oct 26, 2018

View reviewed changes

bact added 3 commits October 27, 2018 15:04

check empty string case in correction()

32cc4fe

Sorted spelling candidates by probability of word occurrence

794ae9b

_edits2() should return a set, to remove duplicated candidates

4c8ada5

bact mentioned this pull request Oct 27, 2018

อยากเพิ่มคำใน pythainlp.spell ครับ #119

Closed

bact added 3 commits October 29, 2018 11:27

Add ability to use custom dictionary, by creating a spell checker obj…

6bba431

…ect, based on NorvigSpellChecker class

Add None option for dict_filter, using _no_filter() function.

5e94b14

Update dict_filter condition

0f315b9

wannaphong approved these changes Oct 29, 2018

View reviewed changes

wannaphong merged commit c148699 into PyThaiNLP:dev Oct 29, 2018

wannaphong mentioned this pull request Nov 3, 2018

List PyThaiNLP 2.0 #118

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Peter Norvig's spell checker to suggest words based on probability #137

Update Peter Norvig's spell checker to suggest words based on probability #137

bact commented Oct 25, 2018 •

edited

Loading

wannaphong Oct 26, 2018

bact Oct 27, 2018

wannaphong commented Oct 28, 2018

bact commented Oct 29, 2018 •

edited

Loading

bact commented Oct 29, 2018

		_WORDS_TOTAL = sum(_WORDS.values())


		def _prob(word, n=_WORDS_TOTAL):

Update Peter Norvig's spell checker to suggest words based on probability #137

Update Peter Norvig's spell checker to suggest words based on probability #137

Conversation

bact commented Oct 25, 2018 • edited Loading

คำ "สะกดผิด" ใน TNC และวิธีแก้ไข

ตัวอย่างคำและความถี่ที่พบใน TNC:

wannaphong Oct 26, 2018

Choose a reason for hiding this comment

bact Oct 27, 2018

Choose a reason for hiding this comment

wannaphong commented Oct 28, 2018

bact commented Oct 29, 2018 • edited Loading

bact commented Oct 29, 2018

bact commented Oct 25, 2018 •

edited

Loading

bact commented Oct 29, 2018 •

edited

Loading