Improve: [newmm tokenizer] Change regular expression of "non-thai-characters" #856

konbraphat51 · 2023-11-01T09:03:44Z

What does this changes

Make the newmm tokenization more accurate; recognize more characters as "non-thai"

What was wrong

#855
It sometimes didn't recognize non-thai symbols as non-thai
"(คนไม่เอา)" -> ['(คน', 'ไม่', 'เอา', ')']
"กม/ชม" -> ['กม', '/ชม']
"สีหน้า(รถ)" -> ['สีหน้า', '(รถ)']

How this fixes it

Fixed the recognition method of "non-thai-character".
The examples above are all improved.

Your checklist for this pull request

Passed code styles and structures
Passed code linting checks and unit test

Before: directly descript non-thai-characters by rule-based After: Just set as "anything except Thai-characters"

konbraphat51 · 2023-11-01T09:37:00Z

['👍👍👍 # ana', 'มาก', 'xxrep', 'xxwrep', 'น้อย', '.1146'] != ['xxwrep', '👍', '#', 'ana', 'มาก', 'xxrep', 'xxwrep', 'น้อย', '.', '1146']

It seems that this change makes the tokenization more minute than the test-case.
What do orginizers here think about this?

wannaphong · 2023-11-01T09:38:25Z

['👍👍👍 # ana', 'มาก', 'xxrep', 'xxwrep', 'น้อย', '.1146'] != ['xxwrep', '👍', '#', 'ana', 'มาก', 'xxrep', 'xxwrep', 'น้อย', '.', '1146']

It seems that this change makes the tokenization more minute than the test-case. What do orginizers here think about this?

Can you update the pull request? 9df5a4a

konbraphat51 · 2023-11-01T09:39:27Z

Greetings PR check

Resource not accessible by integration

I don't understand this. It this error common in this project?

Update thai2fit tokenizer

konbraphat51 · 2023-11-01T09:42:59Z

['👍👍👍 # ana', 'มาก', 'xxrep', 'xxwrep', 'น้อย', '.1146'] != ['xxwrep', '👍', '#', 'ana', 'มาก', 'xxrep', 'xxwrep', 'น้อย', '.', '1146']

It seems that this change makes the tokenization more minute than the test-case. What do orginizers here think about this?

Can you update the pull request? 9df5a4a

Updated

konbraphat51 · 2023-11-01T09:57:28Z

It seems that there is unit-test error occuring by 9df5a4a

Ignorable?

wannaphong · 2023-11-01T10:00:46Z

Ignorable

Yes, It's self-host issues but I don't have time to new setup. The unit-test by GitHub is look good https://github.com/PyThaiNLP/pythainlp/actions/runs/6718024461/job/18256943528

coveralls · 2023-11-01T10:06:17Z

coverage: 0.0% (-87.0%) from 86.983% when pulling a5b6db4 on konbraphat51:dev into 9df5a4a on PyThaiNLP:dev.

wannaphong · 2023-11-01T10:21:16Z

I add some rule to fixed the error. konbraphat51#2

Fixed regex

For further mentenance easier

pep8speaks · 2023-11-01T11:50:31Z

Hello @konbraphat51! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file pythainlp/tokenize/newmm.py:

Line 42:1: E122 continuation line missing indentation or outdented

Comment last updated at 2023-11-01 13:08:42 UTC

konbraphat51 · 2023-11-01T12:04:20Z

I merged and modified @wannaphong PR. Please check.

wannaphong · 2023-11-01T12:40:11Z

OK. It look ['เวลา', ' ', '12:12pm ', 'มี', 'โปรโมชั่น', ' ', '11.11'] is error. ['เวลา', ' ', '12:12pm',' ', 'มี', 'โปรโมชั่น', ' ', '11.11'] is correct.

wannaphong · 2023-11-01T12:49:45Z

I fixed.

konbraphat51 · 2023-11-01T12:57:30Z

In my case, fb3e7bb showed
['เวลา', ' ', '12', ':12pm ', 'มี', 'โปรโมชั่น', ' ', '11.11']

The last [^\u0E00-\u0E7F]+ scoups up all the non-thai character until it hits Thai character

wannaphong · 2023-11-01T13:00:29Z

In my case, fb3e7bb showed ['เวลา', ' ', '12', ':12pm ', 'มี', 'โปรโมชั่น', ' ', '11.11']

The last [^\u0E00-\u0E7F]+ scoups up all the non-thai character until it hits Thai character

3d889f7 showed

>>> from pythainlp.tokenize import word_tokenize
>>> word_tokenize("เวลา 12:12pm มีโปรโมชั่น 11.11")
['เวลา', ' ', '12:12pm', ' ', 'มี', 'โปรโมชั่น', ' ', '11.11']
>>> word_tokenize("กม/ชม")
['กม', '/', 'ชม']
>>> word_tokenize("(คนไม่เอา)")
['(', 'คน', 'ไม่', 'เอา', ')']
>>> word_tokenize("สีหน้า(รถ)")
['สีหน้า', '(', 'รถ', ')']

konbraphat51 · 2023-11-01T13:03:31Z

In my case, fb3e7bb showed ['เวลา', ' ', '12', ':12pm ', 'มี', 'โปรโมชั่น', ' ', '11.11']
The last [^\u0E00-\u0E7F]+ scoups up all the non-thai character until it hits Thai character

3d889f7 showed
>>> from pythainlp.tokenize import word_tokenize
>>> word_tokenize("เวลา 12:12pm มีโปรโมชั่น 11.11")
['เวลา', ' ', '12:12pm', ' ', 'มี', 'โปรโมชั่น', ' ', '11.11']
>>> word_tokenize("กม/ชม")
['กม', '/', 'ชม']
>>> word_tokenize("(คนไม่เอา)")
['(', 'คน', 'ไม่', 'เอา', ')']
>>> word_tokenize("สีหน้า(รถ)")
['สีหน้า', '(', 'รถ', ')']

Oh sorry, I was testing by segment() function. Didn't notice there was a post-process.
3d889f7 This seems good to me :)

Interntion for ` \t\r\n`

sonarcloud · 2023-11-01T13:09:05Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
No Duplication information

konbraphat51 · 2023-11-01T13:09:47Z

Added the commentation for further maintenance.

wannaphong

Awesome 💯

bact

Looks fine. Doesn't break the number grouping.

Improve: Change regular expression of "non-thai-characters"

a8dfcfc

Before: directly descript non-thai-characters by rule-based After: Just set as "anything except Thai-characters"

wannaphong linked an issue Nov 1, 2023 that may be closed by this pull request

Question: newmm tokenizer, why not just thai characters? #855

Closed

bact added the bug bugs in the library label Nov 1, 2023

bact added this to the 4.1 milestone Nov 1, 2023

bact added this to To do in PyThaiNLP Nov 1, 2023

Merge pull request #1 from PyThaiNLP/dev

a5b6db4

Update thai2fit tokenizer

Update newmm.py

801518c

wannaphong and others added 5 commits November 1, 2023 17:26

Update newmm.py

24f8a4a

Merge pull request #2 from wannaphong/dev

29d0929

Fixed regex

Fix: exclude Thai characters

ceac763

Improve: comment regex intention

d94a225

For further mentenance easier

Refac: Make regex easier to read

f5fa497

Fix: fix to PEP 8 style

30d60f6

Refac: unify position of |

fb3e7bb

wannaphong added 2 commits November 1, 2023 19:52

Update other non-Thai characters in newmm

02e9cb5

Update newmm.py

3d889f7

Refac: comment for addition

2e2f0cf

Interntion for ` \t\r\n`

wannaphong approved these changes Nov 2, 2023

View reviewed changes

wannaphong requested a review from bact November 3, 2023 17:21

bact approved these changes Nov 4, 2023

View reviewed changes

bact moved this from To do to Done in PyThaiNLP Nov 4, 2023

bact moved this from Done to In progress in PyThaiNLP Nov 4, 2023

wannaphong merged commit de28159 into PyThaiNLP:dev Nov 5, 2023
12 of 14 checks passed

PyThaiNLP automation moved this from In progress to Done Nov 5, 2023

bact mentioned this pull request Nov 13, 2023

PyThaiNLP 5.0 Change Log #788

Closed

wannaphong mentioned this pull request Nov 26, 2023

Update match non-Thai tokens PyThaiNLP/nlpo3#63

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve: [newmm tokenizer] Change regular expression of "non-thai-characters" #856

Improve: [newmm tokenizer] Change regular expression of "non-thai-characters" #856

konbraphat51 commented Nov 1, 2023 •

edited

Loading

konbraphat51 commented Nov 1, 2023

wannaphong commented Nov 1, 2023

konbraphat51 commented Nov 1, 2023

konbraphat51 commented Nov 1, 2023

konbraphat51 commented Nov 1, 2023 •

edited

Loading

wannaphong commented Nov 1, 2023

coveralls commented Nov 1, 2023

wannaphong commented Nov 1, 2023 •

edited

Loading

pep8speaks commented Nov 1, 2023 •

edited

Loading

konbraphat51 commented Nov 1, 2023

wannaphong commented Nov 1, 2023

wannaphong commented Nov 1, 2023

konbraphat51 commented Nov 1, 2023

wannaphong commented Nov 1, 2023

konbraphat51 commented Nov 1, 2023

sonarcloud bot commented Nov 1, 2023

konbraphat51 commented Nov 1, 2023

wannaphong left a comment

bact left a comment

Improve: [newmm tokenizer] Change regular expression of "non-thai-characters" #856

Improve: [newmm tokenizer] Change regular expression of "non-thai-characters" #856

Conversation

konbraphat51 commented Nov 1, 2023 • edited Loading

What does this changes

What was wrong

How this fixes it

Your checklist for this pull request

konbraphat51 commented Nov 1, 2023

wannaphong commented Nov 1, 2023

konbraphat51 commented Nov 1, 2023

konbraphat51 commented Nov 1, 2023

konbraphat51 commented Nov 1, 2023 • edited Loading

wannaphong commented Nov 1, 2023

coveralls commented Nov 1, 2023

wannaphong commented Nov 1, 2023 • edited Loading

pep8speaks commented Nov 1, 2023 • edited Loading

Comment last updated at 2023-11-01 13:08:42 UTC

konbraphat51 commented Nov 1, 2023

wannaphong commented Nov 1, 2023

wannaphong commented Nov 1, 2023

konbraphat51 commented Nov 1, 2023

wannaphong commented Nov 1, 2023

konbraphat51 commented Nov 1, 2023

sonarcloud bot commented Nov 1, 2023

konbraphat51 commented Nov 1, 2023

wannaphong left a comment

Choose a reason for hiding this comment

bact left a comment

Choose a reason for hiding this comment

konbraphat51 commented Nov 1, 2023 •

edited

Loading

konbraphat51 commented Nov 1, 2023 •

edited

Loading

wannaphong commented Nov 1, 2023 •

edited

Loading

pep8speaks commented Nov 1, 2023 •

edited

Loading