Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Thai word list from ICU BreakIterator dictionary #879

Merged
merged 17 commits into from
Dec 6, 2023

Conversation

pavaris-pm
Copy link
Contributor

@pavaris-pm pavaris-pm commented Dec 5, 2023

What does this changes

@wannaphong @bact from issue #877 since ICU are included to almost all web browser, i've added ICU dictionary to PyThaiNLP where file of ICU dictionary are named as icubrk_th.txt and their python file to load the corpus are named as thai_icu.py krub.

Will resolve #877

Your checklist for this pull request

🚨Please review the guidelines for contributing to this repository.

  • Passed code styles and structures
  • Passed code linting checks and unit test

@pep8speaks
Copy link

pep8speaks commented Dec 5, 2023

Hello @pavaris-pm! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2023-12-06 10:17:06 UTC

@wannaphong
Copy link
Member

Hello! Thank you for your pull request. Can you add filter the word that start with #?

SPDX-License-Identifier: Unicode-DFS-2016
@pavaris-pm
Copy link
Contributor Author

pavaris-pm commented Dec 5, 2023

Hello! Thank you for your pull request. Can you add filter the word that start with #?

Sure. Did you mean add a parameters for user to control whether to return a corpus with the text starts with # or not right? by True if you want a returned corpus including words starts with #, and returned the corpus with filtered out word starts with # (no word start with # in corpus) otherwise.

@wannaphong
Copy link
Member

Hello! Thank you for your pull request. Can you add filter the word that start with #?

Sure. Did you mean add a parameters for user to control whether to return a corpus with the text starts with # or not right? by True if you want a returned corpus including words starts with #, and returned the corpus with filtered out word starts with # (no word start with # in corpus) otherwise.

Yes 👍

@bact bact added enhancement enhance functionalities corpus corpus/dataset-related issues labels Dec 5, 2023
@bact bact added this to In progress in PyThaiNLP Dec 5, 2023
Copy link
Member

@bact bact left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once license info has moved to corpus_license.md AND the comment lines are properly discarded, I can merge this.

pythainlp/corpus/thai_icu.py Outdated Show resolved Hide resolved
@bact bact added this to the 5.0 milestone Dec 5, 2023
@bact bact changed the title add Thai ICU Dict into PyThaiNLP corpus Add Thai ICU wordbreak dictionary to PyThaiNLP corpus Dec 5, 2023
@bact
Copy link
Member

bact commented Dec 5, 2023

Hello! Thank you for your pull request. Can you add filter the word that start with #?

Sure. Did you mean add a parameters for user to control whether to return a corpus with the text starts with # or not right? by True if you want a returned corpus including words starts with #, and returned the corpus with filtered out word starts with # (no word start with # in corpus) otherwise.

I think we can do this in get_corpus().

Maybe add the boolean parameter discard_comments to get_corpus()?
The default is probably False.

Or, we can utilize the existing Python standard library shlex for this. shlex will ignore comment lines when it gets its input.

https://docs.python.org/3/library/shlex.html

@pavaris-pm
Copy link
Contributor Author

pavaris-pm commented Dec 5, 2023

Hello! Thank you for your pull request. Can you add filter the word that start with #?

Sure. Did you mean add a parameters for user to control whether to return a corpus with the text starts with # or not right? by True if you want a returned corpus including words starts with #, and returned the corpus with filtered out word starts with # (no word start with # in corpus) otherwise.

I think we can do this in get_corpus().

Maybe add the boolean parameter discard_comments to get_corpus()? The default is probably False.

Or, we can utilize the existing Python standard library shlex for this. shlex will ignore comment lines when it gets its input.

https://docs.python.org/3/library/shlex.html

@bact @wannaphong i already add comment filtering by adding a new parameters named discard_comments where the default value is set to be False. You can review the code from the latest commit krub

Copy link
Contributor Author

@pavaris-pm pavaris-pm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for help me sorting it alphabetically krub 👍🏻

@pavaris-pm
Copy link
Contributor Author

Hello! Thank you for your pull request. Can you add filter the word that start with #?

Sure. Did you mean add a parameters for user to control whether to return a corpus with the text starts with # or not right? by True if you want a returned corpus including words starts with #, and returned the corpus with filtered out word starts with # (no word start with # in corpus) otherwise.

I think we can do this in get_corpus().

Maybe add the boolean parameter discard_comments to get_corpus()? The default is probably False.

Or, we can utilize the existing Python standard library shlex for this. shlex will ignore comment lines when it gets its input.

https://docs.python.org/3/library/shlex.html

@bact @wannaphong I've made some experiment to test the discard_comments parameters and fix some bugs from it. Now it works perfectly. feel free to review from now on krub. It's done 💯

@pavaris-pm pavaris-pm requested a review from bact December 5, 2023 15:02
Copy link
Member

@wannaphong wannaphong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It look great for me.

Also rename `thai_icu()` to `thai_icu_words()` to make it more explicit and consistent with others, like: `thai_orst_words()`
Change thai_icu to thai_icu_words
Copy link

sonarcloud bot commented Dec 6, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

def get_corpus(filename: str, as_is: bool = False) -> Union[frozenset, list]:
def get_corpus(filename: str,
as_is: bool = False,
comments: bool = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed this to comments instead of discard_comments (as I suggested earlier) to avoid double negation.

The semantic now is:

  • if comments = True, then keep comments
  • if comments = False, then discard comments

"""
path = path_pythainlp_corpus(filename)
lines = []
with open(path, "r", encoding="utf-8-sig") as fh:
lines = fh.read().splitlines()

if not comments:
# take only text before character '#'
lines = [line.split("#", 1)[0] for line in lines]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will allowed the comment to be at any position of the line.

Copy link
Member

@bact bact left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved.

Few modification to get_corpus() to make the code more generic.

I have changed the module/function name to

  • corpus.icu instead of corpus.thai_icu - to make the module name more generic
  • thai_icu_words instead of thai_icu - to make the function name inline with thai_words and thai_orst_words

So when import, it will be like:

from pythainlp.corpus.icu import thai_icu_words

Note: I will also after this rename the wikipedia (#869) and volubilis (#870) corpora as well, to make them more consistent:

So instead of having:

from pythainlp.corpus.volubilis import volubilis
from pythainlp.corpus.wikipedia_titles import wikipedia_titles

we should have:

from pythainlp.corpus.volubilis import thai_volubilis_words
from pythainlp.corpus.wikipedia import thai_wikipedia_titles

@bact bact merged commit 297aadc into PyThaiNLP:dev Dec 6, 2023
10 of 14 checks passed
PyThaiNLP automation moved this from In progress to Done Dec 6, 2023
@bact
Copy link
Member

bact commented Dec 6, 2023

Merged thank you.

@bact bact changed the title Add Thai ICU wordbreak dictionary to PyThaiNLP corpus Add Thai word list from ICU BreakIterator dictionary Dec 15, 2023
@bact bact mentioned this pull request Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
corpus corpus/dataset-related issues enhancement enhance functionalities
Projects
PyThaiNLP
  
Done
Development

Successfully merging this pull request may close these issues.

Add ICU wordbreak dictionary (Thai)
4 participants