Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Thai word list from Thai Wikipedia titles #869

Merged
merged 20 commits into from
Dec 1, 2023

Conversation

konbraphat51
Copy link
Contributor

@konbraphat51 konbraphat51 commented Nov 30, 2023

What does this changes

Add an optional corpus of Wikipedia titles.

  • thai_wikipedia_titles() function
  • license description

Fixes #858

Your checklist for this pull request

🚨Please review the guidelines for contributing to this repository.

  • Passed code styles and structures
  • Passed code linting checks and unit test

@konbraphat51 konbraphat51 changed the title Corpus wiki Add wikipedia titles corpus Nov 30, 2023
@coveralls
Copy link

coveralls commented Nov 30, 2023

Coverage Status

coverage: 86.133% (+0.05%) from 86.085%
when pulling 0b52e14 on konbraphat51:corpus_wiki
into de098f3 on PyThaiNLP:dev.

@konbraphat51 konbraphat51 marked this pull request as ready for review November 30, 2023 04:10
@konbraphat51
Copy link
Contributor Author

Review please

@bact
Copy link
Member

bact commented Nov 30, 2023

Looks good to me.

@bact bact requested review from wannaphong and bact November 30, 2023 15:32
@bact bact added corpus corpus/dataset-related issues enhancement enhance functionalities labels Nov 30, 2023
Copy link
Member

@bact bact left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code, docs, license, test. All looks good.

Once sorted out the order of module imports, I can approve this.

tests/test_corpus.py Outdated Show resolved Hide resolved
@bact bact added this to In progress in PyThaiNLP Nov 30, 2023
## Wikipedia Titles
Corpus of Wikipedia titles (wikipedia_titles.txt) was processed by konbraphat51 (https://github.com/konbraphat51/Thai_Dictionary_Cleaner/tree/main)

The original data is thwiki-latest-all-titles.gz of https://dumps.wikimedia.org/thwiki/latest/ which Wikipedia.org has created.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind to add date of the Wikipedia data here please?
The date that you have downloaded the data for your preparation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's definitely required. Added!

@konbraphat51
Copy link
Contributor Author

Okey, review again please

Copy link
Member

@wannaphong wannaphong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It look great!

Copy link
Member

@bact bact left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

@konbraphat51
Copy link
Contributor Author

Ok, handled with the conflict!

For standard licenses, like Creative Commons, just link to the license URL. No need to put the license text inside this file.
- sort imports
- clean up test structure
Copy link

sonarcloud bot commented Dec 1, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
0.0% 0.0% Duplication

@bact bact merged commit f877567 into PyThaiNLP:dev Dec 1, 2023
9 of 14 checks passed
PyThaiNLP automation moved this from In progress to Done Dec 1, 2023
@bact
Copy link
Member

bact commented Dec 1, 2023

Merged. Thank you.

@bact bact changed the title Add wikipedia titles corpus Add Thai word list from Thai Wikipedia titles Dec 15, 2023
@bact bact mentioned this pull request Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
corpus corpus/dataset-related issues enhancement enhance functionalities
Projects
PyThaiNLP
  
Done
Development

Successfully merging this pull request may close these issues.

[Suggestion] Add a large dictionary data
4 participants