Skip to content

PyThaiNLP v5.0.0-beta1

Pre-release
Pre-release
Compare
Choose a tag to compare
@wannaphong wannaphong released this 05 Feb 05:37
· 46 commits to dev since this release

Schedule

  • First Beta release: 5 February 2024
  • Production release: 10 February 2024

See 5.0 Milestone.

What is new?

License information

  • Use SPDX license identifier at the header of source code #876

Deprecation and other API changes

  • Change default NER to thainer-v2 5e97e7c
  • Move pythainlp.util.is_native_thai to pythainlp.morpheme.is_native_thai 524759a

Dependency

New API

Improve

  • Update code comments and clean up codes by @BLKSerene in #845
  • Improving the documentation byt fixing the typos, adding necesarry details and explanation of the code and the missing necessary details about model and example. by @Saharshjain78 in #850
  • Fix tests of khavee functions by @BLKSerene in #854
  • Update Git Actions versions by @bact in #878
  • Fix ruff args in workflow by @bact in #880
  • Revise ruff args in workflow by @bact in #881
  • Fix coref return type and add fallback by @bact in #883
  • Fix wrong/incompatible types, code readability by @bact in #884
  • Bump protobuf from 3.20 to 3.20.2 by #885
  • Add license info to /tests and README_TH.md by @bact in #886
  • phayathaibert, khavee, parse: Code clean up by @bact in #889
  • ruff: docstring-code-format = true by @bact in #892

Tokenizer

  • Add wtpsplit engine to sentence_tokenize #804
  • New paragraph_tokenize funtion to split Thai text to a paragraph #804
  • Add paragraph_threshold into paragraph_tokenize() function #806 by @pavaris-pm in
  • Add 🪿 Han-solo by @wannaphong in #830
  • Fix newmm to better handle non-Thai characters in tokens #856 by @konbraphat51
  • Fix incorrect passing of flags to re.split by @hauntsaninja in #832
  • Add syllable_tokenize by @wannaphong in #834
  • Add wanchanberta_thai_grammarly by @wannaphong in #836
  • Add extra segmentation style for paragraph_tokenize function by @pavaris-pm in #844
  • Improve: [newmm tokenizer] Change regular expression of "non-thai-characters" by @konbraphat51 in #856

Tag

Chat

Translate

Transliterate

Corpus

  • Add pythainlp.corpus.thai_orst_words() Thai word list from Royal Society of Thailand (ORST) #810 by @wannaphong
  • Add pythainlp.corpus.thai_wikipedia_titles() Thai word list (noun and noun phrases) from Thai Wikipedia titles #869 by @konbraphat51
  • Add pythainlp.corpus.thai_volubilis_words() Thai word list from Volubilis dictionary #870 by @konbraphat51
  • Add pythainlp.corpus.thai_icu_words() Thai word list from ICU BreakIterator dictionary #879 by @pavaris-pm
  • Rename Volubilis/Wikipedia corpus function names for consistency / Fix types by @bact in #882

Util

New Contributors