Skip to content

PyThaiNLP/docker-thai-tokenizers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Thai Word Tokenizers

Publish Docker

This repository is a collection of almost all Thai tokenisers that are publicly available. Having this collection allows us to try each algorithm as ease via Docker.

Technically, each project (called vendor) has its own Docker image with a entry script and auxiliary scripts. These scripts bring a unified interface, allowing us to run those algorithms in the same way.

Vendors

Vendor Alias Available Methods Container Profile
PyThaiNLP pythainlp newmm, longest
DeepCut deepcut deepcut
CutKum cutkum cutkum
Sertis sertis sertis
Thai Language Toolkit tltk mm, ngram, colloc
Smart Word Analysis for Thai (SWATH) swath max, long
Chrome's v8Breakiterator chrome v8breakiterator

Please see Usages for more details.

Setup

  • Pull necessary Docker images. Please check Docker Hub for the avaliable images.
    $ docker pull pythainlp/word-tokenizers:<vendor-alias>
    

Usages

  1. Put text files that you want to tokenise into ./data.
  2. Run the following command ...
$ ./scripts/tokenise.sh <vendor-alias>-<method> <**filename**>

Please check Vendors section for vendors and methods included here.

Example

Let's say you want to tokenise text in ./data/example.text using PyThaiNLP's newmm algorithm. You can use the following command:

$ cat ./data/example.text
อันนี้คือตัวอย่าง

$ ./scripts/tokenise.sh pythainlp:newmm example.text
# Please be aware that you don't need to have ./data in front of the filename.
# Command Output
Tokenising example.text using vendor=pythainlp and method=newmm
CMD: docker run -v /Users/heytitle/projects/tokenisers-for-thai/data:/data  thai-tokeniser:pythainlp newmm example.text
100%|██████████| 1/1 [00:00<00:00, 151.70it/s]
Tokenising /data/example.text with newmm
Tokenised text is written to /data/example_tokenised-pythainlp-newmm.text

$ cat ./data/example_tokenised-pythainlp-newmm.text
อันนี้|คือ|ตัวอย่าง

Development

Architecture

TBD.

Build a vendor's new Docker image

$ ./scripts/build <vendor>

Push a new Docker image to Docker Hub

$ ./scripts/push <vendor>

Acknowledgements

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages