Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manage extra requires + merge g2p and romanization to one transliterate module #153

Merged
merged 14 commits into from
Nov 9, 2018
Merged
5 changes: 2 additions & 3 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,11 @@

language: python
python:
- "3.4"
- "3.5"
- "3.6"
# command to install dependencies, e.g. pip install -r requirements.txt --use-mirrors
install:
- pip install -r requirements-travis.txt
- pip install -r requirements.txt
- pip install .[icu,ner,pos,tokenize,transliterate]
- pip install coveralls

os:
Expand Down
4 changes: 2 additions & 2 deletions README-pypi.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
![PyThaiNLP Logo](https://avatars0.githubusercontent.com/u/32934255?s=200&v=4)

# PyThaiNLP 1.7
# PyThaiNLP 1.8.0

[![Codacy Badge](https://api.codacy.com/project/badge/Grade/cb946260c87a4cc5905ca608704406f7)](https://www.codacy.com/app/pythainlp/pythainlp_2?utm_source=github.com&utm_medium=referral&utm_content=PyThaiNLP/pythainlp&utm_campaign=Badge_Grade)[![pypi](https://img.shields.io/pypi/v/pythainlp.svg)](https://pypi.python.org/pypi/pythainlp)
[![Build Status](https://travis-ci.org/PyThaiNLP/pythainlp.svg?branch=develop)](https://travis-ci.org/PyThaiNLP/pythainlp)
Expand All @@ -14,7 +14,7 @@ PyThaiNLP features include Thai word and subword segmentations, soundex, romaniz

## What's new in version 1.7 ?

- Deprecate Python 2 support
- Deprecate Python 2 support. (Python 2 compatibility code will be completely dropped in PyThaiNLP 1.8)
- Refactor pythainlp.tokenize.pyicu for readability
- Add Thai NER model to pythainlp.ner
- thai2vec v0.2 - larger vocab, benchmarking results on Wongnai dataset
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Python 2 users can still use PyThaiNLP 1.6.
## Capabilities

- Thai word segmentation (```word_tokenize```), including subword segmentation based on Thai Character Cluster (```tcc```) and ETCC (```etcc```)
- Thai romanization (```romanize```)
- Thai romanization and transliteration (```romanize```, ```transliterate```)
- Thai part-of-speech taggers (```pos_tag```)
- Read out number to Thai words (```bahttext```, ```num_to_thaiword```)
- Thai collation (sort by dictionoary order) (```collate```)
Expand Down Expand Up @@ -85,7 +85,7 @@ PyThaiNLP เป็นไลบารีภาษาไพทอนเพื่
## ความสามารถ

- ตัดคำภาษาไทย (```word_tokenize```) และรองรับ Thai Character Clusters (```tcc```) และ ETCC (```etcc```)
- ถอดเสียงภาษาไทยเป็นอักษรละติน (```romanize```)
- ถอดเสียงภาษาไทยเป็นอักษรละตินและสัทอักษร (```romanize```, ```transliterate```)
- ระบุชนิดคำ (part-of-speech) ภาษาไทย (```pos_tag```)
- อ่านตัวเลขเป็นข้อความภาษาไทย (```bahttext```, ```num_to_thaiword```)
- เรียงลำดับคำตามพจนานุกรมไทย (```collate```)
Expand Down
7 changes: 1 addition & 6 deletions appveyor.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,6 @@ build: off

environment:
matrix:
- PYTHON: "C:/Python34"
PYTHON_VERSION: "3.4"
PYTHON_ARCH: "32"
PYICU_WHEEL: "https://get.openlp.org/win-sdk/PyICU-1.9.5-cp34-cp34m-win32.whl"

- PYTHON: "C:/Python36"
PYTHON_VERSION: "3.6"
PYTHON_ARCH: "32"
Expand Down Expand Up @@ -37,7 +32,7 @@ install:
# - "set ICU_VERSION=62"
- "%PYTHON%/python.exe -m pip install --upgrade pip"
- "%PYTHON%/python.exe -m pip install %PYICU_WHEEL%"
- "%PYTHON%/python.exe -m pip install -e ."
- "%PYTHON%/python.exe -m pip install -e .[icu,ner,pos,tokenize,transliterate]"

test_script:
- "%PYTHON%/python.exe -m pip --version"
Expand Down
8 changes: 4 additions & 4 deletions docs/api/romanization.rst
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
.. currentmodule:: pythainlp.romanization

pythainlp.romanization
pythainlp.transliterate
====================================
The :class:`pythainlp.romanization` turns thai text into a romanized one (put simply, spelled with English).
The :class:`pythainlp.transliterate` turns Thai text into a romanized one (put simply, spelled with English).

.. autofunction:: romanization
.. currentmodule:: pythainlp.romanization.thai2rom
.. autofunction:: transliterate
.. currentmodule:: pythainlp.transliterate.thai2rom
.. autoclass:: thai2rom
:members: romanize
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@
# The short X.Y version
version = ''
# The full version, including alpha/beta/rc tags
release = '1.7'
release = '1.8.0'


# -- General configuration ---------------------------------------------------
Expand Down
8 changes: 5 additions & 3 deletions docs/pythainlp-dev-thai.md
Original file line number Diff line number Diff line change
Expand Up @@ -256,12 +256,13 @@ lentext คือ จำนวนคำขั้นต่ำที่ต้อ

คืนค่าเป็น dict

### romanization
### transliteration

```python
from pythainlp.romanization import romanize
from pythainlp.transliterate import romanize, transliterate

romanize(str, engine="royin")
transliterate(str, engine="pyicu")
```

มี engine ดังนี้
Expand All @@ -275,9 +276,10 @@ romanize(str, engine="royin")
**ตัวอย่าง**

```python
from pythainlp.romanization import romanize
from pythainlp.transliterate import romanize, transliterate

romanize("แมว") # 'maew'
transliterate("นก")
```

### spell
Expand Down
5 changes: 0 additions & 5 deletions examples/romanization.py

This file was deleted.

6 changes: 6 additions & 0 deletions examples/transliterate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# -*- coding: utf-8 -*-

from pythainlp.transliterate import romanize, transliterate

print(romanize("แมว"))
print(transliterate("แมว"))
4 changes: 2 additions & 2 deletions pythainlp/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# -*- coding: utf-8 -*-

__version__ = 1.7
__version__ = 1.8

thai_alphabets = "กขฃคฅฆงจฉชซฌญฎฏฐฑฒณดตถทธนบปผฝพฟภมยรลวศษสหฬอฮ" # 44 chars
thai_vowels = "ฤฦะ\u0e31าำ\u0e34\u0e35\u0e36\u0e37\u0e38\u0e39เแโใไ\u0e45\u0e47" # 19
Expand All @@ -24,7 +24,7 @@

from pythainlp.collation import collate
from pythainlp.date import now
from pythainlp.romanization import romanize
from pythainlp.transliterate import romanize, transliterate
from pythainlp.sentiment import sentiment
from pythainlp.soundex import soundex
from pythainlp.spell import spell
Expand Down
2 changes: 1 addition & 1 deletion pythainlp/corpus/tnc.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ def word_freq(word, domain="all"):

r = requests.post(url, data=data)

pat = re.compile('TOTAL</font>(?s).*?#ffffff">(.*?)</font>')
pat = re.compile(r'TOTAL</font>(?s).*?#ffffff">(.*?)</font>')
match = pat.search(r.text)

n = 0
Expand Down
21 changes: 0 additions & 21 deletions pythainlp/g2p/__init__.py

This file was deleted.

13 changes: 1 addition & 12 deletions pythainlp/ner/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,12 @@
"""
__all__ = ["ThaiNameRecognizer"]

import sklearn_crfsuite
from pythainlp.corpus import download, get_file, thai_stopwords
from pythainlp.tag import pos_tag
from pythainlp.tokenize import word_tokenize
from pythainlp.util import is_thaiword

try:
import sklearn_crfsuite
except ImportError:
from pythainlp.tools import install_package

install_package("sklearn-crfsuite")
try:
import sklearn_crfsuite
except ImportError:
raise ImportError("ImportError: Try 'pip install sklearn-crfsuite'")


_WORD_TOKENIZER = "newmm" # ตัวตัดคำ
_STOPWORDS = thai_stopwords()

Expand Down
26 changes: 0 additions & 26 deletions pythainlp/romanization/__init__.py

This file was deleted.

19 changes: 0 additions & 19 deletions pythainlp/romanization/pyicu.py

This file was deleted.

37 changes: 6 additions & 31 deletions pythainlp/sentiment/ulmfit_sent.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,40 +5,15 @@
"""
from collections import defaultdict

import dill as pickle
import numpy as np
import torch
from pythainlp.corpus import download, get_file
from pythainlp.tokenize import word_tokenize
from torch import LongTensor
from torch.autograd import Variable

try:
import numpy as np
import dill as pickle
except ImportError:
from pythainlp.tools import install_package

install_package("numpy")
install_package("dill")
try:
import numpy as np
import dill as pickle
except ImportError:
raise ImportError("ImportError: Try 'pip install numpy dill'")

try:
import torch
from torch import LongTensor
from torch.autograd import Variable
except ImportError:
print("PyTorch required. See https://pytorch.org/.")

# try:
# from fastai.text import multiBatchRNN
# except ImportError:
# print(
# """
# fastai required for multiBatchRNN.
# Run 'pip install https://github.com/fastai/fastai/archive/master.zip'
# """
# )

# from fastai.text import multiBatchRNN

MODEL_NAME = "sent_model"
ITOS_NAME = "itos_sent"
Expand Down
14 changes: 1 addition & 13 deletions pythainlp/tag/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,19 +25,7 @@ def pos_tag(words, engine="unigram", corpus="orchid"):
elif engine == "artagger":

def _tag(text, corpus=None):
try:
from artagger import Tagger
except ImportError:
from pythainlp.tools import install_package

install_package(_ARTAGGER_URL)
try:
from artagger import Tagger
except ImportError:
raise ImportError(
"ImportError: Try 'pip install " + _ARTAGGER_URL + "'"
)

from artagger import Tagger
words = Tagger().tag(" ".join(text))

return [(word.word, word.tag) for word in words]
Expand Down
12 changes: 1 addition & 11 deletions pythainlp/tokenize/deepcut.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,7 @@
Wrapper for deepcut Thai word segmentation
"""

try:
import deepcut
except ImportError:
"""ในกรณีที่ยังไม่ติดตั้ง deepcut ในระบบ"""
from pythainlp.tools import install_package

install_package("deepcut")
try:
import deepcut
except ImportError:
raise ImportError("ImportError: Try 'pip install deepcut'")
import deepcut


def segment(text):
Expand Down
13 changes: 2 additions & 11 deletions pythainlp/tokenize/pyicu.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,20 +4,11 @@
"""
import re

try:
import icu
except ImportError:
from pythainlp.tools import install_package

install_package("pyicu")
try:
import icu
except ImportError:
raise ImportError("ImportError: Try 'pip install pyicu'")
from icu import BreakIterator, Locale


def _gen_words(text):
bd = icu.BreakIterator.createWordInstance(icu.Locale("th"))
bd = BreakIterator.createWordInstance(Locale("th"))
bd.setText(text)
p = bd.first()
for q in bd:
Expand Down
Loading