Skip to content

✨ Split text by languages (e.g. 你喜欢看アニメ吗 -> 你喜欢看 | アニメ | 吗) for NLP tasks (e.g. parse, TTS). Powered by fasttext and langua

License

Notifications You must be signed in to change notification settings

DoodleBears/split-lang

Repository files navigation

VisActor Logo VisActor Logo

split-lang

English | 中文简体 | 日本語

Split text by languages through concatenating over split substrings based on their language, powered by

splitting: budoux and rule-base splitting

language detection: fast-langdetect and wordfreq


PyPI version Downloads Downloads

Open In Colab

License GitHub Repo stars wakatime

1. 💡How it works

Stage 1: rule-based split (separate character, punctuation and digit)

  • hello, how are you -> hello | , | how are you

Stage 2: over-split text to substrings by budoux for Chinese mix with Japanese, (space) for not scripta continua

  • 你喜欢看アニメ吗 -> | 喜欢 | | アニメ |
  • 昨天見た映画はとても感動的でした -> 昨天 | 見た | 映画 | | とても | 感動 | | | した
  • how are you -> how | are | you

Stage 3: concatenate substrings based on their languages using fast-langdetect, wordfreq and regex (rule-based)

  • | 喜欢 | | アニメ | -> 你喜欢看 | アニメ |
  • 昨天 | 見た | 映画 | | とても | 感動 | | | した -> 昨天 | 見た映画はとても感動的でした
  • how | are | you -> how are you

2. 🪨Motivation

  • TTS (Text-To-Speech) model often fails on multi-language speech generation, there are two ways to do:
    • Train a model can pronounce multiple languages
    • (This Package) Separate sentence based on language first, then use different language models
  • Existed models in NLP toolkit (e.g. SpaCy, jieba) is usually helpful for dealing with text in ONE language for each model. Which means multi-language texts need pre-process, like texts below:
你喜欢看アニメ吗?
Vielen Dank merci beaucoup for your help.
你最近好吗、最近どうですか?요즘 어떻게 지내요?sky is clear and sunny。

3. 📕Usage

3.1. 🚀Installation

You can install the package using pip:

pip install split-lang

3.2. Basic

3.2.1. split_by_lang

Open In Colab

from split_lang import LangSplitter
lang_splitter = LangSplitter()
text = "你喜欢看アニメ吗"

substr = lang_splitter.split_by_lang(
    text=text,
)
for index, item in enumerate(substr):
    print(f"{index}|{item.lang}:{item.text}")
0|zh:你喜欢看
1|ja:アニメ
2|zh:吗
from split_lang import LangSplitter
lang_splitter = LangSplitter(merge_across_punctuation=True)
import time
texts = [
    "你喜欢看アニメ吗?我也喜欢看",
    "Please star this project on GitHub, Thanks you. I love you请加星这个项目,谢谢你。我爱你この項目をスターしてください、ありがとうございます!愛してる",
]
time1 = time.time()
for text in texts:
    substr = lang_splitter.split_by_lang(
        text=text,
    )
    for index, item in enumerate(substr):
        print(f"{index}|{item.lang}:{item.text}")
    print("----------------------")
time2 = time.time()
print(time2 - time1)
0|zh:你喜欢看
1|ja:アニメ
2|zh:吗?我也喜欢看
----------------------
0|en:Please star this project on GitHub, Thanks you. I love you
1|zh:请加星这个项目,谢谢你。我爱你
2|ja:この項目をスターしてください、ありがとうございます!愛してる
----------------------
0.007998466491699219

3.2.2. merge_across_digit

lang_splitter.merge_across_digit = False
texts = [
    "衬衫的价格是9.15便士",
]
for text in texts:
    substr = lang_splitter.split_by_lang(
        text=text,
    )
    for index, item in enumerate(substr):
        print(f"{index}|{item.lang}:{item.text}")
0|zh:衬衫的价格是
1|digit:9.15
2|zh:便士

3.3. Advanced

3.3.1. usage of lang_map and default_lang (for your languages)

Important

Add lang code for your usecase if other languages are needed. See Support Language

  • default lang_map looks like below
    • if langua-py or fasttext or any other language detector detect the language that is NOT included in lang_map will be set to default_lang
    • if you set default_lang or value of key:value in lang_map to x, this substring will be merged to the near substring
      • zh | x | jp -> zh | jp (x been merged to one side)
      • In example below, zh-tw is set to x because character in zh and jp sometimes been detected as Traditional Chinese
  • default default_lang is x
DEFAULT_LANG_MAP = {
    "zh": "zh",
    "yue": "zh",  # 粤语
    "wuu": "zh",  # 吴语
    "zh-cn": "zh",
    "zh-tw": "x",
    "ko": "ko",
    "ja": "ja",
    "de": "de",
    "fr": "fr",
    "en": "en",
    "hr": "en",
}
DEFAULT_LANG = "x"

4. Acknowledgement

5. ✨Star History

Star History Chart