Skip to content

Speed Comparison

Shunsuke Kanda edited this page Mar 1, 2023 · 5 revisions

This wiki shows the tokenization speed of Vibrato and other tokenizers and morphological analyzers.

Experimental setup

Competitors

We compare Vibrato 0.5.0 with MeCab and its reimplementations:

For Vibrato and MeCab, we evaluate two system dictionaries: IPADIC 2.7.0 and UniDic 3.1.1. For Lindera, we evaluate two versions: IPADIC and UniDic. sudachi.rs is evaluated for SudachiDict-core.

Further, we evaluate two compact versions of Vibrato UniDic models (distributed in our release page):

  • raw-connector: unidic-cwj-3_1_1+compact
  • dual-connector unidic-cwj-3_1_1+compact-dual

We also compare pointwise prediction-based tokenizers:

For Vaporetto and KyTea, we used the compact SVM model based on BCCWJ and UniDic downloaded from KyTea Models page.

Methodology

We tokenize all sentences in I Am a Cat (by Soseki Natsume), which is available at Aozora Bunko, and report the elapsed time averaged on 100 runs.

  • Number of sentences: 2,346
  • Number of characters per sentence: 158.8

The benchmark code can be found here.

Machine

The following is the specification of the used machine:

  • CPU: Intel Core i9-12900K (L3: 30MB cache, 16 Core, 3.2GHz-5.2GHz)
  • RAM: 64GB (2×32GB, DDR5)
  • OS: Ubuntu 22.04

Experimental result

Library (dict) Elapsed time [ms] STD
Vibrato 0.5.0 (ipadic-mecab 2.7.0) 42 1.24
Vibrato 0.5.0 (unidic-cwj 3.1.1) 75 1.71
Vibrato 0.5.0 (unidic-cwj 3.1.1, raw-connector) 1364 5.14
Vibrato 0.5.0 (unidic-cwj 3.1.1, dual-connector) 170 2.50
MeCab 2020-09-14 (ipadic-mecab 2.7.0) 87 1.24
MeCab 2020-09-14 (unidic-cwj 3.1.1) 179 2.88
Lindera 0.23.0 (ipadic) 97 1.13
Lindera 0.23.0 (unidic) 156 2.11
sudachi.rs 0.6.4-a1 (core, 20210802) 220 4.74
KyTea 2020-04-03 (jp-0.4.7-5) 169 2.83
Vaporetto 0.6.1 (jp-0.4.7-5) 21 0.51
rust-tinysegmenter 0.1.1 166 1.69

Note that Vibrato UniDic models differ in size as follows. Thus, you can use the model with the time-space tradeoff of your choice.

Library (dict) Model size [MB]
Vibrato 0.5.0 (unidic-cwj 3.1.1) 717
Vibrato 0.5.0 (unidic-cwj 3.1.1, raw-connector) 252
Vibrato 0.5.0 (unidic-cwj 3.1.1, dual-connector) 300
Clone this wiki locally