Thai BEST2010 Corpus Cooker

best2010_cooker

A tool for extracting segmented words from Thai segmented BEST2010 corpus.

Data formats

sl: sentence line
wl: word line

Usage

usage: cooker.py [-h] [--quiet] --input_data INPUT_DATA
                 [--output_data OUTPUT_DATA]
                 [--input_data_format INPUT_DATA_FORMAT]
                 [--output_data_format OUTPUT_DATA_FORMAT]
                 [--sentence_len_threshold SENTENCE_LEN_THRESHOLD] [--denoise]
                 [--gen_common_vocab] [--gen_ne_vocab] [--gen_ab_vocab]
                 [--exclude_empty_line] [--gen_div_data]

optional arguments:
  -h, --help            show this help message and exit
  --quiet, -q           Do not report on screen
  --input_data INPUT_DATA, -i INPUT_DATA
                        File path to input data
  --output_data OUTPUT_DATA, -o OUTPUT_DATA
                        File path to output data
  --input_data_format INPUT_DATA_FORMAT, -f INPUT_DATA_FORMAT
                        Choose format of input data among from 'txt' (Default:
                        txt)
  --output_data_format OUTPUT_DATA_FORMAT
                        Choose format of output data among from 'wl' and 'sl'
                        (Default: sl)
  --sentence_len_threshold SENTENCE_LEN_THRESHOLD
                        Sentence length threshold. Sentences whose length are
                        lower than the threshold are ignored (Default: 1)
  --denoise             Specify to deeply clean the input data (original data
                        denoising) see README.md for more details
  --gen_common_vocab    Specify to generate common vocabulary
  --gen_ne_vocab        Specify to generate NE vocabulary
  --gen_ab_vocab        Specify to generate abbreviation (AB) vocabulary
  --exclude_empty_line  Specify to exclude empty line
  --gen_div_data        Specify to generate division data (train, valid, test)
                        see README.md for more details

Example outputs

Start time: 20210609_1919

### arguments
# quiet=False
# input_data=data/best2010-sample.txt
# output_data=cooked
# input_data_format=txt
# output_data_format=sl
# sentence_len_threshold=1
# denoise=True
# gen_common_vocab=True
# gen_ne_vocab=True
# gen_ab_vocab=True
# exclude_empty_line=False
# gen_div_data=True

save cooked data: cooked/20210609_1919/cooked_best2010_20210609_1919.sl
save cooked train data: cooked/20210609_1919/cooked_best2010_20210609_1919.train.sl
save cooked valid data: cooked/20210609_1919/cooked_best2010_20210609_1919.valid.sl
save cooked test data: cooked/20210609_1919/cooked_best2010_20210609_1919.test.sl
save cooked train common vocab data: cooked/20210609_1919/cooked_best2010_20210609_1919.train.vocab
save cooked valid common vocab data: cooked/20210609_1919/cooked_best2010_20210609_1919.valid.vocab
save cooked test common vocab data: cooked/20210609_1919/cooked_best2010_20210609_1919.test.vocab
save cooked train NE data: cooked/20210609_1919/cooked_best2010_20210609_1919.train.ne.vocab
save cooked valid NE data: cooked/20210609_1919/cooked_best2010_20210609_1919.valid.ne.vocab
save cooked test NE data: cooked/20210609_1919/cooked_best2010_20210609_1919.test.ne.vocab
save cooked train AB data: cooked/20210609_1919/cooked_best2010_20210609_1919.train.ab.vocab
save cooked valid AB  data: cooked/20210609_1919/cooked_best2010_20210609_1919.valid.ab.vocab
save cooked test AB data: cooked/20210609_1919/cooked_best2010_20210609_1919.test.ab.vocab
save cooked common vocab data: cooked/20210609_1919/cooked_best2010_20210609_1919.vocab
save cooked NE data: cooked/20210609_1919/cooked_best2010_20210609_1919.ne.vocab
save cooked abbreviation data: cooked/20210609_1919/cooked_best2010_20210609_1919.ab.vocab
### report
# [PRE] line: 50 ...
# [PRE] sent: 50 ...
# [POST] sent: 50 ...
# [POST] train-div: 44 ...
# [POST] valid-div: 4 ...
# [POST] test-div: 2 ...
# [POST] common vocab: 642 ...
# [POST] ne: 7 ...
# [POST] abbreviation: 1 ...
# [POST] train-vocab: 586 ...
# [POST] valid-vocab: 191 ...
# [POST] test-vocab: 89 ...
# [POST] train-ne: 7 ...
# [POST] train-ab: 1 ...
# [POST] word: 4020 ...
# [POST] char: 15977 ...
# [POST] words/sent: min=2 max=177 avg=80.4
# [POST] chars/sent: min=6 max=708 avg=319.54
# [POST] chars/word: min=1 max=20 avg=3.9743781094527364
Elapsed time: 0.109500 sec.

Data divisions (shuffled)

Merged from article, encyclopedia, news, and novel domains, respectively. See src/divs.py for train, validation, and test indices referred to the original data.

Denoise (noisy data based on the original data)

13 cases (34 lines) were found to be noise data (incorrect annotation). Each number refers to the line number in the original data

<NE>w1</NE → <NE>w1</NE>
- 5558 5559 5562 5913 6547 6567 6717 6737
<NE>w1>w2</NE> → <NE>w1 w2</NE>
- 5603
<NE>NE>w1</NE> → <NE>w1</NE>
- 8153 80697 81086 81108 82631
<NE>w1</NE</NE> → <NE>w1</NE>
- 22548 80856 118267
<NE>w1/NE></NE> → <NE>w1</NE>
- 31526 79952 80137
<NE>w1<w2</NE> → <NE>w1w2</NE>
- 13049 45835
AB>w1</AB> → <AB>w1</AB>
- 54419
<NE>w1></NE> → <NE>w1</NE>
- 56893 91142 91144 91145
<NE>w1</NE>w2</NE> → <NE>w1w2</NE>
- 80145
<NE>AB>w1w2</NE> → <NE><AB>w1</AB>w2</NE>
- 82274 84065 89645
<NE>w1</NE</NE>> → <NE>w1</NE>
- 95647
<NEw1</NE> → <NE>w1</NE>
- 132798
<NE>MERGE>w1</NE> → <NE>w1</NE>
- 144226

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
cooked		cooked
data		data
sample_scripts		sample_scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Thai BEST2010 Corpus Cooker

best2010_cooker

Data formats

Usage

Example outputs

Data divisions (shuffled)

Denoise (noisy data based on the original data)

About

Releases

Packages

Languages

License

tchayintr/best2010_cooker

Folders and files

Latest commit

History

Repository files navigation

Thai BEST2010 Corpus Cooker

best2010_cooker

Data formats

Usage

Example outputs

Data divisions (shuffled)

Denoise (noisy data based on the original data)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages