corpus structure and format

Jump to bottom

jidasheng edited this page Dec 6, 2019 · 2 revisions

corpus structure

corpus_dir
    vocab.json
    tags.json
    dataset.txt

all files are UTF-8 encoded

vocab.json
- a list of unique CHARs or WORDs that define the vocabulary
- chars/words that not in the vocabulary will be replaced by UNKNOWN
- examples
  - CHAR-based: ["市", "领", "导", "到", "成", "都", ...]
  - WORD-based: ["市", "领导", "到", "成都", ...]
tags.json
- a list of tags
  - the tags can be any tag set of any order with no constraints
  - the only thing to be concerned when predicting sequences
```
from bi_lstm_crf.app import WordsTagger

model = WordsTagger(model_dir="xxx")
tags, sequences = model(["市领导到成都..."], begin_tags="BS") 

print(tags)  
# [["B", "B", "I", "B", "B-LOC", "I-LOC", "I-LOC", "I-LOC", "I-LOC", "B", "I", "B", "I"]]

print(sequences)
# [['市', '领导', '到', ('成都', 'LOC'), ...]]
```
    - argument begin_tags is used for converting the tags to sequences
    - most of the time, the default value "BS" is right, but:
      - when you using BMEWO format(B(Begin), M(Middle), E(End), W(Word), O(Outside))
      - begin_tags should be set to "BW"
- examples
  - WORD SEGMENTATION: ["B", "I"]
  - NER: ["O", "B-ORG", "I-ORG", ...]

dataset.txt

format
```
[sentence][\tab][tags]
...
```
- the [sentence] should be a string or a list of string
- for CHAR-based, a string is enough to represent a sentence

examples

CHAR-based

市领导到成都...    ["B", "B", "I", "B", "B", "I", ...]
...

WORD-based

["市", "领导", "到", "成都", ...]   ["B", "B", "I", "B", "B", "I", ...]
...

Toggle table of contents Pages 4

Clone this wiki locally