Skip to content

corpus structure and format

jidasheng edited this page Dec 6, 2019 · 2 revisions
  • corpus structure
    corpus_dir
        vocab.json
        tags.json
        dataset.txt
    
    • all files are UTF-8 encoded
  • vocab.json
    • a list of unique CHARs or WORDs that define the vocabulary
    • chars/words that not in the vocabulary will be replaced by UNKNOWN
    • examples
      • CHAR-based: ["市", "领", "导", "到", "成", "都", ...]
      • WORD-based: ["市", "领导", "到", "成都", ...]
  • tags.json
    • a list of tags
      • the tags can be any tag set of any order with no constraints
      • the only thing to be concerned when predicting sequences
        from bi_lstm_crf.app import WordsTagger
        
        model = WordsTagger(model_dir="xxx")
        tags, sequences = model(["市领导到成都..."], begin_tags="BS") 
        
        print(tags)  
        # [["B", "B", "I", "B", "B-LOC", "I-LOC", "I-LOC", "I-LOC", "I-LOC", "B", "I", "B", "I"]]
        
        print(sequences)
        # [['市', '领导', '到', ('成都', 'LOC'), ...]]
        • argument begin_tags is used for converting the tags to sequences
        • most of the time, the default value "BS" is right, but:
          • when you using BMEWO format(B(Begin), M(Middle), E(End), W(Word), O(Outside))
          • begin_tags should be set to "BW"
    • examples
      • WORD SEGMENTATION: ["B", "I"]
      • NER: ["O", "B-ORG", "I-ORG", ...]
  • dataset.txt
    • format
      [sentence][\tab][tags]
      ...
      
      • the [sentence] should be a string or a list of string
      • for CHAR-based, a string is enough to represent a sentence
    • examples
      • CHAR-based
        市领导到成都...    ["B", "B", "I", "B", "B", "I", ...]
        ...
        
      • WORD-based
        ["市", "领导", "到", "成都", ...]   ["B", "B", "I", "B", "B", "I", ...]
        ...
        
Clone this wiki locally