Skip to content

Latest commit

 

History

History
402 lines (356 loc) · 12.2 KB

README.md

File metadata and controls

402 lines (356 loc) · 12.2 KB

Han Transformers

This project provides ancient Chinese models to NLP tasks including language modeling, word segmentation and part-of-speech tagging.

Our paper has been accepted to ROCLING! Please check out our paper.

Dependency

  • transformers ≤ 4.15.0
  • pytorch

Models

We uploaded our models to HuggingFace hub.

Training Corpus

The copyright of the datasets belongs to the Institute of Linguistics, Academia Sinica.

Usage

Installation

pip install transformers==4.15.0
pip install torch==1.10.2

Inference

  • Pre-trained Language Model

    You can use ckiplab/bert-base-han-chinese directly with a pipeline for masked language modeling.

    from transformers import pipeline
    
    # Initialize 
    unmasker = pipeline('fill-mask', model='ckiplab/bert-base-han-chinese')
    
    # Input text with [MASK]
    unmasker("黎[MASK]於變時雍。")
    
    # output
    [{'sequence': '黎 民 於 變 時 雍 。',
    'score': 0.14885780215263367,
    'token': 3696,
    'token_str': '民'},
    {'sequence': '黎 庶 於 變 時 雍 。',
    'score': 0.0859643816947937,
    'token': 2433,
    'token_str': '庶'},
    {'sequence': '黎 氏 於 變 時 雍 。',
    'score': 0.027848130092024803,
    'token': 3694,
    'token_str': '氏'},
    {'sequence': '黎 人 於 變 時 雍 。',
    'score': 0.023678112775087357,
    'token': 782,
    'token_str': '人'},
    {'sequence': '黎 生 於 變 時 雍 。',
    'score': 0.018718384206295013,
    'token': 4495,
    'token_str': '生'}]

    You can use ckiplab/bert-base-han-chinese to get the features of a given text in PyTorch.

    from transformers import AutoTokenizer, AutoModel
    
    # Initialize tokenzier and model
    tokenizer = AutoTokenizer.from_pretrained("ckiplab/bert-base-han-chinese")
    model = AutoModel.from_pretrained("ckiplab/bert-base-han-chinese")
    
    # Input text
    text = "黎民於變時雍。"
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    
    # get encoded token vectors
    output.last_hidden_state    # torch.Tensor with Size([1, 9, 768])
    
    # get encoded sentence vector
    output.pooler_output        # torch.Tensor with Size([1, 768])
  • Word Segmentation (WS)

    In WS, ckiplab/bert-base-han-chinese-ws divides written the text into meaningful units - words. The task is formulated as labeling each word with either beginning (B) or inside (I).

    from transformers import pipeline
    
    # Initialize
    classifier = pipeline("token-classification", model="ckiplab/bert-base-han-chinese-ws")
    
    # Input text
    classifier("帝堯曰放勳")
    
    # output
    [{'entity': 'B',
    'score': 0.9999793,
    'index': 1,
    'word': '帝',
    'start': 0,
    'end': 1},
    {'entity': 'I',
    'score': 0.9915047,
    'index': 2,
    'word': '堯',
    'start': 1,
    'end': 2},
    {'entity': 'B',
    'score': 0.99992275,
    'index': 3,
    'word': '曰',
    'start': 2,
    'end': 3},
    {'entity': 'B',
    'score': 0.99905187,
    'index': 4,
    'word': '放',
    'start': 3,
    'end': 4},
    {'entity': 'I',
    'score': 0.96299917,
    'index': 5,
    'word': '勳',
    'start': 4,
    'end': 5}]
  • Part-of-Speech (PoS) Tagging

    In PoS tagging, ckiplab/bert-base-han-chinese-pos recognizes parts of speech in a given text. The task is formulated as labeling each word with a part of the speech.

    from transformers import pipeline
    
    # Initialize
    classifier = pipeline("token-classification", model="ckiplab/bert-base-han-chinese-pos")
    
    # Input text
    classifier("帝堯曰放勳")
    
    # output
    [{'entity': 'NB1',
    'score': 0.99410427,
    'index': 1,
    'word': '帝',
    'start': 0,
    'end': 1},
    {'entity': 'NB1',
    'score': 0.98874336,
    'index': 2,
    'word': '堯',
    'start': 1,
    'end': 2},
    {'entity': 'VG',
    'score': 0.97059363,
    'index': 3,
    'word': '曰',
    'start': 2,
    'end': 3},
    {'entity': 'NB1',
    'score': 0.9864504,
    'index': 4,
    'word': '放',
    'start': 3,
    'end': 4},
    {'entity': 'NB1',
    'score': 0.9543974,
    'index': 5,
    'word': '勳',
    'start': 4,
    'end': 5}]

Model Performance

Pre-trained Language Model, Perplexity ↓

Language Model MLM Training Data MLM Testing Data
上古 中古 近代 現代
ckiplab/bert-base-han-Chinese 上古 24.7588 87.8176 297.1111 60.3993
中古 67.861 70.6244 133.0536 23.0125
近代 69.1364 77.4154 46.8308 20.4289
現代 118.8596 163.6896 146.5959 4.6143
Merge 31.1807 61.2381 49.0672 4.5017
ckiplab/bert-base-chinese - 233.6394 405.9008 278.7069 8.8521

Word Segmentation (WS), F1 score (%) ↑

WS Model Training Data Testing Data
上古 中古 近代 現代
ckiplab/bert-base-han-chinese-ws 上古 97.6090 88.5734 83.2877 70.3772
中古 92.6402 92.6538 89.4803 78.3827
近代 90.8651 92.1861 94.6495 81.2143
現代 87.0234 83.5810 84.9370 96.9446
Merge 97.4537 91.9990 94.0970 96.7314
ckiplab/bert-base-chinese-ws - 86.5698 82.9115 84.3213 98.1325

Part-of-Speech (POS) Tagging, F1 score (%) ↑

POS Model Training Data Testing Data
上古 中古 近代 現代
ckiplab/bert-base-han-chinese-pos 上古 91.2945 - - -
中古 7.3662 80.4896 11.3371 10.2577
近代 6.4794 14.3653 88.6580 0.5316
現代 11.9895 11.0775 0.4033 93.2813
Merge 88.8772 42.4369 86.9093 92.9012

License

Copyright (c) 2022 CKIP Lab under the GPL-3.0 License.

Citation

Please cite our paper if you use Han-Transformers in your work:

@inproceedings{lin-ma-2022-hantrans,
    title = "{H}an{T}rans: An Empirical Study on Cross-Era Transferability of {C}hinese Pre-trained Language Model",
    author = "Lin, Chin-Tung  and  Ma, Wei-Yun",
    booktitle = "Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)",
    year = "2022",
    address = "Taipei, Taiwan",
    publisher = "The Association for Computational Linguistics and Chinese Language Processing (ACLCLP)",
    url = "https://aclanthology.org/2022.rocling-1.21",
    pages = "164--173",
}