Preprocess the data

usage: preprocess.py [-h] --corpus_path CORPUS_PATH
                     [--dataset_path DATASET_PATH]
                     [--tokenizer {bert,bpe,char,space,xlmroberta}]
                     [--vocab_path VOCAB_PATH] [--merges_path MERGES_PATH]
                     [--spm_model_path SPM_MODEL_PATH]
                     [--tgt_tokenizer {bert,bpe,char,space,xlmroberta}]
                     [--tgt_vocab_path TGT_VOCAB_PATH]
                     [--tgt_merges_path TGT_MERGES_PATH]
                     [--tgt_spm_model_path TGT_SPM_MODEL_PATH]
                     [--processes_num PROCESSES_NUM]
                     [--data_processor {bert,lm,mlm,bilm,albert,mt,t5,cls,prefixlm,gsg,bart}]
                     [--docs_buffer_size DOCS_BUFFER_SIZE]
                     [--seq_length SEQ_LENGTH]
                     [--tgt_seq_length TGT_SEQ_LENGTH]
                     [--dup_factor DUP_FACTOR]
                     [--short_seq_prob SHORT_SEQ_PROB] [--full_sentences]
                     [--seed SEED] [--dynamic_masking] [--whole_word_masking]
                     [--span_masking] [--span_geo_prob SPAN_GEO_PROB]
                     [--span_max_length SPAN_MAX_LENGTH]
                     [--sentence_selection_strategy {lead,random}]

Users have to preprocess the corpus before pre-training. The example of pre-processing on a single machine：

python3 preprocess.py --corpus_path corpora/book_review_bert.txt --vocab_path models/google_zh_vocab.txt \
                      --dataset_path dataset.pt --processes_num 8 --dynamic_masking --data_processor bert

The output of pre-processing stage is dataset.pt (--dataset_path), which is the input of pretrain.py . If multiple machines are available, users can run preprocess.py on one machine and copy the dataset.pt to other machines.

We need to specify the format of dataset.pt generated by pre-processing stage (--data_processor) since different pre-training models require different data formats in pre-training stage. Currently, UER-py supports formats for abundant pre-training models, for example:

lm: language model
mlm: masked language model
cls: classification
bilm: bi-directional language model
bert: masked language model + next sentence prediction
albert: masked language model + sentence order prediction
prefixlm：prefix language model

Notice that we should use the corpus (--corpus_path) whose format is in accordance with the --data_processor . More use cases are found in Pretraining model examples.

--processes_num n denotes that n processes are used for pre-processing. More processes can speed up the preprocess stage but lead to more memory consumption.
--dup_factor denotes that instances are duplicated multiple times (when using static masking). Static masking is used in BERT. The masked words are determined in pre-processing stage.
--dynamic_masking denotes that the words are masked during the pre-training stage, which is used in RoBERTa. Dynamic masking performs better and the output file (--dataset_path) is smaller (since it doesn't have to duplicate instances).
--full_sentences allows a sample to include contents from multiple documents, which is used in RoBERTa.
--span_masking denotes that masking consecutive words, which is used in SpanBERT. If dynamic masking is used, we should specify --span_masking in pre-training stage, otherwise we should specify --span_masking in pre-processing stage.
--docs_buffer_size specifies the buffer size in memory in pre-processing stage.
Sequence length is specified in pre-processing stage by --seq_length . The default value is 128. When doing incremental pre-training upon existing pre-trained model, --seq_length should be smaller than the maximum sequence length the pre-trained model supports (--max_seq_length).

Vocabulary and tokenizer are also specified in pre-processing stage. More details are discussed in Tokenization and vocabulary section.

Home
主页
- 项目特色
- 依赖环境
- 快速上手
- 预训练数据
- 下游任务数据集
- 预训练模型仓库
- 使用说明
- 竞赛解决方案
  - 中文任务测评基准CLUE
  - SMP2020-EWECT
  - SMP2019-ECISA
  - CCF-BDCI2021-面向黑灰产治理的恶意短信变体字还原
  - 英文任务测评基准GLUE
- 引用

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocess the data

Clone this wiki locally