Skip to content

Modelzoo

zhezhaoa edited this page Mar 16, 2022 · 47 revisions

With the help of UER, we pre-trained models of different properties (for example, models based on different corpora, encoders, and targets). All pre-trained weights introduced in this section are in UER format and can be loaded by UER directly. More pre-trained weights will be released in the near future. Unless otherwise noted, Chinese pre-trained models use BERT tokenizer and models/google_zh_vocab.txt as vocabulary (which is used in original BERT project). models/bert/base_config.json is used as configuration file in default. Commonly-used vocabulary and configuration files are included in models/ folder and users do not need to download them. In addition, We use scripts/convert_xxx_from_uer_to_huggingface.py to convert pre-trained weights into format that Huggingface Transformers supports, and upload them to Huggingface model hub (uer). In the rest of the section, we provide download links of pre-trained weights and the right ways of using them. Notice that, for space constraint, more details of a pre-trained weight are discussed in corresponding Huggingface model hub. We will provide the link of Huggingface model hub when we introduce the pre-trained weight.

Chinese RoBERTa Pre-trained Weights

This is the set of 24 Chinese RoBERTa weights. CLUECorpusSmall is used as training corpus. Configuration files are in models/bert/ folder. We only provide configuration files for Tiny,Mini,Small,Medium,Base,and Large models. To load other models, we need to modify emb_sizefeedforward_sizehidden_sizeheads_numlayers_num in the configuration file. Notice that emb_size = emb_size, feedforward_size = 4 * hidden_size, heads_num = hidden_size / 64 . More details of these pre-trained weights are discussed here.

The pre-trained Chinese weight links of different layers (L) and hidden sizes (H):

H=128 H=256 H=512 H=768
L=2 2/128 (Tiny) 2/256 2/512 2/768
L=4 4/128 4/256 (Mini) 4/512 (Small) 4/768
L=6 6/128 6/256 6/512 6/768
L=8 8/128 8/256 8/512 (Medium) 8/768
L=10 10/128 10/256 10/512 10/768
L=12 12/128 12/256 12/512 12/768 (Base)

Take the Tiny weight as an example, we download the Tiny weight through the above link and put it in models/ folder. We can either conduct further pre-training upon it:

python3 preprocess.py --corpus_path corpora/book_review.txt --vocab_path models/google_zh_vocab.txt \
                      --dataset_path dataset.pt --processes_num 8 --data_processor mlm

python3 pretrain.py --dataset_path dataset.pt --pretrained_model_path models/cluecorpussmall_roberta_tiny_seq512_model.bin \
                    --vocab_path models/google_zh_vocab.txt --config_path models/bert/tiny_config.json \
                    --output_model_path models/output_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 5000 --save_checkpoint_steps 2500 --batch_size 64 \
                    --data_processor mlm --target mlm

or use it on downstream classification dataset:

python3 finetune/run_classifier.py --pretrained_model_path models/cluecorpussmall_roberta_tiny_seq512_model.bin \
                                   --vocab_path models/google_zh_vocab.txt --config_path models/bert/tiny_config.json \
                                   --train_path datasets/douban_book_review/train.tsv \
                                   --dev_path datasets/douban_book_review/dev.tsv \
                                   --test_path datasets/douban_book_review/test.tsv \
                                   --learning_rate 3e-4 --epochs_num 8 --batch_size 64

In fine-tuning stage, pre-trained models of different sizes usually require different hyper-parameters. The example of using grid search to find best hyper-parameters:

python3 finetune/run_classifier_grid.py --pretrained_model_path models/cluecorpussmall_roberta_tiny_seq512_model.bin \
                                        --vocab_path models/google_zh_vocab.txt \
                                        --config_path models/bert/tiny_config.json \
                                        --train_path datasets/douban_book_review/train.tsv \
                                        --dev_path datasets/douban_book_review/dev.tsv \
                                        --learning_rate_list 3e-5 1e-4 3e-4 --epochs_num_list 3 5 8 --batch_size_list 32 64

We can reproduce the experimental results reported here through above grid search script.

Chinese word-based RoBERTa Pre-trained Weights

This is the set of 5 Chinese word-based RoBERTa weights. CLUECorpusSmall is used as training corpus. Configuration files are in models/bert/ folder. Google sentencepiece is used as tokenizer tool and models/cluecorpussmall_spm.model is used as sentencepiece model. Most Chinese pre-trained weights are based on Chinese character. Compared with character-based models, word-based models are faster (because of shorter sequence length) and have better performance according to our experimental results. More details of these pre-trained weights are discussed here

The pre-trained Chinese weight links of different sizes:

Link
L=2/H=128 (Tiny)
L=4/H=256 (Mini)
L=4/H=512 (Small)
L=8/H=512 (Medium)
L=12/H=768 (Base)

Take the word-based Tiny weight as an example, we download the word-based Tiny weight through the above link and put it in models/ folder. We can either conduct further pre-training upon it:

python3 preprocess.py --corpus_path corpora/book_review.txt --spm_model_path models/cluecorpussmall_spm.model \
                      --dataset_path dataset.pt --processes_num 8 --data_processor mlm

python3 pretrain.py --dataset_path dataset.pt --pretrained_model_path models/cluecorpussmall_word_roberta_tiny_seq512_model.bin \
                    --spm_model_path models/cluecorpussmall_spm.model --config_path models/bert/tiny_config.json \
                    --output_model_path models/output_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 5000 --save_checkpoint_steps 2500 --batch_size 64 \
                    --data_processor mlm --target mlm

or use it on downstream classification dataset:

python3 finetune/run_classifier.py --pretrained_model_path models/cluecorpussmall_word_roberta_tiny_seq512_model.bin \
                                   --spm_model_path models/cluecorpussmall_spm.model \
                                   --config_path models/bert/tiny_config.json \
                                   --train_path datasets/douban_book_review/train.tsv \
                                   --dev_path datasets/douban_book_review/dev.tsv \
                                   --test_path datasets/douban_book_review/test.tsv \
                                   --learning_rate 3e-4 --epochs_num 8 --batch_size 64

The example of using grid search to find best hyper-parameters for word-based model:

python3 finetune/run_classifier_grid.py --pretrained_model_path models/cluecorpussmall_word_roberta_tiny_seq512_model.bin \
                                        --spm_model_path models/cluecorpussmall_spm.model \
                                        --config_path models/bert/tiny_config.json \
                                        --train_path datasets/douban_book_review/train.tsv \
                                        --dev_path datasets/douban_book_review/dev.tsv
                                        --learning_rate_list 3e-5 1e-4 3e-4 --epochs_num_list 3 5 8 --batch_size_list 32 64

We can reproduce the experimental results reported here through above grid search script.

Chinese GPT-2 Pre-trained Weights

This is the set of Chinese GPT-2 pre-trained weights. Configuration files are in models/gpt2/ folder.

The link and detailed description (Huggingface model hub) of different pre-trained GPT-2 weights:

Model link Description link
CLUECorpusSmall GPT-2 https://huggingface.co/uer/gpt2-chinese-cluecorpussmall
CLUECorpusSmall GPT-2-distil https://huggingface.co/uer/gpt2-distil-chinese-cluecorpussmall
Poem GPT-2 https://huggingface.co/uer/gpt2-chinese-poem
Couplet GPT-2 https://huggingface.co/uer/gpt2-chinese-couplet
Lyric GPT-2 https://huggingface.co/uer/gpt2-chinese-lyric
Ancient GPT-2 https://huggingface.co/uer/gpt2-chinese-ancient

Notice that extended vocabularies (models/google_zh_poem_vocab.txt and models/google_zh_ancient_vocab.txt) are used in Poem and Ancient GPT-2 models. CLUECorpusSmall GPT-2-distil model uses models/gpt2/distil_config.json configuration file. models/gpt2/config.json are used for other weights.

Take the CLUECorpusSmall GPT-2-distil weight as an example, we download the CLUECorpusSmall GPT-2-distil weight through the above link and put it in models/ folder. We can either conduct further pre-training upon it:

python3 preprocess.py --corpus_path corpora/book_review.txt \
                      --vocab_path models/google_zh_vocab.txt \
                      --dataset_path dataset.pt --processes_num 8 \
                      --seq_length 128 --data_processor lm 

python3 pretrain.py --dataset_path dataset.pt \
                    --pretrained_model_path models/cluecorpussmall_gpt2_distil_seq1024_model.bin \
                    --vocab_path models/google_zh_vocab.txt \
                    --config_path models/gpt2/distil_config.json \
                    --output_model_path models/book_review_gpt2_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 10000 --save_checkpoint_steps 5000 --report_steps 1000 \
                    --learning_rate 5e-5 --batch_size 64

or use it on downstream classification dataset:

python3 finetune/run_classifier.py --pretrained_model_path models/cluecorpussmall_gpt2_distil_seq1024_model.bin \
                                   --vocab_path models/google_zh_vocab.txt \
                                   --config_path models/gpt2/distil_config.json \
                                   --train_path datasets/douban_book_review/train.tsv \
                                   --dev_path datasets/douban_book_review/dev.tsv \
                                   --test_path datasets/douban_book_review/test.tsv \
                                   --learning_rate 3e-5 --epochs_num 8 --batch_size 64

GPT-2 model can be used for text generation. First of all, we create story_beginning.txt and enter the beginning of the text. Then we use scripts/generate_lm.py to do text generation:

python3 scripts/generate_lm.py --load_model_path models/cluecorpussmall_gpt2_distil_seq1024_model.bin \
                               --vocab_path models/google_zh_vocab.txt \
                               --config_path models/gpt2/distil_config.json \
                               --test_path story_beginning.txt --prediction_path story_full.txt \
                               --seq_length 128

Chinese ALBERT Pre-trained Weights

This is the set of Chinese ALBERT pre-trained weights. Configuration files are in models/albert/ folder.

The link and detailed description (Huggingface model hub) of different pre-trained ALBERT weights:

Model link Description link
CLUECorpusSmall ALBERT-base https://huggingface.co/uer/albert-base-chinese-cluecorpussmall
CLUECorpusSmall ALBERT-large https://huggingface.co/uer/albert-large-chinese-cluecorpussmall

Take the CLUECorpusSmall ALBERT-base weight as an example, we download the CLUECorpusSmall ALBERT-base weight through the above link and put it in models/ folder. The example of using ALBERT-base on downstream dataset:

python3 finetune/run_classifier.py --pretrained_model_path models/cluecorpussmall_albert_base_seq512_model.bin \
                                   --vocab_path models/google_zh_vocab.txt --config_path models/albert/base_config.json \
                                   --train_path datasets/douban_book_review/train.tsv \
                                   --dev_path datasets/douban_book_review/dev.tsv \
                                   --test_path datasets/douban_book_review/test.tsv \
                                   --learning_rate 2e-5 --epochs_num 3 --batch_size 64

python3 inference/run_classifier_infer.py --load_model_path models/finetuned_model.bin \
                                          --vocab_path models/google_zh_vocab.txt \
                                          --config_path models/albert/base_config.json \
                                          --test_path datasets/douban_book_review/test_nolabel.tsv \
                                          --prediction_path datasets/douban_book_review/prediction.tsv \
                                          --labels_num 2

Chinese T5 Pre-trained Weights

This is the set of Chinese T5 pre-trained weights. Configuration files are in models/t5/ folder.

The link and detailed description (Huggingface model hub) of different pre-trained T5 weights:

Model link Description link
CLUECorpusSmall T5-small https://huggingface.co/uer/t5-small-chinese-cluecorpussmall
CLUECorpusSmall T5-base https://huggingface.co/uer/t5-base-chinese-cluecorpussmall

Take the CLUECorpusSmall T5-small weight as an example, we download the CLUECorpusSmall T5-small weight through the above link and put it in models/ folder. We can conduct further pre-training upon it:

python3 preprocess.py --corpus_path corpora/book_review.txt \
                      --vocab_path models/google_zh_with_sentinel_vocab.txt \
                      --dataset_path dataset.pt \
                      --processes_num 8 --seq_length 128 \
                      --dynamic_masking --data_processor t5

python3 pretrain.py --dataset_path dataset.pt \
                    --pretrained_model_path models/cluecorpussmall_t5_small_seq512_model.bin \
                    --vocab_path models/google_zh_with_sentinel_vocab.txt \
                    --config_path models/t5/small_config.json \
                    --output_model_path models/book_review_t5_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 10000 --save_checkpoint_steps 5000 --report_steps 1000 \
                    --learning_rate 5e-4 --batch_size 64 \
                    --span_masking --span_geo_prob 0.3 --span_max_length 5

or use it on downstream dataset:

python3 finetune/run_text2text.py --pretrained_model_path models/cluecorpussmall_t5_small_seq512_model.bin \
                                  --vocab_path models/google_zh_with_sentinel_vocab.txt \
                                  --config_path models/t5/small_config.json \
                                  --train_path datasets/tnews_text2text/train.tsv \
                                  --dev_path datasets/tnews_text2text/dev.tsv \
                                  --seq_length 128 --tgt_seq_length 8 --learning_rate 3e-4 --epochs_num 3 --batch_size 32

python3 inference/run_text2text_infer.py --load_model_path models/finetuned_model.bin \
                                         --vocab_path models/google_zh_with_sentinel_vocab.txt \
                                         --config_path models/t5/small_config.json \
                                         --test_path datasets/tnews_text2text/test_nolabel.tsv \
                                         --prediction_path datasets/tnews_text2text/prediction.tsv \
                                         --seq_length 128 --tgt_seq_length 8 --batch_size 32

Users can download tnews dataset of text2text format from here.

Chinese T5-v1_1 Pre-trained Weights

This is the set of Chinese T5-v1_1 pre-trained weights. Configuration files are in models/t5-v1_1/ folder.

The link and detailed description (Huggingface model hub) of different pre-trained T5-v1_1 weights:

Model link Description link
CLUECorpusSmall T5-v1_1-small https://huggingface.co/uer/t5-v1_1-small-chinese-cluecorpussmall
CLUECorpusSmall T5-v1_1-base https://huggingface.co/uer/t5-v1_1-base-chinese-cluecorpussmall

Take the CLUECorpusSmall T5-v1_1-small weight as an example, we download the CLUECorpusSmall T5-v1_1-small weight through the above link and put it in models/ folder. We can conduct further pre-training upon it:

python3 preprocess.py --corpus_path corpora/book_review.txt \
                      --vocab_path models/google_zh_with_sentinel_vocab.txt \
                      --dataset_path dataset.pt \
                      --processes_num 8 --seq_length 128 \
                      --dynamic_masking --data_processor t5

python3 pretrain.py --dataset_path dataset.pt \
                    --pretrained_model_path models/cluecorpussmall_t5-v1_1_small_seq512_model.bin \
                    --vocab_path models/google_zh_with_sentinel_vocab.txt \
                    --config_path models/t5-v1_1/small_config.json \
                    --output_model_path models/book_review_t5-v1_1_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 10000 --save_checkpoint_steps 5000 --report_steps 1000 \
                    --learning_rate 5e-4 --batch_size 64 \
                    --span_masking --span_geo_prob 0.3 --span_max_length 5

or use it on downstream dataset:

python3 finetune/run_text2text.py --pretrained_model_path models/cluecorpussmall_t5-v1_1_small_seq512_model.bin \
                                  --vocab_path models/google_zh_with_sentinel_vocab.txt \
                                  --config_path models/t5-v1_1/small_config.json \
                                  --train_path datasets/tnews_text2text/train.tsv \
                                  --dev_path datasets/tnews_text2text/dev.tsv \
                                  --seq_length 128 --tgt_seq_length 8 --learning_rate 3e-4 --epochs_num 3 --batch_size 32

python3 inference/run_text2text_infer.py --load_model_path models/finetuned_model.bin \
                                         --vocab_path models/google_zh_with_sentinel_vocab.txt \
                                         --config_path models/t5-v1_1/small_config.json \
                                         --test_path datasets/tnews_text2text/test_nolabel.tsv \
                                         --prediction_path datasets/tnews_text2text/prediction.tsv \
                                         --seq_length 128 --tgt_seq_length 8 --batch_size 32

PEGASUS Pre-trained Weights

This is PEGASUS pre-trained weights. Configuration files are in models/pegasus/ folder.

The link and detailed description (Huggingface model hub) of PEGASUS weights:

Model link Description link
CLUECorpusSmall PEGASUS-base https://huggingface.co/uer/pegasus-base-chinese-cluecorpussmall

BART Pre-trained Weights

This is BART pre-trained weights. Configuration files are in models/bart/ folder.

The link and detailed description (Huggingface model hub) of BART weights:

Model link Description link
CLUECorpusSmall BART-base https://huggingface.co/uer/bart-base-chinese-cluecorpussmall

Fine-tuned Chinese RoBERTa Weights

This is the set of fine-tuned Chinese RoBERTa weights. All of them use models/bert/base_config.json configuration file.

The link and detailed description (Huggingface model hub) of different fine-tuned RoBERTa weights:

Model link Description link
JD full sentiment classification https://huggingface.co/uer/roberta-base-finetuned-jd-full-chinese
JD binary sentiment classification https://huggingface.co/uer/roberta-base-finetuned-jd-binary-chinese
Dianping sentiment classification https://huggingface.co/uer/roberta-base-finetuned-dianping-chinese
Ifeng news topic classification https://huggingface.co/uer/roberta-base-finetuned-ifeng-chinese
Chinanews news topic classification https://huggingface.co/uer/roberta-base-finetuned-chinanews-chinese
CLUENER2020 NER https://huggingface.co/uer/roberta-base-finetuned-cluener2020-chinese
Extractive QA https://huggingface.co/uer/roberta-base-chinese-extractive-qa

One can load these pre-trained models for pre-training, fine-tuning, and inference.

Chinese Pre-trained Weights Besides Transformer

This is the set of pre-trained weights besides Transformer.

The link and detailed description of different pre-trained weights:

Model link Configuration file Model details Training details
CLUECorpusSmall LSTM language model models/rnn_config.json --embedding word --remove_embedding_layernorm --encoder lstm --target lm steps: 500000 learning rate: 1e-3 batch size: 64*8 (the number of GPUs) sequence length: 256
CLUECorpusSmall GRU language model models/rnn_config.json --embedding word --remove_embedding_layernorm --encoder gru --target lm steps: 500000 learning rate: 1e-3 batch size: 64*8 (the number of GPUs) sequence length: 256
CLUECorpusSmall GatedCNN language model models/gatedcnn_9_config.json --embedding word --remove_embedding_layernorm --encoder gatedcnn --target lm steps: 500000 learning rate: 1e-4 batch size: 64*8 (the number of GPUs) sequence length: 256
CLUECorpusSmall ELMo models/birnn_config.json --embedding word --remove_embedding_layernorm --encoder bilstm --target bilm steps: 500000 learning rate: 5e-4 batch size: 64*8 (the number of GPUs) sequence length: 256

Chinese Pre-trained Weights from Other Organizations

Model link Description Description link
Google Chinese BERT-Base Configuration file: models/bert/base_config.json
Vocabulary: models/google_zh_vocab.txt
Tokenizer: BertTokenizer
https://github.com/google-research/bert
Google Chinese ALBERT-Base Configuration file: models/albert/base_config.json
Vocabulary: models/google_zh_vocab.txt
Tokenizer: BertTokenizer
https://github.com/google-research/albert
Google Chinese ALBERT-Large Configuration file: models/albert/large_config.json
Vocabulary: models/google_zh_vocab.txt
Tokenizer: BertTokenizer
https://github.com/google-research/albert
Google Chinese ALBERT-Xlarge Configuration file: models/albert/xlarge_config.json
Vocabulary: models/google_zh_vocab.txt
Tokenizer: BertTokenizer
https://github.com/google-research/albert
Google Chinese ALBERT-Xxlarge Configuration file: models/albert/xxlarge_config.json
Vocabulary: models/google_zh_vocab.txt
Tokenizer: BertTokenizer
https://github.com/google-research/albert
HFL Chinese BERT-wwm Configuration file: models/bert/base_config.json
Vocabulary: models/google_zh_vocab.txt
Tokenizer: BertTokenizer
https://github.com/ymcui/Chinese-BERT-wwm
HFL Chinese BERT-wwm-ext Configuration file: models/bert/base_config.json
Vocabulary: models/google_zh_vocab.txt
Tokenizer: BertTokenizer
https://github.com/ymcui/Chinese-BERT-wwm
HFL Chinese RoBERTa-wwm-ext Configuration file: models/bert/base_config.json
Vocabulary: models/google_zh_vocab.txt
Tokenizer: BertTokenizer
https://github.com/ymcui/Chinese-BERT-wwm
HFL Chinese RoBERTa-wwm-large-ext Configuration file: models/bert/large_config.json
Vocabulary: models/google_zh_vocab.txt
Tokenizer: BertTokenizer
https://github.com/ymcui/Chinese-BERT-wwm

More pre-trained Weights

Models pre-trained by UER:

Pre-trained model Link Description
Wikizh(word-based)+BertEncoder+BertTarget Model: https://share.weiyun.com/5s4HVMi Vocab: https://share.weiyun.com/5NWYbYn Word-based BERT model pre-trained on Wikizh. Training steps: 500,000
RenMinRiBao+BertEncoder+BertTarget https://share.weiyun.com/5JWVjSE The training corpus is news data from People's Daily (1946-2017).
Webqa2019+BertEncoder+BertTarget https://share.weiyun.com/5HYbmBh The training corpus is WebQA, which is suitable for datasets related with social media, e.g. LCQMC and XNLI. Training steps: 500,000
Weibo+BertEncoder+BertTarget https://share.weiyun.com/5ZDZi4A The training corpus is Weibo.
Weibo+BertEncoder(large)+MlmTarget https://share.weiyun.com/CFKyMkp3 The training corpus is Weibo. The configuration file is bert_large_config.json
Reviews+BertEncoder+MlmTarget https://share.weiyun.com/tBgaSx77 The training corpus is reviews.
Reviews+BertEncoder(large)+MlmTarget https://share.weiyun.com/hn7kp9bs The training corpus is reviews. The configuration file is bert_large_config.json
MixedCorpus+BertEncoder(xlarge)+MlmTarget https://share.weiyun.com/J9rj9WRB Pre-trained on mixed large Chinese corpus. The configuration file is bert_xlarge_config.json
MixedCorpus+BertEncoder(xlarge)+BertTarget(WWM) https://share.weiyun.com/UsI0OSeR Pre-trained on mixed large Chinese corpus. The configuration file is bert_xlarge_config.json
MixedCorpus+BertEncoder(large)+MlmTarget https://share.weiyun.com/5G90sMJ Pre-trained on mixed large Chinese corpus. The configuration file is bert_large_config.json
MixedCorpus+BertEncoder(base)+BertTarget https://share.weiyun.com/5QOzPqq Pre-trained on mixed large Chinese corpus. The configuration file is bert_base_config.json
MixedCorpus+BertEncoder(small)+BertTarget https://share.weiyun.com/fhcUanfy Pre-trained on mixed large Chinese corpus. The configuration file is bert_small_config.json
MixedCorpus+BertEncoder(tiny)+BertTarget https://share.weiyun.com/yXx0lfUg Pre-trained on mixed large Chinese corpus. The configuration file is bert_tiny_config.json
MixedCorpus+GptEncoder+LmTarget https://share.weiyun.com/51nTP8V Pre-trained on mixed large Chinese corpus. Training steps: 500,000 (with sequence lenght of 128) + 100,000 (with sequence length of 512)
Reviews+LstmEncoder+LmTarget https://share.weiyun.com/57dZhqo The training corpus is amazon reviews + JDbinary reviews + dainping reviews (11.4M reviews in total). Language model target is used. It is suitable for datasets related with reviews. It achieves over 5 percent improvements on some review datasets compared with random initialization. Set hidden_size in models/rnn_config.json to 512 before using it. Training steps: 200,000; Sequence length: 128;
(MixedCorpus & Amazon reviews)+LstmEncoder+(LmTarget & ClsTarget) https://share.weiyun.com/5B671Ik Firstly pre-trained on mixed large Chinese corpus with LM target. And then is pre-trained on Amazon reviews with lm target and cls target. It is suitable for datasets related with reviews. It can achieve comparable results with BERT on some review datasets. Training steps: 500,000 + 100,000; Sequence length: 128
IfengNews+BertEncoder+BertTarget https://share.weiyun.com/5HVcUWO The training corpus is news data from Ifeng website. We use news title to predict news abstract. Training steps: 100,000; Sequence length: 128
jdbinary+BertEncoder+ClsTarget https://share.weiyun.com/596k2bu The training corpus is review data from JD (jingdong). CLS target is used for pre-training. It is suitable for datasets related with shopping reviews. Training steps: 50,000; Sequence length: 128
jdfull+BertEncoder+MlmTarget https://share.weiyun.com/5L6EkUF The training corpus is review data from JD (jingdong). MLM target is used for pre-training. Training steps: 50,000; Sequence length: 128
Amazonreview+BertEncoder+ClsTarget https://share.weiyun.com/5XuxtFA The training corpus is review data from Amazon (including book reviews, movie reviews, and etc.). Classification target is used for pre-training. It is suitable for datasets related with reviews, e.g. accuracy is improved on Douban book review datasets from 87.6 to 88.5 (compared with Google BERT). Training steps: 20,000; Sequence length: 128
XNLI+BertEncoder+ClsTarget https://share.weiyun.com/5oXPugA Infersent with BertEncoder
MixedCorpus contains baidubaike, Wikizh, WebQA, RenMinRiBao, literature, and reviews.
Clone this wiki locally