Skip to content

Pretraining data

zhezhaoa edited this page Aug 25, 2023 · 7 revisions

CLUECorpusSmall

CLUECorpusSmall consists of news, web, wiki, and comments corpus. The original data and detailed description can be found here.

Corpus Link
CLUECorpusSmall https://share.weiyun.com/sC6PMhxx
CLUECorpusSmall (BERT format) https://share.weiyun.com/9SPPGUOK

News Commentary v13 (ZH-EN)

News Commentary v13 consists of parallel data and can be downloaded from here.

Corpus Link
news-Commentary-v13-en-zh https://share.weiyun.com/PLMxw6ae
news-Commentary-v13-zh-en https://share.weiyun.com/5rMwRhDi
news-Commentary-v13-en-zh_sampled https://share.weiyun.com/1KTxq3Dc
Clone this wiki locally