Skip to content

Latest commit

 

History

History
98 lines (91 loc) · 4.79 KB

corpus_counts.md

File metadata and controls

98 lines (91 loc) · 4.79 KB

Corpus Counts

Subcorpus docs
edeposcorpus.split00.docs.token 297,478
flashcorpus.split00.docs.token 273,794
news_2.split00.docs.token 3,093,543
news_2.split01.docs.token 2,976,780
news_2.split02.docs.token 3,100,877
news_2.split03.docs.token 2,934,816
news_2.split04.docs.token 2,811,136
news_2.split05.docs.token 2,671,540
news_2.split06.docs.token 2,167,264
news_2.split07.docs.token 2,823,473
news_2.split08.docs.token 281,078
news_2.split09.docs.token 2,979,623
news_2.split10.docs.token 3,105,495
news_2.split11.docs.token 2,745,166
news_2.split12.docs.token 3,105,369
news_2.split13.docs.token 2,888,914
news_2.split14.docs.token 3,044,163
news_2.split15.docs.token 2,243,778
news_2.split16.docs.token 2,652,094
news_2.split17.docs.token 3,073,893
news_2.split18.docs.token 2,926,715
nok.split00.docs.token 1,622,992
offentligt.split00.docs.token 3,306,624
offentligt.split01.docs.token 2,198,892
oscar.split00.docs.token 859,790
oscar.split01.docs.token 800,000
oscar.split02.docs.token 881,928
oscar.split03.docs.token 873,169
oscar.split04.docs.token 892,183
oscar.split05.docs.token 882,298
oscar.split06.docs.token 882,881
oscar.split07.docs.token 888,328
oscar.split08.docs.token 904,712
oscar.split09.docs.token 893,941
oscar.split10.docs.token 865,024
oscar.split11.docs.token 903,509
oscar.split12.docs.token 415,618
runeberg.split00.docs.token 902,768
tweets.split00.docs.token 10,442,046
wiki.split00.docs.token 3,421,795
total 85,035,487

Total tokens in corpus : 15,151,843,671

Too long docs

Subcorpus too long docs (> 1022)
edeposcorpus.split00.docs.token 76
flashcorpus.split00.docs.token 135
news_2.split00.docs.token 668
news_2.split01.docs.token 547
news_2.split02.docs.token 117
news_2.split03.docs.token 742
news_2.split04.docs.token 571
news_2.split05.docs.token 805
news_2.split06.docs.token 2269
news_2.split07.docs.token 557
news_2.split08.docs.token 35
news_2.split09.docs.token 283
news_2.split10.docs.token 443
news_2.split11.docs.token 3459
news_2.split12.docs.token 218
news_2.split13.docs.token 151
news_2.split14.docs.token 171
news_2.split15.docs.token 1461
news_2.split16.docs.token 1232
news_2.split17.docs.token 314
news_2.split18.docs.token 100
nok.split00.docs.token 122
offentligt.split00.docs.token 8
offentligt.split01.docs.token 1
oscar.split00.docs.token 87546
oscar.split01.docs.token 80124
oscar.split02.docs.token 86818
oscar.split03.docs.token 88286
oscar.split04.docs.token 86809
oscar.split05.docs.token 86787
oscar.split06.docs.token 86828
oscar.split07.docs.token 86707
oscar.split08.docs.token 87348
oscar.split09.docs.token 87151
oscar.split10.docs.token 85381
oscar.split11.docs.token 88898
oscar.split12.docs.token 39836
runeberg.split00.docs.token 2738
tweets.split00.docs.token 0
wiki.split00.docs.token 24200
total 1119942

Total tokens in too long docs: 2,999,450,123

Tokens left if not splitting: 12,152,393,548