HisDoc1B Dataset

The HisDoc1B dataset comprises 40,281 books, over 3 million document images, and over 1 billion characters across 30,615 character categories. To the best of our knowledge, HisDoc1B is the largest dataset in the field, surpassing existing datasets by more than 200 times in terms of scale (as shown in the below table). Additionally, it is the only dataset with complete book-level annotations and punctuation annotations.

Dataset	#Books	#Document images	#Characters	#Character categories	Text punctuation
MTHv1[1]	-	1,500	521,370	4,058	×
MTHv2[2]	-	3,199	1,081,678	6,733	×
IC19 HDRC[3]	-	11,715	2,482,994	8,353	×
M5HisDoc[4]	-	8,000	4,367,360	16,151	×
CASIA-AHCDB[5]	-	-	2,276,740	10,350	×
HisDoc1B (Ours)	40,281	3,163,330 (270×)	1,082,544,808 (248×)	30,615 (1.9×)	✓

Table 1: Comparison of HisDoc1B with existing Chinese historical document datasets. The highest and second highest values within each column are denoted by bold and underline, respectively.

Download

OneDrive: https://1drv.ms/u/s!ApQfSeOP7LDTdPghMv281sKYsq0?e=fIuK65
BaiduYun: https://pan.baidu.com/s/1CQnfmHwh6hGigyvHNlmPCQ?pwd=aziq

Directory Format

The dataset is organized in the following directory format:

├── HisDoc1B
    ├── books
    │   ├── xxx.pdf/.djvu
    │   └── ...
    ├── annos
    │   ├── xxx.json
    │   └── ...
    ├── readme.md
    ├── book2im.py
    ├── read_anno.py

Inference codes to generate the dataset

Download the docker image:
OneDrive: https://1drv.ms/u/s!ApQfSeOP7LDTc0VErUM9NBjQLls?e=gVTcqw
BaiduYun: https://pan.baidu.com/s/1OV_RjZ8pf9QJlCrMDFvztA?pwd=wdlm

Initialize the docker container:

docker load -i hisdoc1b-docker.tar
docker run -it --gpus all --shm-size="2g" hisdoc1b-image:20240716

The code are in the /root/HisDoc1B_codes/.

Contact

For any questions about the dataset, please contact the authors by sending an email to yongxin_shi@foxmail.com.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HisDoc1B Dataset

Download

Directory Format

Inference codes to generate the dataset

Contact

About

Releases

Packages

SCUT-DLVCLab/HisDoc1B

Folders and files

Latest commit

History

Repository files navigation

HisDoc1B Dataset

Download

Directory Format

Inference codes to generate the dataset

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages