MegaHan97K Dataset

We introduce MegaHan97K, a mega-category, large-scale dataset that contains the largest 97,455 Chinese character categories.
MegaHan97K includes Chinese characters of 97,455 categories, which significantly surpasses existing datasets with at least six times larger categories and holds the largest volume.
MegaHan97K pioneers to support the latest Chinese GB18030-2022 standard, ensuring the most comprehensive coverage and compatibility with modern Chinese processing systems.
MegaHan97K contains four distinct subsets: handwritten, mouse-written, historical, and synthetic. Each subset contains a greater number of character categories compared to existing datasets, resulting in remarkable scale and diversity advantages.
MegaHan97K effectively mitigates long-tail distribution issues by providing a balanced and sufficient number of samples for each category, ensuring robust training and validation of CCR models.

🔥 Download

Setting	Dataset	status
General CCR	GoogleDrive / BaiduYun:mbns	Released
Zero-Shot CCR	GoogleDrive / BaiduYun:6pxi	Released

🛠️ Usage

Clone this repo:

git clone https://github.com/SCUT-DLVCLab/MegaHan97K.git

Execute the following command to obtain example samples from the MegaHan97K dataset.

python MegaHan_Dataloader.py

Note: If you wish to access the entire dataset, please contact us via the email of the first author listed in the paper to obtain the decryption password.

To access the entire dataset, please first download it, update the data_root in the python MegaHan_Dataloader.py script and then execute

python MegaHan_Dataloader.py

🌄 Gallery

Illustration of the handwritten-original data in MegaHan97K
Illustration of the handwritten-augmented data in MegaHan97K
Illustration of the M⁵HisDoc data in MegaHan97K
Illustration of the Kangxi dictionary data in MegaHan97K
Illustration of the mouse-written-original data in MegaHan97K
Illustration of the mouse-written-augmented data in MegaHan97K
Illustration of the synthetic data in MegaHan97K

💙 Acknowledgement

License

MegaHan97K should be used and distributed under Creative Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) License for non-commercial research purposes.

Copyright

This repository can only be used for non-commercial research purposes.
For commercial use, please contact Prof. Lianwen Jin (eelwjin@scut.edu.cn).
Copyright 2024, Deep Learning and Vision Computing Lab (DLVC-Lab), South China University of Technology.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
MegaHan_Example		MegaHan_Example
images		images
MegaHan_Dataloader.py		MegaHan_Dataloader.py
MegaHan_IDS.txt		MegaHan_IDS.txt
MegaHan_codebook.txt		MegaHan_codebook.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MegaHan97K Dataset

🔥 Download

🛠️ Usage

🌄 Gallery

💙 Acknowledgement

License

Copyright

About

Packages

Languages

SCUT-DLVCLab/MegaHan97K

Folders and files

Latest commit

History

Repository files navigation

MegaHan97K Dataset

🔥 Download

🛠️ Usage

🌄 Gallery

💙 Acknowledgement

License

Copyright

About

Resources

Stars

Watchers

Forks

Packages 0

Languages

Packages