- We introduce MegaHan97K, a mega-category, large-scale dataset that contains the largest 97,455 Chinese character categories.
- MegaHan97K includes Chinese characters of 97,455 categories, which significantly surpasses existing datasets with at least six times larger categories and holds the largest volume.
- MegaHan97K pioneers to support the latest Chinese GB18030-2022 standard, ensuring the most comprehensive coverage and compatibility with modern Chinese processing systems.
- MegaHan97K contains four distinct subsets: handwritten, mouse-written, historical, and synthetic. Each subset contains a greater number of character categories compared to existing datasets, resulting in remarkable scale and diversity advantages.
- MegaHan97K effectively mitigates long-tail distribution issues by providing a balanced and sufficient number of samples for each category, ensuring robust training and validation of CCR models.
Setting | Dataset | status |
---|---|---|
General CCR | GoogleDrive / BaiduYun:mbns | Released |
Zero-Shot CCR | GoogleDrive / BaiduYun:6pxi | Released |
- Clone this repo:
git clone https://github.com/SCUT-DLVCLab/MegaHan97K.git
- Execute the following command to obtain example samples from the MegaHan97K dataset.
python MegaHan_Dataloader.py
Note: If you wish to access the entire dataset, please contact us via the email of the first author listed in the paper to obtain the decryption password.
- To access the entire dataset, please first download it, update the
data_root
in the pythonMegaHan_Dataloader.py
script and then execute
python MegaHan_Dataloader.py
-
Illustration of the handwritten-augmented data in MegaHan97K
-
Illustration of the mouse-written-original data in MegaHan97K
-
Illustration of the mouse-written-augmented data in MegaHan97K
MegaHan97K should be used and distributed under Creative Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) License for non-commercial research purposes.
- This repository can only be used for non-commercial research purposes.
- For commercial use, please contact Prof. Lianwen Jin (eelwjin@scut.edu.cn).
- Copyright 2024, Deep Learning and Vision Computing Lab (DLVC-Lab), South China University of Technology.