Skip to content

MegaHan97K: A Large-Scale Dataset for Mega-Category Chinese Character Recognition with over 97K Categories

Notifications You must be signed in to change notification settings

SCUT-DLVCLab/MegaHan97K

Repository files navigation

MegaHan97K Dataset

  • We introduce MegaHan97K, a mega-category, large-scale dataset that contains the largest 97,455 Chinese character categories.
  • MegaHan97K includes Chinese characters of 97,455 categories, which significantly surpasses existing datasets with at least six times larger categories and holds the largest volume.
  • MegaHan97K pioneers to support the latest Chinese GB18030-2022 standard, ensuring the most comprehensive coverage and compatibility with modern Chinese processing systems.
  • MegaHan97K contains four distinct subsets: handwritten, mouse-written, historical, and synthetic. Each subset contains a greater number of character categories compared to existing datasets, resulting in remarkable scale and diversity advantages.
  • MegaHan97K effectively mitigates long-tail distribution issues by providing a balanced and sufficient number of samples for each category, ensuring robust training and validation of CCR models.

overview

example

🔥 Download

Setting Dataset status
General CCR GoogleDrive / BaiduYun:mbns Released
Zero-Shot CCR GoogleDrive / BaiduYun:6pxi Released

🛠️ Usage

  • Clone this repo:
git clone https://github.com/SCUT-DLVCLab/MegaHan97K.git
  • Execute the following command to obtain example samples from the MegaHan97K dataset.
python MegaHan_Dataloader.py

Note: If you wish to access the entire dataset, please contact us via the email of the first author listed in the paper to obtain the decryption password.

  • To access the entire dataset, please first download it, update the data_root in the python MegaHan_Dataloader.py script and then execute
python MegaHan_Dataloader.py

🌄 Gallery

  • Illustration of the handwritten-original data in MegaHan97K handwo

  • Illustration of the handwritten-augmented data in MegaHan97K handwa

  • Illustration of the M5HisDoc data in MegaHan97K m5

  • Illustration of the Kangxi dictionary data in MegaHan97K kx

  • Illustration of the mouse-written-original data in MegaHan97K mwo

  • Illustration of the mouse-written-augmented data in MegaHan97K mwa

  • Illustration of the synthetic data in MegaHan97K syn

💙 Acknowledgement

License

MegaHan97K should be used and distributed under Creative Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) License for non-commercial research purposes.

Copyright

About

MegaHan97K: A Large-Scale Dataset for Mega-Category Chinese Character Recognition with over 97K Categories

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages