PAN_OCR (custom for Thai National ID Card)

This program reproduce from pan_ocr by Karan Purohit which I modify and adjust to use with Thai Natioanl ID Card.

To understand the building process please read his blog here.

The program structure use Darknet to identify specific text section of ID card, then cropped those sections while pass to Tesseract as the ocr engine to extract text in Thai and English language before save to csv file for further use.

As labeling images for train data, I use VoTT to label each section of image, then generate as yolo anchors boxes format to use when training.

Darknet use yolov4 model object detection with custom weight which obtain by train model with ID card data label with pre-train weight to recognize specific section. The train data use total 69 images with transformation library Albumentations to produce transform images to increase training size from original 23 images.

Tesseract and Pytesseract required to install to run this OCR.

The object detection with anchor boxes with label name for specific area.

The cropped images will generate in cropimgs folder as below.

The result will generate in output folder as csv file.

Main components and Command

darknet.exe
data folder for darknet
yolo custom file weight that obtained from train data by darknet
Tesseract.exe and it component such as eng.traineddata and tha.traineddata
idcards folder to contain target Thai National ID Card for text extraction
cropimgs folder to contain list of crop images generate by program
output folder to contain output csv file

To run the OCR follow this command.

pan.py -d -t

Limitation

The output file has a last column that annotate some column name is not in the right format which may cause by object detection not fully detect text area.
The train data size for detect text area is small which affect model not able to fully detect text section in some case. The model for area detection could be improved by increasing training size with high quality and various images.
While the misspelled in text extraction not fully accurate may not be annotated by any algorithm as it occur by model unable to detect text correctly.
The tesseract may not able to extract text accurately due to some image conditions such as low quality, glare or reflection, and wrong angle of card layout. Also, the tesseract model might not completely recognize thai text. If the image of ID card took properly it could help in text recognition while the tesseract model could be improved accuracy by training model with specific thai font data.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
cropimgs		cropimgs
idcards		idcards
output		output
utils		utils
1_VoTT.png		1_VoTT.png
2_Crop.png		2_Crop.png
3_Result.png		3_Result.png
LICENSE		LICENSE
OCR.png		OCR.png
README.md		README.md
config.py		config.py
config.pyc		config.pyc
darknet.sh		darknet.sh
pan.py		pan.py
predictions.jpg		predictions.jpg
yolo-obj.cfg		yolo-obj.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PAN_OCR (custom for Thai National ID Card)

Main components and Command

Limitation

About

Releases

Packages

Languages

License

GreatSoravit/PAN_OCR

Folders and files

Latest commit

History

Repository files navigation

PAN_OCR (custom for Thai National ID Card)

Main components and Command

Limitation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages