captcha_break

Captcha breaker for very old government website in order to automate the data entry task

This Repo consists of the captcha generator for creation of labelled dataset, we tried to create captchas that were similar to the one available on the website. Website had captchas with strikethrough text of variable length available in red using Times font.

Challenge was to identify:-

captchas text of strikethrough, so just using Pytesseract was not possible
captchas of variable length 5-9 and thus most of the repos/datasets available on github could not be used.

With the issues at hand the recogition task based on contours/ text boxes for different characters was not possible as the strikethrough meant that separate boxes for different chars was not possible. Use of opencv seemed to be the best way ahead, with opencv we trained on the generated captchas and it looked good as we removed the lines and tried to extend the same on our real captcha dataset but due to different morphological operations now tesseract was not able to identify these images and that transition was no longer smooth.

we did not wanted to use DL here as it was not needed/required and use of opencv seemed more sensible. So we used simple python code and since we knew our captchas really well, this is the steps that we followed:-

We removed the line through simple Python script, we generated the images without the line for our real captcha dataset.
We converted the image in greyscale(preprocess) for better OCR results via tesseract. gray1 = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) gray2 = cv2.threshold(gray1, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
We used these preprocessed images to get output from the tesseract OCR text.append(pytesseract.image_to_string(gray1)) #print(text) text.append(pytesseract.image_to_string(gray2)) #print(text)

The cumulative results from these two almost gave all the outputs, but only issue seems to be associated with e as after line removal e is converted similar to c. But even with this we are getting good enough results.

Going ahead we will be adding more approach to this dataset, that will include CNN+CTC based approach. Contour based approach which will start from the processed images that we feed to tesseract.

Code shared here are the captcha generator Google Colab notebook. Colab notebook again for giving the captcha results for our scenario as well. Also sharing the captchas that we have crawled from the target website. We have provided the labeled dataset of 500 captchas that were the most painful part of the process. For experimentation you can always generate any number of captchas from the captcha generator colab notebook that we have shared.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
CAPTCHA_recognition_data		CAPTCHA_recognition_data
.DS_Store		.DS_Store
0de39a03b.jpg		0de39a03b.jpg
CNN_Captchas.ipynb		CNN_Captchas.ipynb
README.md		README.md
Text_strikethrough_clean.ipynb		Text_strikethrough_clean.ipynb
captcha_generator.ipynb		captcha_generator.ipynb
line_removal.py		line_removal.py
real_captchas2.ipynb		real_captchas2.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

captcha_break

About

Releases

Packages

Languages

navneetkrc/captcha_break

Folders and files

Latest commit

History

Repository files navigation

captcha_break

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages