Skip to content

CANDEV-2022 Hackathon Challenge. Problem description: Extracting financial data from non-machine readable PDFs. Team-E.G.L.S. Challenge group: Statistics Canada.

Notifications You must be signed in to change notification settings

ShawkhIbneRashid/candev_egls

Repository files navigation

candev_statcan_egls

The problem statement is to extract tables from a bunch of scanned pdfs. We have addressed the problem in two ways. First we have converted the pdfs into images then extracted the texts from those images. Later we have counted the horizontal lines. This is because the images with tables have higher number of horizontal lines. Then we have extracted text from an image if it crosses a certain value indicating that the image contains a table. The JSON files for this approach are stored in line_count folder. Our next approach is to count the number of numerical values in an image. The motivation behind this approach is the presence of higher number of numerical values in an image with tables. The JSON files for numeric value count approach are stored in num_count folder.

We have also implemeted an user interface from which a user can upload an image and the table will be extracted as a JSON file and will be showed in the format of a table into an HTML page. We also did some postprocessing of the texts to remove garbage words. To run the program you need to first install the required python packages and type in flask run from the particular directory where app.py is.

About

CANDEV-2022 Hackathon Challenge. Problem description: Extracting financial data from non-machine readable PDFs. Team-E.G.L.S. Challenge group: Statistics Canada.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published