Skip to content

Data capture (OCR recognition, NLP extraction), data cleansing (data proofreading), and data integration (integration of different materials into a single system) will be performed on three types of data

Notifications You must be signed in to change notification settings

shuoyuwang/Data-Acquisition-and-Preprocessing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data collection:

1.In order to realize data collection, registration of Baidu Intelligent Cloud OCR recognition engine application, get its API Key and Secret Key.

image image

2.Write access_token.py program to get access_token value by API Key and Secret Key.

image

3.Use access_token to call the VAT invoice recognition API in Baidu's text recognition technology, and write ocr_invoice.py code to extract the invoice

image

4.Use access_token to call Baidu Universal Text Recognition (High Precision Edition) technology API, write ocr_contract.py program to extract the contract. As Baidu OCR is not specifically used to identify the contract API, so choose the General Text Recognition. As the contract is a pdf file, different from the invoice, so you need to modify the parameters for the "pdf_file" to meet the read pdf document.

image

5.Use access_token to call the Baidu Universal Text Recognition (high precision version) technology API, write ocr_bills.py program to extract the bank current

image

6.Standard field extraction extraction of bank flow txt file using NLP technique.

image image image

7.Save the NLP extraction results to a txt file for subsequent data preprocessing operations.

image

Data preprocessing

1.Execute oce_accuracy.py program to extract the fuzzy and imprecise part of the text with high accuracy.

image

Result:

image

2.Based on the NLP results and oce_accuracy on the final data identified are entered into the excel sheet separately.

image image image

Data integration

1.Based on excel file, write Creat_mysql.py program to implement the creation of database, create 7 tables and store data from excel into database.

image image

2.By writing a program that enables calculations to be performed on the stored data

(a) Total revenue, total net profit of the firm for each year;

image

(b) Support for individual requirements.

In the program to leave a part of the sql statement, the user can according to the need to achieve the function of writing sql statements to achieve personalized needs.

image

About

Data capture (OCR recognition, NLP extraction), data cleansing (data proofreading), and data integration (integration of different materials into a single system) will be performed on three types of data

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages