Skip to content

shahan007/SGCO-Scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SGCO-Scraper-Oct8

Internship project to automate marketing & sales team work.
This scraper scrape all the types, categories and details of the Singapore Companies that are listed on the web
Scraped & Cleaned data is located in DataOutput: data.json ; data.xlsx ; extra_cleaned_data.xlsx ; data_sheets.xlsx

Note !

ScrapeSgCo package is the scraper
executable python script resides in ScrapeSgCo package and you just need to run the main.py as located in the base dir
Output of the main.py script will be the cleaned scraped data stored in data.json located in DataOutput dir
To convert the data.json to excel file simply execute convert_to_excel.py located in excel_util_scripts


How to run ?

Clone the repo

$ git clone https://github.com/shahan007/SGCO-Scraper-Oct8

Setting up the environment

$ python -m venv venv
$ source venv/Scripts/activate
(venv) $ pip install -r requirements.txt

Run the Scraper

(venv) $ python main.py



Optional (convert data.json to excel file for excel experts)

(venv) $ python ./excel_util_scripts/convert_to_excel.py

Optional (further clean the generated excel file)

(further clean data.xlsx file for easier usage of the data) (pre-req is the availability of data.xlsx file resulted from the execution of the convert_to_excel.py )

(venv) $ python ./excel_util_scripts/xtra_clean_excel.py

Optional (further splits the clean generated excel file into sheets by WebCategory field)

(venv) $ python ./excel_util_scripts/cat_to_sheet.py

License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE.md file for details