JobDB Job Scraper (Part I)

Click here to see Part II.

This is a customizable web scraper for full-time jobs on JobsDB.com HK with in-depth analysis.

This project is split into two parts: the Job Scraper (this project) and the Data Analysis using the data scraped from the first part.

Part I - Job Scraper

The program in this part scrapes for full-time jobs that are found under a particular keyword on JobsDB.com HK and stores the scraped data in a csv.

How to Use

Setting up the Environment

Clone the respository to your local machine. One way to do it is to pick a location on your local machine where you want the respository to clone to (e.g., the desktop) and type git clone https://github.com/heiinhei911/job-insights.git into the terminal
Change your current directory cd to the location of the cloned respository on your local machine (e.g., cd Desktop/job-insights)
(Optional) Create a virtual environment for the cloned respository (e.g., venv, conda)
(Under the virtual env. if you have created one in step 3) Type pip install -r requirements.txt. This will install all the necessary packages and modules so that the program can run properly

Using the Program

Run the program by running python job-scraper.py in the terminal
Enter the keywords that you would like to search for (e.g., business analyst)
Enter the number of pages that you would like to search for (you can either type in a number for searching a set number of pages OR type in 'all' for searching all pages)
Now we wait!

(This step could take quite a while depending on a number of factors such as the number of job postings you are scraping, your internet speed, the specification of your machine, etc.)
Once all the pages have been processed, all the data will be saved in a [the keyword you have inputted in step 2] + [today's date]/ folder under the jobs/ directory

Output Files

After the scraping has been completed, you will find two files under the newly created folder: output.csv and stats.xlsx.

The output CSV contains all the scraped data related to your keywords. This includes Title, Company, Years of Experience Required, Location, Salary, etc. of a job.

The stats Excel file contains two sheets:

the first sheet (named "Common_Word_Freq") is 50 of the most frequently occurring common words from all the scraped job descriptions, along with their frequency count
the second sheet (named "Org_Word_Freq") is a summary of all the most frequently occurring words from all scraped job descriptions that correspond to the "ORG" named entity in SpaCy (in other words, these words are commonly associated with companies, agencies, institutions, etc.), along with their frequency count

(The frequency counts in both sheets have been arranged in an descending order)

Libraries/Frameworks Used

Part I - Job Scraper: Python, Beautiful Soup, Selenium, Pandas, Scikit-learn, Spacy

Credits

This project and its data are intended for educational purposes only.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
images		images
jobs		jobs
utils		utils
README.md		README.md
constants.py		constants.py
globals.py		globals.py
job-scraper.py		job-scraper.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JobDB Job Scraper (Part I)

Part I - Job Scraper

How to Use

Setting up the Environment

Using the Program

Output Files

Libraries/Frameworks Used

Credits

About

Releases

Packages

Languages

heiinhei911/job-scraper

Folders and files

Latest commit

History

Repository files navigation

JobDB Job Scraper (Part I)

Part I - Job Scraper

How to Use

Setting up the Environment

Using the Program

Output Files

Libraries/Frameworks Used

Credits

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages