Skip to content

A customizable web scraper for jobs on JobsDB.com HK

Notifications You must be signed in to change notification settings

heiinhei911/job-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

JobDB Job Scraper (Part I)

Click here to see Part II.

This is a customizable web scraper for full-time jobs on JobsDB.com HK with in-depth analysis.

This project is split into two parts: the Job Scraper (this project) and the Data Analysis using the data scraped from the first part.

Part I - Job Scraper

The program in this part scrapes for full-time jobs that are found under a particular keyword on JobsDB.com HK and stores the scraped data in a csv.

How to Use

Setting up the Environment

  1. Clone the respository to your local machine. One way to do it is to pick a location on your local machine where you want the respository to clone to (e.g., the desktop) and type git clone https://github.com/heiinhei911/job-insights.git into the terminal
  2. Change your current directory cd to the location of the cloned respository on your local machine (e.g., cd Desktop/job-insights)
  3. (Optional) Create a virtual environment for the cloned respository (e.g., venv, conda)
  4. (Under the virtual env. if you have created one in step 3) Type pip install -r requirements.txt. This will install all the necessary packages and modules so that the program can run properly

Using the Program

  1. Run the program by running python job-scraper.py in the terminal

  2. Enter the keywords that you would like to search for (e.g., business analyst)

    Job Title Input

  3. Enter the number of pages that you would like to search for (you can either type in a number for searching a set number of pages OR type in 'all' for searching all pages)

    Number Of Pages Being Searched Input

  4. Now we wait!

    (This step could take quite a while depending on a number of factors such as the number of job postings you are scraping, your internet speed, the specification of your machine, etc.)

  5. Once all the pages have been processed, all the data will be saved in a [the keyword you have inputted in step 2] + [today's date]/ folder under the jobs/ directory

    Scraping Completed

Output Files

After the scraping has been completed, you will find two files under the newly created folder: output.csv and stats.xlsx.

Output Files

The output CSV contains all the scraped data related to your keywords. This includes Title, Company, Years of Experience Required, Location, Salary, etc. of a job.

The stats Excel file contains two sheets:

  • the first sheet (named "Common_Word_Freq") is 50 of the most frequently occurring common words from all the scraped job descriptions, along with their frequency count

  • the second sheet (named "Org_Word_Freq") is a summary of all the most frequently occurring words from all scraped job descriptions that correspond to the "ORG" named entity in SpaCy (in other words, these words are commonly associated with companies, agencies, institutions, etc.), along with their frequency count

    (The frequency counts in both sheets have been arranged in an descending order)

Libraries/Frameworks Used

Part I - Job Scraper: Python, Beautiful Soup, Selenium, Pandas, Scikit-learn, Spacy

Credits

This project and its data are intended for educational purposes only.

All data come from JobsDB.com HK. All rights reserved to JobsDB.com HK.

About

A customizable web scraper for jobs on JobsDB.com HK

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages