occupationcoder-international

A tool to assign standard occupational classification codes to job descriptions

This repository is a development to the code included in the Python package occupationcoder, with the original codebase by Jyldyz Djumalieva, Arthur Turrell, David Copple, James Thurgood, and Bradley Speigner.

This updated version occupationcoder adds functionality to code job descriptions to the International Standard Classification of Occupations 2008 (ISCO), while retaining original functionality for the UK 3-digit Standard Occupational Classification (SOC) coding scheme.

In addition, this update includes functionality (build_dict.py) to create custom dictionaries for other coding schemes, provided a suitable input format. We provide a .ipynb notebook that demonstrates how to use this. We cannot guarantee the effectiveness of this method with other coding schemes due to the differences between coding schemes, particularly in terms of scheme layout and detail available.

DISCLAIMER

The code contained within this repository is provided 'as is'. We stress that -

Any use of this code is entirely at the risk of the user, and users are fully responsible for checking whether the codebase is suitable for their use case, as well as the quality and accuracy of any outputs generated.
The dictionaries included in this repositories are provided as examples only and should not be considered as official versions of any occupation coding scheme: it is the sole responsibility of the user of this codebase to check whether the dictionaries used are correct and suitable for their use case.
(Co) authors of this codebase at the Office for National Statistics Data Science Campus do not commit to responding to requests for additional features or long-term maintenance of the codebase.
This approach to occupation coding has only been tested with data written in English, we cannot guarantee it will work for other languages.

Please see CONTRUTING.md for furthe details on expected maintenance, bugs etc.

Using `occupationcoder`

In contrast to the original package, as presented here, we assume the user will code inputs provided as a .csv file as opposed to coding single job descriptions (if a single record is to be coded, it is fine to just code an input file with a single record). Information on how to format inputs, run the tool, and a summary of expected outputs is provided below.

Getting started

After cloning the repository locally, we suggest setting up a new virtual environment to house necessary packages (e.g. python -m venv .coder-env and activate this as appropriate for your OS).

To install dependencies and set up the package locally, run the following in a command line interface, in the base directory of this repository (i.e. where this README.md is located):

pip install -r requirements.txt
python setup.py install

Input format

Expected data input is as per the tests/test_vacancies.csv file. Three columns, with headers and content as follows:

job_title: Specific title of the job to code. occupationcoder will use this to attempt an exact match against any specific job titles listed in the target scheme. This is the only field that is treated separately and used for an attempt at an exact match.
job_description: Expected to be a extended description of the given job, including e.g. tasks or further context. E.g. where job_title is "dentist", job_description might be "providing dental care to patients".
job_sector: industrial/sectoral description for the given job. E.g. "medical".

For each row in the input data set, occupationcoder will attempt to find an exact match in the given coding scheme. If one is found, the appropriate code from the coding scheme will be returned. If no exact match is found, three best "fuzzy" matches using TF-IDF will be returned, using information combined from job_title, job_description and job_sector.

Running in the command line

To code the example input file from the command line, use the following (note that by not supplying a value for the scheme argument, this codes to the default scheme, which is SOC):

python occupationcoder/coder.py --in_file="tests/test_vacancies.csv"

Adjust the value given for the in_file argument to code a different input file.

To code to the ISCO coding scheme instead, use the scheme parameter:

python occupationcoder/coder.py --in_file="tests/test_vacancies.csv" --scheme="isco"

Note that the scheme arguments looks for a directory with the same name under occupationcoder/dictionaries. Out of the box, we provide the dictionaries for the SOC scheme as used by the original package, and we have added corresponding ISCO dictionaries.

The dictionaries included in this repositories are provided as examples only and should not be considered as official versions of any occupation coding scheme: it is the sole responsibility of the user of this codebase to check whether the dictionaries used are correct and suitable for their use case.

By default, coder.py provides "long" output, i.e. including TF-IDF scores for each fuzzy prediction (note that these are not given if an exact match is found). To suppress outputting scores, set the output argument to "single" instead of "multi":

python occupationcoder/coder.py --in_file="tests/test_vacancies.csv" --scheme="isco" --output="single"

For a full description of all available arguments in coder.py:

python occupationcoder/coder.py --help

Creating custom or bespoke dictionaries from coding schemes

The current repository provides example dictionaries to allow coding to SOC and ISCO schemes. Although these work and and be used to code given occupations to (as specified above and using the scheme argument for coder.py), they should be considered examples only and it is the responsibility of the user to check that the codes used are correct and suitable for their given use case.

We have provided code and functionality to create bespoke dictionaries from coding schemes (provided the latter are presented in a suitable format). The Python code for this can be found in build_dict.py; to illustrate its use we have presented a Jupyter notebook building_custom_dictionaries.ipynb. Any use of this, again, is at the users' own risk.

When placed in the subdirectories of the dictionaries folder, custom dictionaries (formatted as .json files) should be accessible by using the respective subdirectory name as the value for the scheme parameter for coder.py. (e.g. coder.py --in_file="my_input_file.csv" --scheme="my_custom_scheme").

Pre-requisites

All required packages are specified in requirements.txt.

Testing

Assuming setup.py has been run as above, to run the tests in your virtual environment, use

python -m unittest

in the top level occupationcoder directory. Look in test_occupationcoder.py for what is run and for examples of use. The output appears in the processed_jobs.csv file in the outputs folder.

Credits

As above, this is a development based on occupationcoder authored by Jyldyz Djumalieva, Arthur Turrell, David Copple, James Thurgood, Bradley Speigner and Martin Wood.

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
.github		.github
data		data
occupationcoder		occupationcoder
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AUTHORS.md		AUTHORS.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
config.yml		config.yml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

occupationcoder-international

A tool to assign standard occupational classification codes to job descriptions

DISCLAIMER

Using `occupationcoder`

Getting started

Input format

Running in the command line

Creating custom or bespoke dictionaries from coding schemes

Pre-requisites

Testing

Credits

About

Releases

Packages

Languages

License

datasciencecampus/occupationcoder-international

Folders and files

Latest commit

History

Repository files navigation

occupationcoder-international

A tool to assign standard occupational classification codes to job descriptions

DISCLAIMER

Using occupationcoder

Getting started

Input format

Running in the command line

Creating custom or bespoke dictionaries from coding schemes

Pre-requisites

Testing

Credits

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Using `occupationcoder`

Packages