SIC-SOC-LLM

Overview

This app/package has been created by the Data Science Campus as a proof of concept to evaluate Large Language Models (LLM) potential to assist with classification coding. It uses the LangChain library to perform Retrieval Augmented Generation (RAG) based on the provided classification index. A special case of Standard Industrial Classification (SIC) coding has been used as the primary test case, see method explanation. An example deployment using Streamlit allows for interactive exploration of the model's capabilities.

Data sources

Examples of simplified SIC, Standard Occupational Classification (SOC) and Classification of Individual Consumption According to Purpose (COICOP) are included in the example_data folder. These condensed indices are flattened subsets of more detailed indices officially published online, such as the UK SIC 2007, UK SOC 2020, and COICOP 2018 (pdf).

⚠️ Warning: The example data is provided for demonstration purposes only. No guarrantee is given for its accuracy or up to date status.

In this project, we focused on the SIC. A flexible representation of this hierarchical index (including metadata) has been implemented within the data_models submodule, enabling enhanced context for RAG/LLM. This representation can be used independently for other SIC coding tasks or easily extended to accommodate different classification indices.

The SIC index hierarchy object is built using three data sources provided by ONS:

Published UK SIC summary of structure worksheet (xlsx) - location needs to be specified in config
UK SIC2007 indexes with addendum December 2022 (xlsx) - location needs to be specified in config
SIC resource file by ONSdigital/dp-classification-tools (js) - included inside the package

Installation

1. Virtual environment

It is recommended that you install the project with its required dependencies in a virtual environment. When the virtual environment is activated, any subsequent Python commands will use the Python interpreter and libraries specific to that isolated environment. This ensures that the project uses the correct versions of the dependencies specified in its requirements.

Create and activate a new virtual environment on Linux/OS X:

python3.10 -m venv .venv
source .venv/bin/activate

2. Requirements

Update pip and install requirements:

python -m pip install --upgrade pip
python -m pip install -e ".[dev]"

The -e flag installs the project in "editable" mode, which means that any changes made to the project code will be reflected immediately without the need to reinstall. The ".[dev]" part specifies that both the regular requirements and the development requirements should be installed.

3. LLM authentication:

The package provides code to use popular LLMs, access to the LLMs is a perquisite for use. Depending on your choice, keys/credentials may need to be added, for example:

Include a personal OpenAI API in .env as

OPENAI_API_KEY="<your key>"

Authenticate for Vertex AI:

gcloud config set project "<PROJECT_ID>"
gcloud auth application-default login

Usage

Examples of how to use the sic-soc-llm package can be found in Tutorials and References.

Configuration

The sic-soc-llm package uses a configuration file in TOML format to specify the paths to the data files and the names of the models to use. An example configuration file is provided in sic_soc_llm_config.toml and is read by the get_config function. The following fields are required:

Field	Type	Default value
[lookups]
sic_structure	str	"data/sic-index/publisheduksicsummaryofstructureworksheet.xlsx"
sic_index	str	"data/sic-index/uksic2007indexeswithaddendumdecember2022.xlsx"
sic_condensed	str	"sic_2d_condensed.txt"
soc_condensed	str	"soc_4d_condensed.txt"
coicop_condensed	str	"coicop_5d_condensed.txt"
[llm]
db_dir	str	"data/sic-index/db"
embedding_model_name	str	"all-MiniLM-L6-v2"
llm_model_name	str	"gemini-pro"

Make sure to update the file paths and model names according to your specific setup. While the condensed indexes (.txt) are included in the package, the .xlsx files need to be downloaded from the ONS website (mentioned above) and placed in the specified locations.

Run and deploy Streamlit app

To run the Streamlit app, use the following command:

streamlit run app/Welcome.py --server.port 8500

The app will be available at http://localhost:8500/.

Example commands used to build and deploy the app as a GCP Cloud Run service are provided in cloud_deploy.sh (which references Dockerfile and app.yaml). The Dockerfile contains a set of instructions for building a Docker image. It specifies the base image to use, the files and directories to include, the dependencies and the commands to run. The app.yaml file is used to specify the configuration of the Cloud Run service, including the container image to deploy, the service name, and the port to expose.

Development and testing

1. Pre-commit actions

This repository contains a configuration of pre-commit hooks. If approaching this project as a developer, you are encouraged to install and enable pre-commits by running the following in your shell:

pip install pre-commit
pre-commit install

2. Unit tests

To run the unit tests, use the following command:

python -m pytest

3. Building documentation and webpage:

Build (Quatro markdown) reference files from docstrings:

cd docs
python -m quartodoc build

Render webpage from Quarto markdowns in docs dir (including reference files):

quarto render

License

The code, unless otherwise stated, is released under the MIT Licence. The documentation for this work is subject to © 2024 Crown Copyright (Office for National Statistics) and is available under the terms of the Open Government 3.0 licence.

Data Science Campus

At the Data Science Campus we apply data science, and build skills, for public good across the UK and internationally. Get in touch with the Campus at datasciencecampus@ons.gov.uk.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
app		app
docs		docs
src/sic_soc_llm		src/sic_soc_llm
tests		tests
.flake8		.flake8
.gcloudignore		.gcloudignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.yaml		app.yaml
cloud_deploy.sh		cloud_deploy.sh
codecov.yml		codecov.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SIC-SOC-LLM

Overview

Data sources

Installation

1. Virtual environment

2. Requirements

3. LLM authentication:

Usage

Configuration

Run and deploy Streamlit app

Development and testing

1. Pre-commit actions

2. Unit tests

3. Building documentation and webpage:

License

Data Science Campus

About

Releases

Packages

Contributors 2

Languages

License

datasciencecampus/sic-soc-llm

Folders and files

Latest commit

History

Repository files navigation

SIC-SOC-LLM

Overview

Data sources

Installation

1. Virtual environment

2. Requirements

3. LLM authentication:

Usage

Configuration

Run and deploy Streamlit app

Development and testing

1. Pre-commit actions

2. Unit tests

3. Building documentation and webpage:

License

Data Science Campus

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages