Skip to content

crahal/NHSSpend

Repository files navigation

📊 NHSSpend: Tools and data for NHS procurement 📈

coverage Generic badge Generic badge Generic badge Generic badge DOI

Introduction

This is a library to scrape and reconcile all payments made by a hiarcharcy of NHS institutions over time (c. 2010 to 2020). It is the final of three projects on public procurement data (the first two of which were centgovspend and TSRC-NCVO-CSDP). Code for an interactive dashboard is found at src/dashboard with the help of Ian M. Knowles. Links to open-access (OSF) versions of the two headline academic papers which use this dataset ("The Role of Non-Profits in Public Health Service Provision: Evidence from 25,338 heterogeneous procurement datasets" with John Mohan and "Is outsourcing healthcare services to the private sector associated with higher mortality rates? An observational analysis of privatisation in England's NHS, 2013-2020" by Ben Goodair and Aaron Reeves) can be found here and here. A full, build passing notebook for the first of these two papers can be found here. If you would like to collaborate on related work, please don't hestiate to get in touch! Two spin-off repositories specifically for pdf-parsing and institutional data curation can be found here and here respectively.

Pre-reqs

NHSSpend tries to minimize the number of pre-requisite installations outside of the standard library, and we recommend an Anaconda installation to provide a comprehensive set of basic tools. However, a couple are necessary due to the magnitude of the undertaking. These include a range of modules found in the requirements.txt file (generated by pipreqs). The pdfparser is based on a version of the pdftableparser library, and the Charity Commission data is extracted using the charity-commission-extract library from NCVO. The Elasticsearch functionality is a custom implementation.

Data Origination

The data originates from one of two lists of recognised NHS institutions (Trusts and CCGs) and the main NHS England data provision page. These lists are used to create mappings to websites, and update on the status of the data (data/data_support/ccg_list.xlsx and data/data_support/trust_list.xlsx) with a number of different parametres fed into the scraper (src/NHSscraper.py). The data curation exercise has stopped as of April 2020 in order to focus on the analysis of the data, with the compresse datasets found in data/merged/* subdirectory of this repository). This is also partly due to the Covid-19 pandemic and the restructuring of Clinical Commissioning groups more generally (where 18 mergers took the number of CCGs from 191 to 136). However, please do raise issues on here if you think any of those institutions are mislabelled, or outdated. If you want to update this list (and the subsequent scrapers), please do raise an issue\get in touch (this is a constant ongoing work in progress until there is a centrally covened resource provided by the Government Data Service).

The procurement data itself is provided under an Open Government License (OGL). Guidance for publishing spend over £25,000 is published by HM Treasury.

Reconciliation

The es_configure.md describes the reconciliation approach. These reconciliations are then manually verified and merged back into the procurement data.

Clean, Reconciled Data

It is possible that you are reading this most interested in a copy of the output data! A link to the scraped, parsed, cleaned and reconciled can be found at NHSSpend/data/data_final. Please see the readme.md in that subdirectory for information on each of the fields.

Structure

Repo structure is based on the tree utility.

├ readme.md ├ es_configure.md
├ requirements.txt
├ src
│ └ analysis
│ │ ├ charity_analysis_notebook.ipynb
│ │ ├ general_analysis_functions.py
│ │ ├ helper_functions.py
│ │ ├ charity_analysis_functions.py
│ ├ scrape_and_parse_ccgs.py
│ ├ scrape_and_parse_trusts.py
│ ├ scraping_tools.py
│ ├ generate_output.py
│ ├ ingest_everything.py
│ ├ merge_and_evaluate_tools.py
│ ├ NHSSpend.py
│ ├ parsing_tools.py
│ ├ pdf_table_parser.py
│ ├ preconciliation.py
├ dashboard
├ data
│ └ data_support/*
│ └ data_cc/*
│ └ data_ch/*
│ └ data_dashboard/*
│ └ data_final/*
│ └ data_masteringest/*
│ └ data_merge/*
│ └ data_nhsccgs/*
│ └ data_nhsdigital/*
│ └ data_nhsengland/*
│ └ data_nhstrusts/*
│ └ data_reconciled/*
│ └ data_shapefiles/*
│ └ data_summary/*
├ papers
│ └ corporate_networks
│ └ figures
│ └ tables
│ └ third_sector
├ logging
│ │ ├ nhsspend.log
│ └ eval_logs
├ tokens

Acknowledgements.

The authors are grateful to comments on earlier versions of the work from Mark Exworthy, David Stuckler, Martin Mckee, Lucy Reynolds and James Rees. Technical research assistance provided by Ian Knowles. The origins of this work originate from a scoping and prototyping exercise funded by the ESRC (grant numbers ES/M010392/1 and latterly ES/X000524/1), with majority funding latterly and gratefully acknowledged from the British Academy and the Leverhulme Trust (Grant RC-2018-003), the Leverhulme Centre for Demographic Science (LCDS), and Nuffield College. Insightful comments were gratefully received from participants at the International Conference for Administrative Data Research, the Economic Insights team at the Office for National Statistics, the Spatial Unit at the Department for Levelling Up, the Government Data Science Community Meetup, two editors, and two anonymous referees. Additional thanks are due to Max Hattersly, Ben Goodair and Yu Pei for all of their work on data verification.

Licensing

This code is made available under a GNU GENERAL PUBLIC LICENSE 3.0.

Last updated: 2024-07-28