Skip to content

You can’t handle the (dirty) truth: Data-centric insights improve pseudo-labeling

License

Notifications You must be signed in to change notification settings

seedatnabeel/DIPS

Repository files navigation

You can’t handle the (dirty) truth: Data-centric insights improve pseudo-labeling

arXiv License: MIT

image

This repository contains the implementation of DIPS, a data-centric method to improve pseudo-labeling under imperfect/noisy 'labeled' data from the paper "You can’t handle the (dirty) truth: Data-centric insights improve pseudo-labeling"

DIPS improves a variety of state-of-the-art pseudo-labeling algorithms (semi-supervised learning algorithms) via data-centric insights.

For more details, please read our DMLR paper: You can’t handle the (dirty) truth: Data-centric insights improve pseudo-labeling.

Installation

  1. Clone the repository
  2. (a) Create a new virtual environment with Python 3.10. e.g:
    virtualenv dips_env
  1. (b) Create a new conda environment with Python 3.10. e.g:
    conda create -n dips_env python=3.10
  1. With the venv or conda env activated, run the following command from the repository directory:
  • Install the minimum requirements to run DIPS
pip install -r requirements.txt
  1. Link the environment to the kernel:
python -m ipykernel install --user --name=dips_env

Logging

Outputs from scripts can be logged to Weights and Biases - wandb. An account is required and your WANDB_API_KEY and Entity need to be set in wandb.yaml file provided.

Getting started with DIPS

To get started with DIPS one can try the tutorial.ipynb notebook in the root folder

Scripts

To run the tabular experiments one can run the bash scripts found in the scripts folder, with results logged to wandb. For example:

 bash run_real_tabular.sh

Notebooks

To run the notebook experiments one can run any of the Jupyter notebooks (.ipynb) found in the notebooks folder

Computer Vision

Details to run DIPS for Computer Vision tasks (such as FixMatch) can be found in the fixmatch folder. Requirements specific to these experiments are contained therein.

Citing

If you use this code, please cite the associated paper:

@article{
dips2024,
title={You can't handle the (dirty) truth: Data-centric insights improve pseudo-labeling},
author={Nabeel Seedat and Nicolas Huynh and Fergus Imrie and Mihaela van der Schaar},
journal={Journal of Data-centric Machine Learning Research},
year={2024},
}

About

You can’t handle the (dirty) truth: Data-centric insights improve pseudo-labeling

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published