Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean/remove duplicate images with fastdup #53

Open
mrdbourke opened this issue Jan 23, 2023 · 3 comments
Open

Clean/remove duplicate images with fastdup #53

mrdbourke opened this issue Jan 23, 2023 · 3 comments

Comments

@mrdbourke
Copy link
Owner

mrdbourke commented Jan 23, 2023

Make a script to clean and remove duplicate images with fastdup - https://github.com/visual-layer/fastdup

  • This works well since they did a test across ImageNet21k (millions of images) and it worked in ~3 hours
  • Could run this script periodically to clean images whenever new images are downloaded
@mrdbourke
Copy link
Owner Author

mrdbourke commented Jan 24, 2023

Did this with a notebook and removed 695/25000 (or there abouts) images, saw a slight reduction in performance but this was expected due to less data leakage between train & test sets, see the evaluation run: https://wandb.ai/mrdbourke/test_wandb_artifacts_by_reference/runs/714m0crl

@mrdbourke
Copy link
Owner Author

Original notes (from #50) -

  • Found a library to help with image duplication thanks to hashing — https://github.com/idealo/imagededup
    Removing duplicates will help make the model more robust and prevent data from leaking from train → test set (and then giving false metrics)
  • Created a small notebook for this (07_remove_duplicates.ipynb) and it seems to work very well, found ~500/24500 images were duplicates in a few minutes and there were little samples that weren’t (after a series of quick random plots)
  • Could integrate this workflow to run over all the images every so often (or whenever new data is added to the dataset).

@mrdbourke
Copy link
Owner Author

Next will be to turn the notebook version of this into a script

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant