Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove duplicate images from dataset (train and test) #50

Closed
mrdbourke opened this issue Jan 10, 2023 · 1 comment
Closed

Remove duplicate images from dataset (train and test) #50

mrdbourke opened this issue Jan 10, 2023 · 1 comment

Comments

@mrdbourke
Copy link
Owner

mrdbourke commented Jan 10, 2023

  • Found a library to help with image duplication thanks to hashing — https://github.com/idealo/imagededup
  • Removing duplicates will help make the model more robust and prevent data from leaking from train → test set (and then giving false metrics)
  • Created a small notebook for this (07_remove_duplicates.ipynb) and it seems to work very well, found ~500/24500 images were duplicates in a few minutes and there were little samples that weren’t (after a series of quick random plots)

Could integrate this workflow to run over all the images every so often (or whenever new data is added to the dataset).

@mrdbourke
Copy link
Owner Author

Moved to #53

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant