Skip to content

Data insights from the MAFAT Satellite Vision challenge.

Notifications You must be signed in to change notification settings

dnth/mafat-fastdup-blogpost

Repository files navigation

Data Insights from the MAFAT Satellite Vision Challenge

mafat

In this repository, I use a free tool known as fastdup to gain data insights from MAFAT Satellite Vision Challenge labeled and unlabeled data.

fastdup is a free tool used to manage, clean & curate visual data. It is fast (runs on you CPU) and scalable. It can handle up to 400M images on a single CPU machine.

The main features of fastdup include -

  • Finding duplicates.
  • Finding anomalies.
  • Clustering similar images.

In this repository I ran fastdup on both the labeled and unlabeled data, and document my findings.

At a high level fastdup find the following potential issues in the labeled dataset (1457 images) -

  • A total of 12 fully identical images (d>0.990), which are 0.27 %.
  • A total of 25 nearly identical images (d>0.980), which are 0.57 %.
  • A total of 559 above threshold images (d>0.900), which are 12.79 %.
  • A total of 145 outlier images (d<0.050), which are 3.32 %.

At a high level fastdup find the following potential issues in the unlabeled dataset (8258 images) -

  • A total of 914 fully identical images (d>0.990), which are 3.69 %.
  • A total of 466 nearly identical images (d>0.980), which are 1.88 %.
  • A total of 7393 above threshold images (d>0.900), which are 29.84 %.
  • A total of 825 outlier images (d<0.050), which are 3.33 %.

💭 So what?

As you can see not all images are useful in training a model.

  • Duplicate images do no provide additional insights. They hog disk space and prolong your training time. These can be discarded.
  • Overly dark/bright/blur images without any objects also do not provide value.
  • For the clusters and outliers, I'll leave it for you to decide if they are useful to train a model.

Curating a dataset goes a long way in making sure a model works.

In my opinion these are low-hanging fruits that can be addressed to ensure the dataset is reasonably "clean" before training any model.

If you're interested to explore the dataset yourself, read on.

Happy hacking.

📂 Folder Structure

  • dataset/ - Stores the image dataset downloaded from the MAFAT official webpage. Sign up and downloaded the data into this folder.

  • fastdup_report/ - Stores the reports from fastdup.

  • fastdup_train.ipynb - Notebook to analyze the labeled training images.

  • fastdup_unlabeled.ipynb - Notebook to analyze the unlabeled images.

👯‍♀️ Duplicates

fastdup is extremely fast and robust at finding duplicate images.

In the unlabeled dataset, I find 927 fully identical images which is 3.74 % of the unlabeled data. See the notebook here.

duplicates

Back to top ⤴

🧩 Components

I also used fastdup to find similar looking images (clusters).

As shown below, there are many similar looking images clustered together. These clusters may or may not provide insights.

components

Back to top ⤴

🎸 Outliers

fastdup can also be used to find anomalies in the dataset. The following gallery shows images that are "different" (measured using cosine distance) compared to the rest in the unlabeled dataset.

outliers

Back to top ⤴

📎 Blur

The following gallery shows the images sorted according to blurriness (from most blurry to less).

blur

Back to top ⤴

📙 Bright

The following gallery shows the images sorted according to brightness (brightest at the top).

bright

Back to top ⤴

🪔 Dark

The following gallery shows the images sorted according to darkness (darkest at the top).

dark

Back to top ⤴

📞 Questions? Connect with me

If you have any questions or feedback, please don't hesitate to reach out to me. I'm active on the following platforms.

dnth

❤️ Support Me

I am thrilled to share my work with you and I hope you find it useful.

If you do, please consider supporting my efforts by making a donation and/or sharing this repository on your social media.

Your support will help me to continue developing and maintaining this project, as well as create new ones.

Buy Me A Coffee

Back to top ⤴