Small question about recommanded usage #11

Tyrannas · 2022-11-18T13:18:21Z

Hello, first of all thank you for your work your libraries are amzaing !

I didnt know how to contact you properly and an Issue is probably the wrong way to do it so feel free to clost it without answering if you feel like it.

Anyway I just wanted to ask you a question. I need to perform some similar images grouping :images can be faces, screenshots, drawings, memes etc. so very differents kinds, there are though some images with small (light crop, lighting ...) or big variations (bigger crop, text added etc.) and I'm trying to find a way to regroup them. Until now I was using your other library (undouble) which was working fine but sometimes the grouping functionnality was excluding images that were really close (when computing the ahashes manually these images all had the same ahash but they were not grouped by undouble.group which is odd).

So anyway I started trying to use clustimage and I'm a bit overwhelmed, there seem to be so much functionnalities, ways of computing features, distances, evaluating the clusters etc. etc.

I've read your medium article on clustimage which helps a bit, and I know you're saying one should choose the parameters according to the research question, but I'm no datascientist and I'm a bit lost. My take right now would to try to make a script that iterates over all the possible parameters of clustimage and compute a score based on the images grouping that i've made manually. But I think there must be a smarter way to proceed.

So in other words, my question is: do you recommand any particular set of methods and parameters to group variations of images which can be of very different types.

Thank you in advance and have a good day !

erdogant · 2022-11-20T20:50:34Z

Thanks!

If the images are really “dirty”, it may require some iterations to get them in the right shape. One way to do that is by first clustering the images (as you describe) and then manually checkout the outliers and see whether there is something in common. If yes, add it to a pre-processing step.

There is another point that got my attention. Can you show with an example “images all had the same ahash but they were not grouped”. This is not the expected behavior. Better to fix this first.

Regarding to finding the best set of parameters, maybe you may want to proceed with a supervised approach rather than unsupervised? Or in other words, you can create your labels with clustimage or undouble and then use the labels in a supervised approach.

Tyrannas · 2022-11-21T18:44:07Z

Thank you for your very interesting answer !
Concerning the weird behaviour, it's actually even weirder, when i group this set of images within a small dataset, they appear, but not when I group them in a bigger dataset.

Basically my code looks like this:

from undouble import Undouble
 
model = Undouble()
 
# model.import_data("small\data\set")
model.import_data("big\data\set")
model.compute_hash(method='ahash', hash_size=8)
model.group(threshold=10)

# copy grouped images to output directory with group id in their name
groups = model.results['select_pathnames]
for groud_id, group in enumerate(groups):
    for img in group:
        img_name = img.split('\\')[-1]
        shutil.copy(img, os.path.join(output_path, group_id + "_" + img_name))

when using the small dataset, by 10 images appear in the ouput directory, and they have the same group id, but when I use the big dataset, theses images don't show up in the output directory. So it's kinda hard to just send you the images since it depends on the whole dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Small question about recommanded usage #11

Small question about recommanded usage #11

Tyrannas commented Nov 18, 2022

erdogant commented Nov 20, 2022

Tyrannas commented Nov 21, 2022

Small question about recommanded usage #11

Small question about recommanded usage #11

Comments

Tyrannas commented Nov 18, 2022

erdogant commented Nov 20, 2022

Tyrannas commented Nov 21, 2022