Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small question about recommanded usage #11

Open
Tyrannas opened this issue Nov 18, 2022 · 2 comments
Open

Small question about recommanded usage #11

Tyrannas opened this issue Nov 18, 2022 · 2 comments

Comments

@Tyrannas
Copy link

Hello, first of all thank you for your work your libraries are amzaing !

I didnt know how to contact you properly and an Issue is probably the wrong way to do it so feel free to clost it without answering if you feel like it.

Anyway I just wanted to ask you a question. I need to perform some similar images grouping :images can be faces, screenshots, drawings, memes etc. so very differents kinds, there are though some images with small (light crop, lighting ...) or big variations (bigger crop, text added etc.) and I'm trying to find a way to regroup them. Until now I was using your other library (undouble) which was working fine but sometimes the grouping functionnality was excluding images that were really close (when computing the ahashes manually these images all had the same ahash but they were not grouped by undouble.group which is odd).

So anyway I started trying to use clustimage and I'm a bit overwhelmed, there seem to be so much functionnalities, ways of computing features, distances, evaluating the clusters etc. etc.

I've read your medium article on clustimage which helps a bit, and I know you're saying one should choose the parameters according to the research question, but I'm no datascientist and I'm a bit lost. My take right now would to try to make a script that iterates over all the possible parameters of clustimage and compute a score based on the images grouping that i've made manually. But I think there must be a smarter way to proceed.

So in other words, my question is: do you recommand any particular set of methods and parameters to group variations of images which can be of very different types.

Thank you in advance and have a good day !

@erdogant
Copy link
Owner

Thanks!

If the images are really “dirty”, it may require some iterations to get them in the right shape. One way to do that is by first clustering the images (as you describe) and then manually checkout the outliers and see whether there is something in common. If yes, add it to a pre-processing step.

There is another point that got my attention. Can you show with an example “images all had the same ahash but they were not grouped”. This is not the expected behavior. Better to fix this first.

Regarding to finding the best set of parameters, maybe you may want to proceed with a supervised approach rather than unsupervised? Or in other words, you can create your labels with clustimage or undouble and then use the labels in a supervised approach.

@Tyrannas
Copy link
Author

Thank you for your very interesting answer !
Concerning the weird behaviour, it's actually even weirder, when i group this set of images within a small dataset, they appear, but not when I group them in a bigger dataset.

Basically my code looks like this:

from undouble import Undouble
 
model = Undouble()
 
# model.import_data("small\data\set")
model.import_data("big\data\set")
model.compute_hash(method='ahash', hash_size=8)
model.group(threshold=10)

# copy grouped images to output directory with group id in their name
groups = model.results['select_pathnames]
for groud_id, group in enumerate(groups):
    for img in group:
        img_name = img.split('\\')[-1]
        shutil.copy(img, os.path.join(output_path, group_id + "_" + img_name))

when using the small dataset, by 10 images appear in the ouput directory, and they have the same group id, but when I use the big dataset, theses images don't show up in the output directory. So it's kinda hard to just send you the images since it depends on the whole dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants