Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate Images #9

Open
ansariyusuf opened this issue Oct 9, 2020 · 8 comments
Open

Duplicate Images #9

ansariyusuf opened this issue Oct 9, 2020 · 8 comments

Comments

@ansariyusuf
Copy link

ansariyusuf commented Oct 9, 2020

I am trying to create a food dataset. However, when I try to scrape from Bing using this library, I am getting a lot of duplicate images. Please assist.

Thank you

@NickT5
Copy link

NickT5 commented Oct 18, 2020

My first attempt to filter out duplicates would be to subtract two possible duplicated images and check if the difference is close to zero.

@atsbomb
Copy link

atsbomb commented Dec 6, 2020

I'm getting the same. Downloaded 10000 pictures and 9789 of them were duplicates. Is this a nature of Bing image search, or particular to this downloader?

@jane-cz
Copy link

jane-cz commented Jan 24, 2021

When I scrape 100 photos, after the first 85 to 90 images, they start to repeat, and the rest are all duplicates.
When I scrape 500 photos, 370 of them are duplicates :(
Other than this it works great, so I really hope this issue can get fixed.

@AbhiDhariwal
Copy link

AbhiDhariwal commented Feb 3, 2021

Ya I also faced same issue it was due to how its programed i.e there is no next page in bing so instead first=pagecounter -> do first len of total url visited
also added ignore duplicates if same url is already visited
i will also pull the code or you can visit https://github.com/AbhiDhariwal/bing_image_downloader

@shoppel
Copy link

shoppel commented Apr 1, 2021

I successfully avoided duplicated images with the following code. But now it will search forever. So yeah,
maybe we need a next button for more images.

`
self.duplicates = set()

def save_image(self, link, file_path):
    request = urllib.request.Request(link, None, self.headers)
    image = urllib.request.urlopen(request, timeout=self.timeout).read()
    
    if not imghdr.what(None, image) or image in self.duplicates:
        print('[Error]Invalid image, not saving {}\n'.format(link))
        raise
    else:
        self.duplicates.add(image)

    with open(file_path, 'wb') as f:
        f.write(image)

`

@sid7631
Copy link
Contributor

sid7631 commented Sep 18, 2021

Remove duplicates PR#20

@annabaringer
Copy link

Bumping this as an issue. The fix above looks like it works and would be great if merged. Thanks!

@sid7631
Copy link
Contributor

sid7631 commented Mar 14, 2022

Please close this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants