Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[datasets] Extend the range of public datasets supported in docTR #587

Closed
7 tasks done
Tracked by #791
fg-mindee opened this issue Nov 5, 2021 · 23 comments · Fixed by #785
Closed
7 tasks done
Tracked by #791

[datasets] Extend the range of public datasets supported in docTR #587

fg-mindee opened this issue Nov 5, 2021 · 23 comments · Fixed by #785
Labels
help wanted Extra attention is needed module: datasets Related to doctr.datasets type: enhancement Improvement
Milestone

Comments

@fg-mindee
Copy link
Contributor

fg-mindee commented Nov 5, 2021

@fg-mindee fg-mindee added type: enhancement Improvement module: datasets Related to doctr.datasets labels Nov 5, 2021
@fg-mindee fg-mindee added this to the 1.0.0 milestone Nov 5, 2021
@felixdittrich92
Copy link
Contributor

felixdittrich92 commented Nov 5, 2021

@fg-mindee
handwritten (IAM dataset):
https://fki.tic.heia-fr.ch/databases/iam-handwriting-database
was also used for TrOCR finetuning

PS: contains only english handwritten !
I would prefer a generator at this point wdyt?

@felixdittrich92
Copy link
Contributor

felixdittrich92 commented Nov 5, 2021

TextOCR dataset
https://textvqa.org/textocr/dataset/
COCOText-v2 dataset
https://bgshih.github.io/cocotext/

@felixdittrich92
Copy link
Contributor

felixdittrich92 commented Nov 7, 2021

@fg-mindee
i will start with the IIIT-5K dataset can you assigne me to this 😃 ?
For COCOText-v2 i have currently a working version but i think we have to discuss this before starting (currently implemented with COCOText API which has a BSD license 🦖 )

@fg-mindee
Copy link
Contributor Author

Sure, you can go ahead and I see that you already opened a PR :)

@felixdittrich92
Copy link
Contributor

Char74k dataset
http://www.ee.surrey.ac.uk/CVSSP/demos/chars74k/

@felixdittrich92
Copy link
Contributor

@felixdittrich92
Copy link
Contributor

@fg-mindee
I´m not able to grab the IAM dataset (have registered but after this nothing) so i would say lets change this to IMGUR5K dataset i will prepare this (it´s only a generator script) but we need to upload this (IAM would be the same)
https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset

@fg-mindee
Copy link
Contributor Author

If one dataset cannot be downloaded directly, we'll add instructions where to get it and change the constructor. But apart from some exceptions, we won't reupload public datasets 👍

@felixdittrich92
Copy link
Contributor

@fg-mindee
Same way if a dataset needs multible downloads (1 for anno file, 1 for images, ..) or any other suggsestion ? :)

@fg-mindee
Copy link
Contributor Author

@fg-mindee Same way if a dataset needs multible downloads (1 for anno file, 1 for images, ..) or any other suggsestion ? :)

If the dataset is public, either:

  • all of those are in a downloadable zip --> we DL it and extract
  • there are many URLs to download --> we need to store the list of URL and download all of those in parallel

@felixdittrich92
Copy link
Contributor

felixdittrich92 commented Nov 23, 2021

@fg-mindee

  • IAM dataset is not possible (i have requested access but nothing has happened since then)

  • same for IMGUR5K without uploading we cannot provide this dataset (its a generator script which downloads the images from google images and compares the checksum from a provided list)

Wdyt can we add a function download_from_drive to provide also GoogleDrive downloads ?
For example TotalText download is a drive link and to provide a version from IMGUR5K would than also no problem :)

For ICDAR2003 which task would we provide ?
Character Recognition or Word Recognition ? Both is only english vocab

@felixdittrich92
Copy link
Contributor

ICDAR 2019 Robust Reading Challenge on Multi-lingual scene text detection and recognition
https://rrc.cvc.uab.es/?ch=15&com=downloads (need account)

@fg-mindee
Copy link
Contributor Author

@felixdittrich92 sorry for the late reply!

  • about the selection of datasets, we won't support dozens of datasets in the end, so it's alright we can't have all of them 👍
  • from what I remember, downloading from google drive is a pretty big dep, so we won't go into this :/
  • for ICDAR-like datasets where registration is needed, we can easily, just like our private dataset, use the second type of Datasets: download arg doesn't exist, you need to provide a path to where you downloaded the dataset. (Torchvision does the same as well https://pytorch.org/vision/stable/datasets.html#imagenet)

@felixdittrich92
Copy link
Contributor

@fg-mindee
GoogleDrive downloader would be also one function, which needs only python builtin stuff (does work not with urls you need only the file_id from the url) ... but its ok not to support :)
I will investigate a bit and take a look which datasets really are a good fit and come back with this

@felixdittrich92
Copy link
Contributor

felixdittrich92 commented Nov 24, 2021

@fg-mindee

Additional maybe ?

Replace:
IAM (No access and no response to request) with IMGUR5K ( i am currently not sure how we can provide this in a save way )
ICDAR13 with ICDAR19

I think the IMGUR5K dataset and IC19 would bring the most improvement 🤗
first as handwritten localization and recognition and sec as multilingual localization and recognition

wdyt and can we update the upper list ? 😄

EDIT: for IMGUR5K would it be possible to upload a zip file which contains the annotations/img_urls and hashes ? (9,6MB)
Than i can implement the generation/loading from these files

@felixdittrich92
Copy link
Contributor

@fg-mindee
ok i think (after the open PRs) it would be enough for the moment !?
Only IMGUR5K would be nice as last one that in order to be able to cover handwriting as well
wdyt ?

we should then also revise the boxes for the records so that they all have relative coordinates (but i would open a extra PR if this one is complete) 😄

@fg-mindee
Copy link
Contributor Author

Hello @felixdittrich92 👋

Yes, apart from IMGUR5k, where I'll have to take a closer look at, I think we're good for a while now 👌
Regarding localization annotation format, we were also planning to unify all of those, but it will be a topic for another issue :)

@fg-mindee
Copy link
Contributor Author

fg-mindee commented Nov 29, 2021

About IC19 replacing IC13, most research papers evaluate their perf on IC03, IC13 and IC15.
While I agree it's nice to have multilingual, as of now, the library hasn't been tested to work with characters that might be out of most popular encodings. So perhaps we should stick with either IC13 or IC15 for now 🤔

@fg-mindee
Copy link
Contributor Author

Checking the ref, Imgur5K word images looks like a nice final addition for this round :)

@felixdittrich92
Copy link
Contributor

@fg-mindee
yes of course but this needs i think a bit more discussion :)
IMGUR5K is not a "real" dataset its only a generator script to prepare a dataset 😅
https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset/blob/main/download_imgur5k.py

We can definitly do this but i would prefer to discuss in front of any implementation

@fg-mindee
Copy link
Contributor Author

Mmmh it looks more like the script to simply download the dataset 😅
It parses the list of image URLs that is in the repo, & download them! (cf. https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset/blob/main/download_imgur5k.py#L95)

And it's a real-world handwritten image dataset, not a synthetic one to the best of my understanding!

@felixdittrich92
Copy link
Contributor

@fg-mindee
Yes you misunderstood this with real the meaning was more any compressed format 😅
like other datasets

The repo provides some "lists" with urls, hashes and labels .. of course we can provide something as "live creation" but:

Lets discuss this tomorrow in Slack short :)

@felixdittrich92
Copy link
Contributor

@fg-mindee
please update IAM with IMGUR5K https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset

Thaanks :)

@fg-mindee fg-mindee added the help wanted Extra attention is needed label Dec 24, 2021
@fg-mindee fg-mindee mentioned this issue Jan 10, 2022
85 tasks
@fg-mindee fg-mindee linked a pull request Jan 10, 2022 that will close this issue
1 task
@fg-mindee fg-mindee modified the milestones: 1.0.0, 0.6.0 Jan 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed module: datasets Related to doctr.datasets type: enhancement Improvement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants