[datasets] Extend the range of public datasets supported in docTR #587

fg-mindee · 2021-11-05T12:04:51Z

Currently, we support FUNSD, CORD and SROIE but we should look at extending the range of supported datasets. Among others, we could include handwritten, and in-the-wild situations.

Here is a list of datasets you can usually find in OCR-related benchmarks:

Of course, the list goes on

The text was updated successfully, but these errors were encountered:

felixdittrich92 · 2021-11-05T13:39:03Z

@fg-mindee
handwritten (IAM dataset):
https://fki.tic.heia-fr.ch/databases/iam-handwriting-database
was also used for TrOCR finetuning

PS: contains only english handwritten !
I would prefer a generator at this point wdyt?

felixdittrich92 · 2021-11-05T21:27:33Z

TextOCR dataset
https://textvqa.org/textocr/dataset/
COCOText-v2 dataset
https://bgshih.github.io/cocotext/

felixdittrich92 · 2021-11-07T10:08:53Z

@fg-mindee
i will start with the IIIT-5K dataset can you assigne me to this 😃 ?
For COCOText-v2 i have currently a working version but i think we have to discuss this before starting (currently implemented with COCOText API which has a BSD license 🦖 )

fg-mindee · 2021-11-08T10:24:52Z

Sure, you can go ahead and I see that you already opened a PR :)

felixdittrich92 · 2021-11-08T21:16:11Z

Char74k dataset
http://www.ee.surrey.ac.uk/CVSSP/demos/chars74k/

felixdittrich92 · 2021-11-10T12:07:48Z

TotalText dataset
https://github.com/cs-chan/Total-Text-Dataset/tree/master/Dataset

felixdittrich92 · 2021-11-11T07:15:11Z

@fg-mindee
I´m not able to grab the IAM dataset (have registered but after this nothing) so i would say lets change this to IMGUR5K dataset i will prepare this (it´s only a generator script) but we need to upload this (IAM would be the same)
https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset

fg-mindee · 2021-11-12T11:55:26Z

If one dataset cannot be downloaded directly, we'll add instructions where to get it and change the constructor. But apart from some exceptions, we won't reupload public datasets 👍

felixdittrich92 · 2021-11-12T22:07:17Z

@fg-mindee
Same way if a dataset needs multible downloads (1 for anno file, 1 for images, ..) or any other suggsestion ? :)

fg-mindee · 2021-11-15T11:06:29Z

@fg-mindee Same way if a dataset needs multible downloads (1 for anno file, 1 for images, ..) or any other suggsestion ? :)

If the dataset is public, either:

all of those are in a downloadable zip --> we DL it and extract
there are many URLs to download --> we need to store the list of URL and download all of those in parallel

felixdittrich92 · 2021-11-23T12:44:36Z

@fg-mindee

IAM dataset is not possible (i have requested access but nothing has happened since then)
same for IMGUR5K without uploading we cannot provide this dataset (its a generator script which downloads the images from google images and compares the checksum from a provided list)

Wdyt can we add a function download_from_drive to provide also GoogleDrive downloads ?
For example TotalText download is a drive link and to provide a version from IMGUR5K would than also no problem :)

For ICDAR2003 which task would we provide ?
Character Recognition or Word Recognition ? Both is only english vocab

felixdittrich92 · 2021-11-23T12:52:36Z

ICDAR 2019 Robust Reading Challenge on Multi-lingual scene text detection and recognition
https://rrc.cvc.uab.es/?ch=15&com=downloads (need account)

fg-mindee · 2021-11-23T21:59:11Z

@felixdittrich92 sorry for the late reply!

about the selection of datasets, we won't support dozens of datasets in the end, so it's alright we can't have all of them 👍
from what I remember, downloading from google drive is a pretty big dep, so we won't go into this :/
for ICDAR-like datasets where registration is needed, we can easily, just like our private dataset, use the second type of Datasets: download arg doesn't exist, you need to provide a path to where you downloaded the dataset. (Torchvision does the same as well https://pytorch.org/vision/stable/datasets.html#imagenet)

felixdittrich92 · 2021-11-24T09:56:26Z

@fg-mindee
GoogleDrive downloader would be also one function, which needs only python builtin stuff (does work not with urls you need only the file_id from the url) ... but its ok not to support :)
I will investigate a bit and take a look which datasets really are a good fit and come back with this

felixdittrich92 · 2021-11-24T14:57:47Z

@fg-mindee

Additional maybe ?

IC03 (http://www.iapr-tc11.org/mediawiki/index.php?title=ICDAR_2003_Robust_Reading_Competitions)
Char74k http://www.ee.surrey.ac.uk/CVSSP/demos/chars74k/
for char recognition training

Replace:
IAM (No access and no response to request) with IMGUR5K ( i am currently not sure how we can provide this in a save way )
ICDAR13 with ICDAR19

I think the IMGUR5K dataset and IC19 would bring the most improvement 🤗
first as handwritten localization and recognition and sec as multilingual localization and recognition

wdyt and can we update the upper list ? 😄

EDIT: for IMGUR5K would it be possible to upload a zip file which contains the annotations/img_urls and hashes ? (9,6MB)
Than i can implement the generation/loading from these files

felixdittrich92 · 2021-11-26T15:11:56Z

@fg-mindee
ok i think (after the open PRs) it would be enough for the moment !?
Only IMGUR5K would be nice as last one that in order to be able to cover handwriting as well
wdyt ?

we should then also revise the boxes for the records so that they all have relative coordinates (but i would open a extra PR if this one is complete) 😄

fg-mindee · 2021-11-29T22:43:36Z

Hello @felixdittrich92 👋

Yes, apart from IMGUR5k, where I'll have to take a closer look at, I think we're good for a while now 👌
Regarding localization annotation format, we were also planning to unify all of those, but it will be a topic for another issue :)

fg-mindee · 2021-11-29T22:57:36Z

About IC19 replacing IC13, most research papers evaluate their perf on IC03, IC13 and IC15.
While I agree it's nice to have multilingual, as of now, the library hasn't been tested to work with characters that might be out of most popular encodings. So perhaps we should stick with either IC13 or IC15 for now 🤔

fg-mindee · 2021-12-14T14:31:09Z

Checking the ref, Imgur5K word images looks like a nice final addition for this round :)

felixdittrich92 · 2021-12-14T14:38:40Z

@fg-mindee
yes of course but this needs i think a bit more discussion :)
IMGUR5K is not a "real" dataset its only a generator script to prepare a dataset 😅
https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset/blob/main/download_imgur5k.py

We can definitly do this but i would prefer to discuss in front of any implementation

fg-mindee · 2021-12-14T16:50:42Z

Mmmh it looks more like the script to simply download the dataset 😅
It parses the list of image URLs that is in the repo, & download them! (cf. https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset/blob/main/download_imgur5k.py#L95)

And it's a real-world handwritten image dataset, not a synthetic one to the best of my understanding!

felixdittrich92 · 2021-12-14T17:10:37Z

@fg-mindee
Yes you misunderstood this with real the meaning was more any compressed format 😅
like other datasets

The repo provides some "lists" with urls, hashes and labels .. of course we can provide something as "live creation" but:

Lets discuss this tomorrow in Slack short :)

felixdittrich92 · 2021-12-15T14:01:48Z

@fg-mindee
please update IAM with IMGUR5K https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset

Thaanks :)

fg-mindee added type: enhancement Improvement module: datasets Related to doctr.datasets labels Nov 5, 2021

fg-mindee added this to the 1.0.0 milestone Nov 5, 2021

felixdittrich92 mentioned this issue Nov 8, 2021

IIIT-5K dataset integration #589

Merged

fg-mindee mentioned this issue Nov 15, 2021

SynthText dataset integration #624

Merged

2 tasks

This was referenced Nov 15, 2021

Add multi download in parallel #625

Closed

SVHN dataset integration #634

Merged

This was referenced Nov 26, 2021

ICDAR2019 dataset integration #652

Closed

ICDAR2003 dataset integration #653

Merged

felixdittrich92 mentioned this issue Nov 30, 2021

ICDAR2013 dataset integration #662

Merged

fg-mindee added the help wanted Extra attention is needed label Dec 24, 2021

felixdittrich92 mentioned this issue Jan 5, 2022

Imgur5k dataset integration #785

Merged

1 task

fg-mindee mentioned this issue Jan 10, 2022

Release tracker - v0.6.0 #791

Closed

85 tasks

fg-mindee linked a pull request Jan 10, 2022 that will close this issue

Imgur5k dataset integration #785

Merged

1 task

fg-mindee modified the milestones: 1.0.0, 0.6.0 Jan 10, 2022

fg-mindee closed this as completed in #785 Jan 10, 2022

tobiascornille mentioned this issue May 13, 2023

Support for Handwritten text #1049

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[datasets] Extend the range of public datasets supported in docTR #587

[datasets] Extend the range of public datasets supported in docTR #587

fg-mindee commented Nov 5, 2021 •

edited

Loading

felixdittrich92 commented Nov 5, 2021 •

edited

Loading

felixdittrich92 commented Nov 5, 2021 •

edited

Loading

felixdittrich92 commented Nov 7, 2021 •

edited

Loading

fg-mindee commented Nov 8, 2021

felixdittrich92 commented Nov 8, 2021

felixdittrich92 commented Nov 10, 2021

felixdittrich92 commented Nov 11, 2021

fg-mindee commented Nov 12, 2021

felixdittrich92 commented Nov 12, 2021

fg-mindee commented Nov 15, 2021

felixdittrich92 commented Nov 23, 2021 •

edited

Loading

felixdittrich92 commented Nov 23, 2021

fg-mindee commented Nov 23, 2021

felixdittrich92 commented Nov 24, 2021

felixdittrich92 commented Nov 24, 2021 •

edited

Loading

felixdittrich92 commented Nov 26, 2021

fg-mindee commented Nov 29, 2021

fg-mindee commented Nov 29, 2021 •

edited

Loading

fg-mindee commented Dec 14, 2021

felixdittrich92 commented Dec 14, 2021

fg-mindee commented Dec 14, 2021

felixdittrich92 commented Dec 14, 2021

felixdittrich92 commented Dec 15, 2021

[datasets] Extend the range of public datasets supported in docTR #587

[datasets] Extend the range of public datasets supported in docTR #587

Comments

fg-mindee commented Nov 5, 2021 • edited Loading

felixdittrich92 commented Nov 5, 2021 • edited Loading

felixdittrich92 commented Nov 5, 2021 • edited Loading

felixdittrich92 commented Nov 7, 2021 • edited Loading

fg-mindee commented Nov 8, 2021

felixdittrich92 commented Nov 8, 2021

felixdittrich92 commented Nov 10, 2021

felixdittrich92 commented Nov 11, 2021

fg-mindee commented Nov 12, 2021

felixdittrich92 commented Nov 12, 2021

fg-mindee commented Nov 15, 2021

felixdittrich92 commented Nov 23, 2021 • edited Loading

felixdittrich92 commented Nov 23, 2021

fg-mindee commented Nov 23, 2021

felixdittrich92 commented Nov 24, 2021

felixdittrich92 commented Nov 24, 2021 • edited Loading

felixdittrich92 commented Nov 26, 2021

fg-mindee commented Nov 29, 2021

fg-mindee commented Nov 29, 2021 • edited Loading

fg-mindee commented Dec 14, 2021

felixdittrich92 commented Dec 14, 2021

fg-mindee commented Dec 14, 2021

felixdittrich92 commented Dec 14, 2021

felixdittrich92 commented Dec 15, 2021

fg-mindee commented Nov 5, 2021 •

edited

Loading

felixdittrich92 commented Nov 5, 2021 •

edited

Loading

felixdittrich92 commented Nov 5, 2021 •

edited

Loading

felixdittrich92 commented Nov 7, 2021 •

edited

Loading

felixdittrich92 commented Nov 23, 2021 •

edited

Loading

felixdittrich92 commented Nov 24, 2021 •

edited

Loading

fg-mindee commented Nov 29, 2021 •

edited

Loading