Skip to content

project-anuvaad/anuvaad-ocr-corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 

Repository files navigation

Anuvaad OCR Corpus

This repository contains corpus links for popular Indian languages developed as part of the Anuvaad project.

Please reach out to nlp-nmt@tarento.com for any clarification/interpretation/usage of the linked datasets.

Shield: CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0

Status

Goal

The goal is to build high quality corpus extracted from pdfs for the Indian languages across various domains (General, Legal, Education, Healthcare, Automobile, News etc). This can be eventually used to train the ML models based on the use cases.

Read more about Anuvaad @ http://anuvaad.org/

Links

English

Domain Source Sentence count Corpus Download Link
Educational NCERT 2,03,000 Class-1
Class-2
Class-3
Class-4
Class-5
Class-6
Class-7
Class-8
Class-9
Class-10
Class-11
Class-12
Educational ebalbharti 90,900 Class-1
Class-2
Class-3
Class-4
Class-5
Class-6
Class-7
Class-8
Class-9
Class-10
Class-11
Class-12
Educational NIOS-Diploma 32,100 All

Hindi

Domain Source Sentence count Corpus Download Link
Educational NCERT 2,19,000 Class-1
Class-2
Class-3
Class-4
Class-5
Class-6
Class-7
Class-8
Class-9
Class-10
Class-11
Class-12
Educational ebalbharti 61,500 Class-1
Class-2
Class-3
Class-4
Class-5
Class-6
Class-7
Class-8
Class-9
Class-10
Class-11
Class-12
Educational NIOS-Diploma 31,500 All

Bengali

Domain Source Sentence count Corpus Download Link
Educational ebalbharti 17,200 Class-1
Class-2
Class-3
Class-4
Class-5
Class-11
Class-12
Educational NIOS-Diploma 29,800 All

Tamil

Domain Source Sentence count Corpus Download Link
Educational ebalbharti 10,800 Class-1
Class-2
Class-3
Class-4
Educational NIOS-Diploma 31,700 All

Malayalam

Domain Source Sentence count Corpus Download Link

Telugu

Domain Source Sentence count Corpus Download Link
Educational ebalbharti 69,600 Class-1
Class-2
Class-3
Class-4
Class-5
Class-6
Class-7
Class-8
Class-9
Class-10
Class-11
Class-12
Educational NIOS-Diploma 29,800 All

Kannada

Domain Source Sentence count Corpus Download Link
Educational ebalbharti 61,600 Class-1
Class-2
Class-3
Class-4
Class-5
Class-6
Class-7
Class-8
Class-9
Class-10
Class-11
Class-12
Educational NIOS-Diploma 27,200 All

Marathi

Domain Source Sentence count Corpus Download Link
Educational ebalbharti 68,900 Class-1
Class-2
Class-3
Class-4
Class-5
Class-6
Class-7
Class-8
Class-9
Class-10
Class-11
Class-12
Educational NIOS-Diploma 26,100 All

Punjabi

Domain Source Sentence count Corpus Download Link
Educational NIOS-Diploma 18,900 All

Gujarati

Domain Source Sentence count Corpus Download Link
Educational ebalbharti 63,600 Class-1
Class-2
Class-3
Class-4
Class-5
Class-6
Class-7
Class-8
Class-9
Class-10
Class-11
Class-12
Educational NIOS-Diploma 36,400 All

Assamese

Domain Source Sentence count Corpus Download Link
Educational NIOS-Diploma 27,400 All

Urdu

Domain Source Sentence count Corpus Download Link

Odia

Domain Source Sentence count Corpus Download Link
Educational NIOS-Diploma 27,400 All

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published