Licenses

License considerations for each source are given below. Open use for non-commercial purposes is covered by all licences.

If you view any part of this dataset as a violation of intellectual property rights, please let us know and we will remove it.

Source	Description	License
Arabic Dialects Dataset	Dataset of Arabic dialects for Gulf, Egyptian, Levantine, and Tunisian Arabic dialects plus MSA	No explicit license; website describes data as "some free and useful Arabic corpora that I have created for researchers working on Arabic Natural Language Processing, Corpus and Computational Linguistics."
BLTR	Monolingual Bhojpuri corpus	CC BY-NC-SA 4.0
Global Voices	A parallel corpus of news stories from the web site Global Voices	The website for Global Voices is licensed as Creative Commons Attribution 3.0. There is no explicit additional license accompanying the dataset.
Guaraní Parallel Set	Parallel Guaraní-Spanish news corpus sourced from Paraguyan websites	No explicit license
HKCanCor	Transcribed conversations in Hong Kong Cantonese	CC BY 4.0
IADD	Arabic dialect identification dataset covering 5 regions (Maghrebi, Levantine, Egypt, Iraq, and Gulf) and 9 countries (Algeria, Morocco, Tunisia, Palestine, Jordan, Syria, Lebanon, Egypt and Iraq). It is created from five corpora: DART, SHAMI, TSAC, PADIC, and AOC.	Multiple licenses: Apache License 2.0 (SHAMI); GNU Lesser General Public License v3.0 (TSAC); GNU General Public License v3 (PADIC). DART and AOC had no explicit license.
Leipzig Corpora Collection	A collection of corpora in different languages with an identical format.	The Terms of Usage states "Permission for use is granted free of charge solely for non-commercial personal and scientific purposes licensed under the Creative Commons License CC BY-NC."
LTI	Training data for language identification	From the README: "With the exception of the contents of the Europarl/, ProjectGutenberg/, and PublicDomain/ directories, all code and text in this corpus are copyrighted. However, they may be redistributed under the terms of various Creative Commons licenses and the GNU GPL. Copying the unmodified archive noncommercially is permitted by all of the licenses. For commercial redistribution or redistribution of modified versions, please consult the individual licenses."
MADAR Shared Task 2019, subtask 1	Dialectal Arabic in the travel domain	The MADAR Corpus has a custom license, the text of which can be found in this repo.
EM corpus	Parallel Manipuri-English sentences crawled from The Sangai Express	CC BY-NC 4.0
MIZAN	Parallel Persian-English corpus from literature domain	CC BY 4.0
MT560 v1	A machine translation dataset for over 500 languages to English. We have filtered out data from OPUS-100, Europarl, Open Subtitles, Paracrawl, Wikimedia, Wikimatrix, Wikititles, and Common Crawl due to issues with the fidelity of the language labels.	Apache License 2.0
NLLB Seed	Around 6000 sentences in 39 languages sampled from Wikipedia, intended to cover languages lacking training data.	CC BY-SA 4.0
SETIMES	A parallel corpus of news articles in the Balkan languages	CC-BY-SA 3.0
Tatoeba	Collaborative sentence translations	CC BY 2.0 FR
Tehran English-Persian parallel corpus (TEP)	Parallel Persian-English sentences sourced from subtitles	GNU General Public License
Turkic Interlingua (TIL) Corpus	A large-scale parallel corpus combining most of the public datasets for 22 Turkic languages	CC BY-NC-SA 4.0
WiLI-2018	Wikipedia language identification benchmark containing 235K paragraphs of 235 languages	Open Data Commons Open Database License (ODbL) v1.0
XL-Sum	Summarisation dataset covering 44 languages, sourced from BBC News	CC BY-NC-SA 4.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

licenses.md

licenses.md

Licenses

Files

licenses.md

Latest commit

History

licenses.md

File metadata and controls

Licenses