Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

In particular:

Any updates to pre-existing files shall entail a major version bump.
The inclusion of a new language, without changes to other pre-existing data, shall only entail a minor version bump.

It follows from the above that benchmark scores across different releases of FLORES+ with the same major version number are comparable.

[2.0-rc.3] – 2024-04-25

Changed

Updated nob_Latn after additional quality assessment.

[2.0-rc.2] – 2024-03-12

Changed

Relabeled macrolanguage codes to the correct individual language codes: est_Latn to ekk_Latn, grn_Latn to gug_Latn, kon_Latn to ktu_Latn.
The ajp code has been deprecated. Relabeled ajp_Arab to apc_Arab_sout3123 and apc_Arab to apc_Arab_nort3139.
Relabeled Asante and Akuapem data: aka_Latn to twi_Latn_asan1239 and twi_Latn to twi_Latn_akua1239.

[2.0-rc.1] – 2024-02-29

Added

Added nqo_Nkoo from https://github.com/common-parallel-corpora/common-parallel-corpora.
Added the dev splits for brx_Deva, dgo_Deva, mni_Mtei, snd_Deva, gom_Deva from https://github.com/ai4bharat/IndicTrans2.
Added the dev split for mhr_Cyrl. Many thanks to Andrey Chemyshev @fu-lab.
Added the dev split for chv_Cyrl. Many thanks to @AlAntonov.

Changed

Relabeled zho_Hans to cmn_Hans after confirming the data is in Standard Beijing Mandarin.
Relabeled zho_Hant to cmn_Hant after confirming the data is in Taiwanese Mandarin.
Relabeled tgl_Latn to fil_Latn after confirming the data is in Filipino.
Relabeled tzm_Tfng to zgh_Tfng after additional quality assessment revealed the data was in Standard Moroccan Tamazight. Many thanks to @MedAymenF for pointing out the issues with the original FLORES-200 data.
Updated lij_Latn after additional quality assessment. Data has undergone minor spelling and syntactic fixes.
Updated ckb_Arab to replace usage of the non-standard ك character and improve the translation quality. Many thanks to @Sarchia for pointing out the issues with the original FLORES-200 data.
Updated yue_Hant to better conform to Honk Kong Cantonese. Thanks to the users who reported issues in facebookresearch/flores#61.
Updated several datasets due to a small number of translations (typically 1-2 sentences per dataset) having been overwritten by other sentences. The issue was reported in facebookresearch/flores#62 and facebookresearch/flores#67 – many thanks to @sotwi and @kargaranamir! Affected dev sets: ary_Arab, azb_Arab, ban_Latn, bod_Tibt, bos_Latn, bug_Latn, crh_Latn, dik_Latn, dyu_Latn, dzo_Tibt, fao_Latn, hat_Latn, jav_Latn, kam_Latn, kas_Deva, kaz_Cyrl, kbp_Latn, lim_Latn, lin_Latn, lit_Latn, lus_Latn, npi_Deva, run_Latn, san_Deva, sat_Olck, spa_Latn, ssw_Latn, sun_Latn, szl_Latn, taq_Tfng, urd_Arab, ydd_Hebr. Affected devtest sets: ary_Arab, bam_Latn, bod_Tibt, dyu_Latn, gla_Latn, grn_Latn, hat_Latn, hne_Deva, kam_Latn, kaz_Cyrl, kik_Latn, lin_Latn, lit_Latn, lua_Latn, min_Arab, min_Latn, npi_Deva, run_Latn, san_Deva, smo_Latn, spa_Latn, szl_Latn, taq_Tfng, tgk_Cyrl, uig_Arab, urd_Arab, ydd_Hebr.

1.0 - 2023-10-24

Initial release. This is the exact same data that was released under the name FLORES-200 by NLLB Team et al. (2022). It can be downloaded from https://tinyurl.com/flores200dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHANGELOG.md

CHANGELOG.md

Changelog

[2.0-rc.3] – 2024-04-25

Changed

[2.0-rc.2] – 2024-03-12

Changed

[2.0-rc.1] – 2024-02-29

Added

Changed

1.0 - 2023-10-24

Files

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Changelog

[2.0-rc.3] – 2024-04-25

Changed

[2.0-rc.2] – 2024-03-12

Changed

[2.0-rc.1] – 2024-02-29

Added

Changed

1.0 - 2023-10-24