-
An overview of multi-modal datasets proposed for large-scale pre-training.
-
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Pre-training Dataset and Benchmarks, [Paper] [Github]
-
LAION-5B: An open large-scale dataset for training next generation image-text models, [Paper] [Project]
-
COYO-700M: Image-Text Pair Dataset [Code]
NO. | Dataset | Year | Scale | Modality | Language | Available | URL |
---|---|---|---|---|---|---|---|
01 | SBU Captions | 2011 | 1M | image-text | English | Yes | [Link] |
02 | Flickr30k | 2014 | 145K | image-text | English | Yes | [Link] |
03 | COCO | 2014 | 567K | image-text | English | Yes | [Link] |
04 | Visual Genome | 2017 | 5.4M | image-text | English | Yes | [Link] |
05 | VQA v2.0 | 2017 | 1.1M | image-text | English | Yes | [Link] |
06 | FashionGen | 2018 | 300k | image-text | English | Yes | [Link] |
07 | CC3M | 2018 | 3M | image-text | English | Yes | [Link] |
08 | GQA | 2019 | 1M | image-text | English | Yes | [Link] |
09 | LAIT | 2020 | 10M | image-text | English | No | - |
10 | CC12M | 2021 | 12M | image-text | English | Yes | [Link] |
11 | AltText | 2021 | 1.8B | image-text | English | No | - |
12 | TVQA | 2018 | 21,793 | video-text | English | Yes | [Link] |
13 | HT100M | 2019 | 136M | video-text | English | Yes | [Link] |
14 | WebVid2M | 2021 | 2.5M | video-text | English | Yes | [Link] |
15 | YFCC-100M | 2015 | 100M | image-text | English | Yes | [Link] |
16 | LAION-400M | 2021 | 400M | image-text | English | Yes | [Link] |
17 | RedCaps | 2021 | 12M | image-text | English | Yes | [Link] |
18 | Wukong | 2022 | 100M | image-text | Chinese | Yes | [Link] |
19 | CxC | 2021 | 24K | image-text | English | Yes | [Link] |
20 | Product1M | 2021 | 1M | image-text | Chinese | Yes | [Link] |
21 | WIT | 2021 | 37.5M | image-text | Multi-lingual | Yes | [Link] |
22 | JFT-300M | 2017 | 30M | image-text | English | No | - |
23 | JFT-3B | 2021 | 3000M | image-text | English | No | - |
24 | IG-3.5B-17k | 2018 | 350M | image-text | English | No | - |
25 | M6-Corpus | 2021 | 60M | image, image-text | Chinese | No | - |
26 | M5Product | 2021 | 6M | image, text, table, video, audio | English | Yes | [Link] |
27 | Localized Narratives | 2020 | 849k | image, audio, text, mouse trace | English | Yes | [Link] |
28 | RUC-CAS-WenLan | 2021 | 30M | image-text | Chinese | No | - |
29 | WuDaoMM | 2022 | 600M | image-text | Chinese | Yes | [Link] |
30 | MEP-3M | 2021 | 3M | image-text | Chinese | Yes | [Link] |
31 | WSCD | 2021 | 650M | image-text | Chinese | No | - |