An overview of multi-modal datasets proposed for large-scale pre-training.
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Pre-training Dataset and Benchmarks, [Paper] [Github]
LAION-5B: An open large-scale dataset for training next generation image-text models, [Paper] [Project]
COYO-700M: Image-Text Pair Dataset [Code]

NO.	Dataset	Year	Scale	Modality	Language	Available	URL
01	SBU Captions	2011	1M	image-text	English	Yes	[Link]
02	Flickr30k	2014	145K	image-text	English	Yes	[Link]
03	COCO	2014	567K	image-text	English	Yes	[Link]
04	Visual Genome	2017	5.4M	image-text	English	Yes	[Link]
05	VQA v2.0	2017	1.1M	image-text	English	Yes	[Link]
06	FashionGen	2018	300k	image-text	English	Yes	[Link]
07	CC3M	2018	3M	image-text	English	Yes	[Link]
08	GQA	2019	1M	image-text	English	Yes	[Link]
09	LAIT	2020	10M	image-text	English	No	-
10	CC12M	2021	12M	image-text	English	Yes	[Link]
11	AltText	2021	1.8B	image-text	English	No	-
12	TVQA	2018	21,793	video-text	English	Yes	[Link]
13	HT100M	2019	136M	video-text	English	Yes	[Link]
14	WebVid2M	2021	2.5M	video-text	English	Yes	[Link]
15	YFCC-100M	2015	100M	image-text	English	Yes	[Link]
16	LAION-400M	2021	400M	image-text	English	Yes	[Link]
17	RedCaps	2021	12M	image-text	English	Yes	[Link]
18	Wukong	2022	100M	image-text	Chinese	Yes	[Link]
19	CxC	2021	24K	image-text	English	Yes	[Link]
20	Product1M	2021	1M	image-text	Chinese	Yes	[Link]
21	WIT	2021	37.5M	image-text	Multi-lingual	Yes	[Link]
22	JFT-300M	2017	30M	image-text	English	No	-
23	JFT-3B	2021	3000M	image-text	English	No	-
24	IG-3.5B-17k	2018	350M	image-text	English	No	-
25	M6-Corpus	2021	60M	image, image-text	Chinese	No	-
26	M5Product	2021	6M	image, text, table, video, audio	English	Yes	[Link]
27	Localized Narratives	2020	849k	image, audio, text, mouse trace	English	Yes	[Link]
28	RUC-CAS-WenLan	2021	30M	image-text	Chinese	No	-
29	WuDaoMM	2022	600M	image-text	Chinese	Yes	[Link]
30	MEP-3M	2021	3M	image-text	Chinese	Yes	[Link]
31	WSCD	2021	650M	image-text	Chinese	No	-

Provide feedback

Saved searches