MMPreTrain Release v1.0.0: Backbones, Self-Supervised Learning and Multi-Modalilty

Support more multi-modal algorithms and datasets

We are excited to announce that there are several advanced multi-modal methods suppported! We integrated huggingface/transformers with vision backbones in MMPreTrain to run inference and training(in developing).

Methods	Datasets
BLIP (arxiv'2022)	COCO (caption, retrieval, vqa)
BLIP-2 (arxiv'2023)	Flickr30k (caption, retrieval)
OFA (CoRR'2022)	GQA
Flamingo (NeurIPS'2022)	NLVR2
Chinese CLIP (arxiv'2022)	NoCaps
MiniGPT-4 (arxiv'2023)	OCR VQA
LLaVA (arxiv'2023)	Text VQA
Otter (arxiv'2023)	VG VQA
	VisualGenomeQA
	VizWiz
	VSR

Add iTPN, SparK self-supervised learning algorithms.

Provide examples of New Config and DeepSpeed/FSDP

We test DeepSpeed and FSDP with MMEngine. The following are the memory and training time with ViT-large, ViT-huge and 8B multi-modal models, the left figure is the memory data, and the right figure is the training time data.

Test environment: 8*A100 (80G) PyTorch 2.0.0

Remark: Both FSDP and DeepSpeed were tested with default configurations and not tuned, besides manually tuning the FSDP wrap policy can further reduce training time and memory usage.

New Features

Transfer shape-bias tool from mmselfsup (#1658)
Download dataset by using MIM&OpenDataLab (#1630)
Support New Configs (#1639, #1647, #1665)
Support Flickr30k Retrieval dataset (#1625)
Support SparK (#1531)
Support LLaVA (#1652)
Support Otter (#1651)
Support MiniGPT-4 (#1642)
Add support for VizWiz dataset (#1636)
Add support for vsr dataset (#1634)
Add InternImage Classification project (#1569)
Support OCR-VQA dataset (#1621)
Support OK-VQA dataset (#1615)
Support TextVQA dataset (#1569)
Support iTPN and HiViT (#1584)
Add retrieval mAP metric (#1552)
Support NoCap dataset based on BLIP. (#1582)
Add GQA dataset (#1585)

Improvements

Update fsdp vit-huge and vit-large config (#1675)
Support deepspeed with flexible runner (#1673)
Update Otter and LLaVA docs and config. (#1653)
Add image_only param of ScienceQA (#1613)
Support to use "split" to specify training set/validation (#1535)

Bug Fixes

Refactor _prepare_pos_embed in ViT (#1656， #1679)
Freeze pre norm in vision transformer (#1672)
Fix bug loading IN1k dataset (#1641)
Fix sam bug (#1633)
Fixed circular import error for new transform (#1609)
Update torchvision transform wrapper (#1595)
Set default out_type in CAM visualization (#1586)

Docs Update

Fix spelling (#1681)
Fix doc typos (#1671, #1644, #1629)
Add t-SNE visualization doc (#1555)

New Contributors

@alexwangxiang made their first contribution in #1555
@InvincibleWyq made their first contribution in #1615
@yyk-wew made their first contribution in #1634
@fanqiNO1 made their first contribution in #1673
@Ben-Louis made their first contribution in #1679
@Lamply made their first contribution in #1671
@minato-ellie made their first contribution in #1644
@liweiwp made their first contribution in #1629

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MMPreTrain Release v1.0.0: Backbones, Self-Supervised Learning and Multi-Modalilty

MMPreTrain Release v1.0.0: Backbones, Self-Supervised Learning and Multi-Modalilty

Support more multi-modal algorithms and datasets

Add iTPN, SparK self-supervised learning algorithms.

Provide examples of New Config and DeepSpeed/FSDP

New Features

Improvements

Bug Fixes

Docs Update

New Contributors

Contributors