yfzhang114 / SliME Public

Notifications You must be signed in to change notification settings
Fork 4
Star 111

✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

Apache-2.0 license

111 stars 4 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
docs		docs
images		images
llava		llava
scripts		scripts
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Slime_video.md		Slime_video.md
cog.yaml		cog.yaml
pyproject.toml		pyproject.toml

Repository files navigation

Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

[📖 arXiv Paper] [📊 Dataset][🏆 Models]

🔥 Update

[07/16]🔥The SliME strategy demonstrates exceptional versatility, extending seamlessly to video analysis (See Slime_video.md). Remarkably, even though the model has never been specifically trained on video data, it is capable of processing up to 8 frames. In the Video-MME benchmark, the model surpasses numerous 7B/8B baselines that have undergone training on video datasets.
[06/11]🔥SliME is coming! We release the paper, code, models, and data for SliME!
[06/11]🔥SliME-70B will be released soon.

👀 Contents

Install
Model
Preparation
Train
Evaluation
Examples
Citation

🔮 Install

Please follow the instructions below to install the required packages.

Clone this repository

git clone https://github.com/yfzhang114/SliME.git

Install Package

conda create -n slime python=3.10 -y
conda activate slime
cd SliME
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Install additional packages for training cases

pip install -e ".[train]"
pip install ninja
pip install datasets
pip install flash-attn --no-build-isolation

🔍 Model

We provide all our fully finetuned models on Stage 1/2 and 3 data for SliME:

Model	Base LLM	Vision Encoder	Finetuning Data	Finetuning schedule	Download
SliME-7B	Vicuna-7B-v1.5	CLIP-L	SharedGPT+SMR	full_ft	ckpt
SliME-8B	Llama-3-8B-Instruct	CLIP-L	SharedGPT+SMR	full_ft	ckpt
SliME-13B	Vicuna-13B-v1.5	CLIP-L	SharedGPT+SMR	full_ft	ckpt
SliME-70B	Llama-3-70B-Instruct	CLIP-L	SharedGPT+SMR	Lora	ckpt

Here are the pretrained weights on Stage 1/2 data only:

Model	Base LLM	Vision Encoder	Pretrain Data	Finetuning schedule	Download
SliME-7B	Vicuna-7B-v1.5	CLIP-L	LLaVA-Pretrain	1e	ckpt
SliME-8B	Llama-3-8B-Instruct	CLIP-L	LLaVA-Pretrain	1e	ckpt
SliME-13B	Vicuna-13B-v1.5	CLIP-L	LLaVA-Pretrain	1e	ckpt
SliME-70B	Llama-3-70B-Instruct	CLIP-L	LLaVA-Pretrain	1e	ckpt

🔮 Preparation

Dataset

Please follow LLaVA and SharedGPT4V to prepare the corresponding images and data.

SMR data structure

data
├── arxivqa
│   └── images
├── DVQA
│   └── images
├── Geometry3K
│   └── 0-2400 dirs
├── ChartQA
│   └── train_images
└── GeoQA3
│    ├── image
│    └── json
├── mathvision
├── scienceqa
├── tabmwp
└── GeoQA3
│    ├── train
│    └── test
│    └── val
└── ai2d
│    ├── abc_images
│    └── images
└── geoqa+
│   └── images

You can find the pre-processing code at this URL. If you have any questions about file names or image paths, please refer to the pre-processing code.

Arxiv QA Download images using this download url

python playground/data/process_arxivqa.py

DVQA

Download images using this url.

ChartQA

Clone this repo

extract all the training images in ChartQA_Dataset/train/png into ChartQA

Geometry3K

Download images using this url.

The image path in our json file will be os.path.join(f'Geometry3K/i', 'img_diagram.png')

GeoQA3

Download images using this url

extract all the training images in GeoQA3/image

MathVision

Download images using this url

Our data will not include the images from test-mini split automatically

ScienceQA

wget https://scienceqa.s3.us-west-1.amazonaws.com/images/train.zip
wget https://scienceqa.s3.us-west-1.amazonaws.com/images/val.zip
wget https://scienceqa.s3.us-west-1.amazonaws.com/images/test.zip

unzip -q train.zip
unzip -q val.zip
unzip -q test.zip

rm train.zip
rm val.zip
rm test.zip

Tabmwp

Download images using this url

TextbookQA

Download images using this url

AI2D:

Download images using this url

GeoQA+

Download images using this url

📈 Train

Click to see the detail model structure

SliME training consists of three stages: (1) training the global projector and attention adapter specifically; (2) training the local compression layer; and (3) training the full model.

SliME is trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. Always keep the global batch size the same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus.

Please make sure you download and organize the data following Preparation before training.

If you want to train and finetune SliME, please run the following command for SliME-7B with image size 336:

bash scripts/vicuna/vicuna_7b_pt.sh
bash scripts/vicuna/vicuna_7b_sft.sh

or for SliME-8B with image size 336:

bash scripts/llama/llama3_8b_pt.sh
bash scripts/llama/llama3_8b_sft.sh

Because we reuse the pre-trained projecter weights from the SliME-7B, you can directly use the sft commands stage-3 instruction tuning by changing the PROJECTOR_DIR:

bash scripts/llama/llama3_8b_sft.sh

Please find more training scripts of in scripts/.

📈 Evaluation

We perform evaluation on several image-based benchmarks. Please see Evaluation for the detailes.

If you want to evaluate the model on image-based benchmarks, please use the scripts in scripts/MODEL_PATH/eval. For example, run the following command for TextVQA evaluation with SliME-7B:

bash scripts/llama/eval/textvqa.sh

Please find more evaluation scripts in scripts/MODEL_PATH.

The evaluation code and needed files can be found here.

👀 Examples

We provide some examples in this section. More examples can be found in our project page.

Hi-Resolution Understanding

Click to expand more examples

Citation

If you find this repo useful for your research, please consider citing the paper

@article{zhang2024beyond,
  title={Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models},
  author={Zhang, Yi-Fan and Wen, Qingsong and Fu, Chaoyou and Wang, Xue and Zhang, Zhang and Wang, Liang and Jin, Rong},
  journal={arXiv preprint arXiv:2406.08487},
  year={2024}
}

Acknowledgement

We would like to thank the following repos for their great work:

This work is built upon the LLaVA.
This work utilizes LLMs from , Vicuna, and Llama3.

License

The data and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaVA, LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

About

✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

Apache-2.0 license

Report repository

Releases

No releases published

Packages

No packages published

Languages