Skip to content

✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

License

Notifications You must be signed in to change notification settings

yfzhang114/SliME

Repository files navigation

Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

Multi-Modal

🔥 Update

  • [07/16]🔥The SliME strategy demonstrates exceptional versatility, extending seamlessly to video analysis (See Slime_video.md). Remarkably, even though the model has never been specifically trained on video data, it is capable of processing up to 8 frames. In the Video-MME benchmark, the model surpasses numerous 7B/8B baselines that have undergone training on video datasets.
  • [06/11]🔥SliME is coming! We release the paper, code, models, and data for SliME!
  • [06/11]🔥SliME-70B will be released soon.

👀 Contents

🔮 Install

Please follow the instructions below to install the required packages.

  1. Clone this repository
git clone https://github.com/yfzhang114/SliME.git
  1. Install Package
conda create -n slime python=3.10 -y
conda activate slime
cd SliME
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  1. Install additional packages for training cases
pip install -e ".[train]"
pip install ninja
pip install datasets
pip install flash-attn --no-build-isolation

🔍 Model

We provide all our fully finetuned models on Stage 1/2 and 3 data for SliME:

Model Base LLM Vision Encoder Finetuning Data Finetuning schedule Download
SliME-7B Vicuna-7B-v1.5 CLIP-L SharedGPT+SMR full_ft ckpt
SliME-8B Llama-3-8B-Instruct CLIP-L SharedGPT+SMR full_ft ckpt
SliME-13B Vicuna-13B-v1.5 CLIP-L SharedGPT+SMR full_ft ckpt
SliME-70B Llama-3-70B-Instruct CLIP-L SharedGPT+SMR Lora ckpt

Here are the pretrained weights on Stage 1/2 data only:

Model Base LLM Vision Encoder Pretrain Data Finetuning schedule Download
SliME-7B Vicuna-7B-v1.5 CLIP-L LLaVA-Pretrain 1e ckpt
SliME-8B Llama-3-8B-Instruct CLIP-L LLaVA-Pretrain 1e ckpt
SliME-13B Vicuna-13B-v1.5 CLIP-L LLaVA-Pretrain 1e ckpt
SliME-70B Llama-3-70B-Instruct CLIP-L LLaVA-Pretrain 1e ckpt

🔮 Preparation

Dataset

Please follow LLaVA and SharedGPT4V to prepare the corresponding images and data.

SMR data structure

data
├── arxivqa
│   └── images
├── DVQA
│   └── images
├── Geometry3K
│   └── 0-2400 dirs
├── ChartQA
│   └── train_images
└── GeoQA3
│    ├── image
│    └── json
├── mathvision
├── scienceqa
├── tabmwp
└── GeoQA3
│    ├── train
│    └── test
│    └── val
└── ai2d
│    ├── abc_images
│    └── images
└── geoqa+
│   └── images

You can find the pre-processing code at this URL. If you have any questions about file names or image paths, please refer to the pre-processing code.

  1. Arxiv QA Download images using this download url
python playground/data/process_arxivqa.py
  1. DVQA

Download images using this url.

  1. ChartQA

Clone this repo

extract all the training images in ChartQA_Dataset/train/png into ChartQA

  1. Geometry3K

Download images using this url.

The image path in our json file will be os.path.join(f'Geometry3K/i', 'img_diagram.png')

  1. GeoQA3

Download images using this url

extract all the training images in GeoQA3/image

  1. MathVision

Download images using this url

Our data will not include the images from test-mini split automatically

  1. ScienceQA
wget https://scienceqa.s3.us-west-1.amazonaws.com/images/train.zip
wget https://scienceqa.s3.us-west-1.amazonaws.com/images/val.zip
wget https://scienceqa.s3.us-west-1.amazonaws.com/images/test.zip

unzip -q train.zip
unzip -q val.zip
unzip -q test.zip

rm train.zip
rm val.zip
rm test.zip
  1. Tabmwp

Download images using this url

  1. TextbookQA

Download images using this url

  1. AI2D:

Download images using this url

  1. GeoQA+

Download images using this url

📈 Train

Click to see the detail model structure

SliME training consists of three stages: (1) training the global projector and attention adapter specifically; (2) training the local compression layer; and (3) training the full model.

SliME is trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. Always keep the global batch size the same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus.

Please make sure you download and organize the data following Preparation before training.

If you want to train and finetune SliME, please run the following command for SliME-7B with image size 336:

bash scripts/vicuna/vicuna_7b_pt.sh
bash scripts/vicuna/vicuna_7b_sft.sh

or for SliME-8B with image size 336:

bash scripts/llama/llama3_8b_pt.sh
bash scripts/llama/llama3_8b_sft.sh

Because we reuse the pre-trained projecter weights from the SliME-7B, you can directly use the sft commands stage-3 instruction tuning by changing the PROJECTOR_DIR:

bash scripts/llama/llama3_8b_sft.sh

Please find more training scripts of in scripts/.

📈 Evaluation

We perform evaluation on several image-based benchmarks. Please see Evaluation for the detailes.

If you want to evaluate the model on image-based benchmarks, please use the scripts in scripts/MODEL_PATH/eval. For example, run the following command for TextVQA evaluation with SliME-7B:

bash scripts/llama/eval/textvqa.sh

Please find more evaluation scripts in scripts/MODEL_PATH.

The evaluation code and needed files can be found here.

👀 Examples

We provide some examples in this section. More examples can be found in our project page.

Hi-Resolution Understanding

Click to expand more examples

Citation

If you find this repo useful for your research, please consider citing the paper

@article{zhang2024beyond,
  title={Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models},
  author={Zhang, Yi-Fan and Wen, Qingsong and Fu, Chaoyou and Wang, Xue and Zhang, Zhang and Wang, Liang and Jin, Rong},
  journal={arXiv preprint arXiv:2406.08487},
  year={2024}
}

Acknowledgement

We would like to thank the following repos for their great work:

  • This work is built upon the LLaVA.
  • This work utilizes LLMs from , Vicuna, and Llama3.

License

The data and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaVA, LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

About

✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published