GitHub - CircleRadon/TokenPacker: The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM".

Comparisons with existing methods 💡

Updates 📌

[2024/7/25] We released checkpoints, please check them.
[2024/7/3] We released the paper of our TokenPacker on Arxiv.
[2024/7/3] We released the training and inference codes.

What is TokenPacker 👀

TokenPacker is a novel visual projector, which adopts a coarse-to-fine scheme to inject the enriched characteristics to generate the condensed visual tokens. Using TokenPacker, we can compress the visual tokens by 75%∼89%, while achieves comparable or even better performance across diverse benchmarks with significantly higher efficiency.

Comparisons with various projectors

High-Resolution Image Understanding with TokenPacker 🔬

To support efficient high-resolution image understanding, we further develop an effective image cropping method TokenPacker-HD.

Install 🛠️

Clone this repository and navigate to TokenPacker folder

git clone https://github.com/CircleRadon/TokenPacker.git
cd TokenPacker

Install packages

conda create -n tokenpacker python=3.10 -y
conda activate tokenpacker
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Install additional packages for training cases

pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Training 🚀

LLaVA-TokenPacker

Dataset

To make a fair comparison, we use the same training data as in LLaVA-1.5, i.e., CC3M-595K for stage 1, and Mix665k for stage 2.

Training

Stage1: Image-Text Alignment Pre-training

bash scripts/v1_5/pretrain.sh

Stage2: Visual Instruction Tuning

bash scripts/v1_5/finetune.sh

Note: Using --scale_factor to control compression ratio, support [2,3,4]

LLaVA-TokenPacker-HD

Dataset

To obtain the competitive high-resolution performance, we use 2.7M data as organized by Mini-Gemini, i.e., 1.2M for stage 1 and 1.5M for stage 2.

Training

Stage1: Image-Text Alignment Pre-training

bash scripts/v1_5/pretrain_hd.sh

Stage2: Visual Instruction Tuning

bash scripts/v1_5/finetune_hd.sh

Note:

Using --scale_factor to control compression ratio, support [2,3,4].
Using --patch_num to control max patch dividing number, support [9,16,25].

Experiments

Model Zoo

Model	Max Res.	Compre. Ratio	Token Num.	Max Patch Num.	Training Data	Download
TokenPacker-7b	336x336	1/4	144	-	558K+665K	checkpoints
TokenPacker-13b	336x336	1/4	144	-	558K+665K	checkpoints
TokenPacker-HD-7b	1088x1088	1/4	~954	9	1.2M+1.5M	checkpoints
TokenPacker-HD-13b	1088x1088	1/4	~954	9	1.2M+1.5M	checkpoints
TokenPacker-HD-13b	1344x1344	1/4	~1393	16	1.2M+1.5M	checkpoints
TokenPacker-HD-13b	1344x1344	1/9	~619	16	1.2M+1.5M	checkpoints
TokenPacker-HD-13b	1344x1344	1/16	~347	16	1.2M+1.5M	checkpoints

Note:

The token number of TokenPacker-HD is the average statistically across all training and test data.
The training data of 558K+665K follows LLaVA-1.5, the one of 1.2M+1.5M follows Mini-Gemini.
All LLMs use Vicuna-7b/13b as based LLM.

Visualization

We provide some visual examples.

High-resolution image understanding.

TODO List 📝

Release the training and inference codes.
Release all checkpoints.

Acknowledgement 💌

LLaVA-v1.5: the codebase we built upon.
Mini-Gemini: the organized data we used for training high-resolution method.

BibTeX 🖊️

@misc{TokenPacker,
  title={TokenPacker: Efficient Visual Projector for Multimodal LLM},
  author={Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jianke Zhu and Lei Zhang},
  year={2024},
  eprint={2407.02392},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
docs		docs
llava		llava
scripts		scripts
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Comparisons with existing methods 💡

Updates 📌

What is TokenPacker 👀

Comparisons with various projectors

High-Resolution Image Understanding with TokenPacker 🔬

Install 🛠️

Training 🚀

LLaVA-TokenPacker

Dataset

Training

LLaVA-TokenPacker-HD

Dataset

Training

Experiments

Model Zoo

Visualization

TODO List 📝

Acknowledgement 💌

BibTeX 🖊️

About

Releases

Packages

Contributors 3

Languages

CircleRadon/TokenPacker

Folders and files

Latest commit

History

Repository files navigation

Comparisons with existing methods 💡

Updates 📌

What is TokenPacker 👀

Comparisons with various projectors

High-Resolution Image Understanding with TokenPacker 🔬

Install 🛠️

Training 🚀

LLaVA-TokenPacker

Dataset

Training

LLaVA-TokenPacker-HD

Dataset

Training

Experiments

Model Zoo

Visualization

TODO List 📝

Acknowledgement 💌

BibTeX 🖊️

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages