pixelgpt

Harnessing visual texts represents a burgeoning frontier in the evolution of language modeling. In this paper, we introduce a novel pre-training framework for a suite of pixel-based autoregressive language models, pre-training on a corpus of over 400 million documents rendered as RGB images. Our approach is characterized by a dual-modality training regimen, engaging both visual data through next patch prediction with a regression head and textual data via next token prediction with a classification head. This study is particularly focused on investigating the synergistic interplay between visual and textual modalities of language. Our comprehensive evaluation across a diverse array of benchmarks reveals that the confluence of visual and textual data substantially augments the efficacy of pixel-based language models. Notably, our findings show that a unidirectional pixel-based model, devoid of textual data during training, can match the performance levels of advanced bidirectional pixel-based models on various language understanding benchmarks. This work highlights the considerable untapped potential of integrating visual and textual information for language modeling purposes. We will release our code, data, and checkpoints to inspire further research advancement.

Requirements

To run the code, you should install the dependency libraries.

bash run_requirements.sh

Fine-tuning Data

We mainly fine-tune pixelgpt on rendered GLUE and XNLI datasets. The rendered version of these experimental datasets is released at baidu/rendered_GLUE and baidu/rendered_xnli. Before fine-tuning, you should download the required dataset from HuggingFace, save it locally, and then extract it:

# Extract rendered GLUE
tar -xvf rendered_glue.tar

# Extract rendered XNLI
tar -xvf rendered_xnli.tar

For the rendered GLUE dataset, the extracted files contain multiple tasks. Each task has a corresponding training set, validation set, and test set. Note that for the MNLI task, both the validation and test sets contain matched and mismatched versions. You will need to assign the local paths of these task datasets to the --train_file, --validation_file, and --test_file parameters in the fine-tuning script. For the rendered XNLI dataset, assign the local dataset path to the --data_file_dir parameter in the corresponding fine-tuning script.

Pre-trained Models

We pre-trained PixelGPT and three other models: MonoGPT, and DualGPT. We release checkpoints used in our experiment, which can be downloaded at baidu/PixelGPT, baidu/MonoGPT, and baidu/DualGPT. Before running the fine-tuning scripts bellow, download the corresponding pre-trained models from our open-source model repository above and place the file in the pre-trained model directory, e.g. pretrained_models/pixel_gpt.

Fine-tuning

Our main fine-tuning experiments were performed on rendered GLUE and XNLI. The scripts to run the experiments are given below.

GLUE

Unless otherwise specified, we take the MNLI dataset as an example.

PixelGPT

bash run/pixel_gpt/ft_pixel_gpt_mnli.sh pretrained_models/PixelGPT

MonoGPT

# Text-only Fine-tuning
run/mono_gpt/ft_mono_gpt_mnli_text.sh pretrained_models/MonoGPT

# Pixel-only Fine-tuning
run/mono_gpt/ft_mono_gpt_mnli_pixel.sh pretrained_models/MonoGPT

# Pair-modality Fine-tuning
run/mono_gpt/ft_mono_gpt_mnli_pair.sh pretrained_models/MonoGPT

DualGPT

# Text-only Fine-tuning
run/dual_gpt/ft_dual_gpt_mnli_text.sh pretrained_models/DualGPT

# Pixel-only Fine-tuning
run/dual_gpt/ft_dual_gpt_mnli_pixel.sh pretrained_models/DualGPT


# Pair-modality Fine-tuning
run/dual_gpt/ft_dual_gpt_mnli_pair.sh pretrained_models/DualGPT

XNLI

our evaluation of rendered XNLI is performed in two distinct scenarios: (1) Translate-train-all, where the model is fine-tuned on a blend of original English and machine-translated data from other 14 languages, aiming to appraise the model's multilingual understanding; (2) Cross-lingual Transfer settings, wherein fine-tuning is conducted solely on English data, with multi-language test sets employed to evaluate the model’s transferability across languages.

Translate-train-all

PixelGPT

bash run/cross_lingual/xnli/train_all/pixel_gpt/ft_pixel_gpt_xnli.sh pretrained_models/PixelGPT

MonoGPT

# Text-only Fine-tuning
bash run/cross_lingual/xnli/train_all/mono_gpt/ft_mono_gpt_xnli_text.sh pretrained_models/MonoGPT

# Pixel-only Fine-tuning
bash run/cross_lingual/xnli/train_all/mono_gpt/ft_mono_gpt_xnli_image.sh pretrained_models/MonoGPT

# Pair-modality Fine-tuning
bash run/cross_lingual/xnli/train_all/mono_gpt/ft_mono_gpt_xnli_pair.sh pretrained_models/MonoGPT

DualGPT

# Text-only Fine-tuning
bash run/cross_lingual/xnli/train_all/dual_gpt/ft_dual_gpt_xnli_text.sh pretrained_models/DualGPT

# Pixel-only Fine-tuning
bash run/cross_lingual/xnli/train_all/dual_gpt/ft_dual_gpt_xnli_image.sh pretrained_models/DualGPT

# Pair-modality Fine-tuning
bash run/cross_lingual/xnli/train_all/dual_gpt/ft_dual_gpt_xnli_pair.sh pretrained_models/DualGPT

Cross-lingaul Transfer

PixelGPT

bash run/cross_lingual/xnli/train_en/pixel_gpt/ft_pixel_gpt_xnli.sh pretrained_models/PixelGPT

MonoGPT

# Text-only Fine-tuning
bash run/cross_lingual/xnli/train_en/mono_gpt/ft_mono_gpt_xnli_text.sh pretrained_models/MonoGPT

# Pixel-only Fine-tuning
bash run/cross_lingual/xnli/train_en/mono_gpt/ft_mono_gpt_xnli_image.sh pretrained_models/MonoGPT

# Pair-modality Fine-tuning
run/cross_lingual/xnli/train_en/mono_gpt/ft_mono_gpt_xnli_pair.sh pretrained_models/MonoGPT

DualGPT

# Text-only Fine-tuning
bash run/cross_lingual/xnli/train_en/dual_gpt/ft_dual_gpt_xnli_text.sh pretrained_models/DualGPT

# Pixel-only Fine-tuning
bash run/cross_lingual/xnli/train_en/dual_gpt/ft_dual_gpt_xnli_image.sh pretrained_models/DualGPT

# Pair-modality Fine-tuning
bash run/cross_lingual/xnli/train_en/dual_gpt/ft_dual_gpt_xnli_pair.sh pretrained_models/DualGPT

Citation

For attribution in academic contexts, please cite this work as:

@article{chai2024dual,
  title={Dual Modalities of Text: Visual and Textual Generative Pre-training},
  author={Chai, Yekun and Liu, Qingyi and Xiao, Jingwu and Wang, Shuohuan and Sun, Yu and Wu, Hua},
  journal={arXiv preprint arXiv:2404.10710},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 304 Commits
configs		configs
renderers		renderers
run		run
scripts/training		scripts/training
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
auto_run_dual_gpt_pair.sh		auto_run_dual_gpt_pair.sh
auto_run_dual_gpt_pixel.sh		auto_run_dual_gpt_pixel.sh
auto_run_dual_gpt_text.sh		auto_run_dual_gpt_text.sh
auto_run_mono_gpt_pair.sh		auto_run_mono_gpt_pair.sh
auto_run_mono_gpt_pixel.sh		auto_run_mono_gpt_pixel.sh
auto_run_mono_gpt_text.sh		auto_run_mono_gpt_text.sh
auto_run_pixel_gpt.sh		auto_run_pixel_gpt.sh
auto_run_text_gpt.sh		auto_run_text_gpt.sh
requirements.txt		requirements.txt
run_requirements.sh		run_requirements.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pixelgpt

Requirements

Fine-tuning Data

Pre-trained Models

Fine-tuning

GLUE

PixelGPT

MonoGPT

DualGPT

XNLI

Translate-train-all

PixelGPT

MonoGPT

DualGPT

Cross-lingaul Transfer

PixelGPT

MonoGPT

DualGPT

Citation

About

Releases

Packages

Contributors 2

Languages

License

ernie-research/pixelgpt

Folders and files

Latest commit

History

Repository files navigation

pixelgpt

Requirements

Fine-tuning Data

Pre-trained Models

Fine-tuning

GLUE

PixelGPT

MonoGPT

DualGPT

XNLI

Translate-train-all

PixelGPT

MonoGPT

DualGPT

Cross-lingaul Transfer

PixelGPT

MonoGPT

DualGPT

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages