Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add image-text-to-text task guide #31777

Merged
merged 22 commits into from
Jul 19, 2024
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
61fc59b
Add image-text-to-text task page
merveenoyan Jul 3, 2024
c282928
Update docs/source/en/tasks/image_text_to_text.md
merveenoyan Jul 3, 2024
9c28150
Update docs/source/en/tasks/image_text_to_text.md
merveenoyan Jul 3, 2024
ed0ce47
Update docs/source/en/tasks/image_text_to_text.md
merveenoyan Jul 3, 2024
f0227c0
Update docs/source/en/tasks/image_text_to_text.md
merveenoyan Jul 3, 2024
ba872f8
Update docs/source/en/tasks/image_text_to_text.md
merveenoyan Jul 3, 2024
78a4ee6
Update docs/source/en/tasks/image_text_to_text.md
merveenoyan Jul 3, 2024
0755adc
Update docs/source/en/tasks/image_text_to_text.md
merveenoyan Jul 3, 2024
91a6ab3
Update docs/source/en/tasks/image_text_to_text.md
merveenoyan Jul 3, 2024
43ab484
Update docs/source/en/tasks/image_text_to_text.md
merveenoyan Jul 3, 2024
5f8b08b
Update docs/source/en/tasks/image_text_to_text.md
merveenoyan Jul 3, 2024
6946805
Update docs/source/en/tasks/image_text_to_text.md
merveenoyan Jul 3, 2024
b4d1028
Address comments
merveenoyan Jul 3, 2024
755374e
Fix heading
merveenoyan Jul 4, 2024
d2e4dd6
Update docs/source/en/tasks/image_text_to_text.md
merveenoyan Jul 10, 2024
5be2b7f
Update docs/source/en/tasks/image_text_to_text.md
merveenoyan Jul 10, 2024
f862630
Update docs/source/en/tasks/image_text_to_text.md
merveenoyan Jul 10, 2024
0db46a9
Update docs/source/en/tasks/image_text_to_text.md
merveenoyan Jul 10, 2024
3a9d5f6
Update docs/source/en/tasks/image_text_to_text.md
merveenoyan Jul 10, 2024
5652ffc
Update docs/source/en/tasks/image_text_to_text.md
merveenoyan Jul 10, 2024
e834f9f
Address comments
merveenoyan Jul 10, 2024
e81a549
Update image_text_to_text.md
merveenoyan Jul 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,8 @@
title: Visual Question Answering
- local: tasks/text-to-speech
title: Text to speech
- local: tasks/image_text_to_text
title: Image-text-to-text
title: Multimodal
- isExpanded: false
sections:
Expand Down
227 changes: 227 additions & 0 deletions docs/source/en/tasks/image_text_to_text.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,227 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# Image-text-to-text

[[open-in-colab]]

Image-text-to-text models, also known as vision language models (VLMs), are language models that take an image input. These models can tackle various tasks, from visual question answering to image segmentation. This task shares many similarities with image-to-text, but with some overlapping use cases like image captioning. Image-to-text models only take image inputs and often accomplish a specific task, whereas VLMs take open-ended text and image inputs and are more generalist models.

In this guide, we provide a brief overview of VLMs and show how to use them with Transformers for inference.

To begin with, there are multiple types of VLMs:
- base models used for fine-tuning
- chat fine-tuned models for conversation
- instruction fine-tuned models

This guide focuses on inference with an instruction-tuned model.

Let's begin installing the dependencies.

```bash
pip install -q transformers accelerate flash_attn
```

Let's initialize the model and the processor.

```python
from transformers import AutoProcessor, Idefics2ForConditionalGeneration
import torch

device = torch.device("cuda")
model = Idefics2ForConditionalGeneration.from_pretrained(
"HuggingFaceM4/idefics2-8b",
torch_dtype=torch.bfloat16,
_attn_implementation="flash_attention_2",
).to(device)
merveenoyan marked this conversation as resolved.
Show resolved Hide resolved

processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b")
```

This model has a [chat template](./chat_templating) format that's required for the input. Moreover, the model can also accept multiple images as input in a single conversation or message. We will now prepare the inputs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't quite right, we don't need to use a chat template for the model inputs. It's just useful to correctly format the prompt in the case of message-style inputs

Copy link
Contributor Author

@merveenoyan merveenoyan Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You likely know better, I thought when fine-tuning these chat templates are included in fine-tuning data, thus it is required to use chat templates no? e.g. Mistral one has <INST> </INST>

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It really depends - technically the data could be already formatted as the chat string. It just happens that the message format is commonly used. There's no reason I can't pass a string directly to the tokenizer and model directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I confused this with prompt templates


The image inputs look like the following.

<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png" alt="Two cats sitting on a net"/>
</div>

<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg" alt="A bee on a pink flower"/>
</div>


```python
from PIL import Image
import requests

img_urls =["https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png",
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"]
images = [Image.open(requests.get(img_urls[0], stream=True).raw),
Image.open(requests.get(img_urls[1], stream=True).raw)]
merveenoyan marked this conversation as resolved.
Show resolved Hide resolved
```

Below is an example of the chat template. We can feed conversation turns and the last message as an input by appending it at the end of the template.


```python
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What do we see in this image?"},
]
},
{
"role": "assistant",
"content": [
{"type": "text", "text": "In this image we can see two cats on the nets."},
]
},
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "And how about this image?"},
]
},
]
```

We will now call the processors' [`~ProcessorMixin.apply_chat_template`] method to preprocess its output along with the image inputs.

```python
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
merveenoyan marked this conversation as resolved.
Show resolved Hide resolved
inputs = processor(text=prompt, images=[images[0], images[1]], return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}
merveenoyan marked this conversation as resolved.
Show resolved Hide resolved
```

We can now pass the preprocessed inputs to the model.

```python
generated_ids = model.generate(**inputs, max_new_tokens=500)
merveenoyan marked this conversation as resolved.
Show resolved Hide resolved
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

print(generated_texts)
## ['User: What do we see in this image? \nAssistant: In this image we can see two cats on the nets. \nUser: And how about this image? \nAssistant: In this image we can see flowers, plants and insect.']
```

## Streaming

We can use [text streaming](./generation_strategies#streaming) for a better generation experience. Transformers supports streaming with the [`TextStreamer`] or [`TextIteratorStreamer`] classes. We will use the [`TextIteratorStreamer`] with IDEFICS-8B.
merveenoyan marked this conversation as resolved.
Show resolved Hide resolved

Assume we have an application that keeps chat history and takes in the new user input. We will preprocess the inputs as usual and initialize [`TextIteratorStreamer`] to handle the generation in a separate thread. This allows you to stream the generated text tokens in real-time. Any generation arguments can be passed to [`TextIteratorStreamer`].

```python
import time
from transformers import TextIteratorStreamer
from threading import Thread

def model_inference(
user_prompt,
chat_history,
max_new_tokens,
images
):
user_prompt = {
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": user_prompt},
]
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something funny happening with the indentation here

chat_history.append(user_prompt)
streamer = TextIteratorStreamer(
processor.tokenizer,
skip_prompt=True,
timeout=5.0,
)

generation_args = {
"max_new_tokens": max_new_tokens,
"streamer": streamer,
"do_sample": False
}

prompt = processor.apply_chat_template(chat_history, add_generation_prompt=True)
inputs = processor(
text=prompt,
images=images,
return_tensors="pt",
)
inputs = {k: v.to(device) for k, v in inputs.items()}
merveenoyan marked this conversation as resolved.
Show resolved Hide resolved
generation_args.update(inputs)

thread = Thread(
target=model.generate,
kwargs=generation_args,
)
thread.start()
merveenoyan marked this conversation as resolved.
Show resolved Hide resolved

acc_text = ""
for text_token in streamer:
time.sleep(0.04)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to add this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise the text flows super fast which is essentially against streaming (and also from my experience it was crashing too)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise the text flows super fast which is essentially against streaming

I'm a bit confused - don't we want our models to generate text as fast as possible? My understanding of streaming is just that we don't wait for completion before returning the result

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amyeroberts the streaming feature enables one to see the tokens flow and stop them from flowing if the generation is going to a bad place, as in https://huggingface.co/docs/text-generation-inference/en/conceptual/streaming so we'd like it to wait a bit in between tokens

acc_text += text_token
if acc_text.endswith("<end_of_utterance>"):
acc_text = acc_text[:-18]
yield acc_text
```

Now let's call the `model_inference` function we created and stream the values.

```python
generator = model_inference(user_prompt = "And what is in this image?",
chat_history = messages,
max_new_tokens = 100,
images = images)
merveenoyan marked this conversation as resolved.
Show resolved Hide resolved

for value in generator:
print(value)

# In
# In this
# In this image ...
```

## Fit models in smaller hardware

VLMs are often large and need to be optimized to fit in smaller hardware. Transformers supports many model quantization libraries, and here we will only show int8 quantization with [Quanto](./quantization/quanto#quanto).

First, install dependencies.

```bash
pip install -U quanto bitsandbytes
```

To quantize a model during loading, we need to first create [`QuantoConfig`]. Then load the model as usual, but pass `quantization_config` during model initialization.

```python
from transformers import Idefics2ForConditionalGeneration, AutoTokenizer, QuantoConfig

model_id = "HuggingFaceM4/idefics2-8b"
quantization_config = QuantoConfig(weights="int8")
quantized_model = Idefics2ForConditionalGeneration.from_pretrained(model_id, device_map="cuda", quantization_config=quantization_config)
```

And that's it, we can use the model the same way with no changes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good here to note what kind of change this makes e.g. x% reduction in memory footprint


## Further Reading

Here are some more resources for the image-text-to-text task.

- [Image-text-to-text task page](https://huggingface.co/tasks/image-text-to-text) covers model types, use cases, datasets, and more.
- [Vision Language Models Explained](https://huggingface.co/blog/vlms) is a blog post that covers everything about vision language models and supervised fine-tuning using [TRL](https://huggingface.co/docs/trl/en/index).
Loading