Merge remote-tracking branch 'upstream/main' into new-quant-method

huggingface · Jul 22, 2024 · d517336 · d517336
2 parents 4723908 + bd9dca3
commit d517336
Show file tree

Hide file tree

Showing 98 changed files with 2,373 additions and 924 deletions.
diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
@@ -92,6 +92,8 @@
       title: Visual Question Answering
     - local: tasks/text-to-speech
       title: Text to speech
+    - local: tasks/image_text_to_text
+      title: Image-text-to-text
     title: Multimodal
   - isExpanded: false
     sections:
@@ -761,7 +763,7 @@
       - local: model_doc/bros
         title: BROS
       - local: model_doc/chameleon
-        title: chameleon
+        title: Chameleon
       - local: model_doc/chinese_clip
         title: Chinese-CLIP
       - local: model_doc/clip

diff --git a/docs/source/en/model_doc/chameleon.md b/docs/source/en/model_doc/chameleon.md
@@ -34,13 +34,13 @@ being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs
 generation, all in a single model. It also matches or exceeds the performance of much larger models,
 including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal
 generation evaluation, where either the prompt or outputs contain mixed sequences of both images and
-text. Chameleon marks a significant step forward in a unified modeling of full multimodal documents*
+text. Chameleon marks a significant step forward in unified modeling of full multimodal documents*
 
 
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/chameleon_arch.png"
 alt="drawing" width="600"/>
 
-<small> Chameleon incorporates a vector quantizer module to transform images into discrete tokens. That also enables image geenration using an auto-regressive transformer. Taken from the <a href="https://arxiv.org/abs/2405.09818v1">original paper.</a> </small>
+<small> Chameleon incorporates a vector quantizer module to transform images into discrete tokens. That also enables image generation using an auto-regressive transformer. Taken from the <a href="https://arxiv.org/abs/2405.09818v1">original paper.</a> </small>
 
 This model was contributed by [joaogante](https://huggingface.co/joaogante) and [RaushanTurganbay](https://huggingface.co/RaushanTurganbay).
 The original code can be found [here](https://github.com/facebookresearch/chameleon).
@@ -55,13 +55,14 @@ The original code can be found [here](https://github.com/facebookresearch/chamel
 - Chameleon generates in chat format which means that the generated text will always be the "assistant's turn". You can enable a text completion generation by passing `return_for_text_completion=True` when calling the processor.
 
 > [!NOTE]
-> Chameleon implementation in Transformers uses a special image token to indicate where to merge image embeddings. For special image token we didn't add a new one but used one of the reserved tokens: `<reserved08707>`.
+> Chameleon implementation in Transformers uses a special image token to indicate where to merge image embeddings. For special image token we didn't add a new one but used one of the reserved tokens: `<reserved08707>`. You have to add `<image>` to your prompt in the place where the image should be embedded for correct generation.
 
 ## Usage example
 
 ### Single image inference
 
-Here's how to load the model and perform inference in half-precision (`torch.float16`):
+Chameleon is a gated model so make sure to have access and login to Hugging Face Hub using a token. 
+Here's how to load the model and perform inference in half-precision (`torch.bfloat16`):
 
 ```python
 from transformers import ChameleonProcessor, ChameleonForConditionalGeneration
@@ -70,7 +71,7 @@ from PIL import Image
 import requests
 
 processor = ChameleonProcessor.from_pretrained("facebook/chameleon-7b")
-model = ChameleonForConditionalGeneration.from_pretrained("facebook/chameleon-7b", torch_dtype=torch.float16, device_map="auto")
+model = ChameleonForConditionalGeneration.from_pretrained("facebook/chameleon-7b", torch_dtype=torch.bfloat16, device_map="cuda")
 
 # prepare image and text prompt
 url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
@@ -95,7 +96,8 @@ from PIL import Image
 import requests
 
 processor = ChameleonProcessor.from_pretrained("facebook/chameleon-7b")
-model = ChameleonForConditionalGeneration.from_pretrained("facebook/chameleon-7b", torch_dtype=torch.float16, device_map="auto")
+
+model = ChameleonForConditionalGeneration.from_pretrained("facebook/chameleon-7b", torch_dtype=torch.bfloat16, device_map="cuda")
 
 # Get three different images
 url = "https://www.ilankelman.org/stopsigns/australia.jpg"
@@ -115,7 +117,7 @@ prompts = [
 
 # We can simply feed images in the order they have to be used in the text prompt
 # Each "<image>" token uses one image leaving the next for the subsequent "<image>" tokens
-inputs = processor(text=prompts, images=[image_stop, image_cats, image_snowman], padding=True, return_tensors="pt").to(device="cuda", dtype=torch.float16)
+inputs = processor(text=prompts, images=[image_stop, image_cats, image_snowman], padding=True, return_tensors="pt").to(device="cuda", dtype=torch.bfloat16)
 
 # Generate
 generate_ids = model.generate(**inputs, max_new_tokens=50)
@@ -138,7 +140,7 @@ quantization_config = BitsAndBytesConfig(
     bnb_4bit_compute_dtype=torch.float16,
 )
 
-model = ChameleonForConditionalGeneration.from_pretrained("meta-chameleon", quantization_config=quantization_config, device_map="auto")
+model = ChameleonForConditionalGeneration.from_pretrained("facebook/chameleon-7b", quantization_config=quantization_config, device_map="cuda")
 ```
 
 ### Use Flash-Attention 2 and SDPA to further speed-up generation
@@ -148,9 +150,10 @@ The models supports both, Flash-Attention 2 and PyTorch's [`torch.nn.functional.
 ```python
 from transformers import ChameleonForConditionalGeneration
 
+model_id = "facebook/chameleon-7b"
 model = ChameleonForConditionalGeneration.from_pretrained(
     model_id, 
-    torch_dtype=torch.float16, 
+    torch_dtype=torch.bfloat16, 
     low_cpu_mem_usage=True,
     attn_implementation="flash_attention_2"
 ).to(0)

diff --git a/docs/source/en/model_doc/marian.md b/docs/source/en/model_doc/marian.md
@@ -105,7 +105,7 @@ from huggingface_hub import list_models
 
 model_list = list_models()
 org = "Helsinki-NLP"
-model_ids = [x.modelId for x in model_list if x.modelId.startswith(org)]
+model_ids = [x.id for x in model_list if x.id.startswith(org)]
 suffix = [x.split("/")[1] for x in model_ids]
 old_style_multi_models = [f"{org}/{s}" for s in suffix if s != s.lower()]
 ```

diff --git a/docs/source/en/model_doc/roberta.md b/docs/source/en/model_doc/roberta.md
@@ -51,19 +51,19 @@ This model was contributed by [julien-c](https://huggingface.co/julien-c). The o
 
 ## Usage tips
 
-- This implementation is the same as [`BertModel`] with a tiny embeddings tweak as well as a setup
-  for Roberta pretrained models.
-- RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a
+- This implementation is the same as [`BertModel`] with a minor tweak to the embeddings, as well as a setup
+  for RoBERTa pretrained models.
+- RoBERTa has the same architecture as BERT but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a
   different pretraining scheme.
-- RoBERTa doesn't have `token_type_ids`, you don't need to indicate which token belongs to which segment. Just
-  separate your segments with the separation token `tokenizer.sep_token` (or `</s>`)
-- Same as BERT with better pretraining tricks:
-
-    * dynamic masking: tokens are masked differently at each epoch, whereas BERT does it once and for all
-    * together to reach 512 tokens (so the sentences are in an order than may span several documents)
-    * train with larger batches
-    * use BPE with bytes as a subunit and not characters (because of unicode characters)
-- [CamemBERT](camembert) is a wrapper around RoBERTa. Refer to this page for usage examples.
+- RoBERTa doesn't have `token_type_ids`, so you don't need to indicate which token belongs to which segment. Just
+  separate your segments with the separation token `tokenizer.sep_token` (or `</s>`).
+- RoBERTa is similar to BERT but with better pretraining techniques:
+
+    * Dynamic masking: tokens are masked differently at each epoch, whereas BERT does it once and for all.
+    * Sentence packing: Sentences are packed together to reach 512 tokens (so the sentences are in an order that may span several documents).
+    * Larger batches: Training uses larger batches.
+    * Byte-level BPE vocabulary: Uses BPE with bytes as a subunit instead of characters, accommodating Unicode characters.
+- [CamemBERT](camembert) is a wrapper around RoBERTa. Refer to its model page for usage examples.
 
 ## Resources
 

diff --git a/docs/source/en/model_doc/video_llava.md b/docs/source/en/model_doc/video_llava.md
@@ -98,7 +98,7 @@ indices = np.arange(0, total_frames, total_frames / 8).astype(int)
 video = read_video_pyav(container, indices)
 
 # For better results, we recommend to prompt the model in the following format
-prompt = "USER: <video>Why is this funny? ASSISTANT:"
+prompt = "USER: <video>\nWhy is this funny? ASSISTANT:"
 inputs = processor(text=prompt, videos=video, return_tensors="pt")
 
 out = model.generate(**inputs, max_new_tokens=60)
@@ -108,7 +108,7 @@ processor.batch_decode(out, skip_special_tokens=True, clean_up_tokenization_spac
 For multiple turns conversation change the prompt format to:
 
 ```bash
-"USER: <video>What do you see in this video? ASSISTANT: A baby reading a book. USER: Why is the it funny? ASSISTANT:"
+"USER: <video>\nWhat do you see in this video? ASSISTANT: A baby reading a book. USER: Why is the it funny? ASSISTANT:"
 ```
 
 ### Mixed Media Mode
@@ -123,7 +123,7 @@ import requests
 # Load and image and write a new prompt
 url = "http://images.cocodataset.org/val2017/000000039769.jpg"
 image = Image.open(requests.get(url, stream=True).raw)
-prompt = "USER: <image> How many cats are there in the image? ASSISTANT: There are two cats. USER: <video>Why is this video funny? ASSISTANT:"
+prompt = "USER: <image>\nHow many cats are there in the image? ASSISTANT: There are two cats. USER: <video>\nWhy is this video funny? ASSISTANT:"
 
 inputs = processor(text=prompt, images=image, videos=clip, padding=True, return_tensors="pt")