Add Video Llava #29733

zucchini-nlp · 2024-03-19T14:21:50Z

What does this PR do?

Fixes #29640 . Add a new model, Video Llava to the library.

This is a draft PR, will add more here later.

The model now accepts both, video and image as input in the same batch. Each visual has its own special token, so we do not need to repeat "" 8 times for 8 frames. Here is a short code snippet:

import torch
import numpy as np
import requests
from PIL import Image
from decord import VideoReader
from huggingface_hub import hf_hub_download
from transformers import VideoLlavaForConditionalGeneration, VideoLlavaProcessor


def load_video_tensor(video_path, n_frms=4, transform=False):
    vr = VideoReader(uri=video_path, height=224, width=224)
    n_frms = min(n_frms, len(vr))
    indices = np.arange(0, len(vr), len(vr) / n_frms).astype(int)
    frames = vr.get_batch(indices).asnumpy()
    return frames

# -------------------------------------------------------------------------------------------------------------------

model = VideoLlavaForConditionalGeneration.from_pretrained("/home/raushan/video_llava/")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
clip = load_video_tensor(video_path, n_frms=8)

processor = VideoLlavaProcessor.from_pretrained("/home/raushan/video_llava")
prompt_vid = "USER: <video>What do you see? ASSISTANT: A baby with a remote control. USER: <video>Why is this funny? ASSISTANT:"
prompt_img = "USER: <image>How many cats are there in the image? ASSISTANT:"
prompt_mix = "USER: <image>How many cats are there in the image? ASSISTANT: 2. USER: <video>Why is this video funny? ASSISTANT:"

inputs = processor(text=[prompt_mix, prompt_img], visual_inputs=[image, clip, image], padding=True, return_tensors="pt")

out = model.generate(**inputs, max_new_tokens=40)
print(processor.batch_decode(out, skip_special_tokens=True, clean_up_tokenization_spaces=True))

# [
#    'USER:  How many cats are there in the image? ASSISTANT: 2. USER:  Why is this video funny? ASSISTANT: The video is funny because the baby is sitting on 
#         the bed and reading a book, which is an unusual and amusing sight. Babies are typically not known for reading books, and the fact', 
#    'USER:  How many cats are there in the image? ASSISTANT: There are two cats in the image..'
# ]

zucchini-nlp · 2024-03-19T14:28:46Z

@LinB203 hey! As we talked before, here is a draft PR of Video Llava. I checked that the modeling part runs without errors and generates similar to the original repo.

To update model files on the hub, you can use convert_weight script and use this branch to test if model is loading correctly. Whenever you are available, can you look through, if I missed anything important? :)

HuggingFaceDocBuilderDev · 2024-03-19T14:42:23Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp · 2024-03-27T16:39:24Z

@LinB203 pinging in case the first one got lost in notifications :)

ArthurZucker · 2024-03-30T07:34:49Z

FYI @NielsRogge and @amyeroberts

README.md

src/transformers/models/auto/configuration_auto.py

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

zucchini-nlp · 2024-04-08T12:37:46Z

I believe we can start reviewing this now. I converted weights and added them to my hub account temporarily, so that we can run and test the model.

In the meanwhile will be waiting for @LinB203 to update the weights in the organization hub, which I think is required before we merge the PR

amyeroberts · 2024-04-08T13:23:14Z

@zucchini-nlp Awesome work! First thing to do before a full review is resolving the failing tests. Some of these are unrelated, and rebasing on main should resolve. Looking at the CI, some of the failure as video llava specific - let me know if you need any help addressing them!

zucchini-nlp · 2024-04-08T16:53:01Z

Rebased with main and resolved conflicts. The only failing doctest seems to be not able to load and run 7b model in 120sec, but I think we will leave it anyway to show how Video-Llava works

amyeroberts · 2024-04-10T13:45:33Z

@zucchini-nlp You can exclude the model from running in the doc tests by adding it to slow_documentation_tests.txt.

Then, once the PR is reviewed in a steady state and ready for merging, we can run the slow tests and the documentation tests to make sure everything is correct before merging

amyeroberts

Thanks for all the work adding this model - looks great!

src/transformers/models/video_llava/configuration_video_llava.py

amyeroberts · 2024-05-07T19:51:11Z

docs/source/en/model_doc/video_llava.md

+```bash
+"USER: <video>\n<prompt> ASSISTANT:"
+```
+
+For multiple turns conversation:
+
+```bash
+"USER: <video>\n<prompt1> ASSISTANT: <answer1></s>USER: <prompt2> ASSISTANT: <answer2></s>USER: <prompt3> ASSISTANT:"
+```


Could you expand this to show a full example using the model? Users typically want to just copy-paste

amyeroberts · 2024-05-07T20:23:42Z

src/transformers/models/video_llava/image_processing_video_llava.py

If the image processor takes both images and videos as input, and only one of them is required, then setting image = None seems reasonable

amyeroberts · 2024-05-07T20:29:29Z

src/transformers/models/video_llava/modeling_video_llava.py

+                        input_ids,
+                        attention_mask,
+                        labels,
+                        num_frames=8,


Has this been added? Skimming I didn't spot but might have just missed

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

zucchini-nlp · 2024-05-08T10:19:41Z

Hey @LinB203, can you let us know if you can upload HF weights of VideoLlava to the organization? The model seems rready to be added to the library

LinB203 · 2024-05-09T06:38:58Z

Hey @LinB203, can you let us know if you can upload HF weights of VideoLlava to the organization? The model seems rready to be added to the library

Thanks for your great attention and employing the model in transformers. I wonder that which organization you mentioned?
I have uploaded the weight in my personal repo, does it ok?

zucchini-nlp · 2024-05-09T13:39:37Z

@LinB203 I mean the weights that can loaded into the transformers model that were converted by the "convert_weights" script I added in this PR. I already have it tested and added weights to my account in the hub, but it would be nice if it was under the "LanguageBind" org. If you need to keep the current version of the state dict, you can call the new model as "LanguageBind/Video-LLaVA-7B-hf" prob

LinB203 · 2024-05-09T14:02:11Z

@LinB203 I mean the weights that can loaded into the transformers model that were converted by the "convert_weights" script I added in this PR. I already have it tested and added weights to my account in the hub, but it would be nice if it was under the "LanguageBind" org. If you need to keep the current version of the state dict, you can call the new model as "LanguageBind/Video-LLaVA-7B-hf" prob

I see. I will make LanguageBind/Video-LLaVA-7B-hf repo and upload new converted weight by your script tonight. Thank you very much.

LinB203 · 2024-05-09T23:00:14Z

@LinB203 I mean the weights that can loaded into the transformers model that were converted by the "convert_weights" script I added in this PR. I already have it tested and added weights to my account in the hub, but it would be nice if it was under the "LanguageBind" org. If you need to keep the current version of the state dict, you can call the new model as "LanguageBind/Video-LLaVA-7B-hf" prob

Finished. https://huggingface.co/LanguageBind/Video-LLaVA-7B-hf/tree/main

zucchini-nlp · 2024-05-10T08:53:57Z

@amyeroberts done! The last commit should trigger the slow tests after your approval.

Note about the failing code-check, it's failing because of the ARCHIVE_MAP which apparently was removed for all models. So I didn't add it for VideoLlava

amyeroberts · 2024-05-13T14:30:52Z

@zucchini-nlp Great! :D Could you rebase to include the upstream changes like the ARCHIVE_MAP removal? This should make everything green and ensure it's just that triggering the errors

zucchini-nlp · 2024-05-15T09:38:33Z

The PR passed all the tests, slow tests are passing for me locally. Should be good to go

amyeroberts · 2024-05-15T09:45:22Z

@zucchini-nlp Great - let's merge! Do you have permission to do so? I can merge if not.

zucchini-nlp · 2024-05-15T10:00:12Z

Just fixed one slow test, will merge when i get all green

* add model draft * update docstring * add tests * support image and video as input * update for better handling of mixed input and clean-up a bit * bug when mixed inputs & add tests * Update README.md Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * Merge remote-tracking branch 'upstream/main' into video_llava * link to abstract of paper in README * fix test * fix-copies * make tests happy * skip docstest for now * do not run doctest for now * Update src/transformers/models/video_llava/processing_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/video_llava/image_processing_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/video_llava/image_processing_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/video_llava/image_processing_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/video_llava/image_processing_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update tests/models/video_llava/test_modeling_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/video_llava/image_processing_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * address review comments * failing tests * Fix vocab_size in common tests for VLMs * codestyle * Update src/transformers/models/video_llava/configuration_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/video_llava/configuration_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/video_llava/modeling_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/video_llava/modeling_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update docs/source/en/model_doc/video_llava.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update docs/source/en/model_doc/video_llava.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/video_llava/image_processing_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update docs/source/en/model_doc/video_llava.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/video_llava/processing_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update tests/models/video_llava/test_modeling_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update tests/models/video_llava/test_modeling_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update tests/models/video_llava/test_modeling_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * PR suggestions * fix-copies * Update src/transformers/models/video_llava/configuration_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/video_llava/configuration_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * add full example in docs * clean-up with new model-id * [run-slow] video_llava * update docstring * [run-slow] video_llava * remove all achive maps * fix some tests * test was supposed to be skipped for llava :) --------- Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

add model draft

dce6678

zucchini-nlp marked this pull request as draft March 19, 2024 14:21

zucchini-nlp added 4 commits March 20, 2024 14:03

update docstring

72626df

add tests

8cca731

support image and video as input

4ea4f70

update for better handling of mixed input and clean-up a bit

c36819d

ArthurZucker added the New model label Mar 30, 2024

NielsRogge reviewed Mar 30, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

NielsRogge reviewed Mar 30, 2024

View reviewed changes

src/transformers/models/auto/configuration_auto.py Outdated Show resolved Hide resolved

zucchini-nlp and others added 2 commits April 8, 2024 14:33

bug when mixed inputs & add tests

c1a8fd5

Update README.md

c591c75

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

zucchini-nlp changed the title ~~Add Video Llava [WIP]~~ Add Video Llava Apr 8, 2024

zucchini-nlp requested a review from amyeroberts April 8, 2024 12:37

zucchini-nlp marked this pull request as ready for review April 8, 2024 12:38

zucchini-nlp added 6 commits April 8, 2024 15:46

Merge remote-tracking branch 'upstream/main' into video_llava

5ff8d18

link to abstract of paper in README

a6bc68d

fix test

eb309ed

fix-copies

2f46f6c

Merge branch 'main' into video_llava

6b51b7e

make tests happy

e112958

zucchini-nlp added 2 commits April 10, 2024 16:43

skip docstest for now

5cb6163

do not run doctest for now

930147d

zucchini-nlp and others added 3 commits April 30, 2024 18:07

fix-copies

975d959

Merge branch 'main' into video_llava

7f30e3b

Merge branch 'huggingface:main' into video_llava

6bdad81

zucchini-nlp requested a review from amyeroberts May 1, 2024 12:14

amyeroberts approved these changes May 7, 2024

View reviewed changes

zucchini-nlp and others added 4 commits May 8, 2024 15:02

Update src/transformers/models/video_llava/configuration_video_llava.py

a817f31

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

Update src/transformers/models/video_llava/configuration_video_llava.py

dba80e2

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

Merge remote-tracking branch 'upstream/main' into video_llava

6b3eafb

add full example in docs

ba4e125

zucchini-nlp added 5 commits May 10, 2024 09:35

clean-up with new model-id

6cc8af1

[run-slow] video_llava

885a5ae

update docstring

377aafe

Merge branch 'main' into video_llava

637b197

[run-slow] video_llava

a411347

zucchini-nlp and others added 2 commits May 14, 2024 13:27

Merge branch 'huggingface:main' into video_llava

0d83eaf

remove all achive maps

8134039

fix some tests

8e15514

test was supposed to be skipped for llava :)

5d1e976

zucchini-nlp merged commit bd9f4d7 into huggingface:main May 15, 2024
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Video Llava #29733

Add Video Llava #29733

zucchini-nlp commented Mar 19, 2024 •

edited

Loading

zucchini-nlp commented Mar 19, 2024

HuggingFaceDocBuilderDev commented Mar 19, 2024

zucchini-nlp commented Mar 27, 2024

ArthurZucker commented Mar 30, 2024

zucchini-nlp commented Apr 8, 2024

amyeroberts commented Apr 8, 2024

zucchini-nlp commented Apr 8, 2024

amyeroberts commented Apr 10, 2024

amyeroberts left a comment

amyeroberts May 7, 2024

amyeroberts May 7, 2024

amyeroberts May 7, 2024

zucchini-nlp commented May 8, 2024

LinB203 commented May 9, 2024 •

edited

Loading

zucchini-nlp commented May 9, 2024

LinB203 commented May 9, 2024

LinB203 commented May 9, 2024

zucchini-nlp commented May 10, 2024

amyeroberts commented May 13, 2024

zucchini-nlp commented May 15, 2024

amyeroberts commented May 15, 2024

zucchini-nlp commented May 15, 2024

Add Video Llava #29733

Add Video Llava #29733

Conversation

zucchini-nlp commented Mar 19, 2024 • edited Loading

What does this PR do?

zucchini-nlp commented Mar 19, 2024

HuggingFaceDocBuilderDev commented Mar 19, 2024

zucchini-nlp commented Mar 27, 2024

ArthurZucker commented Mar 30, 2024

zucchini-nlp commented Apr 8, 2024

amyeroberts commented Apr 8, 2024

zucchini-nlp commented Apr 8, 2024

amyeroberts commented Apr 10, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

amyeroberts May 7, 2024

Choose a reason for hiding this comment

amyeroberts May 7, 2024

Choose a reason for hiding this comment

amyeroberts May 7, 2024

Choose a reason for hiding this comment

zucchini-nlp commented May 8, 2024

LinB203 commented May 9, 2024 • edited Loading

zucchini-nlp commented May 9, 2024

LinB203 commented May 9, 2024

LinB203 commented May 9, 2024

zucchini-nlp commented May 10, 2024

amyeroberts commented May 13, 2024

zucchini-nlp commented May 15, 2024

amyeroberts commented May 15, 2024

zucchini-nlp commented May 15, 2024

zucchini-nlp commented Mar 19, 2024 •

edited

Loading

LinB203 commented May 9, 2024 •

edited

Loading