Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Video Llava #29733

Merged
merged 57 commits into from
May 15, 2024
Merged

Add Video Llava #29733

merged 57 commits into from
May 15, 2024

Conversation

zucchini-nlp
Copy link
Member

@zucchini-nlp zucchini-nlp commented Mar 19, 2024

What does this PR do?

Fixes #29640 . Add a new model, Video Llava to the library.

This is a draft PR, will add more here later.

The model now accepts both, video and image as input in the same batch. Each visual has its own special token, so we do not need to repeat "" 8 times for 8 frames. Here is a short code snippet:

import torch
import numpy as np
import requests
from PIL import Image
from decord import VideoReader
from huggingface_hub import hf_hub_download
from transformers import VideoLlavaForConditionalGeneration, VideoLlavaProcessor


def load_video_tensor(video_path, n_frms=4, transform=False):
    vr = VideoReader(uri=video_path, height=224, width=224)
    n_frms = min(n_frms, len(vr))
    indices = np.arange(0, len(vr), len(vr) / n_frms).astype(int)
    frames = vr.get_batch(indices).asnumpy()
    return frames

# -------------------------------------------------------------------------------------------------------------------

model = VideoLlavaForConditionalGeneration.from_pretrained("/home/raushan/video_llava/")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
clip = load_video_tensor(video_path, n_frms=8)

processor = VideoLlavaProcessor.from_pretrained("/home/raushan/video_llava")
prompt_vid = "USER: <video>What do you see? ASSISTANT: A baby with a remote control. USER: <video>Why is this funny? ASSISTANT:"
prompt_img = "USER: <image>How many cats are there in the image? ASSISTANT:"
prompt_mix = "USER: <image>How many cats are there in the image? ASSISTANT: 2. USER: <video>Why is this video funny? ASSISTANT:"

inputs = processor(text=[prompt_mix, prompt_img], visual_inputs=[image, clip, image], padding=True, return_tensors="pt")

out = model.generate(**inputs, max_new_tokens=40)
print(processor.batch_decode(out, skip_special_tokens=True, clean_up_tokenization_spaces=True))

# [
#    'USER:  How many cats are there in the image? ASSISTANT: 2. USER:  Why is this video funny? ASSISTANT: The video is funny because the baby is sitting on 
#         the bed and reading a book, which is an unusual and amusing sight. Babies are typically not known for reading books, and the fact', 
#    'USER:  How many cats are there in the image? ASSISTANT: There are two cats in the image..'
# ]

@zucchini-nlp zucchini-nlp marked this pull request as draft March 19, 2024 14:21
@zucchini-nlp
Copy link
Member Author

@LinB203 hey! As we talked before, here is a draft PR of Video Llava. I checked that the modeling part runs without errors and generates similar to the original repo.

To update model files on the hub, you can use convert_weight script and use this branch to test if model is loading correctly. Whenever you are available, can you look through, if I missed anything important? :)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@zucchini-nlp
Copy link
Member Author

@LinB203 pinging in case the first one got lost in notifications :)

@ArthurZucker
Copy link
Collaborator

FYI @NielsRogge and @amyeroberts

README.md Outdated Show resolved Hide resolved
zucchini-nlp and others added 2 commits April 8, 2024 14:33
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
@zucchini-nlp zucchini-nlp changed the title Add Video Llava [WIP] Add Video Llava Apr 8, 2024
@zucchini-nlp
Copy link
Member Author

I believe we can start reviewing this now. I converted weights and added them to my hub account temporarily, so that we can run and test the model.

In the meanwhile will be waiting for @LinB203 to update the weights in the organization hub, which I think is required before we merge the PR

@zucchini-nlp zucchini-nlp marked this pull request as ready for review April 8, 2024 12:38
@amyeroberts
Copy link
Collaborator

@zucchini-nlp Awesome work! First thing to do before a full review is resolving the failing tests. Some of these are unrelated, and rebasing on main should resolve. Looking at the CI, some of the failure as video llava specific - let me know if you need any help addressing them!

@zucchini-nlp
Copy link
Member Author

Rebased with main and resolved conflicts. The only failing doctest seems to be not able to load and run 7b model in 120sec, but I think we will leave it anyway to show how Video-Llava works

@amyeroberts
Copy link
Collaborator

@zucchini-nlp You can exclude the model from running in the doc tests by adding it to slow_documentation_tests.txt.

Then, once the PR is reviewed in a steady state and ready for merging, we can run the slow tests and the documentation tests to make sure everything is correct before merging

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work adding this model - looks great!

Comment on lines 54 to 62
```bash
"USER: <video>\n<prompt> ASSISTANT:"
```

For multiple turns conversation:

```bash
"USER: <video>\n<prompt1> ASSISTANT: <answer1></s>USER: <prompt2> ASSISTANT: <answer2></s>USER: <prompt3> ASSISTANT:"
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand this to show a full example using the model? Users typically want to just copy-paste

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the image processor takes both images and videos as input, and only one of them is required, then setting image = None seems reasonable

input_ids,
attention_mask,
labels,
num_frames=8,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this been added? Skimming I didn't spot but might have just missed

@zucchini-nlp
Copy link
Member Author

Hey @LinB203, can you let us know if you can upload HF weights of VideoLlava to the organization? The model seems rready to be added to the library

@LinB203
Copy link

LinB203 commented May 9, 2024

Hey @LinB203, can you let us know if you can upload HF weights of VideoLlava to the organization? The model seems rready to be added to the library

Thanks for your great attention and employing the model in transformers. I wonder that which organization you mentioned?
I have uploaded the weight in my personal repo, does it ok?

@zucchini-nlp
Copy link
Member Author

@LinB203 I mean the weights that can loaded into the transformers model that were converted by the "convert_weights" script I added in this PR. I already have it tested and added weights to my account in the hub, but it would be nice if it was under the "LanguageBind" org. If you need to keep the current version of the state dict, you can call the new model as "LanguageBind/Video-LLaVA-7B-hf" prob

@LinB203
Copy link

LinB203 commented May 9, 2024

@LinB203 I mean the weights that can loaded into the transformers model that were converted by the "convert_weights" script I added in this PR. I already have it tested and added weights to my account in the hub, but it would be nice if it was under the "LanguageBind" org. If you need to keep the current version of the state dict, you can call the new model as "LanguageBind/Video-LLaVA-7B-hf" prob

I see. I will make LanguageBind/Video-LLaVA-7B-hf repo and upload new converted weight by your script tonight. Thank you very much.

@LinB203
Copy link

LinB203 commented May 9, 2024

@LinB203 I mean the weights that can loaded into the transformers model that were converted by the "convert_weights" script I added in this PR. I already have it tested and added weights to my account in the hub, but it would be nice if it was under the "LanguageBind" org. If you need to keep the current version of the state dict, you can call the new model as "LanguageBind/Video-LLaVA-7B-hf" prob

Finished. https://huggingface.co/LanguageBind/Video-LLaVA-7B-hf/tree/main

@zucchini-nlp
Copy link
Member Author

@amyeroberts done! The last commit should trigger the slow tests after your approval.

Note about the failing code-check, it's failing because of the ARCHIVE_MAP which apparently was removed for all models. So I didn't add it for VideoLlava

@amyeroberts
Copy link
Collaborator

@zucchini-nlp Great! :D Could you rebase to include the upstream changes like the ARCHIVE_MAP removal? This should make everything green and ensure it's just that triggering the errors

@zucchini-nlp
Copy link
Member Author

The PR passed all the tests, slow tests are passing for me locally. Should be good to go

@amyeroberts
Copy link
Collaborator

@zucchini-nlp Great - let's merge! Do you have permission to do so? I can merge if not.

@zucchini-nlp
Copy link
Member Author

Just fixed one slow test, will merge when i get all green

@zucchini-nlp zucchini-nlp merged commit bd9f4d7 into huggingface:main May 15, 2024
24 checks passed
itazap pushed a commit that referenced this pull request May 24, 2024
* add model draft

* update docstring

* add tests

* support image and video as input

* update for better handling of mixed input and clean-up a bit

* bug when mixed inputs & add tests

* Update README.md

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Merge remote-tracking branch 'upstream/main' into video_llava

* link to abstract of paper in README

* fix test

* fix-copies

* make tests happy

* skip docstest for now

* do not run doctest for now

* Update src/transformers/models/video_llava/processing_video_llava.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/models/video_llava/image_processing_video_llava.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/models/video_llava/image_processing_video_llava.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/models/video_llava/image_processing_video_llava.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/models/video_llava/image_processing_video_llava.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update tests/models/video_llava/test_modeling_video_llava.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/models/video_llava/image_processing_video_llava.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* address review comments

* failing tests

* Fix vocab_size in common tests for VLMs

* codestyle

* Update src/transformers/models/video_llava/configuration_video_llava.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/models/video_llava/configuration_video_llava.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/models/video_llava/modeling_video_llava.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/models/video_llava/modeling_video_llava.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update docs/source/en/model_doc/video_llava.md

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update docs/source/en/model_doc/video_llava.md

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/models/video_llava/image_processing_video_llava.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update docs/source/en/model_doc/video_llava.md

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/models/video_llava/processing_video_llava.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update tests/models/video_llava/test_modeling_video_llava.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update tests/models/video_llava/test_modeling_video_llava.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update tests/models/video_llava/test_modeling_video_llava.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* PR suggestions

* fix-copies

* Update src/transformers/models/video_llava/configuration_video_llava.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/models/video_llava/configuration_video_llava.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* add full example in docs

* clean-up with new model-id

* [run-slow] video_llava

* update docstring

* [run-slow] video_llava

* remove all achive maps

* fix some tests

* test was supposed to be skipped for llava :)

---------

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Video-LLaVA with transformers library
7 participants