Add LLaVa NeXT Video #31252

zucchini-nlp · 2024-06-05T07:51:47Z

What does this PR do?

Adds a new model for Video LLMs built by tuning LLaVa-NeXT on video dataset, current SOTA for videos on VideoMME bench.

This PR introduces some changes to how we usually handled interleaved multimodal data. I need @amyeroberts opinion on that, of it's a viable option for these kinds of models. I added a separate videoProcessor class which has all same parameters as ImageProcessor but a different logic. The reason: imageProcessor file was already full of extra helper functions, and adding more conditions seemed to degrade readbility

Also, I added a chat template, if you can verify that it's correct @NielsRogge 😄 (it's in the convert_model.py). I will push all templates to the hub later, currently only 7B model has its template working

Fixes #31164

HuggingFaceDocBuilderDev · 2024-06-05T08:12:31Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

the import needs to be like this I believe

src/transformers/models/llava_next_video/diff_llava_next_video.py

zucchini-nlp · 2024-06-07T08:14:00Z

ready for review!

amyeroberts

Thanks for all the work adding this model!

Most comments are about copied from and diff file choices

src/transformers/image_utils.py

src/transformers/models/llava_next_video/configuration_llava_next_video.py

tests/models/llava_next_video/test_modeling_llava_next_video.py

src/transformers/models/llava_next_video/processing_llava_next_video.py

amyeroberts · 2024-06-07T11:53:44Z

src/transformers/models/llava_next_video/video_processing_llava_next_video.py

+    raise ValueError(f"Could not make batched video from {videos}")
+
+
+class LlavaNextVideoVideoProcessor(BaseImageProcessor):


To be consistent with other models, this should really be LlavaNextVideoImageProcessor.

I'd say we should for do this for this model, and then in a separate PR we can consider introduce a VideoProcessor class, which we might want to use for new models, and as a possible alias for the image processor for the other video models.

The outstanding question would be for video llava, which takes both images and videos. Given the model, I'd say it should probably be a VideoProcessor

Oke, but in case of this model, it also accepts both as input. Right now it uses simple "LLaVaNext" as image processor and adds a new class LlavaNextVideoImageProcessor for video processing. So, for VideoLlava we can separate into two classes in this same way, one for image and another for video processing.

The only major change is that arg name for images will be pixel_values and not pixel_values_images if we do so.

I will call this one LlavaNextVideoImageProcessor processor then and try to unify video-image models in separate PR.

src/transformers/models/llava_next_video/modeling_llava_next_video.py

NielsRogge

Some initial comments, in particular the diff should probably be entirely defined in diff_llava_next_video.py, which then populates the config and modeling files

docs/source/en/model_doc/llava-next-video.md

src/transformers/models/llava_next_video/configuration_llava_next_video.py

src/transformers/models/llava_next_video/modeling_llava_next_video.py

src/transformers/models/llava_next_video/processing_llava_next_video.py

zucchini-nlp · 2024-06-11T08:15:07Z

@amyeroberts ready for re-review

Diff was not working and wasn't copying content from parent class (LLava_NeXT). I did some changes in the diff-converter file, so would be nice to get @ArthurZucker review on that. However, diff still doesn't work when we have to overwrite docstring for config, so I simply made my own new config file. I also tried diff-converting example2, still doesn't copy docstring and I couldn't solve it yet
VideoProcessor is gone, so now we just have two ImageProcessor classes used for each modality, simply to isolate unrelated code in two files

ArthurZucker

🔥 kudos for using the diff file so well

docs/source/en/model_doc/llava-next-video.md

src/transformers/models/llava_next_video/convert_llava_next_video_weights_to_hf.py

src/transformers/models/llava_next_video/diff_llava_next_video.py

src/transformers/models/llava_next_video/modeling_llava_next_video.py

utils/diff_model_converter.py

ArthurZucker · 2024-06-12T15:56:18Z

utils/diff_model_converter.py

+    # Iterate directky from node.body as there can be property/setters with same names which are overwritten when we use a dict
+    for func in original_node.body.body:
+        name = func.name.value if hasattr(func, "name") else func


so you mean original_methods.items() is overwriding some things?

Otherwise the goal is indeed to overwrite the methods using the func.with_changes!

This happens only when original class has two methods with identical naming, in my case property and it setter. If we use dict, only one is retained as dict can't hold two keys with same naming.

Not sure if there're cases when directly iterating might cause errors 🤔

Hhhhhha I see!
Then it means we are not taking into account the decorator. Indeed that is a good catch and was probably affectingh @younesbelkada in his cohere porting

amyeroberts

Wow - looks great! 🔥

Some comments relating to the diff logic and the generated model file, but overall looks great

tests/models/video_llava/test_modeling_video_llava.py

src/transformers/models/llava_next_video/modeling_llava_next_video.py

src/transformers/models/llava_next_video/processing_llava_next_video.py

amyeroberts · 2024-06-13T13:34:42Z

src/transformers/models/llava_next_video/diff_llava_next_video.py

@ArthurZucker I can forsee hitting potential issues when e.g. we have more than one modeling file. For example, the models under data2vec.

Seeing the config file here made me realise this. Rather than try to handle which imports belong to which file e.g. configuration_llava_next vs. modeling_llava_next we could make this simpler (and less prone to error) by having diff files for each e.g. diff_modeling_llava_next_video.py, diff_configuration_llava_next_video.py.

I agree, but there should be an easy way to solve importing stuff. Basically pasting all imports everywhere and removing what is not used

Keeping the config and tokenizer and whatever else in a single diff file for now will be simpler IMO !

src/transformers/models/llava_next_video/diff_llava_next_video.py

amyeroberts · 2024-06-13T13:36:51Z

src/transformers/models/llava_next_video/modeling_llava_next_video.py

+
+
+@dataclass
+# Copied from transformers.models.idefics.modeling_idefics.IdeficsCausalLMOutputWithPast with Idefics->LlavaNextVideo


I don't think we want to copy across the copied from statements from the original file 👀

hehe, this is diff-generated but I'll see if that can be fixed + ping @ArthurZucker for visibility

UPD: Found how to remove those and can push a commit if needed. Btw, make fix-copies currently doesn't check for diff-files, so imo "copied from" statements are needed to catch and propagate code changes

yep we could / should add the check into make fix-copies to test the hash of the file vs hash of the diff converter generated file

I'll leave this one for next the PR (to whoever wants to add the checks) as it's not related to LLaVa-NeXT :)

Edit: will open an issue to not forget

amyeroberts · 2024-06-13T13:37:27Z

tests/models/llava_next_video/test_modeling_llava_next_video.py

+    Model tester for `LlavaNextVideoForConditionalGeneration`.
+    """
+
+    all_model_classes = (LlavaNextVideoForConditionalGeneration,) if is_torch_available() else ()


Do we need to define all_generative_models here?

Same as in another PR, VLMs cannot do GenerationTesterMixin out of the box due to pixel_values, custom generation code or partially having generation related code in modeling (e.g. expand_attn_mask). It's on my todo list, but requires more unification of VLMs before

utils/diff_model_converter.py

zucchini-nlp · 2024-06-20T06:52:17Z

@amyeroberts this one is also ready for the last review

amyeroberts

Great work - thanks for adding the model and for testing the diff converter!

tests/models/vipllava/test_modeling_vipllava.py

src/transformers/models/llava_next_video/processing_llava_next_video.py

amyeroberts · 2024-06-21T15:42:43Z

@zucchini-nlp Only thing before merge is to do a run on the slow tests for the model with a [run_slow] llava-next-video commit message

…_video.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

zucchini-nlp · 2024-06-24T04:19:52Z

Hmm I am not sure slow tests are working, they were skipped in here and in BlipVideo. I think the commit message format is correct

amyeroberts · 2024-06-24T10:46:21Z

Hmm I am not sure slow tests are working, they were skipped in here and in BlipVideo. I think the commit message format is correct

Ah, we didn't have the run-slow label on the PR, which might have caused it to be skipped. @zucchini-nlp Could you try again?

zucchini-nlp marked this pull request as ready for review June 5, 2024 07:51

zucchini-nlp changed the title ~~Add LLaVa NeXT Video~~ [WIP] Add LLaVa NeXT Video Jun 5, 2024

zucchini-nlp marked this pull request as draft June 5, 2024 07:52

zucchini-nlp force-pushed the llava_next_video branch from 1548ded to 37aa822 Compare June 5, 2024 13:54

ArthurZucker reviewed Jun 6, 2024

View reviewed changes

src/transformers/models/llava_next_video/diff_llava_next_video.py Outdated Show resolved Hide resolved

zucchini-nlp force-pushed the llava_next_video branch from af53791 to a144d8d Compare June 7, 2024 06:50

zucchini-nlp marked this pull request as ready for review June 7, 2024 08:04

zucchini-nlp requested review from NielsRogge and amyeroberts June 7, 2024 08:14

zucchini-nlp changed the title ~~[WIP] Add LLaVa NeXT Video~~ Add LLaVa NeXT Video Jun 7, 2024

amyeroberts reviewed Jun 7, 2024

View reviewed changes

NielsRogge reviewed Jun 10, 2024

View reviewed changes

zucchini-nlp requested review from amyeroberts and ArthurZucker June 11, 2024 08:15

zucchini-nlp mentioned this pull request Jun 12, 2024

Add video modality for InstrucBLIP #30182

Merged

ArthurZucker reviewed Jun 12, 2024

View reviewed changes

amyeroberts reviewed Jun 13, 2024

View reviewed changes

squash into single commit

3947ad8

zucchini-nlp force-pushed the llava_next_video branch from 8ce7ecd to 3947ad8 Compare June 14, 2024 08:37

zucchini-nlp added 5 commits June 14, 2024 11:00

run diff once more

3524944

docstring

9d38b3c

Merge remote-tracking branch 'upstream/main' into llava_next_video

1806e97

tests

c326356

minor chnages and ready to go

8bd4c1c

zucchini-nlp mentioned this pull request Jun 20, 2024

Check diff files in check_copies #31509

Open

Merge branch 'huggingface:main' into llava_next_video

e60c5a9

amyeroberts approved these changes Jun 21, 2024

View reviewed changes

tests/models/vipllava/test_modeling_vipllava.py Outdated Show resolved Hide resolved

src/transformers/models/llava_next_video/processing_llava_next_video.py Outdated Show resolved Hide resolved

zucchini-nlp and others added 3 commits June 21, 2024 21:04

Update src/transformers/models/llava_next_video/processing_llava_next…

c539a45

…_video.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

Update tests/models/vipllava/test_modeling_vipllava.py

2283029

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

[run-slow] llava-next-video

2a9cd68

amyeroberts added the run-slow label Jun 24, 2024

zucchini-nlp added 2 commits June 24, 2024 13:07

[run-slow] llava-next-video

14820d8

[run-slow] llava_next_video

97867b9

amyeroberts mentioned this pull request Jun 24, 2024

Add Florence2 support #31506

Closed

5 tasks

zucchini-nlp added 7 commits June 25, 2024 07:50

fix two tests

c61ba03

fix slow tests

0397d9f

remove logit checks due to numeric errors

063004d

run test once more

821e60f

[run-slow] llava_next_video

2659142

final try to pass the test

16aa854

[run-slow] llava_next_video

bc51d0c

ydshieh force-pushed the llava_next_video branch from b0433ed to bc51d0c Compare June 26, 2024 07:19

ydshieh added 5 commits June 26, 2024 14:44

[run-slow] llava_next_video

83d0eee

[run-slow] llava_next_video

b68fa21

style

1f7c32d

fix

67f1060

style

dc3d5e3

zucchini-nlp merged commit e71f286 into huggingface:main Jun 26, 2024
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LLaVa NeXT Video #31252

Add LLaVa NeXT Video #31252

zucchini-nlp commented Jun 5, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 5, 2024

ArthurZucker left a comment

zucchini-nlp commented Jun 7, 2024

amyeroberts left a comment

amyeroberts Jun 7, 2024

zucchini-nlp Jun 10, 2024

NielsRogge left a comment

zucchini-nlp commented Jun 11, 2024 •

edited

Loading

ArthurZucker left a comment

ArthurZucker Jun 12, 2024

zucchini-nlp Jun 13, 2024

ArthurZucker Jun 19, 2024

amyeroberts left a comment

amyeroberts Jun 13, 2024

ArthurZucker Jun 19, 2024

ArthurZucker Jun 19, 2024

amyeroberts Jun 13, 2024

zucchini-nlp Jun 13, 2024 •

edited

Loading

ArthurZucker Jun 19, 2024

zucchini-nlp Jun 20, 2024 •

edited

Loading

amyeroberts Jun 13, 2024

zucchini-nlp Jun 13, 2024

zucchini-nlp commented Jun 20, 2024

amyeroberts left a comment

amyeroberts commented Jun 21, 2024

zucchini-nlp commented Jun 24, 2024

amyeroberts commented Jun 24, 2024

		raise ValueError(f"Could not make batched video from {videos}")


		class LlavaNextVideoVideoProcessor(BaseImageProcessor):



		@dataclass
		# Copied from transformers.models.idefics.modeling_idefics.IdeficsCausalLMOutputWithPast with Idefics->LlavaNextVideo

Add LLaVa NeXT Video #31252

Add LLaVa NeXT Video #31252

Conversation

zucchini-nlp commented Jun 5, 2024 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Jun 5, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

zucchini-nlp commented Jun 7, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NielsRogge left a comment

Choose a reason for hiding this comment

zucchini-nlp commented Jun 11, 2024 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zucchini-nlp Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zucchini-nlp Jun 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zucchini-nlp commented Jun 20, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

amyeroberts commented Jun 21, 2024

zucchini-nlp commented Jun 24, 2024

amyeroberts commented Jun 24, 2024

zucchini-nlp commented Jun 5, 2024 •

edited

Loading

zucchini-nlp commented Jun 11, 2024 •

edited

Loading

zucchini-nlp Jun 13, 2024 •

edited

Loading

zucchini-nlp Jun 20, 2024 •

edited

Loading