Add support for BLIP and GIT in image-to-text and VQA pipelines #21110

NielsRogge · 2023-01-13T15:08:12Z

Feature request

BLIP and GIT are 2 recent additions in the library, providing state-of-the-art performance for tasks like image captioning and visual question answering (VQA). GIT is even capable of video captioning and video QA.

Hence it makes sense to support them in our image-to-text and VQA pipelines.

Motivation

Having support for better models in pipelines is very desired!

See also a request for it here: https://discuss.huggingface.co/t/support-for-different-models-in-text-to-image-pipeline/29504

Your contribution

I can assist in adding support, see #18446 as a very similar case

atturaioe · 2023-01-13T21:39:51Z

Hi @NielsRogge , can work on it?

NielsRogge · 2023-01-16T21:22:20Z

Sure @atturaioe feel free to start working on it

02shanks · 2023-01-17T08:24:28Z

I am writing to inquire about the possibility of me starting work on this issue. @NielsRogge can I contribute?

pikaduck · 2023-02-01T05:16:50Z

@NielsRogge is this issue still open for contribution?

NielsRogge · 2023-02-01T07:49:20Z

Yes

RaghavPrabhakar66 · 2023-02-08T13:31:51Z

@NielsRogge If nobody is working on it, I would like to pick up the issue.

strankid · 2023-02-16T21:41:55Z

I would like to pick the issue if its still available.

Tanmaypatil123 · 2023-04-17T14:04:20Z

@NielsRogge is this issue still open to contribute . I would like to work on it

NielsRogge · 2023-04-18T12:29:42Z

Support for BLIP in the image-to-text pipeline has been added in #21904. GIT can be added as explained in this comment, feel free to open a PR.

Support for the VQA pipeline still needs to be added for both, also there contributions are welcome.

sushmanthreddy · 2023-04-21T19:01:13Z

@NielsRogge can I work on this issue??

marechaux · 2023-05-02T14:14:34Z

Hello @NielsRogge !

I would like to work on this issue (add support for VQA to GIT model) as a first contribution.

But before I start, I have a question :

Currently the only model implementing the VQA pipeline is ViltForQuestionAnswering, it does the task using classification

However in GIT paper they say that :

For VQA, the input question is treated as a text prefix, and the answer is generated in an auto-regressive way. Furthermore, we present a new generation-based scheme for ImageNet classification, where the predicted labels come directly from our generative model without pre-defining the vocabulary.

So I wonder if I should implement it as a classifier or should I follow the paper ?

Thanks

NielsRogge · 2023-05-03T07:14:58Z

Hi @marechaux, we will need to implement the 2 different approaches in the VQA pipeline. ViLT and GIT indeed solve VQA entirely different (ViLT is a classifier whereas GIT is a generative GPT-like model).

skiss10 · 2023-05-09T23:01:39Z

Support for BLIP in the image-to-text pipeline has been added in #21904. GIT can be added as explained in this comment, feel free to open a PR.

Support for the VQA pipeline still needs to be added for both, also there contributions are welcome.

Hey @NielsRogge, took a shot at this. Am I correct in understanding that the ideal implementation of "microsoft/git-base" in the image-to-text pipeline would look something like this?

from transformers import AutoProcessor, GitForVision2Seq

processor = AutoProcessor.from_pretrained("microsoft/git-base")
model = GitForVision2Seq.from_pretrained("microsoft/git-base")

pipe = pipeline("image-to-text", model=model, image_processor=processor.image_processor, tokenizer=processor.tokenizer)
print(pipe("https://www.wideopenpets.com/wp-content/uploads/sites/6/2021/12/Popular-Horse-Feature-Image.png"))

If so, I got this to work by:

Adding the GitForVision2Seq class and making it available for imports / in MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES
Updating src/transformers/models/git/processing_git.py to use a custom GITImageProcessor. This GITImageProcessor is an exact copy of the CLIPImageProcessor that GitProcessor already wraps, with the only difference being how the GITImageProcessor.preprocess method returns data when being called by the ImageToTextPipeline.preprocess method (Basically adding the input_ids key with a value of None ).

So the GITImageProcessor.preprocesor ends with this:

data = {"pixel_values": images} 
return_data = BatchFeature(data=data, tensor_type=return_tensors) 
return_data['input_ids'] = None 
return return_data

rather than the CLIPImageProcessor.preprocessor method returning this

data = {"pixel_values": images} 
return BatchFeature(data=data, tensor_type=return_tensors)

Curious your thoughts on this approach. How would this would affect other GIT image processing workflows (i.e. VQA, etc.)? Could we can use a conditional to account for those?

NielsRogge · 2023-05-10T16:59:02Z

Thanks for taking a stab at this. I'm fine with adding a GitForVision2Seq (as proposed by @Narsil) however it'd be great to not having to add a custom GITImageProcessor. What's the reason this is added? Is it only to include "input_ids" which are set to None?

skiss10 · 2023-05-10T18:05:01Z

Exactly this - 'only to include "input_ids" which are set to None?'

I see how adding an entirely new GITImageProcessor seems excessive when all it would do is add the Input_ids : None key value pair to data being returned from the .preprocess method.

As you describe here, #21514 (comment), Once we hit the preprocess method in ImageToTextPipeline and map the model to git, the model_inputs are returned (via the CLIPImageProcessor through the GITProcessor in processing_git.py) without the input_ids key. So AFAIK, the best we can do is modify the return value of the CLIPImageProcessor.preprocess method without changing the CLIPImageProcessor class by replicating the CLIPImageProcessor, rebranding it as a GITImageProcessor, and modify the .preprocess method.

Let me know if that works or if you feel there is a better approach. Is the idea that there would be some way to do this within GitForVision2Seq?

As an aside, I read some best practices for working in the transformers library (https://huggingface.co/transformers/v4.10.1/add_new_model.html#general-overview-of-transformers). Would it be preferable to copy the entire CLIPImageProcessor class as GITImageProcessor within processing_git.py or do something more like this within processing_git.py.

class GITImageProcessor(CLIPImageProcessor):
    def preprocess(self, *args, **kwargs):
        # Call the original preprocess method
        return_data = super().preprocess(*args, **kwargs)
        
        # Add 'input_ids' key to the data
        return_data['input_ids'] = None

        return return_data

NielsRogge · 2023-05-11T20:51:04Z

Hmm I don't get why input_ids need to be set to None. Could you clarify?

This example shows that you only need to pass pixel_values to the generate method to do image captioning.

jprivera44 · 2023-05-20T21:06:55Z

Hello, it seems that the BLIP for the image to text pipeline has been completed, however that the VQA pipeline for both BLIP & GIT are not complete, along with the image to text pipeline for GIT. @marechaux how is the VQA going for GIT?

jucamohedano · 2023-05-27T19:19:30Z

Hi! I'm also interested in helping out if we can divide the work :)

Tanmaypatil123 · 2023-06-28T16:13:11Z

Hey @NielsRogge , I was working on VQA pipeline for BLIP but i am confused how can i give pixel_values to _forward method in VisualQuestionAnsweringPipeline (src) because BLIP requires pixel values and those are generated by preprocessor . Sorry if this is silly question because this is my first open source contribution .

NielsRogge · 2023-06-29T17:13:15Z

Hi @Tanmaypatil123 there's already this PR: #23348. Feel free to take it over/improve it

PinakShome · 2023-07-03T20:34:26Z

Hello, can I work on this?

Vipul-Pandey-22 · 2023-07-05T05:31:12Z

Hi Team, Can I start working on it ?

jpizarrom · 2023-08-14T06:59:52Z

Hi @NielsRogge, I would like to try to add GIT for VQA as my first contribution, is it ok?
I looked at #23348 , and I want to know if it is fine to return the full generated text, I make it work locally, so I could prepare a PR if no one else is working on this.

I believe the input_ids or its lenght could be used in the postprocess of VisualQuestionAnsweringPipeline to remove the prompt/prefix, like in TextGenerationPipeline, but it will require to do a refactor in _forward in VisualQuestionAnsweringPipeline to return also the input_ids.

e.g.

transformers/src/transformers/pipelines/text_generation.py

Lines 294 to 305 in fe3c8ab

    
               prompt_length = len( 
        
                   self.tokenizer.decode( 
        
                       input_ids[0], 
        
                       skip_special_tokens=True, 
        
                       clean_up_tokenization_spaces=clean_up_tokenization_spaces, 
        
                   ) 
        
               ) 
        
           if return_type == ReturnType.FULL_TEXT: 
        
               all_text = prompt_text + text[prompt_length:] 
        
           else: 
        
               all_text = text[prompt_length:]

astern21 · 2023-11-21T01:56:25Z

Is this still open for contribution? Would love to help out.

jpizarrom · 2023-11-21T07:28:33Z

Hi @astern21, I started a draft PR, but didn't get to finish it, and now I'm not really working on it.

NielsRogge added the Good First Issue label Jan 13, 2023

anakin87 mentioned this issue Jan 14, 2023

feat: ImageToText (caption generator) deepset-ai/haystack#3859

Merged

6 tasks

atturaioe mentioned this issue Jan 21, 2023

[WIP] Support BLIP and GIT in image-to-text and VQA pipelines #21227

Closed

4 tasks

anakin87 mentioned this issue Feb 28, 2023

Impossible to use any of the GIT models from Microsoft deepset-ai/haystack#4301

Closed

1 task

marechaux mentioned this issue May 13, 2023

Add support for GIT model in VQA pipelines #23348

Closed

4 tasks

anakin87 mentioned this issue May 14, 2023

feat: add BLIP support in TransformersImageToText deepset-ai/haystack#4912

Merged

6 tasks

NielsRogge mentioned this issue May 15, 2023

[image-to-text pipeline] Add conditional text support + GIT #23362

Merged

1 task

This was referenced Aug 14, 2023

Add GitForCausalLM model in VQA pipeline #25509

Closed

Add Blip2 model in VQA pipeline #25532

Merged

jpizarrom mentioned this issue Sep 19, 2023

add GitTokenizer #26263

Closed

5 tasks

amyeroberts mentioned this issue Jan 12, 2024

ImageToTextPipeline does not support InstructBlip Models #27975

Open

4 tasks

st81 mentioned this issue Mar 3, 2024

[WIP] Add support for GIT in image-to-text and VQA pipelines #29415

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for BLIP and GIT in image-to-text and VQA pipelines #21110

Add support for BLIP and GIT in image-to-text and VQA pipelines #21110

NielsRogge commented Jan 13, 2023

atturaioe commented Jan 13, 2023

NielsRogge commented Jan 16, 2023

02shanks commented Jan 17, 2023

pikaduck commented Feb 1, 2023

NielsRogge commented Feb 1, 2023

RaghavPrabhakar66 commented Feb 8, 2023

strankid commented Feb 16, 2023

Tanmaypatil123 commented Apr 17, 2023

NielsRogge commented Apr 18, 2023

sushmanthreddy commented Apr 21, 2023

marechaux commented May 2, 2023

NielsRogge commented May 3, 2023

skiss10 commented May 9, 2023 •

edited

Loading

NielsRogge commented May 10, 2023

skiss10 commented May 10, 2023

NielsRogge commented May 11, 2023

jprivera44 commented May 20, 2023

jucamohedano commented May 27, 2023

Tanmaypatil123 commented Jun 28, 2023

NielsRogge commented Jun 29, 2023

PinakShome commented Jul 3, 2023

Vipul-Pandey-22 commented Jul 5, 2023

jpizarrom commented Aug 14, 2023

astern21 commented Nov 21, 2023

jpizarrom commented Nov 21, 2023

Add support for BLIP and GIT in image-to-text and VQA pipelines #21110

Add support for BLIP and GIT in image-to-text and VQA pipelines #21110

Comments

NielsRogge commented Jan 13, 2023

Feature request

Motivation

Your contribution

atturaioe commented Jan 13, 2023

NielsRogge commented Jan 16, 2023

02shanks commented Jan 17, 2023

pikaduck commented Feb 1, 2023

NielsRogge commented Feb 1, 2023

RaghavPrabhakar66 commented Feb 8, 2023

strankid commented Feb 16, 2023

Tanmaypatil123 commented Apr 17, 2023

NielsRogge commented Apr 18, 2023

sushmanthreddy commented Apr 21, 2023

marechaux commented May 2, 2023

NielsRogge commented May 3, 2023

skiss10 commented May 9, 2023 • edited Loading

NielsRogge commented May 10, 2023

skiss10 commented May 10, 2023

NielsRogge commented May 11, 2023

jprivera44 commented May 20, 2023

jucamohedano commented May 27, 2023

Tanmaypatil123 commented Jun 28, 2023

NielsRogge commented Jun 29, 2023

PinakShome commented Jul 3, 2023

Vipul-Pandey-22 commented Jul 5, 2023

jpizarrom commented Aug 14, 2023

astern21 commented Nov 21, 2023

jpizarrom commented Nov 21, 2023

skiss10 commented May 9, 2023 •

edited

Loading