Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for BLIP and GIT in image-to-text and VQA pipelines #21110

Open
NielsRogge opened this issue Jan 13, 2023 · 25 comments
Open

Add support for BLIP and GIT in image-to-text and VQA pipelines #21110

NielsRogge opened this issue Jan 13, 2023 · 25 comments

Comments

@NielsRogge
Copy link
Contributor

Feature request

BLIP and GIT are 2 recent additions in the library, providing state-of-the-art performance for tasks like image captioning and visual question answering (VQA). GIT is even capable of video captioning and video QA.

Hence it makes sense to support them in our image-to-text and VQA pipelines.

Motivation

Having support for better models in pipelines is very desired!

See also a request for it here: https://discuss.huggingface.co/t/support-for-different-models-in-text-to-image-pipeline/29504

Your contribution

I can assist in adding support, see #18446 as a very similar case

@atturaioe
Copy link
Contributor

Hi @NielsRogge , can work on it?

@NielsRogge
Copy link
Contributor Author

Sure @atturaioe feel free to start working on it

@02shanks
Copy link

I am writing to inquire about the possibility of me starting work on this issue. @NielsRogge can I contribute?

@pikaduck
Copy link

pikaduck commented Feb 1, 2023

@NielsRogge is this issue still open for contribution?

@NielsRogge
Copy link
Contributor Author

Yes

@RaghavPrabhakar66
Copy link
Contributor

@NielsRogge If nobody is working on it, I would like to pick up the issue.

@strankid
Copy link

I would like to pick the issue if its still available.

@Tanmaypatil123
Copy link
Contributor

@NielsRogge is this issue still open to contribute . I would like to work on it

@NielsRogge
Copy link
Contributor Author

Support for BLIP in the image-to-text pipeline has been added in #21904. GIT can be added as explained in this comment, feel free to open a PR.

Support for the VQA pipeline still needs to be added for both, also there contributions are welcome.

@sushmanthreddy
Copy link
Contributor

@NielsRogge can I work on this issue??

@marechaux
Copy link

Hello @NielsRogge !

I would like to work on this issue (add support for VQA to GIT model) as a first contribution.

But before I start, I have a question :

Currently the only model implementing the VQA pipeline is ViltForQuestionAnswering, it does the task using classification

However in GIT paper they say that :

For VQA, the input question is treated as a text prefix, and the answer is generated in an auto-regressive way. Furthermore, we present a new generation-based scheme for ImageNet classification, where the predicted labels come directly from our generative model without pre-defining the vocabulary.

So I wonder if I should implement it as a classifier or should I follow the paper ?

Thanks

@NielsRogge
Copy link
Contributor Author

Hi @marechaux, we will need to implement the 2 different approaches in the VQA pipeline. ViLT and GIT indeed solve VQA entirely different (ViLT is a classifier whereas GIT is a generative GPT-like model).

@skiss10
Copy link

skiss10 commented May 9, 2023

Support for BLIP in the image-to-text pipeline has been added in #21904. GIT can be added as explained in this comment, feel free to open a PR.

Support for the VQA pipeline still needs to be added for both, also there contributions are welcome.

Hey @NielsRogge, took a shot at this. Am I correct in understanding that the ideal implementation of "microsoft/git-base" in the image-to-text pipeline would look something like this?

from transformers import AutoProcessor, GitForVision2Seq

processor = AutoProcessor.from_pretrained("microsoft/git-base")
model = GitForVision2Seq.from_pretrained("microsoft/git-base")

pipe = pipeline("image-to-text", model=model, image_processor=processor.image_processor, tokenizer=processor.tokenizer)
print(pipe("https://www.wideopenpets.com/wp-content/uploads/sites/6/2021/12/Popular-Horse-Feature-Image.png"))

If so, I got this to work by:

  1. Adding the GitForVision2Seq class and making it available for imports / in MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES
  2. Updating src/transformers/models/git/processing_git.py to use a custom GITImageProcessor. This GITImageProcessor is an exact copy of the CLIPImageProcessor that GitProcessor already wraps, with the only difference being how the GITImageProcessor.preprocess method returns data when being called by the ImageToTextPipeline.preprocess method (Basically adding the input_ids key with a value of None ).

So the GITImageProcessor.preprocesor ends with this:

data = {"pixel_values": images} 
return_data = BatchFeature(data=data, tensor_type=return_tensors) 
return_data['input_ids'] = None 
return return_data

rather than the CLIPImageProcessor.preprocessor method returning this

data = {"pixel_values": images} 
return BatchFeature(data=data, tensor_type=return_tensors) 

Curious your thoughts on this approach. How would this would affect other GIT image processing workflows (i.e. VQA, etc.)? Could we can use a conditional to account for those?

@NielsRogge
Copy link
Contributor Author

Thanks for taking a stab at this. I'm fine with adding a GitForVision2Seq (as proposed by @Narsil) however it'd be great to not having to add a custom GITImageProcessor. What's the reason this is added? Is it only to include "input_ids" which are set to None?

@skiss10
Copy link

skiss10 commented May 10, 2023

Exactly this - 'only to include "input_ids" which are set to None?'

I see how adding an entirely new GITImageProcessor seems excessive when all it would do is add the Input_ids : None key value pair to data being returned from the .preprocess method.

As you describe here, #21514 (comment), Once we hit the preprocess method in ImageToTextPipeline and map the model to git, the model_inputs are returned (via the CLIPImageProcessor through the GITProcessor in processing_git.py) without the input_ids key. So AFAIK, the best we can do is modify the return value of the CLIPImageProcessor.preprocess method without changing the CLIPImageProcessor class by replicating the CLIPImageProcessor, rebranding it as a GITImageProcessor, and modify the .preprocess method.

Let me know if that works or if you feel there is a better approach. Is the idea that there would be some way to do this within GitForVision2Seq?

As an aside, I read some best practices for working in the transformers library (https://huggingface.co/transformers/v4.10.1/add_new_model.html#general-overview-of-transformers). Would it be preferable to copy the entire CLIPImageProcessor class as GITImageProcessor within processing_git.py or do something more like this within processing_git.py.

class GITImageProcessor(CLIPImageProcessor):
    def preprocess(self, *args, **kwargs):
        # Call the original preprocess method
        return_data = super().preprocess(*args, **kwargs)
        
        # Add 'input_ids' key to the data
        return_data['input_ids'] = None

        return return_data

@NielsRogge
Copy link
Contributor Author

Hmm I don't get why input_ids need to be set to None. Could you clarify?

This example shows that you only need to pass pixel_values to the generate method to do image captioning.

@jprivera44
Copy link
Contributor

Hello, it seems that the BLIP for the image to text pipeline has been completed, however that the VQA pipeline for both BLIP & GIT are not complete, along with the image to text pipeline for GIT. @marechaux how is the VQA going for GIT?

@jucamohedano
Copy link

Hi! I'm also interested in helping out if we can divide the work :)

@Tanmaypatil123
Copy link
Contributor

Hey @NielsRogge , I was working on VQA pipeline for BLIP but i am confused how can i give pixel_values to _forward method in VisualQuestionAnsweringPipeline (src) because BLIP requires pixel values and those are generated by preprocessor . Sorry if this is silly question because this is my first open source contribution .

@NielsRogge
Copy link
Contributor Author

Hi @Tanmaypatil123 there's already this PR: #23348. Feel free to take it over/improve it

@PinakShome
Copy link

Hello, can I work on this?

@Vipul-Pandey-22
Copy link

Hi Team, Can I start working on it ?

@jpizarrom
Copy link
Contributor

Hi @NielsRogge, I would like to try to add GIT for VQA as my first contribution, is it ok?
I looked at #23348 , and I want to know if it is fine to return the full generated text, I make it work locally, so I could prepare a PR if no one else is working on this.

I believe the input_ids or its lenght could be used in the postprocess of VisualQuestionAnsweringPipeline to remove the prompt/prefix, like in TextGenerationPipeline, but it will require to do a refactor in _forward in VisualQuestionAnsweringPipeline to return also the input_ids.

e.g.

prompt_length = len(
self.tokenizer.decode(
input_ids[0],
skip_special_tokens=True,
clean_up_tokenization_spaces=clean_up_tokenization_spaces,
)
)
if return_type == ReturnType.FULL_TEXT:
all_text = prompt_text + text[prompt_length:]
else:
all_text = text[prompt_length:]

@astern21
Copy link

Is this still open for contribution? Would love to help out.

@jpizarrom
Copy link
Contributor

Hi @astern21, I started a draft PR, but didn't get to finish it, and now I'm not really working on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet