Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nice work! Does the mantis has image seperator when sending to LLM? #4

Closed
lucasjinreal opened this issue May 18, 2024 · 14 comments
Closed

Comments

@lucasjinreal
Copy link

lucasjinreal commented May 18, 2024

Hi, wanna ask ,does mantis used image separator between images sending to LLM? From i can tell, llava doesn't have it and the data used in Mantis doesn't provide a str for separator too.

Also, do u think which way is better? If consider video frames input as well

@jdf-prog
Copy link
Collaborator

jdf-prog commented May 22, 2024

Yes! Mantis used an image separator between images, but in a text form. We have prepended and to each placeholder in the text automatically. These codes are written in MllavaProcessor so you don't need to manually add these separators.

Personally I would say adding image separators will be better. There was a ablation study for it in Co-Instruct paper (https://arxiv.org/abs/2402.16641), Table 8. Where you can see adding separators can really improve performance.

For video frames, Mantis follows the use of separators and according to our performance on MVBench, it seems using image separators can achieve good performance on MVBench, (on par with VideoChat, which is designed specifically for video understanding). Since video frames can also be seen as a kind of multi-image format, we somehow prove the possibility that maybe we don't need a separator video encoder, just a single image encoder can do well.

Feel free to ask more questions!

@lucasjinreal
Copy link
Author

Hi, thanks for the reply. This is an interesting conclusion. How does Mantis added the seperator, for instance, if we have 3 frames single sample? I just wondering llava's <im_start>Image1Image2Image3<im_end> seems not really work.

For image encoder can performances well on video frames, does it same way seperrator as in images? How does the massive tokens concated for frames problem being solved? (if using MLP, the output tokens, 5 images would blooming the context length)

@jdf-prog
Copy link
Collaborator

We actually wrote it in the paper. Every image token <image> will be transformed into (image {i}, <BOI><image><EOI> where <BOI> is the beginning of image token, and <EOI> is the end of image token.

For video tasks, we process each video frame the same way as a separator image. So yes, we indeed add the above image separators and denotations for each video frame.

About the image tokens cost, that's indeed a problem. However, Mantis models are trained under 8192 context length. Let's say each image costs 576 tokens (in CLIP and SigLIP cases), then it can accept at most 14 images.

Recently, we released Mantis-Idefics2. It uses a perceiver to resample image tokens into the fixed 64 tokens. That's way more efficient than the MLP mapping. Let's say we still have 8192 context length, it can accept at most 128 images. With the help of flash-attention 2, they can run in a very efficient way.

Currently, we did not optimize in the video direction. Mantis-Video might also be one of our future works. Stay tuned.

@lucasjinreal
Copy link
Author

lucasjinreal commented May 24, 2024

@jdf-prog Hi, Mantis-Idefics2 used Navit and maxium 980 input resolution, Does the test perviod also resize into maxium 980 or just original size input?

Have u also conducted the slicing image same as idefics2?

Also, how does the model trained, with idefics2 as pretrained model?

Also, I have noticed that, the Mantis-Idenfics2 got a 51.8 on mvbench for video, it's not bad but still not very good interms of image improvement, the mvbench, 54.85 is the best 7B baseline. Do u have any thoughts on this?

@jdf-prog
Copy link
Collaborator

Thanks for the questions!
During the training of idefics2, we disabled the image splitting (slicing) by default to make the image tokens more efficient. We also disable the image splitting during the evaluation.

For the model initialization, yes. We directly apply Idefics2 as a pre-trained model.

Well, It's worth to notice that Mantis-Instruct only contains 14K video associated data, which is way smaller than the data used to train other video understanding models, such as 720K Valley and 100K Video-ChatGPT. Mantis-Video might be our future work.

@lucasjinreal
Copy link
Author

lucasjinreal commented May 25, 2024

Hi,still have some question wanna to dicuss.

You mentioned the training used Idefics2 as pretrained model, then there would be a very serious question and could possiable the biggest concern to me:

If you using Idefics2 as pretrained model (no matter pretrained vit only, or pretrained as a whole), since the vit part already aligned with particullar LLM, the Mistral in this case, so that no matter how you train, your higher bar would be Idefics2 itself. I mean, you can't surpassed it since the vit already be trained perfectly. Exception you using the vit and train a new bigger LLM.

From the scores you got on the table, I also noticed some metric were dropped a lot compare with Idefics2 itself, not only the TextVQA consider you disabled the slicing strategy.

Do u think using the well-trained VLM can achiever a better results further training on it? What if we only using the vit part of it and training everything else?

@jdf-prog
Copy link
Collaborator

jdf-prog commented May 27, 2024

@lucasjinreal Thanks for the questions!

The first question sounds like a problem about continue-finetuning (continue-pretraining), Indeed, knowledge forgetting in continual training cannot be avoided and the most intuitive method to mitigate this issue is to replay some previously trained data. For Idefics2, that would be HuggingFaceM4/the_cauldron. We did not do this for the simplicity of our experiments but any attempt on this from the community is welcomed. There are also many papers about continual learning in the community. Maybe this survey paper can help you: https://arxiv.org/abs/2302.00487

About the sentence "the higher bar would be idefics2 itself", I would personally have doubts about it. I think it actually depends on the tasks. There must be some trade-offs if we want to optimize a model for different abilities. In our case, we are doing a tradeoff between the multi-image ability and the single-image ability. So Mantis-Idefics2 indeed improve the multi-image ability a lot, with a cost of single-image ability degeneration with an acceptable cost (2% to 6%). I would say this is an interesting observation that is worth further investigation.

"Do u think using the well-trained VLM can achieve better results for further training on it?". Still, there is always a cost, for training LLMs, it's a trade-off of old knowledge and new knowledge. The answer actually depends on what kind of metrics you are using to evaluate.

Thanks again for your deep thoughts about Mantis. We indeed have many problems to solve in the multi-image VLM scenario. Feel free to ask any other questions and I'd love to discuss.

@lucasjinreal
Copy link
Author

@jdf-prog Hi, thanks for your deep insights.

Looks like built upon some strong baseline for continue training can also be very benifitial. I think the next step, you might will consider transfer MiniCPM to your base? They have got a high margin even surpassed idefics2 since they were using llama3 as a language model.

If so, what;s the plan would be? If not, what's the next move would be to push the margin further?

@jdf-prog
Copy link
Collaborator

jdf-prog commented Jun 4, 2024

MiniCPM is a great baseline! We will include it as both a baseline and backbone in the future. We have some directions for Mantis' future work. We are actively working on it and aim to build Mantis as a brand. Stay Tuned!

@lucasjinreal
Copy link
Author

Hello, I have to say, VideoChat2 now got MVBench 60 point....

@wenhuchen
Copy link
Contributor

Hello, I have to say, VideoChat2 now got MVBench 60 point....

Yes, we are training Mantis-video now to specifics focus on video tasks. Stay tuned!

@patrick-tssn
Copy link

Hello, I have to say, VideoChat2 now got MVBench 60 point....

I hope this message finds you well. I have a question out of curiosity: How do you compare methods that utilize different training data and LLM backbones?

@yeppp27
Copy link

yeppp27 commented Jul 16, 2024

Hi! Thanks for your great work~ I what to ask where is the interleaved text-image processing with the image seperator? Since I didn't find it in the data.py
image

@jdf-prog
Copy link
Collaborator

It's written in the model's processor file:
See this function

@jdf-prog jdf-prog closed this as completed Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants