Nice work! Does the mantis has image seperator when sending to LLM? #4

lucasjinreal · 2024-05-18T04:34:51Z

Hi, wanna ask ,does mantis used image separator between images sending to LLM? From i can tell, llava doesn't have it and the data used in Mantis doesn't provide a str for separator too.

Also, do u think which way is better? If consider video frames input as well

jdf-prog · 2024-05-22T16:45:47Z

Yes! Mantis used an image separator between images, but in a text form. We have prepended and to each placeholder in the text automatically. These codes are written in MllavaProcessor so you don't need to manually add these separators.

Personally I would say adding image separators will be better. There was a ablation study for it in Co-Instruct paper (https://arxiv.org/abs/2402.16641), Table 8. Where you can see adding separators can really improve performance.

For video frames, Mantis follows the use of separators and according to our performance on MVBench, it seems using image separators can achieve good performance on MVBench, (on par with VideoChat, which is designed specifically for video understanding). Since video frames can also be seen as a kind of multi-image format, we somehow prove the possibility that maybe we don't need a separator video encoder, just a single image encoder can do well.

Feel free to ask more questions!

lucasjinreal · 2024-05-23T05:04:51Z

Hi, thanks for the reply. This is an interesting conclusion. How does Mantis added the seperator, for instance, if we have 3 frames single sample? I just wondering llava's <im_start>Image1Image2Image3<im_end> seems not really work.

For image encoder can performances well on video frames, does it same way seperrator as in images? How does the massive tokens concated for frames problem being solved? (if using MLP, the output tokens, 5 images would blooming the context length)

jdf-prog · 2024-05-23T16:50:47Z

We actually wrote it in the paper. Every image token <image> will be transformed into (image {i}, <BOI><image><EOI> where <BOI> is the beginning of image token, and <EOI> is the end of image token.

For video tasks, we process each video frame the same way as a separator image. So yes, we indeed add the above image separators and denotations for each video frame.

About the image tokens cost, that's indeed a problem. However, Mantis models are trained under 8192 context length. Let's say each image costs 576 tokens (in CLIP and SigLIP cases), then it can accept at most 14 images.

Recently, we released Mantis-Idefics2. It uses a perceiver to resample image tokens into the fixed 64 tokens. That's way more efficient than the MLP mapping. Let's say we still have 8192 context length, it can accept at most 128 images. With the help of flash-attention 2, they can run in a very efficient way.

Currently, we did not optimize in the video direction. Mantis-Video might also be one of our future works. Stay tuned.

lucasjinreal · 2024-05-24T02:59:07Z

@jdf-prog Hi, Mantis-Idefics2 used Navit and maxium 980 input resolution, Does the test perviod also resize into maxium 980 or just original size input?

Have u also conducted the slicing image same as idefics2?

Also, how does the model trained, with idefics2 as pretrained model?

Also, I have noticed that, the Mantis-Idenfics2 got a 51.8 on mvbench for video, it's not bad but still not very good interms of image improvement, the mvbench, 54.85 is the best 7B baseline. Do u have any thoughts on this?

jdf-prog · 2024-05-24T16:29:51Z

Thanks for the questions!
During the training of idefics2, we disabled the image splitting (slicing) by default to make the image tokens more efficient. We also disable the image splitting during the evaluation.

For the model initialization, yes. We directly apply Idefics2 as a pre-trained model.

Well, It's worth to notice that Mantis-Instruct only contains 14K video associated data, which is way smaller than the data used to train other video understanding models, such as 720K Valley and 100K Video-ChatGPT. Mantis-Video might be our future work.

lucasjinreal · 2024-05-25T03:43:32Z

Hi，still have some question wanna to dicuss.

You mentioned the training used Idefics2 as pretrained model, then there would be a very serious question and could possiable the biggest concern to me:

If you using Idefics2 as pretrained model (no matter pretrained vit only, or pretrained as a whole), since the vit part already aligned with particullar LLM, the Mistral in this case, so that no matter how you train, your higher bar would be Idefics2 itself. I mean, you can't surpassed it since the vit already be trained perfectly. Exception you using the vit and train a new bigger LLM.

From the scores you got on the table, I also noticed some metric were dropped a lot compare with Idefics2 itself, not only the TextVQA consider you disabled the slicing strategy.

Do u think using the well-trained VLM can achiever a better results further training on it? What if we only using the vit part of it and training everything else?

jdf-prog · 2024-05-27T20:24:37Z

@lucasjinreal Thanks for the questions!

The first question sounds like a problem about continue-finetuning (continue-pretraining), Indeed, knowledge forgetting in continual training cannot be avoided and the most intuitive method to mitigate this issue is to replay some previously trained data. For Idefics2, that would be HuggingFaceM4/the_cauldron. We did not do this for the simplicity of our experiments but any attempt on this from the community is welcomed. There are also many papers about continual learning in the community. Maybe this survey paper can help you: https://arxiv.org/abs/2302.00487

About the sentence "the higher bar would be idefics2 itself", I would personally have doubts about it. I think it actually depends on the tasks. There must be some trade-offs if we want to optimize a model for different abilities. In our case, we are doing a tradeoff between the multi-image ability and the single-image ability. So Mantis-Idefics2 indeed improve the multi-image ability a lot, with a cost of single-image ability degeneration with an acceptable cost (2% to 6%). I would say this is an interesting observation that is worth further investigation.

"Do u think using the well-trained VLM can achieve better results for further training on it?". Still, there is always a cost, for training LLMs, it's a trade-off of old knowledge and new knowledge. The answer actually depends on what kind of metrics you are using to evaluate.

Thanks again for your deep thoughts about Mantis. We indeed have many problems to solve in the multi-image VLM scenario. Feel free to ask any other questions and I'd love to discuss.

lucasjinreal · 2024-05-28T03:10:55Z

@jdf-prog Hi, thanks for your deep insights.

Looks like built upon some strong baseline for continue training can also be very benifitial. I think the next step, you might will consider transfer MiniCPM to your base? They have got a high margin even surpassed idefics2 since they were using llama3 as a language model.

If so, what;s the plan would be? If not, what's the next move would be to push the margin further?

jdf-prog · 2024-06-04T20:19:25Z

MiniCPM is a great baseline! We will include it as both a baseline and backbone in the future. We have some directions for Mantis' future work. We are actively working on it and aim to build Mantis as a brand. Stay Tuned!

lucasjinreal · 2024-06-05T02:32:27Z

Hello, I have to say, VideoChat2 now got MVBench 60 point....

wenhuchen · 2024-06-25T21:52:59Z

Hello, I have to say, VideoChat2 now got MVBench 60 point....

Yes, we are training Mantis-video now to specifics focus on video tasks. Stay tuned!

patrick-tssn · 2024-07-01T12:32:17Z

Hello, I have to say, VideoChat2 now got MVBench 60 point....

I hope this message finds you well. I have a question out of curiosity: How do you compare methods that utilize different training data and LLM backbones?

yeppp27 · 2024-07-16T16:22:51Z

Hi! Thanks for your great work~ I what to ask where is the interleaved text-image processing with the image seperator? Since I didn't find it in the data.py

jdf-prog · 2024-07-16T18:11:45Z

It's written in the model's processor file:
See this function

jdf-prog closed this as completed Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nice work! Does the mantis has image seperator when sending to LLM? #4

Nice work! Does the mantis has image seperator when sending to LLM? #4

lucasjinreal commented May 18, 2024 •

edited

Loading

jdf-prog commented May 22, 2024 •

edited

Loading

lucasjinreal commented May 23, 2024

jdf-prog commented May 23, 2024

lucasjinreal commented May 24, 2024 •

edited

Loading

jdf-prog commented May 24, 2024

lucasjinreal commented May 25, 2024 •

edited

Loading

jdf-prog commented May 27, 2024 •

edited

Loading

lucasjinreal commented May 28, 2024

jdf-prog commented Jun 4, 2024

lucasjinreal commented Jun 5, 2024

wenhuchen commented Jun 25, 2024

patrick-tssn commented Jul 1, 2024

yeppp27 commented Jul 16, 2024

jdf-prog commented Jul 16, 2024

Nice work! Does the mantis has image seperator when sending to LLM? #4

Nice work! Does the mantis has image seperator when sending to LLM? #4

Comments

lucasjinreal commented May 18, 2024 • edited Loading

jdf-prog commented May 22, 2024 • edited Loading

lucasjinreal commented May 23, 2024

jdf-prog commented May 23, 2024

lucasjinreal commented May 24, 2024 • edited Loading

jdf-prog commented May 24, 2024

lucasjinreal commented May 25, 2024 • edited Loading

jdf-prog commented May 27, 2024 • edited Loading

lucasjinreal commented May 28, 2024

jdf-prog commented Jun 4, 2024

lucasjinreal commented Jun 5, 2024

wenhuchen commented Jun 25, 2024

patrick-tssn commented Jul 1, 2024

yeppp27 commented Jul 16, 2024

jdf-prog commented Jul 16, 2024

lucasjinreal commented May 18, 2024 •

edited

Loading

jdf-prog commented May 22, 2024 •

edited

Loading

lucasjinreal commented May 24, 2024 •

edited

Loading

lucasjinreal commented May 25, 2024 •

edited

Loading

jdf-prog commented May 27, 2024 •

edited

Loading