Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support MiniCPM-V-2.5 #7599

Merged
merged 67 commits into from
Aug 9, 2024
Merged

Conversation

tc-mb
Copy link
Contributor

@tc-mb tc-mb commented May 28, 2024

Dear llama.cpp Official,

Hi, I'm writing to address our new PR submission for integrating our model MiniCPM-Llama3-V 2.5 into llama.cpp, which has been trending on Huggingface for over a week and has garnered significant user demand. During the previous PR attempt of MiniCPM-V, we identified several critical implementation bugs. The official minicpm-v team has since fixed all these issues, resulting in a performance that matches our PyTorch version. These changes also distinguish our implementation significantly from LLaVA example codebase.

Here are some key differences and improvements we've made:

  1. Flexible Image Handling: We support arbitrary image sizes by dynamically segmenting images into sub-images, allowing our ViT to accept various aspect ratios, unlike the fixed dimensions required by other models.
  2. 2D Resampler: Our model uses a 2D resampler to down sample image features into smaller sequences, significantly speeding up inference.
  3. Enhanced Embedding: Unlike the original positional encoding of VIT used in previous VLMs, we employ a new approach for image embedding with a PosEmbedding layer.
  4. Distinct Tokenizer: Our tokenizer is different from LLaVA's, leading to unique special token decoding.
  5. Upper Framework Support: We've optimized our model for better integration with frameworks like Ollama.
  6. CLI Optimization: We've made modifications to better adapt the CLI for Android use.
  7. NPU-Optimized ViT: We've rewritten the Vision Transformer (ViT) component to leverage NPU on mobile devices, optimizing I/O for Android inference. (this week)

While some aspects of our implementation may appear similar to LLaVA example codebase, these distinct features and optimizations set our model apart. We can reference LLaVA for the overlapping components to maintain code integrity, but this might compromise the standalone nature of different examples, akin to how Huggingface Transformers ensures each model has its unique implementation.

Given the extensive user interest and the robust performance of our implementation, merging this model would significantly benefit the community. We are open to collaborating on any adjustments you deem necessary and are committed to ensuring the highest code quality and usability.

Thank you for considering our request. We look forward to your feedback and hope for a positive resolution.

Best regards,
MiniCPM-V Official ^_^

@tc-mb tc-mb marked this pull request as ready for review May 28, 2024 20:41
@tc-mb
Copy link
Contributor Author

tc-mb commented Aug 7, 2024

Ah there appears to be a conflict - need to resolve this so the CI can run

0d903f80-665a-4c65-bdaf-fbe47f3cc9ee

I tried to solve most of the problems, but there seems to be an error in the original llava-cli. Can you help me check it?

@ggerganov
Copy link
Owner

Does this patch fix it?

diff --git a/examples/llava/clip.h b/examples/llava/clip.h
index f028f187..2ff4d399 100644
--- a/examples/llava/clip.h
+++ b/examples/llava/clip.h
@@ -18,8 +18,6 @@
 #    define CLIP_API
 #endif
 
-struct clip_ctx;
-
 #ifdef __cplusplus
 extern "C" {
 #endif

@tc-mb
Copy link
Contributor Author

tc-mb commented Aug 8, 2024

Does this patch fix it?

diff --git a/examples/llava/clip.h b/examples/llava/clip.h
index f028f187..2ff4d399 100644
--- a/examples/llava/clip.h
+++ b/examples/llava/clip.h
@@ -18,8 +18,6 @@
 #    define CLIP_API
 #endif
 
-struct clip_ctx;
-
 #ifdef __cplusplus
 extern "C" {
 #endif

This does work and only leaves one now.

@tc-mb
Copy link
Contributor Author

tc-mb commented Aug 9, 2024

@ggerganov I'm glad that the CI is all green. Can we merge this pr now?
If merge, I will submit a pr of MiniCPM-V 2.6 today.
6SiUFeY7V8

@ggerganov ggerganov merged commit 3071c0a into ggerganov:master Aug 9, 2024
54 checks passed
@cmp-nct
Copy link
Contributor

cmp-nct commented Aug 9, 2024

Grats, awesome to see this progress so much. Thanks for the effort, looking forwarding to see 2.6

@chigkim
Copy link

chigkim commented Aug 11, 2024

@tc-mb, This is awesome!!! Hopefully 2.6 is on the way!
@ggerganov, is the server still broken with no support for vision language models? Any plan to bring it back?
Thanks EVERYONE!

@tc-mb tc-mb deleted the prepare-PR-of-minicpm-v2.5 branch August 12, 2024 08:48
@tc-mb
Copy link
Contributor Author

tc-mb commented Aug 13, 2024

Hi, @ggerganov, I have submitted the PR for MiniCPM-V 2.6. This PR only updates the model, and the CI is all green too. Could you please take a look at it if you are free in the near future?

#8967

@lin72h
Copy link

lin72h commented Aug 15, 2024

@cmp-nct I've been following your contributions about Vision models for a while. Very interested to hear your opinion about MiniCPM-V-2.6 and MiniCPM-V-2.5 versions.

@fairydreaming
Copy link
Collaborator

Did anyone actually try to convert the model with the provided scripts as described in README-minicpmv2.5.md? It looks like there is a problem: #9098

@tc-mb
Copy link
Contributor Author

tc-mb commented Sep 11, 2024

mmproj and image are invalid tokens when using llama-minicpmv-cli binary file, but both of these tokens are used in the "example usage" line when running the binary file. Neither of these tokens are listed in "llama-minicpmv-cli --help" section. Attempting to run on android phone (Qualcomm 8650) through adb shell.

Hi, have you tried it on a PC? I think the problem is not with the code logic, but may be caused by cross-compilation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues examples python python script changes Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.