- Conducting learning and research on MLLM based on the MME rankings.
36 advanced MLLMs, including BLIP-2, InstructBLIP, LLaVA, MiniGPT-4, mPLUG-Owl, LLaMA-Adapter V2, ImageBind_LLM, Otter, VisualGLM-6B, Multimodal-GPT, PandaGPT, VPGTrans, LaVIN, Lynx, Octopus, LRV-Instruction, Cheetor, MMICL, GIT2, BLIVA, Skywork-MM, Qwen-VL-Chat, InternLM-XComposer-VL, Lion, Muffin, WeMM, SPHINX, InfMLLM, mPLUG-Owl2, GPT-4V, CVLM, LVIS-INSTRUCT4V, Kanva, DataOptim, ShareGPT4Vand BELLE-VL .
Models_Perception | Models_Cognition |
---|---|
![]() |
![]() |
Num. | Arch. | Model | Version | Perception | Cognition |
---|---|---|---|---|---|
1 | FlanT5xxl | BLIP-2 | Flant5xxl | 1293.84 | 290.00 |
2 | FlanT5xxl | InstructBLIP | FlanT5xxl | 1212.82 | 291.79 |
3 | FlanT5xxl | MMICL | FlanT5xxl | 1381.73 | 428.93 |
4 | FlanT5xxl | BLIVA | FlanT5xxl | 1337.73 | 331.43 |
more details
Models | version | existence | count | position | color | OCR | posters | cast | scene | landmark | artwork | score |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BLIP-2 | Flant5xxl | 160.00 | 135.00 | 73.33 | 148.33 | 110.00 | 141.84 | 105.59 | 145.25 | 138.00 | 136.50 | 1293.84 |
InstructBLIP | FlanT5xxl | 185.00 | 143.33 | 66.67 | 153.33 | 72.50 | 123.81 | 101.18 | 153.00 | 79.75 | 134.25 | 1212.82 |
MMICL | FlanT5xxl | 170.00 | 160.00 | 81.67 | 156.67 | 100.00 | 146.26 | 141.76 | 153.75 | 136.13 | 135.50 | 1381.73 |
BLIVA | FlanT5xxl | 180.00 | 138.33 | 81.67 | 180.00 | 87.50 | 155.10 | 140.88 | 151.50 | 89.50 | 133.25 | 1337.73 |
Models | version | Common_Sense_Reasoning | Numerical_Calculation | Text_Translation | Code_Reasoning | score |
---|---|---|---|---|---|---|
BLIP-2 | Flant5xxl | 110.00 | 40.00 | 65.00 | 75.00 | 290.00 |
InstructBLIP | FlanT5xxl | 129.29 | 40.00 | 65.00 | 57.50 | 291.79 |
MMICL | FlanT5xxl | 136.43 | 82.50 | 132.50 | 77.50 | 428.93 |
BLIVA | FlanT5xxl | 136.43 | 57.50 | 77.50 | 60.00 | 331.43 |
Num. | Arch. | Model | Version | Perception | Cognition |
---|---|---|---|---|---|
1 | LLaMA | mPLUG-Owl | Llama-7B | 967.34 | 276.07 |
2 | LLaMA | SPHINX | LLaMA2-13B | 1560.15 | 310.00 |
3 | LLaMA | LaVIN | LAVIN-13B | 963.60 | 249.64 |
4 | LLaMA | mPLUG-Owl2 | LLaMA2-7B | 1450.20 | 313.21 |
5 | LLaMA | LLaMA-Adapter V2 | LLaMA-Adapter-v2.1-7B | 1328.39 | 356.43 |
more details
Models | version | existence | count | position | color | OCR | posters_200 | cast_200 | scene_200 | landmark_200 | artwork_200 | score |
---|---|---|---|---|---|---|---|---|---|---|---|---|
mPLUG-Owl | Llama-7B | 120.00 | 50.00 | 50.00 | 55.00 | 65.00 | 136.05 | 100.29 | 135.50 | 159.25 | 96.25 | 967.34 |
SPHINX | LLaMA2-13B | 195.00 | 160.00 | 153.33 | 160.00 | 87.50 | 164.29 | 177.94 | 160.00 | 168.09 | 134.00 | 1560.15 |
LaVIN | LAVIN-13B | 185.00 | 88.33 | 63.33 | 75.00 | 107.50 | 79.59 | 47.35 | 136.75 | 93.50 | 87.25 | 963.60 |
mPLUG-Owl2 | LLaMA2-7B | 185.00 | 155.00 | 88.33 | 150.00 | 102.50 | 160.20 | 164.41 | 153.25 | 157.25 | 134.25 | 1450.20 |
LLaMA-Adapter V2 | LLaMA-Adapter-v2.1-7B | 185.00 | 133.33 | 56.67 | 118.33 | 102.50 | 147.96 | 136.76 | 156.25 | 167.84 | 123.75 | 1328.39 |
Models | version | Common_Sense_Reasoning | Numerical_Calculation | Text_Translation | Code_Reasoning | score |
---|---|---|---|---|---|---|
mPLUG-Owl | Llama-7B | 78.57 | 60.00 | 80.00 | 57.50 | 276.07 |
SPHINX | LLaMA2-13B | 130.00 | 55.00 | 75.00 | 50.00 | 310.00 |
LaVIN | LAVIN-13B | 87.14 | 65.00 | 47.50 | 50.00 | 249.64 |
mPLUG-Owl2 | LLaMA2-7B | 115.71 | 35.00 | 102.50 | 60.00 | 313.21 |
LLaMA-Adapter V2 | LLaMA-Adapter-v2.1-7B | 106.43 | 47.50 | 112.50 | 90.00 | 356.43 |
Num. | Arch. | Model | Version | Perception | Cognition |
---|---|---|---|---|---|
1 | Vicuna | MiniGPT-4 | Vicuna-13B | 581.66 | 144.29 |
2 | Vicuna | PandaGPT | Vicuna-7B | 642.59 | 228.57 |
3 | Vicuna | LLaVA | Vicuna-13B | 1531.31 | 295.36 |
4 | Vicuna | LaVIN | LAVIN-13B | 963.60 | 249.64 |
5 | Vicuna | VPGTrans | Vicuna-7B | 790.45 | 249.29 |
6 | Vicuna | Lynx | Vicuna-7B | 1373.24 | 215.71 |
7 | Vicuna | Cheetor | Vicuna-7B | 1299.97 | 321.07 |
8 | Vicuna | Muffin | Vicuna-13B | 1281.02 | 290.00 |
9 | Vicuna | InfMLLM | Vicuna-13B | 1567.99 | 347.14 |
10 | Vicuna | CVLM | Vicuna-13B | 1636.45 | 488.93 |
11 | Vicuna | LVIS-INSTRUCT4V | Vicuna-13B | 1574.89 | 286.79 |
12 | Vicuna | ShareGPT4V | Vicuna-13B | 1618.70 | 303.21 |
13 | Vicuna | DataOptim-LLaVA | Vicuna-13B | 1563.56 | 361.07 |
more details
Models | version | existence | count | position | color | OCR | posters | cast | scene | landmark | artwork | score |
---|---|---|---|---|---|---|---|---|---|---|---|---|
MiniGPT-4 | Vicuna-13B | 68.33 | 55.00 | 43.33 | 75.00 | 57.50 | 41.84 | 54.41 | 71.75 | 54.00 | 60.50 | 581.66 |
PandaGPT | Vicuna-7B | 70.00 | 50.00 | 50.00 | 50.00 | 50.00 | 76.53 | 57.06 | 118.00 | 69.75 | 51.25 | 642.59 |
LLaVA | Vicuna-13B | 185.00 | 155.00 | 133.33 | 170.00 | 125.00 | 160.54 | 152.94 | 161.25 | 170.50 | 117.75 | 1531.31 |
LaVIN | LAVIN-13B | 185.00 | 88.33 | 63.33 | 75.00 | 107.50 | 79.59 | 47.35 | 136.75 | 93.50 | 87.25 | 963.60 |
VPGTrans | Vicuna-7B | 70.00 | 85.00 | 63.33 | 73.33 | 77.50 | 84.01 | 53.53 | 141.75 | 64.75 | 77.25 | 790.45 |
Lynx | Vicuna-7B | 195.00 | 151.67 | 90.00 | 170.00 | 77.50 | 124.83 | 118.24 | 164.50 | 162.00 | 119.50 | 1373.24 |
Cheetor | Vicuna-7B | 180.00 | 96.67 | 80.00 | 116.67 | 100.00 | 147.28 | 164.12 | 156.00 | 145.73 | 113.50 | 1299.97 |
Muffin | Vicuna-13B | 195.00 | 163.33 | 66.67 | 165.00 | 57.50 | 137.76 | 81.76 | 151.25 | 146.25 | 116.50 | 1281.02 |
InfMLLM | Vicuna-13B | 190.00 | 151.67 | 143.33 | 185.00 | 132.50 | 163.27 | 161.47 | 165.25 | 167.00 | 108.50 | 1567.99 |
CVLM | Vicuna-13B | 185.00 | 155.00 | 178.33 | 185.00 | 155.00 | 162.24 | 155.88 | 162.75 | 169.50 | 127.75 | 1636.45 |
LVIS-INSTRUCT4V | Vicuna-13B | 195.00 | 160.00 | 128.33 | 180.00 | 132.50 | 162.59 | 161.47 | 163.25 | 161.50 | 130.25 | 1574.89 |
ShareGPT4V | Vicuna-13B | 190.00 | 165.00 | 153.33 | 185.00 | 132.50 | 169.05 | 153.82 | 168.00 | 174.00 | 128.00 | 1618.70 |
DataOptim-LLaVA | Vicuna-13B | 190.00 | 165.00 | 121.67 | 155.00 | 162.50 | 169.73 | 159.41 | 166.50 | 160.00 | 113.75 | 1563.56 |
Models | version | Common_Sense_Reasoning | Numerical_Calculation | Text_Translation | Code_Reasoning | score |
---|---|---|---|---|---|---|
MiniGPT-4 | Vicuna-13B | 59.29 | 45.00 | 0.00 | 40.00 | 144.29 |
PandaGPT | Vicuna-7B | 73.57 | 50.00 | 57.50 | 47.50 | 228.57 |
LLaVA | Vicuna-13B | 127.86 | 42.50 | 77.50 | 47.50 | 295.36 |
LaVIN | LAVIN-13B | 87.14 | 65.00 | 47.50 | 50.00 | 249.64 |
VPGTrans | Vicuna-7B | 64.29 | 50.00 | 77.50 | 57.50 | 249.29 |
Lynx | Vicuna-7B | 110.71 | 17.50 | 42.50 | 45.00 | 215.71 |
Cheetor | Vicuna-7B | 98.57 | 77.50 | 57.50 | 87.50 | 321.07 |
Muffin | Vicuna-13B | 137.76 | 81.76 | 151.25 | 146.25 | 116.50 |
InfMLLM | Vicuna-13B | 132.14 | 60.00 | 102.50 | 52.50 | 347.14 |
CVLM | Vicuna-13B | 131.43 | 137.50 | 147.50 | 72.50 | 488.93 |
LVIS-INSTRUCT4V | Vicuna-13B | 134.29 | 40.00 | 70.00 | 42.50 | 286.79 |
ShareGPT4V | Vicuna-13B | 125.71 | 45.00 | 80.00 | 52.50 | 303.21 |
DataOptim-LLaVA | Vicuna-13B | 123.57 | 47.50 | 110.00 | 80.00 | 361.07 |
Num. | Arch. | Model | Version | Perception | Cognition |
---|---|---|---|---|---|
1 | OpenFlamingo | Multimodal-GPT | Multimodal-GPT-9B | 654.72 | 226.79 |
2 | OpenFlamingo | Otter | OTTER-Image-MPT7B | 1292.26 | 306.43 |
more details
#### PerceptionModels | version | existence | count | position | color | OCR | posters | cast | scene | landmark | artwork | score |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Multimodal-GPT | Multimodal-GPT-9B | 61.67 | 55.00 | 58.33 | 68.33 | 82.50 | 57.82 | 73.82 | 68.00 | 69.75 | 59.50 | 654.72 |
Otter | OTTER-Image-MPT7B | 195.00 | 88.33 | 86.67 | 113.33 | 72.50 | 138.78 | 172.65 | 158.75 | 137.25 | 129.00 | 1292.26 |
Models | version | Common_Sense_Reasoning_2 | Numerical_Calculation | Text_Translation | Code_Reasoning | score |
---|---|---|---|---|---|---|
Multimodal-GPT | Multimodal-GPT-9B | 49.29 | 62.50 | 60.00 | 55.00 | 226.79 |
Otter | OTTER-Image-MPT7B | 106.43 | 72.50 | 57.50 | 70.00 | 306.43 |
Num. | Arch. | Model | Version | Perception | Cognition |
---|---|---|---|---|---|
1 | InternLM | InternLM-XComposer-VL | InternLM-7B | 1528.45 | 391.07 |
2 | InternLM | Lion | InternLM-7B | 1545.80 | 445.71 |
3 | InternLM | WeMM | InternLM-7B | 1621.66 | 445.00 |
more details
#### PerceptionModels | version | existence | count | position | color | OCR | posters | cast | scene | landmark | artwork | score |
---|---|---|---|---|---|---|---|---|---|---|---|---|
InternLM-XComposer-VL | InternLM-7B | 190.00 | 158.33 | 126.67 | 165.00 | 125.00 | 161.90 | 150.29 | 159.75 | 165.25 | 126.25 | 1528.45 |
Lion | InternLM-7B | 190.00 | 155.00 | 153.33 | 180.00 | 72.50 | 181.63 | 150.59 | 159.00 | 173.00 | 130.75 | 1545.80 |
WeMM | InternLM-7B | 195.00 | 140.00 | 126.67 | 168.33 | 147.50 | 160.54 | 179.12 | 176.25 | 172.25 | 156.00 | 1621.66 |
Models | version | Common_Sense_Reasoning | Numerical_Calculation | Text_Translation | Code_Reasoning | score |
---|---|---|---|---|---|---|
InternLM-XComposer-VL | InternLM-7B | 138.57 | 55.00 | 112.50 | 85.00 | 391.07 |
Lion | InternLM-7B | 125.71 | 105.00 | 147.50 | 67.50 | 445.71 |
WeMM | InternLM-7B | 140.00 | 57.50 | 130.00 | 117.50 | 445.00 |
Num. | Arch. | Model | Version | Perception | Cognition |
---|---|---|---|---|---|
1 | Qwen | Qwen-VL-Chat | Qwen-7B | 1487.58 | 360.71 |
2 | Qwen | Kanva | Qwen-14B | 1666.08 | 217.14 |
3 | Qwen | BELLE-VL | Qwen-14B | 1595.34 | 332.14 |
more details
#### PerceptionModels | version | existence | count | position | color | OCR | posters_200 | cast_200 | scene_200 | landmark_200 | artwork_200 | score |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Qwen-VL-Chat | Qwen-7B | 158.33 | 150.00 | 128.33 | 170.00 | 140.00 | 178.57 | 120.59 | 152.25 | 164.00 | 125.50 | 1487.58 |
Kanva | Qwen-14B | 195.00 | 156.67 | 185.00 | 160.00 | 152.50 | 140.82 | 145.00 | 179.75 | 184.34 | 167.00 | 1666.08 |
BELLE-VL | Qwen-14B | 190.00 | 150.00 | 130.00 | 175.00 | 177.50 | 166.33 | 136.76 | 156.25 | 174.00 | 139.50 | 1595.34 |
Models | version | Common_Sense_Reasoning | Numerical_Calculation | Text_Translation | Code_Reasoning | score |
---|---|---|---|---|---|---|
Qwen-VL-Chat | Qwen-7B | 130.71 | 40.00 | 147.50 | 42.50 | 360.71 |
Kanva | Qwen-14B | 72.14 | 50.00 | 50.00 | 45.00 | 217.14 |
BELLE-VL | Qwen-14B | 127.14 | 47.50 | 102.50 | 55.00 | 332.14 |
Num. | Arch. | Model | Version | Perception | Cognition |
---|---|---|---|---|---|
1 | MPT | Octopus | MPT7B | 1095.75 | 312.50 |
2 | MPT | Otter | OTTER-Image-MPT7B | 1292.26 | 306.43 |
more details
#### PerceptionModels | version | existence | count | position | color | OCR | posters | cast | scene | landmark | artwork | score |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Octopus | MPT7B | 180.00 | 53.33 | 48.33 | 103.33 | 65.00 | 138.10 | 129.41 | 157.25 | 126.00 | 95.00 | 1095.75 |
Otter | OTTER-Image-MPT7B | 195.00 | 88.33 | 86.67 | 113.33 | 72.50 | 138.78 | 172.65 | 158.75 | 137.25 | 129.00 | 1292.26 |
Models | version | Common_Sense_Reasoning | Numerical_Calculation | Text_Translation | Code_Reasoning | score |
---|---|---|---|---|---|---|
Octopus | MPT7B | 100.00 | 47.50 | 102.50 | 62.50 | 312.50 |
Otter | OTTER-Image-MPT7B | 106.43 | 72.50 | 57.50 | 70.00 | 306.43 |
Num. | Arch. | Model | Version | Perception | Cognition |
---|---|---|---|---|---|
1 | GLM | VisualGLM-6B | VisualGLM-6B | 705.31 | 181.79 |
more details
#### PerceptionModels | version | existence | count | position | color | OCR | posters | cast | scene | landmark | artwork | score |
---|---|---|---|---|---|---|---|---|---|---|---|---|
VisualGLM-6B | VisualGLM-6B | 85.00 | 50.00 | 48.33 | 55.00 | 42.50 | 65.99 | 53.24 | 146.25 | 83.75 | 75.25 | 705.31 |
Models | version | Common_Sense_Reasoning | Numerical_Calculation | Text_Translation | Code_Reasoning | score |
---|---|---|---|---|---|---|
VisualGLM-6B | VisualGLM-6B | 39.29 | 45.00 | 50.00 | 47.50 | 181.79 |
Num. | Arch. | Model | Version | Perception | Cognition |
---|---|---|---|---|---|
1 | imagebind_huge+Open-Chinese-LLaMA-7B | ImageBind_LLM | imagebind_LLM-7B | 775.77 | 213.57 |
2 | MiniGPT/LLaMA | LRV-Instruction | LRV-7B | 1299.79 | 286.79 |
more details
#### PerceptionModels | version | existence | count | position | color | OCR | posters | cast | scene | landmark | artwork | score |
---|---|---|---|---|---|---|---|---|---|---|---|---|
ImageBind_LLM | imagebind_LLM-7B | 128.33 | 60.00 | 46.67 | 73.33 | 80.00 | 64.97 | 76.47 | 113.25 | 62.00 | 70.75 | 775.77 |
LRV-Instruction | LRV-7B | 165.00 | 111.67 | 86.67 | 165.00 | 110.00 | 139.04 | 112.65 | 147.98 | 160.53 | 101.25 | 1299.79 |
Models | version | Common_Sense_Reasoning_2 | Numerical_Calculation | Text_Translation | Code_Reasoning | score |
---|---|---|---|---|---|---|
ImageBind_LLM | imagebind_LLM-7B | 48.57 | 55.00 | 50.00 | 60.00 | 213.57 |
LRV-Instruction | LRV-7B | 100.71 | 70.00 | 85.00 | 72.50 | 328.21 |