CoLVA

Failure Instances of GPT-4o, Swipe to see more

Recent advancements in multimodal models have demonstrated a strong alignment between vision and language, resulting in enhanced capabilities in visual perception, reasoning, and vision-language understanding. However, is the vision component sufficiently discriminative for visual matching in the context of multiple images? Our research indicates that the visual matching capabilities of current multimodal LLMs (MLLMs) still exhibit systematic shortcomings, even with current strong MLLMs models, GPT-4o.

Evaluating Visual Matching Capabilities of MLLMs

To understand the visual matching incapilities of MLLMs, we firstly build a new and challenging benchmark on instance-level correspondence across multiple images. Specifically, we collect 1,510 samples from 15 public datasets and internet video platforms. These samples encompass various scenes, such as indoor environments, urban settings, and card and billiards games. Each sample is carefully annotated with multi-image QA pairs by the three skilled annotators. For the benchmark, we evaluate 36 state-of-the-art (SOTA) MLLM methods across eight aspects, including color, shape, posture, size, textual or logo markers, and more.

More MMVM Cases — Fig 2. Challenging test cases of our MMVM benchmark, where each row shows cases of different match types.

We have evaluated 36 SOTA MLLMs (e.g. InternVL2, Qwen2VL, LLaVA-OneVision, GeminiPro, GPT-4o) on the MMVM benchmark. However, none of the MLLMs can achive an overall accuracy exceeding 50%.

MMVM Benchmark Results — Tab 1. MMVM Benchmark results. Accuracy is the metric, and the overall accuracy is computed across all 1,510 evaluation samples. The accuracy for each of the eight match types is calculated separately on their respective samples.

Analysis of Current MLLMs' Shortcomings on Visual Matching Ability

None of these MLLMs can achieve an overall accuracy exceeding 50% on MMVM benchmark. This phenomenon suggests significant deficiencies in cuurent MLLMs' performance on the visual matching task. We argue that two main factors cause this phenomenon:

Attention Map — Fig 3. The attention map of image pairs when the MLLMs generate responses. In all image pairs, a query object is specified in the first image, while multiple candidate objects are specified in the second image. The MLLMs select the candidate object in the second image that matches the query object.

Although current MLLMs possess the necessary capabilities for visual matching, such as detailed descriptions of objects' appearances and positions, they lack corresponding data to teach them to utilize these basic knowledge and abilities for correct visual matching. This hypothesis stems from two observations: Firstly, as shown in Fig.3, we visualize the attention maps between multiple SOTA MLLMs' responses and images, which show that MLLMs correctly focus on the query objects. Secondly, we find that these MLLMs, such as InternVL2-76B, can accurately identify query objects' basic information, including category, color, position, shape, posture, and relationships with other objects in the environment. However, it only achieved a score of 25 on the MMVM benchmark. These results indicate that current MLLMs possess the necessary capabilities for visual matching, yet they cannot properly utilize these abilities to perform visual matching successfully.
Current MLLMs rely on CLIP models to understand images and cannot comprehend fine-grained and discriminative visual features, which are essential for visual matching since candidate objects often share extremely similar semantic information. As shown in Fig.3, the attention maps demonstrate that MLLMs give almost equal attention to all candidate objects, unable to distinguish the correct object.

CoLVA Improves MLLMs' Visual Matching Ability

To address the shortcomings, we propose a novel Object-level Contrastive Learning (OCL) strategy and introduce a fine-grained vision expert to provide the discriminative and fine-grained visual features, thereby improving the MLLM's visual matching performance.

CoLVA Model — Fig 3. The overview of CoLVA. The left side shows how we use object-level contrastive loss to train the RADIO adapter to simultaneously obtain discriminative features and align the RADIO feature space with MLLM's feature space. The right side shows how we integrate the learned contrastive visual tokens into the MLLMs. We directly concatenate the learned contrastive visual tokens with the origin visual tokens output from the MLLM's visual encoder and feed them into the MLLM's LLM for answer generation.

Object-Level Contrastive Learning

Inspired by the success of contrastive learning in visual representation, tracking and video segmentation, we introduce a novel object-level contrastive learning (OCL) strategy to help MLLM learn more discriminative features for better visual matching. Firstly, we obtain the object-level representations using masked average pooling based on the image feature. Then, the object-level contrastive loss is conducted on the object-level representations:

Fine-grained Vision Expert

We incorporate an additional fine-grained vision expert, RADIO, into the MLLM to provide more powerful visual representations. RADIO is distilled from the SAM encoder, DINOv2, CLIP, and other vision foundation models, thus possessing comprehensive capabilities such as fine-grained visual features and good image-text alignment ability. Due to the significant gaps between RADIO's and MLLM's feature spaces, we introduce an additional pre-training stage to align them, and OCL can be easily integrated into this process. As depicted on the left side of Fig. 3, we incorporate RADIO into the MLLM. OCL is used in the pre-training stage to simultaneously obtain discriminative features and align the RADIO feature space with MLLM's feature space. For an input image pair, one image is fed into the MLLM's original visual encoder, while another is input into the RADIO. We then apply OCL on all objects' representations.

Experiments

We integrate CoLVA with several base models (InternVL2-4B, Qwen2VL-2B, Qwen2VL-7B) and improve their visual matching capabilities significantly. The best one, CoLVA-Qwen2VL-7B, achieves 51.06% overall accuracy on the MMVM benchmark, surpassing GPT-4o and baseline by 8.41% and 23.58% overall accuracy, respectively.

CoLVA has generalization capability across different MLLMs

CoLVA can maintain the inherent general visual question answering capabilities of its base model.