Recent advancements in multimodal models have demonstrated a strong alignment between vision and language, resulting in enhanced capabilities in visual perception, reasoning, and vision-language understanding. However, is the vision component sufficiently discriminative for visual matching in the context of multiple images? Our research indicates that the visual matching capabilities of current multimodal LLMs (MLLMs) still exhibit systematic shortcomings, even with current strong MLLMs models, GPT-4o.
To understand the visual matching incapilities of MLLMs, we firstly build a new and challenging benchmark on
instance-level correspondence across multiple images. Specifically, we collect 1,510 samples from 15 public datasets
and internet video platforms. These samples encompass various scenes, such as indoor environments, urban settings, and
card and billiards games. Each sample is carefully annotated with multi-image QA pairs by the three skilled annotators.
For the benchmark, we evaluate 36 state-of-the-art (SOTA) MLLM methods across eight aspects, including color, shape, posture,
size, textual or logo markers, and more.
We have evaluated 36 SOTA MLLMs (e.g. InternVL2, Qwen2VL, LLaVA-OneVision, GeminiPro, GPT-4o) on the MMVM benchmark. However, none of the MLLMs can achive an overall accuracy exceeding 50%.
None of these MLLMs can achieve an overall accuracy exceeding 50% on MMVM benchmark. This phenomenon suggests significant deficiencies in cuurent MLLMs' performance on the visual matching task. We argue that two main factors cause this phenomenon:
To address the shortcomings, we propose a novel Object-level Contrastive Learning (OCL) strategy and introduce a fine-grained vision expert to provide the discriminative and fine-grained visual features, thereby improving the MLLM's visual matching performance.
Inspired by the success of contrastive learning in visual representation, tracking and video segmentation, we introduce a novel object-level contrastive learning (OCL) strategy to help MLLM learn more discriminative features for better visual matching. Firstly, we obtain the object-level representations using masked average pooling based on the image feature. Then, the object-level contrastive loss is conducted on the object-level representations:
We incorporate an additional fine-grained vision expert, RADIO, into the MLLM to provide more powerful visual representations. RADIO is distilled from the SAM encoder, DINOv2, CLIP, and other vision foundation models, thus possessing comprehensive capabilities such as fine-grained visual features and good image-text alignment ability. Due to the significant gaps between RADIO's and MLLM's feature spaces, we introduce an additional pre-training stage to align them, and OCL can be easily integrated into this process. As depicted on the left side of Fig. 3, we incorporate RADIO into the MLLM. OCL is used in the pre-training stage to simultaneously obtain discriminative features and align the RADIO feature space with MLLM's feature space. For an input image pair, one image is fed into the MLLM's original visual encoder, while another is input into the RADIO. We then apply OCL on all objects' representations.
We integrate CoLVA with several base models (InternVL2-4B, Qwen2VL-2B, Qwen2VL-7B) and improve their visual matching capabilities significantly. The best one, CoLVA-Qwen2VL-7B, achieves 51.06% overall accuracy on the MMVM benchmark, surpassing GPT-4o and baseline by 8.41% and 23.58% overall accuracy, respectively.
CoLVA has generalization capability across different MLLMs
CoLVA can maintain the inherent general visual question answering capabilities of its base model.
@misc{zhou2025sameexploringvisualcorrespondence,
title={Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs},
author={Yikang Zhou and Tao Zhang and Shilin Xu and Shihao Chen and Qianyu Zhou and Yunhai Tong and Shunping Ji and Jiangning Zhang and Xiangtai Li and Lu Qi}
year={2025},
eprint={2501.04670},
archivePrefix={arXiv},
primaryClass={cs.CV}
}