Recent advances in Vision–Language Models (VLMs) have achieved state-of-the-art performance on numerous benchmark tasks. However, the use of internet-scale, often proprietary, pretraining corpora raises a critical concern for both practitioners and users: inflated performance due to test-set leakage. While prior work has proposed mitigation strategies such as decontamination of pretraining data and benchmark redesign for LLMs, the complementary direction of developing detection methods for contaminated VLMs remains underexplored. To address this gap, we deliberately contaminate open-source VLMs on popular benchmarks and show that existing detection approaches either fail outright or exhibit inconsistent behavior. We then propose a novel simple yet effective detection method based on multi-modal semantic perturbation, demonstrating that contaminated models fail to generalize under controlled perturbations. Finally, we validate our approach across multiple contamination strategies, confirming its robustness and effectiveness.
Detection results of contaminated LLaVA-v1.5-7B and Qwen2-VL-7B on RealWorldQA. _P denotes the semantically perturbed variant. To clarify, the accuracies were measured on the 440 images after manual filtering. Detected? indicates whether the model was detected as contaminated.
Detection results of contaminated LLaVA-v1.5-7B and Qwen2-VL-7B on MMStar. _P denotes the semantically perturbed variant. To clarify, the accuracies were measured on the 478 images after manual filtering. Detected? indicates whether the model was detected as contaminated.
From the results, we observe that:
Example where the perturbed variant is easier to solve than the original question. In the original image, the traffic sign is small and the text barely legible; after perturbation, the sign is enlarged and clearly visible.
When Flux ControlNet generates the perturbed image, it often highlights the salient visual cues more clearly than the original images.
We assume that by preserving the original question and only altering the answer choice, the question difficulty remains comparable. Combined with some examples like above, the perturbed dataset as a whole should have a similar or lower difficulty than the original dataset.
@misc{park2025contaminationdetectionvlmsusing,
title={Contamination Detection for VLMs using Multi-Modal Semantic Perturbation},
author={Jaden Park and Mu Cai and Feng Yao and Jingbo Shang and Soochahn Lee and Yong Jae Lee},
year={2025},
eprint={2511.03774},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2511.03774},
}
This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.