Contamination Detection for VLMs using Multi‑Modal Semantic Perturbations

arXiv 2025
1 University of Wisconsin-Madison 2 University of California, San Diego 3 Kookmin University

We introduce Multi-modal Semantic Perturbation, a pipeline to create perturbed benchmarks that can be used to detect data contamination in VLMs.

Example of our multi-modal semantic perturbation pipeline applied to RealWorldQA benchmark. Using ControlNet trained with Flux models, a new speed limit sign is generated, changing the correct answer from (B) to (C) while preserving the original image's overall composition. A contaminated model that has memorized the original question is likely to fail on the perturbed version.


This perturbation pipeline generates image-question pairs with the original image composition intact but modified slightly so that the answer is changed.

The perturbed benchmark will have a similar or lower difficulty than the original benchmark, meaning clean models that truly generalize should perform better. However, we discover that contaminated models consistently underperform, showing dramatic performance drops up to -45%.

Abstract

Recent advances in Vision–Language Models (VLMs) have achieved state-of-the-art performance on numerous benchmark tasks. However, the use of internet-scale, often proprietary, pretraining corpora raises a critical concern for both practitioners and users: inflated performance due to test-set leakage. While prior work has proposed mitigation strategies such as decontamination of pretraining data and benchmark redesign for LLMs, the complementary direction of developing detection methods for contaminated VLMs remains underexplored. To address this gap, we deliberately contaminate open-source VLMs on popular benchmarks and show that existing detection approaches either fail outright or exhibit inconsistent behavior. We then propose a novel simple yet effective detection method based on multi-modal semantic perturbation, demonstrating that contaminated models fail to generalize under controlled perturbations. Finally, we validate our approach across multiple contamination strategies, confirming its robustness and effectiveness.

Pipeline of Multi-modal Semantic Perturbation

  1. First, we randomly sample a different answer choice from the original question, and feed the original image along with the question and the new answer to an LLM.
  2. This generates a dense caption that can be used to prompt ControlNet to generate a perturbed image with the new answer, while preserving the image's overall composition.
  3. We manually verify the perturbed image-question pairs to ensure they are valid and the answer is changed. This process can be automated using a strong reasoning model like o3.

We verify that contaminated models consistently underperform on the perturbed benchmark with varying epochs, training strategies, and model architectures.

Contamination Detection Results

We use RealWorldQA and MMStar, which are two popular benchmarks for VLMs that strictly require the visual information in the image to answer the question. We verify that multi-modal semantic perturbation can detect contamination with varying epochs, training strategies, and model architectures.

Detection results of contaminated LLaVA-v1.5-7B and Qwen2-VL-7B on RealWorldQA. _P denotes the semantically perturbed variant. To clarify, the accuracies were measured on the 440 images after manual filtering. Detected? indicates whether the model was detected as contaminated.

Detection results of contaminated LLaVA-v1.5-7B and Qwen2-VL-7B on MMStar. _P denotes the semantically perturbed variant. To clarify, the accuracies were measured on the 478 images after manual filtering. Detected? indicates whether the model was detected as contaminated.

From the results, we observe that:

  1. We do not need to manually adjust the detection threshold. Hence, our method is a practical solution to a realistic scenario where we do not know which models have been contaminated.
  2. Our method is robust to various realistic contamination strategies (e.g. models contaminated for only 1 epoch!).
  3. The performance gap between the original and perturbed variants is positively correlated with the amount of contamination, aligning with our hypothesis that contaminated models fail to generalize under controlled perturbations.

For a full comparison to existing detection methods, please refer to the Section 5 of our paper.

Why perturbations can generate easier variants

Example where the perturbed variant is easier to solve than the original question. In the original image, the traffic sign is small and the text barely legible; after perturbation, the sign is enlarged and clearly visible.

When Flux ControlNet generates the perturbed image, it often highlights the salient visual cues more clearly than the original images.

We assume that by preserving the original question and only altering the answer choice, the question difficulty remains comparable. Combined with some examples like above, the perturbed dataset as a whole should have a similar or lower difficulty than the original dataset.

Ablation Studies

Real-world Counterfactuals.

We use NaturalBench, a dataset consisting of natural adversarial counterfactual examples. We simulate our pipeline by training on one varaint of the counterfactuals and evaluating on the other variant.

We observe that clean models show similar performance on both variants, while contaminated models show dramatic performance drops up to -45.58% (98.63% -> 53.05%). Note that this is a two-way multiple choice benchmark, meaning that a contaminated model with near perfect performance on the leaked data performs as bad as random guessing on the perturbed variant.

This result suggests that any reliable semantic variation - natural, procedural, or synthetic - fits our framework.

Our pipeline is modular.

For our main experiments, we use GPT-4o as the LLM and Flux ControlNet as the text-to-image model. However, we verify that our approach works even after replacing GPT-4o with Molmo-7B-D, and after replacing manual filtering with verification by a strong reasoning model like o3.

Our pipeline generalizes to pretraining and larger models.

We verify that our approach can detect contamination that occurs during the pretraining stage, when LLaVA-v1.5-7B is trained from scratch. We also verify that our approach extends to larger models like LLaVA-v1.5-13B.

For a full list of results on our ablation experiments, please refer to the Appendix of our paper.

BibTeX


        @misc{park2025contaminationdetectionvlmsusing,
          title={Contamination Detection for VLMs using Multi-Modal Semantic Perturbation}, 
          author={Jaden Park and Mu Cai and Feng Yao and Jingbo Shang and Soochahn Lee and Yong Jae Lee},
          year={2025},
          eprint={2511.03774},
          archivePrefix={arXiv},
          primaryClass={cs.LG},
          url={https://arxiv.org/abs/2511.03774}, 
        }
  

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.