MVTamperBench: Evaluating Robustness of Vision-Language Models
Recent advancements in Vision-Language Models (VLMs) have enabled significant progress in complex video understanding tasks. However, their robustness to real-world manipulations remains underexplored, limiting their reliability in critical applications. To address this gap, we introduce MVTamperBen...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Recent advancements in Vision-Language Models (VLMs) have enabled significant
progress in complex video understanding tasks. However, their robustness to
real-world manipulations remains underexplored, limiting their reliability in
critical applications. To address this gap, we introduce MVTamperBench, a
comprehensive benchmark designed to evaluate VLM's resilience to video
tampering effects, including rotation, dropping, masking, substitution, and
repetition. By systematically assessing state-of-the-art models, MVTamperBench
reveals substantial variability in robustness, with models like InternVL2-8B
achieving high performance, while others, such as Llama-VILA1.5-8B, exhibit
severe vulnerabilities. To foster broader adoption and reproducibility,
MVTamperBench is integrated into VLMEvalKit, a modular evaluation toolkit,
enabling streamlined testing and facilitating advancements in model robustness.
Our benchmark represents a critical step towards developing tamper-resilient
VLMs, ensuring their dependability in real-world scenarios.
Project Page: https://amitbcp.github.io/MVTamperBench/ |
---|---|
DOI: | 10.48550/arxiv.2412.19794 |