Exploring What Why and How: A Multifaceted Benchmark for Causation Understanding of Video Anomaly
Recent advancements in video anomaly understanding (VAU) have opened the door to groundbreaking applications in various fields, such as traffic monitoring and industrial automation. While the current benchmarks in VAU predominantly emphasize the detection and localization of anomalies. Here, we ende...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Recent advancements in video anomaly understanding (VAU) have opened the door
to groundbreaking applications in various fields, such as traffic monitoring
and industrial automation. While the current benchmarks in VAU predominantly
emphasize the detection and localization of anomalies. Here, we endeavor to
delve deeper into the practical aspects of VAU by addressing the essential
questions: "what anomaly occurred?", "why did it happen?", and "how severe is
this abnormal event?". In pursuit of these answers, we introduce a
comprehensive benchmark for Exploring the Causation of Video Anomalies (ECVA).
Our benchmark is meticulously designed, with each video accompanied by detailed
human annotations. Specifically, each instance of our ECVA involves three sets
of human annotations to indicate "what", "why" and "how" of an anomaly,
including 1) anomaly type, start and end times, and event descriptions, 2)
natural language explanations for the cause of an anomaly, and 3) free text
reflecting the effect of the abnormality. Building upon this foundation, we
propose a novel prompt-based methodology that serves as a baseline for tackling
the intricate challenges posed by ECVA. We utilize "hard prompt" to guide the
model to focus on the critical parts related to video anomaly segments, and
"soft prompt" to establish temporal and spatial relationships within these
anomaly segments. Furthermore, we propose AnomEval, a specialized evaluation
metric crafted to align closely with human judgment criteria for ECVA. This
metric leverages the unique features of the ECVA dataset to provide a more
comprehensive and reliable assessment of various video large language models.
We demonstrate the efficacy of our approach through rigorous experimental
analysis and delineate possible avenues for further investigation into the
comprehension of video anomaly causation. |
---|---|
DOI: | 10.48550/arxiv.2412.07183 |