Exploring Explainability in Video Action Recognition
Image Classification and Video Action Recognition are perhaps the two most foundational tasks in computer vision. Consequently, explaining the inner workings of trained deep neural networks is of prime importance. While numerous efforts focus on explaining the decisions of trained deep neural networ...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Image Classification and Video Action Recognition are perhaps the two most
foundational tasks in computer vision. Consequently, explaining the inner
workings of trained deep neural networks is of prime importance. While numerous
efforts focus on explaining the decisions of trained deep neural networks in
image classification, exploration in the domain of its temporal version, video
action recognition, has been scant. In this work, we take a deeper look at this
problem. We begin by revisiting Grad-CAM, one of the popular feature
attribution methods for Image Classification, and its extension to Video Action
Recognition tasks and examine the method's limitations. To address these, we
introduce Video-TCAV, by building on TCAV for Image Classification tasks, which
aims to quantify the importance of specific concepts in the decision-making
process of Video Action Recognition models. As the scalable generation of
concepts is still an open problem, we propose a machine-assisted approach to
generate spatial and spatiotemporal concepts relevant to Video Action
Recognition for testing Video-TCAV. We then establish the importance of
temporally-varying concepts by demonstrating the superiority of dynamic
spatiotemporal concepts over trivial spatial concepts. In conclusion, we
introduce a framework for investigating hypotheses in action recognition and
quantitatively testing them, thus advancing research in the explainability of
deep neural networks used in video action recognition. |
---|---|
DOI: | 10.48550/arxiv.2404.09067 |