Leveraging Temporal Contextualization for Video Action Recognition
We propose a novel framework for video understanding, called Temporally Contextualized CLIP (TC-CLIP), which leverages essential temporal information through global interactions in a spatio-temporal domain within a video. To be specific, we introduce Temporal Contextualization (TC), a layer-wise tem...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We propose a novel framework for video understanding, called Temporally
Contextualized CLIP (TC-CLIP), which leverages essential temporal information
through global interactions in a spatio-temporal domain within a video. To be
specific, we introduce Temporal Contextualization (TC), a layer-wise temporal
information infusion mechanism for videos, which 1) extracts core information
from each frame, 2) connects relevant information across frames for the
summarization into context tokens, and 3) leverages the context tokens for
feature encoding. Furthermore, the Video-conditional Prompting (VP) module
processes context tokens to generate informative prompts in the text modality.
Extensive experiments in zero-shot, few-shot, base-to-novel, and
fully-supervised action recognition validate the effectiveness of our model.
Ablation studies for TC and VP support our design choices. Our project page
with the source code is available at https://github.com/naver-ai/tc-clip |
---|---|
DOI: | 10.48550/arxiv.2404.09490 |