Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding
While most modern video understanding models operate on short-range clips, real-world videos are often several minutes long with semantically consistent segments of variable length. A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed te...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | While most modern video understanding models operate on short-range clips,
real-world videos are often several minutes long with semantically consistent
segments of variable length. A common approach to process long videos is
applying a short-form video model over uniformly sampled clips of fixed
temporal length and aggregating the outputs. This approach neglects the
underlying nature of long videos since fixed-length clips are often redundant
or uninformative. In this paper, we aim to provide a generic and adaptive
sampling approach for long-form videos in lieu of the de facto uniform
sampling. Viewing videos as semantically consistent segments, we formulate a
task-agnostic, unsupervised, and scalable approach based on Kernel Temporal
Segmentation (KTS) for sampling and tokenizing long videos. We evaluate our
method on long-form video understanding tasks such as video classification and
temporal action localization, showing consistent gains over existing approaches
and achieving state-of-the-art performance on long-form video modeling. |
---|---|
DOI: | 10.48550/arxiv.2309.11569 |