HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics
Existing research often treats long-form videos as extended short videos, leading to several limitations: inadequate capture of long-range dependencies, inefficient processing of redundant information, and failure to extract high-level semantic concepts. To address these issues, we propose a novel a...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Existing research often treats long-form videos as extended short videos,
leading to several limitations: inadequate capture of long-range dependencies,
inefficient processing of redundant information, and failure to extract
high-level semantic concepts. To address these issues, we propose a novel
approach that more accurately reflects human cognition. This paper introduces
HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics,
a model that simulates episodic memory accumulation to capture action sequences
and reinforces them with semantic knowledge dispersed throughout the video. Our
work makes two key contributions: First, we develop an Episodic COmpressor
(ECO) that efficiently aggregates crucial representations from micro to
semi-macro levels, overcoming the challenge of long-range dependencies. Second,
we propose a Semantics ReTRiever (SeTR) that enhances these aggregated
representations with semantic information by focusing on the broader context,
dramatically reducing feature dimensionality while preserving relevant
macro-level information. This addresses the issues of redundancy and lack of
high-level concept extraction. Extensive experiments demonstrate that HERMES
achieves state-of-the-art performance across multiple long-video understanding
benchmarks in both zero-shot and fully-supervised settings. |
---|---|
DOI: | 10.48550/arxiv.2408.17443 |