Open-Vocabulary Action Localization with Iterative Visual Prompting
Video action localization aims to find the timings of specific actions from a long video. Although existing learning-based approaches have been successful, they require annotating videos, which comes with a considerable labor cost. This paper proposes a learning-free, open-vocabulary approach based...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Video action localization aims to find the timings of specific actions from a
long video. Although existing learning-based approaches have been successful,
they require annotating videos, which comes with a considerable labor cost.
This paper proposes a learning-free, open-vocabulary approach based on emerging
off-the-shelf vision-language models (VLMs). The challenge stems from the fact
that VLMs are neither designed to process long videos nor tailored for finding
actions. We overcome these problems by extending an iterative visual prompting
technique. Specifically, we sample video frames and create a concatenated image
with frame index labels, making a VLM guess a frame that is considered to be
closest to the start and end of the action. Iterating this process by narrowing
a sampling time window results in finding the specific frames corresponding to
the start and end of an action. We demonstrate that this technique yields
reasonable performance, achieving results comparable to state-of-the-art
zero-shot action localization. These results illustrate a practical extension
of VLMs for understanding videos. A sample code is available at
https://microsoft.github.io/VLM-Video-Action-Localization/. |
---|---|
DOI: | 10.48550/arxiv.2408.17422 |