AVSegFormer: Audio-Visual Segmentation with Transformer
The combination of audio and vision has long been a topic of interest in the multi-modal community. Recently, a new audio-visual segmentation (AVS) task has been introduced, aiming to locate and segment the sounding objects in a given video. This task demands audio-driven pixel-level scene understan...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The combination of audio and vision has long been a topic of interest in the
multi-modal community. Recently, a new audio-visual segmentation (AVS) task has
been introduced, aiming to locate and segment the sounding objects in a given
video. This task demands audio-driven pixel-level scene understanding for the
first time, posing significant challenges. In this paper, we propose
AVSegFormer, a novel framework for AVS tasks that leverages the transformer
architecture. Specifically, we introduce audio queries and learnable queries
into the transformer decoder, enabling the network to selectively attend to
interested visual features. Besides, we present an audio-visual mixer, which
can dynamically adjust visual features by amplifying relevant and suppressing
irrelevant spatial channels. Additionally, we devise an intermediate mask loss
to enhance the supervision of the decoder, encouraging the network to produce
more accurate intermediate predictions. Extensive experiments demonstrate that
AVSegFormer achieves state-of-the-art results on the AVS benchmark. The code is
available at https://github.com/vvvb-github/AVSegFormer. |
---|---|
DOI: | 10.48550/arxiv.2307.01146 |