How Does Audio Influence Visual Attention in Omnidirectional Videos? Database and Model
Understanding and predicting viewer attention in omnidirectional videos (ODVs) is crucial for enhancing user engagement in virtual and augmented reality applications. Although both audio and visual modalities are essential for saliency prediction in ODVs, the joint exploitation of these two modaliti...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Understanding and predicting viewer attention in omnidirectional videos
(ODVs) is crucial for enhancing user engagement in virtual and augmented
reality applications. Although both audio and visual modalities are essential
for saliency prediction in ODVs, the joint exploitation of these two modalities
has been limited, primarily due to the absence of large-scale audio-visual
saliency databases and comprehensive analyses. This paper comprehensively
investigates audio-visual attention in ODVs from both subjective and objective
perspectives. Specifically, we first introduce a new audio-visual saliency
database for omnidirectional videos, termed AVS-ODV database, containing 162
ODVs and corresponding eye movement data collected from 60 subjects under three
audio modes including mute, mono, and ambisonics. Based on the constructed
AVS-ODV database, we perform an in-depth analysis of how audio influences
visual attention in ODVs. To advance the research on audio-visual saliency
prediction for ODVs, we further establish a new benchmark based on the AVS-ODV
database by testing numerous state-of-the-art saliency models, including
visual-only models and audio-visual models. In addition, given the limitations
of current models, we propose an innovative omnidirectional audio-visual
saliency prediction network (OmniAVS), which is built based on the U-Net
architecture, and hierarchically fuses audio and visual features from the
multimodal aligned embedding space. Extensive experimental results demonstrate
that the proposed OmniAVS model outperforms other state-of-the-art models on
both ODV AVS prediction and traditional AVS predcition tasks. The AVS-ODV
database and OmniAVS model will be released to facilitate future research. |
---|---|
DOI: | 10.48550/arxiv.2408.05411 |