How Does Audio Influence Visual Attention in Omnidirectional Videos? Database and Model

Understanding and predicting viewer attention in omnidirectional videos (ODVs) is crucial for enhancing user engagement in virtual and augmented reality applications. Although both audio and visual modalities are essential for saliency prediction in ODVs, the joint exploitation of these two modaliti...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Zhu, Yuxin, Duan, Huiyu, Zhang, Kaiwei, Zhu, Yucheng, Zhu, Xilei, Teng, Long, Min, Xiongkuo, Zhai, Guangtao
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Zhu, Yuxin
Duan, Huiyu
Zhang, Kaiwei
Zhu, Yucheng
Zhu, Xilei
Teng, Long
Min, Xiongkuo
Zhai, Guangtao
description Understanding and predicting viewer attention in omnidirectional videos (ODVs) is crucial for enhancing user engagement in virtual and augmented reality applications. Although both audio and visual modalities are essential for saliency prediction in ODVs, the joint exploitation of these two modalities has been limited, primarily due to the absence of large-scale audio-visual saliency databases and comprehensive analyses. This paper comprehensively investigates audio-visual attention in ODVs from both subjective and objective perspectives. Specifically, we first introduce a new audio-visual saliency database for omnidirectional videos, termed AVS-ODV database, containing 162 ODVs and corresponding eye movement data collected from 60 subjects under three audio modes including mute, mono, and ambisonics. Based on the constructed AVS-ODV database, we perform an in-depth analysis of how audio influences visual attention in ODVs. To advance the research on audio-visual saliency prediction for ODVs, we further establish a new benchmark based on the AVS-ODV database by testing numerous state-of-the-art saliency models, including visual-only models and audio-visual models. In addition, given the limitations of current models, we propose an innovative omnidirectional audio-visual saliency prediction network (OmniAVS), which is built based on the U-Net architecture, and hierarchically fuses audio and visual features from the multimodal aligned embedding space. Extensive experimental results demonstrate that the proposed OmniAVS model outperforms other state-of-the-art models on both ODV AVS prediction and traditional AVS predcition tasks. The AVS-ODV database and OmniAVS model will be released to facilitate future research.
doi_str_mv 10.48550/arxiv.2408.05411
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2408_05411</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2408_05411</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2408_054113</originalsourceid><addsrcrecordid>eNqFzr0OgkAQBOBrLIz6AFbuC4iHQkJniGiwMDYGS7JyS7LJcWfuwJ-3V4i91SQzU3xCzEMZREkcyxW6Fz-CdSSTQMZRGI7FNbdPyCx5SDvFFo6m1h2ZiqBg36GGtG3JtGwNsIFzY1ixo6ovvmPBiqzfQoYt3tAToFFwsor0VIxq1J5mv5yIxWF_2eXLgVDeHTfo3mVPKQfK5v_jA4anPn4</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>How Does Audio Influence Visual Attention in Omnidirectional Videos? Database and Model</title><source>arXiv.org</source><creator>Zhu, Yuxin ; Duan, Huiyu ; Zhang, Kaiwei ; Zhu, Yucheng ; Zhu, Xilei ; Teng, Long ; Min, Xiongkuo ; Zhai, Guangtao</creator><creatorcontrib>Zhu, Yuxin ; Duan, Huiyu ; Zhang, Kaiwei ; Zhu, Yucheng ; Zhu, Xilei ; Teng, Long ; Min, Xiongkuo ; Zhai, Guangtao</creatorcontrib><description>Understanding and predicting viewer attention in omnidirectional videos (ODVs) is crucial for enhancing user engagement in virtual and augmented reality applications. Although both audio and visual modalities are essential for saliency prediction in ODVs, the joint exploitation of these two modalities has been limited, primarily due to the absence of large-scale audio-visual saliency databases and comprehensive analyses. This paper comprehensively investigates audio-visual attention in ODVs from both subjective and objective perspectives. Specifically, we first introduce a new audio-visual saliency database for omnidirectional videos, termed AVS-ODV database, containing 162 ODVs and corresponding eye movement data collected from 60 subjects under three audio modes including mute, mono, and ambisonics. Based on the constructed AVS-ODV database, we perform an in-depth analysis of how audio influences visual attention in ODVs. To advance the research on audio-visual saliency prediction for ODVs, we further establish a new benchmark based on the AVS-ODV database by testing numerous state-of-the-art saliency models, including visual-only models and audio-visual models. In addition, given the limitations of current models, we propose an innovative omnidirectional audio-visual saliency prediction network (OmniAVS), which is built based on the U-Net architecture, and hierarchically fuses audio and visual features from the multimodal aligned embedding space. Extensive experimental results demonstrate that the proposed OmniAVS model outperforms other state-of-the-art models on both ODV AVS prediction and traditional AVS predcition tasks. The AVS-ODV database and OmniAVS model will be released to facilitate future research.</description><identifier>DOI: 10.48550/arxiv.2408.05411</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-08</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2408.05411$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2408.05411$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Zhu, Yuxin</creatorcontrib><creatorcontrib>Duan, Huiyu</creatorcontrib><creatorcontrib>Zhang, Kaiwei</creatorcontrib><creatorcontrib>Zhu, Yucheng</creatorcontrib><creatorcontrib>Zhu, Xilei</creatorcontrib><creatorcontrib>Teng, Long</creatorcontrib><creatorcontrib>Min, Xiongkuo</creatorcontrib><creatorcontrib>Zhai, Guangtao</creatorcontrib><title>How Does Audio Influence Visual Attention in Omnidirectional Videos? Database and Model</title><description>Understanding and predicting viewer attention in omnidirectional videos (ODVs) is crucial for enhancing user engagement in virtual and augmented reality applications. Although both audio and visual modalities are essential for saliency prediction in ODVs, the joint exploitation of these two modalities has been limited, primarily due to the absence of large-scale audio-visual saliency databases and comprehensive analyses. This paper comprehensively investigates audio-visual attention in ODVs from both subjective and objective perspectives. Specifically, we first introduce a new audio-visual saliency database for omnidirectional videos, termed AVS-ODV database, containing 162 ODVs and corresponding eye movement data collected from 60 subjects under three audio modes including mute, mono, and ambisonics. Based on the constructed AVS-ODV database, we perform an in-depth analysis of how audio influences visual attention in ODVs. To advance the research on audio-visual saliency prediction for ODVs, we further establish a new benchmark based on the AVS-ODV database by testing numerous state-of-the-art saliency models, including visual-only models and audio-visual models. In addition, given the limitations of current models, we propose an innovative omnidirectional audio-visual saliency prediction network (OmniAVS), which is built based on the U-Net architecture, and hierarchically fuses audio and visual features from the multimodal aligned embedding space. Extensive experimental results demonstrate that the proposed OmniAVS model outperforms other state-of-the-art models on both ODV AVS prediction and traditional AVS predcition tasks. The AVS-ODV database and OmniAVS model will be released to facilitate future research.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNqFzr0OgkAQBOBrLIz6AFbuC4iHQkJniGiwMDYGS7JyS7LJcWfuwJ-3V4i91SQzU3xCzEMZREkcyxW6Fz-CdSSTQMZRGI7FNbdPyCx5SDvFFo6m1h2ZiqBg36GGtG3JtGwNsIFzY1ixo6ovvmPBiqzfQoYt3tAToFFwsor0VIxq1J5mv5yIxWF_2eXLgVDeHTfo3mVPKQfK5v_jA4anPn4</recordid><startdate>20240809</startdate><enddate>20240809</enddate><creator>Zhu, Yuxin</creator><creator>Duan, Huiyu</creator><creator>Zhang, Kaiwei</creator><creator>Zhu, Yucheng</creator><creator>Zhu, Xilei</creator><creator>Teng, Long</creator><creator>Min, Xiongkuo</creator><creator>Zhai, Guangtao</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240809</creationdate><title>How Does Audio Influence Visual Attention in Omnidirectional Videos? Database and Model</title><author>Zhu, Yuxin ; Duan, Huiyu ; Zhang, Kaiwei ; Zhu, Yucheng ; Zhu, Xilei ; Teng, Long ; Min, Xiongkuo ; Zhai, Guangtao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2408_054113</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhu, Yuxin</creatorcontrib><creatorcontrib>Duan, Huiyu</creatorcontrib><creatorcontrib>Zhang, Kaiwei</creatorcontrib><creatorcontrib>Zhu, Yucheng</creatorcontrib><creatorcontrib>Zhu, Xilei</creatorcontrib><creatorcontrib>Teng, Long</creatorcontrib><creatorcontrib>Min, Xiongkuo</creatorcontrib><creatorcontrib>Zhai, Guangtao</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhu, Yuxin</au><au>Duan, Huiyu</au><au>Zhang, Kaiwei</au><au>Zhu, Yucheng</au><au>Zhu, Xilei</au><au>Teng, Long</au><au>Min, Xiongkuo</au><au>Zhai, Guangtao</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>How Does Audio Influence Visual Attention in Omnidirectional Videos? Database and Model</atitle><date>2024-08-09</date><risdate>2024</risdate><abstract>Understanding and predicting viewer attention in omnidirectional videos (ODVs) is crucial for enhancing user engagement in virtual and augmented reality applications. Although both audio and visual modalities are essential for saliency prediction in ODVs, the joint exploitation of these two modalities has been limited, primarily due to the absence of large-scale audio-visual saliency databases and comprehensive analyses. This paper comprehensively investigates audio-visual attention in ODVs from both subjective and objective perspectives. Specifically, we first introduce a new audio-visual saliency database for omnidirectional videos, termed AVS-ODV database, containing 162 ODVs and corresponding eye movement data collected from 60 subjects under three audio modes including mute, mono, and ambisonics. Based on the constructed AVS-ODV database, we perform an in-depth analysis of how audio influences visual attention in ODVs. To advance the research on audio-visual saliency prediction for ODVs, we further establish a new benchmark based on the AVS-ODV database by testing numerous state-of-the-art saliency models, including visual-only models and audio-visual models. In addition, given the limitations of current models, we propose an innovative omnidirectional audio-visual saliency prediction network (OmniAVS), which is built based on the U-Net architecture, and hierarchically fuses audio and visual features from the multimodal aligned embedding space. Extensive experimental results demonstrate that the proposed OmniAVS model outperforms other state-of-the-art models on both ODV AVS prediction and traditional AVS predcition tasks. The AVS-ODV database and OmniAVS model will be released to facilitate future research.</abstract><doi>10.48550/arxiv.2408.05411</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2408.05411
ispartof
issn
language eng
recordid cdi_arxiv_primary_2408_05411
source arXiv.org
subjects Computer Science - Computer Vision and Pattern Recognition
title How Does Audio Influence Visual Attention in Omnidirectional Videos? Database and Model
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-21T22%3A33%3A53IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=How%20Does%20Audio%20Influence%20Visual%20Attention%20in%20Omnidirectional%20Videos?%20Database%20and%20Model&rft.au=Zhu,%20Yuxin&rft.date=2024-08-09&rft_id=info:doi/10.48550/arxiv.2408.05411&rft_dat=%3Carxiv_GOX%3E2408_05411%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true