How Does Audio Influence Visual Attention in Omnidirectional Videos? Database and Model

Understanding and predicting viewer attention in omnidirectional videos (ODVs) is crucial for enhancing user engagement in virtual and augmented reality applications. Although both audio and visual modalities are essential for saliency prediction in ODVs, the joint exploitation of these two modaliti...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Zhu, Yuxin, Duan, Huiyu, Zhang, Kaiwei, Zhu, Yucheng, Zhu, Xilei, Teng, Long, Min, Xiongkuo, Zhai, Guangtao
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Zhu, Yuxin Duan, Huiyu Zhang, Kaiwei Zhu, Yucheng Zhu, Xilei Teng, Long Min, Xiongkuo Zhai, Guangtao
description	Understanding and predicting viewer attention in omnidirectional videos (ODVs) is crucial for enhancing user engagement in virtual and augmented reality applications. Although both audio and visual modalities are essential for saliency prediction in ODVs, the joint exploitation of these two modalities has been limited, primarily due to the absence of large-scale audio-visual saliency databases and comprehensive analyses. This paper comprehensively investigates audio-visual attention in ODVs from both subjective and objective perspectives. Specifically, we first introduce a new audio-visual saliency database for omnidirectional videos, termed AVS-ODV database, containing 162 ODVs and corresponding eye movement data collected from 60 subjects under three audio modes including mute, mono, and ambisonics. Based on the constructed AVS-ODV database, we perform an in-depth analysis of how audio influences visual attention in ODVs. To advance the research on audio-visual saliency prediction for ODVs, we further establish a new benchmark based on the AVS-ODV database by testing numerous state-of-the-art saliency models, including visual-only models and audio-visual models. In addition, given the limitations of current models, we propose an innovative omnidirectional audio-visual saliency prediction network (OmniAVS), which is built based on the U-Net architecture, and hierarchically fuses audio and visual features from the multimodal aligned embedding space. Extensive experimental results demonstrate that the proposed OmniAVS model outperforms other state-of-the-art models on both ODV AVS prediction and traditional AVS predcition tasks. The AVS-ODV database and OmniAVS model will be released to facilitate future research.
doi_str_mv	10.48550/arxiv.2408.05411
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2408_05411</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2408_05411</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2408_054113</originalsourceid><addsrcrecordid>eNqFzr0OgkAQBOBrLIz6AFbuC4iHQkJniGiwMDYGS7JyS7LJcWfuwJ-3V4i91SQzU3xCzEMZREkcyxW6Fz-CdSSTQMZRGI7FNbdPyCx5SDvFFo6m1h2ZiqBg36GGtG3JtGwNsIFzY1ixo6ovvmPBiqzfQoYt3tAToFFwsor0VIxq1J5mv5yIxWF_2eXLgVDeHTfo3mVPKQfK5v_jA4anPn4</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>How Does Audio Influence Visual Attention in Omnidirectional Videos? Database and Model</title><source>arXiv.org</source><creator>Zhu, Yuxin ; Duan, Huiyu ; Zhang, Kaiwei ; Zhu, Yucheng ; Zhu, Xilei ; Teng, Long ; Min, Xiongkuo ; Zhai, Guangtao</creator><creatorcontrib>Zhu, Yuxin ; Duan, Huiyu ; Zhang, Kaiwei ; Zhu, Yucheng ; Zhu, Xilei ; Teng, Long ; Min, Xiongkuo ; Zhai, Guangtao</creatorcontrib><description>Understanding and predicting viewer attention in omnidirectional videos (ODVs) is crucial for enhancing user engagement in virtual and augmented reality applications. Although both audio and visual modalities are essential for saliency prediction in ODVs, the joint exploitation of these two modalities has been limited, primarily due to the absence of large-scale audio-visual saliency databases and comprehensive analyses. This paper comprehensively investigates audio-visual attention in ODVs from both subjective and objective perspectives. Specifically, we first introduce a new audio-visual saliency database for omnidirectional videos, termed AVS-ODV database, containing 162 ODVs and corresponding eye movement data collected from 60 subjects under three audio modes including mute, mono, and ambisonics. Based on the constructed AVS-ODV database, we perform an in-depth analysis of how audio influences visual attention in ODVs. To advance the research on audio-visual saliency prediction for ODVs, we further establish a new benchmark based on the AVS-ODV database by testing numerous state-of-the-art saliency models, including visual-only models and audio-visual models. In addition, given the limitations of current models, we propose an innovative omnidirectional audio-visual saliency prediction network (OmniAVS), which is built based on the U-Net architecture, and hierarchically fuses audio and visual features from the multimodal aligned embedding space. Extensive experimental results demonstrate that the proposed OmniAVS model outperforms other state-of-the-art models on both ODV AVS prediction and traditional AVS predcition tasks. The AVS-ODV database and OmniAVS model will be released to facilitate future research.</description><identifier>DOI: 10.48550/arxiv.2408.05411</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-08</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2408.05411$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2408.05411$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Zhu, Yuxin</creatorcontrib><creatorcontrib>Duan, Huiyu</creatorcontrib><creatorcontrib>Zhang, Kaiwei</creatorcontrib><creatorcontrib>Zhu, Yucheng</creatorcontrib><creatorcontrib>Zhu, Xilei</creatorcontrib><creatorcontrib>Teng, Long</creatorcontrib><creatorcontrib>Min, Xiongkuo</creatorcontrib><creatorcontrib>Zhai, Guangtao</creatorcontrib><title>How Does Audio Influence Visual Attention in Omnidirectional Videos? Database and Model</title><description>Understanding and predicting viewer attention in omnidirectional videos (ODVs) is crucial for enhancing user engagement in virtual and augmented reality applications. Although both audio and visual modalities are essential for saliency prediction in ODVs, the joint exploitation of these two modalities has been limited, primarily due to the absence of large-scale audio-visual saliency databases and comprehensive analyses. This paper comprehensively investigates audio-visual attention in ODVs from both subjective and objective perspectives. Specifically, we first introduce a new audio-visual saliency database for omnidirectional videos, termed AVS-ODV database, containing 162 ODVs and corresponding eye movement data collected from 60 subjects under three audio modes including mute, mono, and ambisonics. Based on the constructed AVS-ODV database, we perform an in-depth analysis of how audio influences visual attention in ODVs. To advance the research on audio-visual saliency prediction for ODVs, we further establish a new benchmark based on the AVS-ODV database by testing numerous state-of-the-art saliency models, including visual-only models and audio-visual models. In addition, given the limitations of current models, we propose an innovative omnidirectional audio-visual saliency prediction network (OmniAVS), which is built based on the U-Net architecture, and hierarchically fuses audio and visual features from the multimodal aligned embedding space. Extensive experimental results demonstrate that the proposed OmniAVS model outperforms other state-of-the-art models on both ODV AVS prediction and traditional AVS predcition tasks. The AVS-ODV database and OmniAVS model will be released to facilitate future research.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNqFzr0OgkAQBOBrLIz6AFbuC4iHQkJniGiwMDYGS7JyS7LJcWfuwJ-3V4i91SQzU3xCzEMZREkcyxW6Fz-CdSSTQMZRGI7FNbdPyCx5SDvFFo6m1h2ZiqBg36GGtG3JtGwNsIFzY1ixo6ovvmPBiqzfQoYt3tAToFFwsor0VIxq1J5mv5yIxWF_2eXLgVDeHTfo3mVPKQfK5v_jA4anPn4</recordid><startdate>20240809</startdate><enddate>20240809</enddate><creator>Zhu, Yuxin</creator><creator>Duan, Huiyu</creator><creator>Zhang, Kaiwei</creator><creator>Zhu, Yucheng</creator><creator>Zhu, Xilei</creator><creator>Teng, Long</creator><creator>Min, Xiongkuo</creator><creator>Zhai, Guangtao</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240809</creationdate><title>How Does Audio Influence Visual Attention in Omnidirectional Videos? Database and Model</title><author>Zhu, Yuxin ; Duan, Huiyu ; Zhang, Kaiwei ; Zhu, Yucheng ; Zhu, Xilei ; Teng, Long ; Min, Xiongkuo ; Zhai, Guangtao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2408_054113</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhu, Yuxin</creatorcontrib><creatorcontrib>Duan, Huiyu</creatorcontrib><creatorcontrib>Zhang, Kaiwei</creatorcontrib><creatorcontrib>Zhu, Yucheng</creatorcontrib><creatorcontrib>Zhu, Xilei</creatorcontrib><creatorcontrib>Teng, Long</creatorcontrib><creatorcontrib>Min, Xiongkuo</creatorcontrib><creatorcontrib>Zhai, Guangtao</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhu, Yuxin</au><au>Duan, Huiyu</au><au>Zhang, Kaiwei</au><au>Zhu, Yucheng</au><au>Zhu, Xilei</au><au>Teng, Long</au><au>Min, Xiongkuo</au><au>Zhai, Guangtao</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>How Does Audio Influence Visual Attention in Omnidirectional Videos? Database and Model</atitle><date>2024-08-09</date><risdate>2024</risdate><abstract>Understanding and predicting viewer attention in omnidirectional videos (ODVs) is crucial for enhancing user engagement in virtual and augmented reality applications. Although both audio and visual modalities are essential for saliency prediction in ODVs, the joint exploitation of these two modalities has been limited, primarily due to the absence of large-scale audio-visual saliency databases and comprehensive analyses. This paper comprehensively investigates audio-visual attention in ODVs from both subjective and objective perspectives. Specifically, we first introduce a new audio-visual saliency database for omnidirectional videos, termed AVS-ODV database, containing 162 ODVs and corresponding eye movement data collected from 60 subjects under three audio modes including mute, mono, and ambisonics. Based on the constructed AVS-ODV database, we perform an in-depth analysis of how audio influences visual attention in ODVs. To advance the research on audio-visual saliency prediction for ODVs, we further establish a new benchmark based on the AVS-ODV database by testing numerous state-of-the-art saliency models, including visual-only models and audio-visual models. In addition, given the limitations of current models, we propose an innovative omnidirectional audio-visual saliency prediction network (OmniAVS), which is built based on the U-Net architecture, and hierarchically fuses audio and visual features from the multimodal aligned embedding space. Extensive experimental results demonstrate that the proposed OmniAVS model outperforms other state-of-the-art models on both ODV AVS prediction and traditional AVS predcition tasks. The AVS-ODV database and OmniAVS model will be released to facilitate future research.</abstract><doi>10.48550/arxiv.2408.05411</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2408.05411
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2408_05411
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	How Does Audio Influence Visual Attention in Omnidirectional Videos? Database and Model
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-21T22%3A33%3A53IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=How%20Does%20Audio%20Influence%20Visual%20Attention%20in%20Omnidirectional%20Videos?%20Database%20and%20Model&rft.au=Zhu,%20Yuxin&rft.date=2024-08-09&rft_id=info:doi/10.48550/arxiv.2408.05411&rft_dat=%3Carxiv_GOX%3E2408_05411%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true