Temporal video scene segmentation using deep-learning

The automatic temporal video scene segmentation (also known as video story segmentation) is still an open problem without definite solutions in most cases. Among the available techniques, the ones which shows better results are multimodal using features extracted from multiple modalities. Multimodal...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Multimedia tools and applications 2021-05, Vol.80 (12), p.17487-17513
Hauptverfasser: Trojahn, Tiago Henrique, Goularte, Rudinei
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 17513
container_issue 12
container_start_page 17487
container_title Multimedia tools and applications
container_volume 80
creator Trojahn, Tiago Henrique
Goularte, Rudinei
description The automatic temporal video scene segmentation (also known as video story segmentation) is still an open problem without definite solutions in most cases. Among the available techniques, the ones which shows better results are multimodal using features extracted from multiple modalities. Multimodal fusion may be performed to fuse each modality as a single representation (early fusion) or by each modality segmentation (late fusion), the latter been widely due to multimodal fusion simplicity. Recently, deep learning techniques such as convolutional neural networks (CNN) has been successfully employed to extract features from multiple data sources, easing the development of early fusion methods. However, CNNs cannot adequately learn cues which are temporally distributed along the video due to difficulties to model temporal features data dependencies. A particular deep learning approach which can learn such cues is the recurrent neural network (RNN). Successfully employed on text processing, RNNs are fitted to analyze sequences of data of variable length and may better grasp the temporal relationship among low-level features of video segments, hopefully obtaining more accurate scene boundary detection. This paper goes beyond direct applying RNNs and proposes a new multimodal approach to temporally segment a video into scenes. This approach builds a new architecture carefully combining CNN and RNN capabilities, obtaining better efficacy results on the task when compared with related techniques on a public video dataset.
doi_str_mv 10.1007/s11042-020-10450-2
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2529604495</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2529604495</sourcerecordid><originalsourceid>FETCH-LOGICAL-c319t-2d1474772b8b706316ca4198907fc0402d4a757855e7417fcc8046750304a2d43</originalsourceid><addsrcrecordid>eNp9UMtKxEAQHETBdfUHPAU8j3bPI5McZfEFC17W8zCbdEKW7CTOJIJ_72gEb576VVVdFGPXCLcIYO4iIijBQQBPjQYuTtgKtZHcGIGnqZcFcKMBz9lFjAcAzLVQK6Z3dByH4Prso6tpyGJFnrJI7ZH85KZu8NkcO99mNdHIe3LBp-mSnTWuj3T1W9fs7fFht3nm29enl839llcSy4mLGpVRycG-2BvIJeaVU1gWJZimAgWiVs5oU2hNRmHaVQWoPLmUoFw6yjW7WXTHMLzPFCd7GObg00srtChzUKrUCSUWVBWGGAM1dgzd0YVPi2C_47FLPDbFY3_isSKR5EKKCexbCn_S_7C-AE66ZTQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2529604495</pqid></control><display><type>article</type><title>Temporal video scene segmentation using deep-learning</title><source>SpringerLink Journals - AutoHoldings</source><creator>Trojahn, Tiago Henrique ; Goularte, Rudinei</creator><creatorcontrib>Trojahn, Tiago Henrique ; Goularte, Rudinei</creatorcontrib><description>The automatic temporal video scene segmentation (also known as video story segmentation) is still an open problem without definite solutions in most cases. Among the available techniques, the ones which shows better results are multimodal using features extracted from multiple modalities. Multimodal fusion may be performed to fuse each modality as a single representation (early fusion) or by each modality segmentation (late fusion), the latter been widely due to multimodal fusion simplicity. Recently, deep learning techniques such as convolutional neural networks (CNN) has been successfully employed to extract features from multiple data sources, easing the development of early fusion methods. However, CNNs cannot adequately learn cues which are temporally distributed along the video due to difficulties to model temporal features data dependencies. A particular deep learning approach which can learn such cues is the recurrent neural network (RNN). Successfully employed on text processing, RNNs are fitted to analyze sequences of data of variable length and may better grasp the temporal relationship among low-level features of video segments, hopefully obtaining more accurate scene boundary detection. This paper goes beyond direct applying RNNs and proposes a new multimodal approach to temporally segment a video into scenes. This approach builds a new architecture carefully combining CNN and RNN capabilities, obtaining better efficacy results on the task when compared with related techniques on a public video dataset.</description><identifier>ISSN: 1380-7501</identifier><identifier>EISSN: 1573-7721</identifier><identifier>DOI: 10.1007/s11042-020-10450-2</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Artificial neural networks ; Computer Communication Networks ; Computer Science ; Data Structures and Information Theory ; Deep learning ; Feature extraction ; Multimedia Information Systems ; Neural networks ; Recurrent neural networks ; Segmentation ; Segments ; Special Purpose and Application-Based Systems</subject><ispartof>Multimedia tools and applications, 2021-05, Vol.80 (12), p.17487-17513</ispartof><rights>The Author(s), under exclusive licence to Springer Science+Business Media, LLC part of Springer Nature 2021</rights><rights>The Author(s), under exclusive licence to Springer Science+Business Media, LLC part of Springer Nature 2021.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c319t-2d1474772b8b706316ca4198907fc0402d4a757855e7417fcc8046750304a2d43</citedby><cites>FETCH-LOGICAL-c319t-2d1474772b8b706316ca4198907fc0402d4a757855e7417fcc8046750304a2d43</cites><orcidid>0000-0002-1826-4456</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s11042-020-10450-2$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s11042-020-10450-2$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,41488,42557,51319</link.rule.ids></links><search><creatorcontrib>Trojahn, Tiago Henrique</creatorcontrib><creatorcontrib>Goularte, Rudinei</creatorcontrib><title>Temporal video scene segmentation using deep-learning</title><title>Multimedia tools and applications</title><addtitle>Multimed Tools Appl</addtitle><description>The automatic temporal video scene segmentation (also known as video story segmentation) is still an open problem without definite solutions in most cases. Among the available techniques, the ones which shows better results are multimodal using features extracted from multiple modalities. Multimodal fusion may be performed to fuse each modality as a single representation (early fusion) or by each modality segmentation (late fusion), the latter been widely due to multimodal fusion simplicity. Recently, deep learning techniques such as convolutional neural networks (CNN) has been successfully employed to extract features from multiple data sources, easing the development of early fusion methods. However, CNNs cannot adequately learn cues which are temporally distributed along the video due to difficulties to model temporal features data dependencies. A particular deep learning approach which can learn such cues is the recurrent neural network (RNN). Successfully employed on text processing, RNNs are fitted to analyze sequences of data of variable length and may better grasp the temporal relationship among low-level features of video segments, hopefully obtaining more accurate scene boundary detection. This paper goes beyond direct applying RNNs and proposes a new multimodal approach to temporally segment a video into scenes. This approach builds a new architecture carefully combining CNN and RNN capabilities, obtaining better efficacy results on the task when compared with related techniques on a public video dataset.</description><subject>Artificial neural networks</subject><subject>Computer Communication Networks</subject><subject>Computer Science</subject><subject>Data Structures and Information Theory</subject><subject>Deep learning</subject><subject>Feature extraction</subject><subject>Multimedia Information Systems</subject><subject>Neural networks</subject><subject>Recurrent neural networks</subject><subject>Segmentation</subject><subject>Segments</subject><subject>Special Purpose and Application-Based Systems</subject><issn>1380-7501</issn><issn>1573-7721</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>8G5</sourceid><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><sourceid>GUQSH</sourceid><sourceid>M2O</sourceid><recordid>eNp9UMtKxEAQHETBdfUHPAU8j3bPI5McZfEFC17W8zCbdEKW7CTOJIJ_72gEb576VVVdFGPXCLcIYO4iIijBQQBPjQYuTtgKtZHcGIGnqZcFcKMBz9lFjAcAzLVQK6Z3dByH4Prso6tpyGJFnrJI7ZH85KZu8NkcO99mNdHIe3LBp-mSnTWuj3T1W9fs7fFht3nm29enl839llcSy4mLGpVRycG-2BvIJeaVU1gWJZimAgWiVs5oU2hNRmHaVQWoPLmUoFw6yjW7WXTHMLzPFCd7GObg00srtChzUKrUCSUWVBWGGAM1dgzd0YVPi2C_47FLPDbFY3_isSKR5EKKCexbCn_S_7C-AE66ZTQ</recordid><startdate>20210501</startdate><enddate>20210501</enddate><creator>Trojahn, Tiago Henrique</creator><creator>Goularte, Rudinei</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7SC</scope><scope>7WY</scope><scope>7WZ</scope><scope>7XB</scope><scope>87Z</scope><scope>8AL</scope><scope>8AO</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FK</scope><scope>8FL</scope><scope>8G5</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BEZIV</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FRNLG</scope><scope>F~G</scope><scope>GNUQQ</scope><scope>GUQSH</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K60</scope><scope>K6~</scope><scope>K7-</scope><scope>L.-</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>M0C</scope><scope>M0N</scope><scope>M2O</scope><scope>MBDVC</scope><scope>P5Z</scope><scope>P62</scope><scope>PQBIZ</scope><scope>PQBZA</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>Q9U</scope><orcidid>https://orcid.org/0000-0002-1826-4456</orcidid></search><sort><creationdate>20210501</creationdate><title>Temporal video scene segmentation using deep-learning</title><author>Trojahn, Tiago Henrique ; Goularte, Rudinei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c319t-2d1474772b8b706316ca4198907fc0402d4a757855e7417fcc8046750304a2d43</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Artificial neural networks</topic><topic>Computer Communication Networks</topic><topic>Computer Science</topic><topic>Data Structures and Information Theory</topic><topic>Deep learning</topic><topic>Feature extraction</topic><topic>Multimedia Information Systems</topic><topic>Neural networks</topic><topic>Recurrent neural networks</topic><topic>Segmentation</topic><topic>Segments</topic><topic>Special Purpose and Application-Based Systems</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Trojahn, Tiago Henrique</creatorcontrib><creatorcontrib>Goularte, Rudinei</creatorcontrib><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Computer and Information Systems Abstracts</collection><collection>ABI/INFORM Collection</collection><collection>ABI/INFORM Global (PDF only)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>ABI/INFORM Global (Alumni Edition)</collection><collection>Computing Database (Alumni Edition)</collection><collection>ProQuest Pharma Collection</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ABI/INFORM Collection (Alumni Edition)</collection><collection>Research Library (Alumni Edition)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Business Premium Collection</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Business Premium Collection (Alumni)</collection><collection>ABI/INFORM Global (Corporate)</collection><collection>ProQuest Central Student</collection><collection>Research Library Prep</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Business Collection (Alumni Edition)</collection><collection>ProQuest Business Collection</collection><collection>Computer Science Database</collection><collection>ABI/INFORM Professional Advanced</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>ABI/INFORM Global</collection><collection>Computing Database</collection><collection>Research Library</collection><collection>Research Library (Corporate)</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>One Business (ProQuest)</collection><collection>ProQuest One Business (Alumni)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>ProQuest Central Basic</collection><jtitle>Multimedia tools and applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Trojahn, Tiago Henrique</au><au>Goularte, Rudinei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Temporal video scene segmentation using deep-learning</atitle><jtitle>Multimedia tools and applications</jtitle><stitle>Multimed Tools Appl</stitle><date>2021-05-01</date><risdate>2021</risdate><volume>80</volume><issue>12</issue><spage>17487</spage><epage>17513</epage><pages>17487-17513</pages><issn>1380-7501</issn><eissn>1573-7721</eissn><abstract>The automatic temporal video scene segmentation (also known as video story segmentation) is still an open problem without definite solutions in most cases. Among the available techniques, the ones which shows better results are multimodal using features extracted from multiple modalities. Multimodal fusion may be performed to fuse each modality as a single representation (early fusion) or by each modality segmentation (late fusion), the latter been widely due to multimodal fusion simplicity. Recently, deep learning techniques such as convolutional neural networks (CNN) has been successfully employed to extract features from multiple data sources, easing the development of early fusion methods. However, CNNs cannot adequately learn cues which are temporally distributed along the video due to difficulties to model temporal features data dependencies. A particular deep learning approach which can learn such cues is the recurrent neural network (RNN). Successfully employed on text processing, RNNs are fitted to analyze sequences of data of variable length and may better grasp the temporal relationship among low-level features of video segments, hopefully obtaining more accurate scene boundary detection. This paper goes beyond direct applying RNNs and proposes a new multimodal approach to temporally segment a video into scenes. This approach builds a new architecture carefully combining CNN and RNN capabilities, obtaining better efficacy results on the task when compared with related techniques on a public video dataset.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s11042-020-10450-2</doi><tpages>27</tpages><orcidid>https://orcid.org/0000-0002-1826-4456</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 1380-7501
ispartof Multimedia tools and applications, 2021-05, Vol.80 (12), p.17487-17513
issn 1380-7501
1573-7721
language eng
recordid cdi_proquest_journals_2529604495
source SpringerLink Journals - AutoHoldings
subjects Artificial neural networks
Computer Communication Networks
Computer Science
Data Structures and Information Theory
Deep learning
Feature extraction
Multimedia Information Systems
Neural networks
Recurrent neural networks
Segmentation
Segments
Special Purpose and Application-Based Systems
title Temporal video scene segmentation using deep-learning
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T00%3A29%3A35IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Temporal%20video%20scene%20segmentation%20using%20deep-learning&rft.jtitle=Multimedia%20tools%20and%20applications&rft.au=Trojahn,%20Tiago%20Henrique&rft.date=2021-05-01&rft.volume=80&rft.issue=12&rft.spage=17487&rft.epage=17513&rft.pages=17487-17513&rft.issn=1380-7501&rft.eissn=1573-7721&rft_id=info:doi/10.1007/s11042-020-10450-2&rft_dat=%3Cproquest_cross%3E2529604495%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2529604495&rft_id=info:pmid/&rfr_iscdi=true