Voxel-based Viterbi Active Speaker Tracking (V-VAST) with best view selection for video lecture post-production

An automated system is presented for reducing a multi-view lecture recording into a single view video containing a best view summary of active speakers. The system uses skin color detection and voxel-based analysis in locating likely speaker locations. Using time-delay estimates from multiple micro...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Kelly, Damien, Kokaram, Anil, Boland, Frank
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 2299
container_issue
container_start_page 2296
container_title
container_volume
creator Kelly, Damien
Kokaram, Anil
Boland, Frank
description An automated system is presented for reducing a multi-view lecture recording into a single view video containing a best view summary of active speakers. The system uses skin color detection and voxel-based analysis in locating likely speaker locations. Using time-delay estimates from multiple micro phones, speech activity is analyzed for each speaker position. The Viterbi algorithm is then used to estimate a track of the active speaker which maximizes the observed speech activity. This novel approach is termed Voxel-based Viterbi Active Speaker Tracking (V-VAST) and is shown to track speakers with an accuracy of 0.23m. Using the tracking information, the system then extracts from the available camera views the most frontal face view of the active speaker to display.
doi_str_mv 10.1109/ICASSP.2011.5946941
format Conference Proceeding
fullrecord <record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_5946941</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>5946941</ieee_id><sourcerecordid>5946941</sourcerecordid><originalsourceid>FETCH-LOGICAL-i175t-f1cd9a013fc0783be22e12c465aa9f439f0444aaef6fc608b3b31f2568846cea3</originalsourceid><addsrcrecordid>eNo1UMlOwzAUNJtEKf2CXnyEg4t3x8cKsUmVQEqJuFWO8wympYmctIW_J0CZy0izPOkNQmNGJ4xRe_VwPc3zpwmnjE2UldpKdoDOmFTGUCWsOUQDLowlzNKXIzSyJvv3MnqMBkxxSjST9hSN2vad9tDcGGUHqC7qT1iR0rVQ4SJ2kMqIp76LW8B5A24JCc-T88u4fsUXBSmm-fwS72L3hktoO7yNsMMtrKCv1Gsc6tRLFdT4R9kkwE3ddqRJdbX5TZyjk-BWLYz2PETPtzfz63sye7zrv5yRyIzqSGC-so4yETw1mSiBc2DcS62cs0EKG6iU0jkIOnhNs1KUggWudJZJ7cGJIRr_3Y0AsGhS_HDpa7HfTnwD1Itf7Q</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Voxel-based Viterbi Active Speaker Tracking (V-VAST) with best view selection for video lecture post-production</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Kelly, Damien ; Kokaram, Anil ; Boland, Frank</creator><creatorcontrib>Kelly, Damien ; Kokaram, Anil ; Boland, Frank</creatorcontrib><description>An automated system is presented for reducing a multi-view lecture recording into a single view video containing a best view summary of active speakers. The system uses skin color detection and voxel-based analysis in locating likely speaker locations. Using time-delay estimates from multiple micro phones, speech activity is analyzed for each speaker position. The Viterbi algorithm is then used to estimate a track of the active speaker which maximizes the observed speech activity. This novel approach is termed Voxel-based Viterbi Active Speaker Tracking (V-VAST) and is shown to track speakers with an accuracy of 0.23m. Using the tracking information, the system then extracts from the available camera views the most frontal face view of the active speaker to display.</description><identifier>ISSN: 1520-6149</identifier><identifier>ISBN: 9781457705380</identifier><identifier>ISBN: 1457705389</identifier><identifier>EISSN: 2379-190X</identifier><identifier>EISBN: 1457705397</identifier><identifier>EISBN: 9781457705373</identifier><identifier>EISBN: 9781457705397</identifier><identifier>EISBN: 1457705370</identifier><identifier>DOI: 10.1109/ICASSP.2011.5946941</identifier><language>eng</language><publisher>IEEE</publisher><subject>Audio-Visual Tracking ; Cameras ; Face ; Microphones ; Multi-camera ; Multi-microphone ; Skin ; Speech ; Three dimensional displays ; Time-Delay Estimation ; Viterbi ; Viterbi algorithm</subject><ispartof>2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011, p.2296-2299</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/5946941$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,2052,27902,54895</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/5946941$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Kelly, Damien</creatorcontrib><creatorcontrib>Kokaram, Anil</creatorcontrib><creatorcontrib>Boland, Frank</creatorcontrib><title>Voxel-based Viterbi Active Speaker Tracking (V-VAST) with best view selection for video lecture post-production</title><title>2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title><addtitle>ICASSP</addtitle><description>An automated system is presented for reducing a multi-view lecture recording into a single view video containing a best view summary of active speakers. The system uses skin color detection and voxel-based analysis in locating likely speaker locations. Using time-delay estimates from multiple micro phones, speech activity is analyzed for each speaker position. The Viterbi algorithm is then used to estimate a track of the active speaker which maximizes the observed speech activity. This novel approach is termed Voxel-based Viterbi Active Speaker Tracking (V-VAST) and is shown to track speakers with an accuracy of 0.23m. Using the tracking information, the system then extracts from the available camera views the most frontal face view of the active speaker to display.</description><subject>Audio-Visual Tracking</subject><subject>Cameras</subject><subject>Face</subject><subject>Microphones</subject><subject>Multi-camera</subject><subject>Multi-microphone</subject><subject>Skin</subject><subject>Speech</subject><subject>Three dimensional displays</subject><subject>Time-Delay Estimation</subject><subject>Viterbi</subject><subject>Viterbi algorithm</subject><issn>1520-6149</issn><issn>2379-190X</issn><isbn>9781457705380</isbn><isbn>1457705389</isbn><isbn>1457705397</isbn><isbn>9781457705373</isbn><isbn>9781457705397</isbn><isbn>1457705370</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2011</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNo1UMlOwzAUNJtEKf2CXnyEg4t3x8cKsUmVQEqJuFWO8wympYmctIW_J0CZy0izPOkNQmNGJ4xRe_VwPc3zpwmnjE2UldpKdoDOmFTGUCWsOUQDLowlzNKXIzSyJvv3MnqMBkxxSjST9hSN2vad9tDcGGUHqC7qT1iR0rVQ4SJ2kMqIp76LW8B5A24JCc-T88u4fsUXBSmm-fwS72L3hktoO7yNsMMtrKCv1Gsc6tRLFdT4R9kkwE3ddqRJdbX5TZyjk-BWLYz2PETPtzfz63sye7zrv5yRyIzqSGC-so4yETw1mSiBc2DcS62cs0EKG6iU0jkIOnhNs1KUggWudJZJ7cGJIRr_3Y0AsGhS_HDpa7HfTnwD1Itf7Q</recordid><startdate>201105</startdate><enddate>201105</enddate><creator>Kelly, Damien</creator><creator>Kokaram, Anil</creator><creator>Boland, Frank</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>201105</creationdate><title>Voxel-based Viterbi Active Speaker Tracking (V-VAST) with best view selection for video lecture post-production</title><author>Kelly, Damien ; Kokaram, Anil ; Boland, Frank</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i175t-f1cd9a013fc0783be22e12c465aa9f439f0444aaef6fc608b3b31f2568846cea3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2011</creationdate><topic>Audio-Visual Tracking</topic><topic>Cameras</topic><topic>Face</topic><topic>Microphones</topic><topic>Multi-camera</topic><topic>Multi-microphone</topic><topic>Skin</topic><topic>Speech</topic><topic>Three dimensional displays</topic><topic>Time-Delay Estimation</topic><topic>Viterbi</topic><topic>Viterbi algorithm</topic><toplevel>online_resources</toplevel><creatorcontrib>Kelly, Damien</creatorcontrib><creatorcontrib>Kokaram, Anil</creatorcontrib><creatorcontrib>Boland, Frank</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Kelly, Damien</au><au>Kokaram, Anil</au><au>Boland, Frank</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Voxel-based Viterbi Active Speaker Tracking (V-VAST) with best view selection for video lecture post-production</atitle><btitle>2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</btitle><stitle>ICASSP</stitle><date>2011-05</date><risdate>2011</risdate><spage>2296</spage><epage>2299</epage><pages>2296-2299</pages><issn>1520-6149</issn><eissn>2379-190X</eissn><isbn>9781457705380</isbn><isbn>1457705389</isbn><eisbn>1457705397</eisbn><eisbn>9781457705373</eisbn><eisbn>9781457705397</eisbn><eisbn>1457705370</eisbn><abstract>An automated system is presented for reducing a multi-view lecture recording into a single view video containing a best view summary of active speakers. The system uses skin color detection and voxel-based analysis in locating likely speaker locations. Using time-delay estimates from multiple micro phones, speech activity is analyzed for each speaker position. The Viterbi algorithm is then used to estimate a track of the active speaker which maximizes the observed speech activity. This novel approach is termed Voxel-based Viterbi Active Speaker Tracking (V-VAST) and is shown to track speakers with an accuracy of 0.23m. Using the tracking information, the system then extracts from the available camera views the most frontal face view of the active speaker to display.</abstract><pub>IEEE</pub><doi>10.1109/ICASSP.2011.5946941</doi><tpages>4</tpages></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1520-6149
ispartof 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011, p.2296-2299
issn 1520-6149
2379-190X
language eng
recordid cdi_ieee_primary_5946941
source IEEE Electronic Library (IEL) Conference Proceedings
subjects Audio-Visual Tracking
Cameras
Face
Microphones
Multi-camera
Multi-microphone
Skin
Speech
Three dimensional displays
Time-Delay Estimation
Viterbi
Viterbi algorithm
title Voxel-based Viterbi Active Speaker Tracking (V-VAST) with best view selection for video lecture post-production
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-10T05%3A44%3A12IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Voxel-based%20Viterbi%20Active%20Speaker%20Tracking%20(V-VAST)%20with%20best%20view%20selection%20for%20video%20lecture%20post-production&rft.btitle=2011%20IEEE%20International%20Conference%20on%20Acoustics,%20Speech%20and%20Signal%20Processing%20(ICASSP)&rft.au=Kelly,%20Damien&rft.date=2011-05&rft.spage=2296&rft.epage=2299&rft.pages=2296-2299&rft.issn=1520-6149&rft.eissn=2379-190X&rft.isbn=9781457705380&rft.isbn_list=1457705389&rft_id=info:doi/10.1109/ICASSP.2011.5946941&rft_dat=%3Cieee_6IE%3E5946941%3C/ieee_6IE%3E%3Curl%3E%3C/url%3E&rft.eisbn=1457705397&rft.eisbn_list=9781457705373&rft.eisbn_list=9781457705397&rft.eisbn_list=1457705370&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=5946941&rfr_iscdi=true