Source Separation of Convolutive and Noisy Mixtures Using Audio-Visual Dictionary Learning and Probabilistic Time-Frequency Masking

In existing audio-visual blind source separation (AV-BSS) algorithms, the AV coherence is usually established through statistical modelling, using e.g., Gaussian mixture models (GMMs). These methods often operate in a low-dimensional feature space, rendering an effective global representation of the...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on signal processing 2013-11, Vol.61 (22), p.5520-5535
Hauptverfasser:	Qingju Liu, Wenwu Wang, Jackson, Philip J. B., Barnard, Mark, Kittler, Josef, Chambers, Jonathon
Format:	Artikel
Sprache:	eng
Schlagworte:	Applied sciences Audio-visual coherence blind source separation Coding, codes Coherence convolutive mixtures Detection, estimation, filtering, equalization, prediction Dictionaries dictionary learning Encoding Exact sciences and technology Information, signal and communications theory Matching pursuit algorithms noisy mixtures Signal and communications theory Signal, noise Source separation Telecommunications and information theory Training Visualization
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	5535
container_issue	22
container_start_page	5520
container_title	IEEE transactions on signal processing
container_volume	61
creator	Qingju Liu Wenwu Wang Jackson, Philip J. B. Barnard, Mark Kittler, Josef Chambers, Jonathon
description	In existing audio-visual blind source separation (AV-BSS) algorithms, the AV coherence is usually established through statistical modelling, using e.g., Gaussian mixture models (GMMs). These methods often operate in a low-dimensional feature space, rendering an effective global representation of the data. The local information, which is important in capturing the temporal structure of the data, however, has not been explicitly exploited. In this paper, we propose a new method for capturing such local information, based on audio-visual dictionary learning (AVDL). We address several challenges associated with AVDL, including cross-modality differences in size, dimension and sampling rate, as well as the issues of scalability and computational complexity. Following a commonly employed bootstrap coding-learning process, we have developed a new AVDL algorithm which features, a bimodality balanced and scalable matching criterion, a size and dimension adaptive dictionary, a fast search index for efficient coding, and cross-modality diverse sparsity. We also show how the proposed AVDL can be incorporated into a BSS algorithm. As an example, we consider binaural mixtures, mimicking aspects of human binaural hearing, and derive a new noise-robust AV-BSS algorithm by combining the proposed AVDL algorithm with Mandel's BSS method, which is a state-of-the-art audio-domain method using time-frequency masking. We have systematically evaluated the proposed AVDL and AV-BSS algorithms, and show their advantages over the corresponding baseline methods, using both synthetic data and visual speech data from the multimodal LILiR Twotalk corpus.
doi_str_mv	10.1109/TSP.2013.2277834
format	Article
fullrecord	<record><control><sourceid>pascalfrancis_RIE</sourceid><recordid>TN_cdi_ieee_primary_6576817</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>6576817</ieee_id><sourcerecordid>28150137</sourcerecordid><originalsourceid>FETCH-LOGICAL-c335t-5ebb65a947605d3994db67e1d6f4c1ffadb27dc39da005f5445169774941711b3</originalsourceid><addsrcrecordid>eNo9kM9LwzAYQIsoOKd3wUsuHjuTJmma45hOhamDbeKtpPkh0a6ZSTvc2X_clI2dEsh7H_leklwjOEII8rvlYj7KIMKjLGOswOQkGSBOUAoJy0_jHVKc0oJ9nCcXIXxBiAjh-SD5W7jOSw0WeiO8aK1rgDNg4pqtq7vWbjUQjQKvzoYdeLG_bed1AKtgm08w7pR16bsNnajBvZW9LPwOzLTwTQ_05ty7SlS2tqG1EiztWqdTr3863cg4UITvCF4mZ0bUQV8dzmGymj4sJ0_p7O3xeTKepRJj2qZUV1VOBY8LQaow50RVOdNI5YZIZIxQVcaUxFwJCKmhhFCUc8ZIzMAQqvAwgfu50rsQvDblxtt1_HKJYNlHLGPEso9YHiJG5XavbESQojZeNNKGo5cViEacRe5mz1mt9fE5pywvEMP_pB99Hg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Source Separation of Convolutive and Noisy Mixtures Using Audio-Visual Dictionary Learning and Probabilistic Time-Frequency Masking</title><source>IEEE Electronic Library (IEL)</source><creator>Qingju Liu ; Wenwu Wang ; Jackson, Philip J. B. ; Barnard, Mark ; Kittler, Josef ; Chambers, Jonathon</creator><creatorcontrib>Qingju Liu ; Wenwu Wang ; Jackson, Philip J. B. ; Barnard, Mark ; Kittler, Josef ; Chambers, Jonathon</creatorcontrib><description>In existing audio-visual blind source separation (AV-BSS) algorithms, the AV coherence is usually established through statistical modelling, using e.g., Gaussian mixture models (GMMs). These methods often operate in a low-dimensional feature space, rendering an effective global representation of the data. The local information, which is important in capturing the temporal structure of the data, however, has not been explicitly exploited. In this paper, we propose a new method for capturing such local information, based on audio-visual dictionary learning (AVDL). We address several challenges associated with AVDL, including cross-modality differences in size, dimension and sampling rate, as well as the issues of scalability and computational complexity. Following a commonly employed bootstrap coding-learning process, we have developed a new AVDL algorithm which features, a bimodality balanced and scalable matching criterion, a size and dimension adaptive dictionary, a fast search index for efficient coding, and cross-modality diverse sparsity. We also show how the proposed AVDL can be incorporated into a BSS algorithm. As an example, we consider binaural mixtures, mimicking aspects of human binaural hearing, and derive a new noise-robust AV-BSS algorithm by combining the proposed AVDL algorithm with Mandel's BSS method, which is a state-of-the-art audio-domain method using time-frequency masking. We have systematically evaluated the proposed AVDL and AV-BSS algorithms, and show their advantages over the corresponding baseline methods, using both synthetic data and visual speech data from the multimodal LILiR Twotalk corpus.</description><identifier>ISSN: 1053-587X</identifier><identifier>EISSN: 1941-0476</identifier><identifier>DOI: 10.1109/TSP.2013.2277834</identifier><identifier>CODEN: ITPRED</identifier><language>eng</language><publisher>New York, NY: IEEE</publisher><subject>Applied sciences ; Audio-visual coherence ; blind source separation ; Coding, codes ; Coherence ; convolutive mixtures ; Detection, estimation, filtering, equalization, prediction ; Dictionaries ; dictionary learning ; Encoding ; Exact sciences and technology ; Information, signal and communications theory ; Matching pursuit algorithms ; noisy mixtures ; Signal and communications theory ; Signal, noise ; Source separation ; Telecommunications and information theory ; Training ; Visualization</subject><ispartof>IEEE transactions on signal processing, 2013-11, Vol.61 (22), p.5520-5535</ispartof><rights>2015 INIST-CNRS</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c335t-5ebb65a947605d3994db67e1d6f4c1ffadb27dc39da005f5445169774941711b3</citedby><cites>FETCH-LOGICAL-c335t-5ebb65a947605d3994db67e1d6f4c1ffadb27dc39da005f5445169774941711b3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/6576817$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/6576817$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=28150137$$DView record in Pascal Francis$$Hfree_for_read</backlink></links><search><creatorcontrib>Qingju Liu</creatorcontrib><creatorcontrib>Wenwu Wang</creatorcontrib><creatorcontrib>Jackson, Philip J. B.</creatorcontrib><creatorcontrib>Barnard, Mark</creatorcontrib><creatorcontrib>Kittler, Josef</creatorcontrib><creatorcontrib>Chambers, Jonathon</creatorcontrib><title>Source Separation of Convolutive and Noisy Mixtures Using Audio-Visual Dictionary Learning and Probabilistic Time-Frequency Masking</title><title>IEEE transactions on signal processing</title><addtitle>TSP</addtitle><description>In existing audio-visual blind source separation (AV-BSS) algorithms, the AV coherence is usually established through statistical modelling, using e.g., Gaussian mixture models (GMMs). These methods often operate in a low-dimensional feature space, rendering an effective global representation of the data. The local information, which is important in capturing the temporal structure of the data, however, has not been explicitly exploited. In this paper, we propose a new method for capturing such local information, based on audio-visual dictionary learning (AVDL). We address several challenges associated with AVDL, including cross-modality differences in size, dimension and sampling rate, as well as the issues of scalability and computational complexity. Following a commonly employed bootstrap coding-learning process, we have developed a new AVDL algorithm which features, a bimodality balanced and scalable matching criterion, a size and dimension adaptive dictionary, a fast search index for efficient coding, and cross-modality diverse sparsity. We also show how the proposed AVDL can be incorporated into a BSS algorithm. As an example, we consider binaural mixtures, mimicking aspects of human binaural hearing, and derive a new noise-robust AV-BSS algorithm by combining the proposed AVDL algorithm with Mandel's BSS method, which is a state-of-the-art audio-domain method using time-frequency masking. We have systematically evaluated the proposed AVDL and AV-BSS algorithms, and show their advantages over the corresponding baseline methods, using both synthetic data and visual speech data from the multimodal LILiR Twotalk corpus.</description><subject>Applied sciences</subject><subject>Audio-visual coherence</subject><subject>blind source separation</subject><subject>Coding, codes</subject><subject>Coherence</subject><subject>convolutive mixtures</subject><subject>Detection, estimation, filtering, equalization, prediction</subject><subject>Dictionaries</subject><subject>dictionary learning</subject><subject>Encoding</subject><subject>Exact sciences and technology</subject><subject>Information, signal and communications theory</subject><subject>Matching pursuit algorithms</subject><subject>noisy mixtures</subject><subject>Signal and communications theory</subject><subject>Signal, noise</subject><subject>Source separation</subject><subject>Telecommunications and information theory</subject><subject>Training</subject><subject>Visualization</subject><issn>1053-587X</issn><issn>1941-0476</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2013</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kM9LwzAYQIsoOKd3wUsuHjuTJmma45hOhamDbeKtpPkh0a6ZSTvc2X_clI2dEsh7H_leklwjOEII8rvlYj7KIMKjLGOswOQkGSBOUAoJy0_jHVKc0oJ9nCcXIXxBiAjh-SD5W7jOSw0WeiO8aK1rgDNg4pqtq7vWbjUQjQKvzoYdeLG_bed1AKtgm08w7pR16bsNnajBvZW9LPwOzLTwTQ_05ty7SlS2tqG1EiztWqdTr3863cg4UITvCF4mZ0bUQV8dzmGymj4sJ0_p7O3xeTKepRJj2qZUV1VOBY8LQaow50RVOdNI5YZIZIxQVcaUxFwJCKmhhFCUc8ZIzMAQqvAwgfu50rsQvDblxtt1_HKJYNlHLGPEso9YHiJG5XavbESQojZeNNKGo5cViEacRe5mz1mt9fE5pywvEMP_pB99Hg</recordid><startdate>20131101</startdate><enddate>20131101</enddate><creator>Qingju Liu</creator><creator>Wenwu Wang</creator><creator>Jackson, Philip J. B.</creator><creator>Barnard, Mark</creator><creator>Kittler, Josef</creator><creator>Chambers, Jonathon</creator><general>IEEE</general><general>Institute of Electrical and Electronics Engineers</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>IQODW</scope><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20131101</creationdate><title>Source Separation of Convolutive and Noisy Mixtures Using Audio-Visual Dictionary Learning and Probabilistic Time-Frequency Masking</title><author>Qingju Liu ; Wenwu Wang ; Jackson, Philip J. B. ; Barnard, Mark ; Kittler, Josef ; Chambers, Jonathon</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c335t-5ebb65a947605d3994db67e1d6f4c1ffadb27dc39da005f5445169774941711b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2013</creationdate><topic>Applied sciences</topic><topic>Audio-visual coherence</topic><topic>blind source separation</topic><topic>Coding, codes</topic><topic>Coherence</topic><topic>convolutive mixtures</topic><topic>Detection, estimation, filtering, equalization, prediction</topic><topic>Dictionaries</topic><topic>dictionary learning</topic><topic>Encoding</topic><topic>Exact sciences and technology</topic><topic>Information, signal and communications theory</topic><topic>Matching pursuit algorithms</topic><topic>noisy mixtures</topic><topic>Signal and communications theory</topic><topic>Signal, noise</topic><topic>Source separation</topic><topic>Telecommunications and information theory</topic><topic>Training</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Qingju Liu</creatorcontrib><creatorcontrib>Wenwu Wang</creatorcontrib><creatorcontrib>Jackson, Philip J. B.</creatorcontrib><creatorcontrib>Barnard, Mark</creatorcontrib><creatorcontrib>Kittler, Josef</creatorcontrib><creatorcontrib>Chambers, Jonathon</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>Pascal-Francis</collection><collection>CrossRef</collection><jtitle>IEEE transactions on signal processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Qingju Liu</au><au>Wenwu Wang</au><au>Jackson, Philip J. B.</au><au>Barnard, Mark</au><au>Kittler, Josef</au><au>Chambers, Jonathon</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Source Separation of Convolutive and Noisy Mixtures Using Audio-Visual Dictionary Learning and Probabilistic Time-Frequency Masking</atitle><jtitle>IEEE transactions on signal processing</jtitle><stitle>TSP</stitle><date>2013-11-01</date><risdate>2013</risdate><volume>61</volume><issue>22</issue><spage>5520</spage><epage>5535</epage><pages>5520-5535</pages><issn>1053-587X</issn><eissn>1941-0476</eissn><coden>ITPRED</coden><abstract>In existing audio-visual blind source separation (AV-BSS) algorithms, the AV coherence is usually established through statistical modelling, using e.g., Gaussian mixture models (GMMs). These methods often operate in a low-dimensional feature space, rendering an effective global representation of the data. The local information, which is important in capturing the temporal structure of the data, however, has not been explicitly exploited. In this paper, we propose a new method for capturing such local information, based on audio-visual dictionary learning (AVDL). We address several challenges associated with AVDL, including cross-modality differences in size, dimension and sampling rate, as well as the issues of scalability and computational complexity. Following a commonly employed bootstrap coding-learning process, we have developed a new AVDL algorithm which features, a bimodality balanced and scalable matching criterion, a size and dimension adaptive dictionary, a fast search index for efficient coding, and cross-modality diverse sparsity. We also show how the proposed AVDL can be incorporated into a BSS algorithm. As an example, we consider binaural mixtures, mimicking aspects of human binaural hearing, and derive a new noise-robust AV-BSS algorithm by combining the proposed AVDL algorithm with Mandel's BSS method, which is a state-of-the-art audio-domain method using time-frequency masking. We have systematically evaluated the proposed AVDL and AV-BSS algorithms, and show their advantages over the corresponding baseline methods, using both synthetic data and visual speech data from the multimodal LILiR Twotalk corpus.</abstract><cop>New York, NY</cop><pub>IEEE</pub><doi>10.1109/TSP.2013.2277834</doi><tpages>16</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1053-587X
ispartof	IEEE transactions on signal processing, 2013-11, Vol.61 (22), p.5520-5535
issn	1053-587X 1941-0476
language	eng
recordid	cdi_ieee_primary_6576817
source	IEEE Electronic Library (IEL)
subjects	Applied sciences Audio-visual coherence blind source separation Coding, codes Coherence convolutive mixtures Detection, estimation, filtering, equalization, prediction Dictionaries dictionary learning Encoding Exact sciences and technology Information, signal and communications theory Matching pursuit algorithms noisy mixtures Signal and communications theory Signal, noise Source separation Telecommunications and information theory Training Visualization
title	Source Separation of Convolutive and Noisy Mixtures Using Audio-Visual Dictionary Learning and Probabilistic Time-Frequency Masking
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-12T10%3A13%3A43IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-pascalfrancis_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Source%20Separation%20of%20Convolutive%20and%20Noisy%20Mixtures%20Using%20Audio-Visual%20Dictionary%20Learning%20and%20Probabilistic%20Time-Frequency%20Masking&rft.jtitle=IEEE%20transactions%20on%20signal%20processing&rft.au=Qingju%20Liu&rft.date=2013-11-01&rft.volume=61&rft.issue=22&rft.spage=5520&rft.epage=5535&rft.pages=5520-5535&rft.issn=1053-587X&rft.eissn=1941-0476&rft.coden=ITPRED&rft_id=info:doi/10.1109/TSP.2013.2277834&rft_dat=%3Cpascalfrancis_RIE%3E28150137%3C/pascalfrancis_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=6576817&rfr_iscdi=true