M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses

The growing popularity of multi-channel wearable devices, such as smart glasses, has led to a surge of applications such as targeted speech recognition and enhanced hearing. However, current approaches to solve these tasks use independently trained models, which may not benefit from large amounts of...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Yang, Yufeng, Raj, Desh, Lin, Ju, Moritz, Niko, Jia, Junteng, Keren, Gil, Lakomkin, Egor, Huang, Yiteng, Donley, Jacob, Mahadeokar, Jay, Kalinli, Ozlem
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Sound
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Yang, Yufeng Raj, Desh Lin, Ju Moritz, Niko Jia, Junteng Keren, Gil Lakomkin, Egor Huang, Yiteng Donley, Jacob Mahadeokar, Jay Kalinli, Ozlem
description	The growing popularity of multi-channel wearable devices, such as smart glasses, has led to a surge of applications such as targeted speech recognition and enhanced hearing. However, current approaches to solve these tasks use independently trained models, which may not benefit from large amounts of unlabeled data. In this paper, we propose M-BEST-RQ, the first multi-channel speech foundation model for smart glasses, which is designed to leverage large-scale self-supervised learning (SSL) in an array-geometry agnostic approach. While prior work on multi-channel speech SSL only evaluated on simulated settings, we curate a suite of real downstream tasks to evaluate our model, namely (i) conversational automatic speech recognition (ASR), (ii) spherical active source localization, and (iii) glasses wearer voice activity detection, which are sourced from the MMCSG and EasyCom datasets. We show that a general-purpose M-BEST-RQ encoder is able to match or surpass supervised models across all tasks. For the conversational ASR task in particular, using only 8 hours of labeled speech, our model outperforms a supervised ASR baseline that is trained on 2000 hours of labeled data, which demonstrates the effectiveness of our approach.
doi_str_mv	10.48550/arxiv.2409.11494
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2409_11494</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2409_11494</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2409_114943</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DM0NLE04WRw8dV1cg0O0Q0KtFJwVPAtzSnJ1HXOSMzLS81RCC5ITU3OUHDLL81LSSzJzM9T8M1PAYqn5RcpBOcmFpUouOckFhenFvMwsKYl5hSn8kJpbgZ5N9cQZw9dsH3xBUWZQNWV8SB748H2GhNWAQDdrTZC</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses</title><source>arXiv.org</source><creator>Yang, Yufeng ; Raj, Desh ; Lin, Ju ; Moritz, Niko ; Jia, Junteng ; Keren, Gil ; Lakomkin, Egor ; Huang, Yiteng ; Donley, Jacob ; Mahadeokar, Jay ; Kalinli, Ozlem</creator><creatorcontrib>Yang, Yufeng ; Raj, Desh ; Lin, Ju ; Moritz, Niko ; Jia, Junteng ; Keren, Gil ; Lakomkin, Egor ; Huang, Yiteng ; Donley, Jacob ; Mahadeokar, Jay ; Kalinli, Ozlem</creatorcontrib><description>The growing popularity of multi-channel wearable devices, such as smart glasses, has led to a surge of applications such as targeted speech recognition and enhanced hearing. However, current approaches to solve these tasks use independently trained models, which may not benefit from large amounts of unlabeled data. In this paper, we propose M-BEST-RQ, the first multi-channel speech foundation model for smart glasses, which is designed to leverage large-scale self-supervised learning (SSL) in an array-geometry agnostic approach. While prior work on multi-channel speech SSL only evaluated on simulated settings, we curate a suite of real downstream tasks to evaluate our model, namely (i) conversational automatic speech recognition (ASR), (ii) spherical active source localization, and (iii) glasses wearer voice activity detection, which are sourced from the MMCSG and EasyCom datasets. We show that a general-purpose M-BEST-RQ encoder is able to match or surpass supervised models across all tasks. For the conversational ASR task in particular, using only 8 hours of labeled speech, our model outperforms a supervised ASR baseline that is trained on 2000 hours of labeled data, which demonstrates the effectiveness of our approach.</description><identifier>DOI: 10.48550/arxiv.2409.11494</identifier><language>eng</language><subject>Computer Science - Sound</subject><creationdate>2024-09</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2409.11494$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2409.11494$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Yang, Yufeng</creatorcontrib><creatorcontrib>Raj, Desh</creatorcontrib><creatorcontrib>Lin, Ju</creatorcontrib><creatorcontrib>Moritz, Niko</creatorcontrib><creatorcontrib>Jia, Junteng</creatorcontrib><creatorcontrib>Keren, Gil</creatorcontrib><creatorcontrib>Lakomkin, Egor</creatorcontrib><creatorcontrib>Huang, Yiteng</creatorcontrib><creatorcontrib>Donley, Jacob</creatorcontrib><creatorcontrib>Mahadeokar, Jay</creatorcontrib><creatorcontrib>Kalinli, Ozlem</creatorcontrib><title>M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses</title><description>The growing popularity of multi-channel wearable devices, such as smart glasses, has led to a surge of applications such as targeted speech recognition and enhanced hearing. However, current approaches to solve these tasks use independently trained models, which may not benefit from large amounts of unlabeled data. In this paper, we propose M-BEST-RQ, the first multi-channel speech foundation model for smart glasses, which is designed to leverage large-scale self-supervised learning (SSL) in an array-geometry agnostic approach. While prior work on multi-channel speech SSL only evaluated on simulated settings, we curate a suite of real downstream tasks to evaluate our model, namely (i) conversational automatic speech recognition (ASR), (ii) spherical active source localization, and (iii) glasses wearer voice activity detection, which are sourced from the MMCSG and EasyCom datasets. We show that a general-purpose M-BEST-RQ encoder is able to match or surpass supervised models across all tasks. For the conversational ASR task in particular, using only 8 hours of labeled speech, our model outperforms a supervised ASR baseline that is trained on 2000 hours of labeled data, which demonstrates the effectiveness of our approach.</description><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DM0NLE04WRw8dV1cg0O0Q0KtFJwVPAtzSnJ1HXOSMzLS81RCC5ITU3OUHDLL81LSSzJzM9T8M1PAYqn5RcpBOcmFpUouOckFhenFvMwsKYl5hSn8kJpbgZ5N9cQZw9dsH3xBUWZQNWV8SB748H2GhNWAQDdrTZC</recordid><startdate>20240917</startdate><enddate>20240917</enddate><creator>Yang, Yufeng</creator><creator>Raj, Desh</creator><creator>Lin, Ju</creator><creator>Moritz, Niko</creator><creator>Jia, Junteng</creator><creator>Keren, Gil</creator><creator>Lakomkin, Egor</creator><creator>Huang, Yiteng</creator><creator>Donley, Jacob</creator><creator>Mahadeokar, Jay</creator><creator>Kalinli, Ozlem</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240917</creationdate><title>M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses</title><author>Yang, Yufeng ; Raj, Desh ; Lin, Ju ; Moritz, Niko ; Jia, Junteng ; Keren, Gil ; Lakomkin, Egor ; Huang, Yiteng ; Donley, Jacob ; Mahadeokar, Jay ; Kalinli, Ozlem</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2409_114943</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Yang, Yufeng</creatorcontrib><creatorcontrib>Raj, Desh</creatorcontrib><creatorcontrib>Lin, Ju</creatorcontrib><creatorcontrib>Moritz, Niko</creatorcontrib><creatorcontrib>Jia, Junteng</creatorcontrib><creatorcontrib>Keren, Gil</creatorcontrib><creatorcontrib>Lakomkin, Egor</creatorcontrib><creatorcontrib>Huang, Yiteng</creatorcontrib><creatorcontrib>Donley, Jacob</creatorcontrib><creatorcontrib>Mahadeokar, Jay</creatorcontrib><creatorcontrib>Kalinli, Ozlem</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Yang, Yufeng</au><au>Raj, Desh</au><au>Lin, Ju</au><au>Moritz, Niko</au><au>Jia, Junteng</au><au>Keren, Gil</au><au>Lakomkin, Egor</au><au>Huang, Yiteng</au><au>Donley, Jacob</au><au>Mahadeokar, Jay</au><au>Kalinli, Ozlem</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses</atitle><date>2024-09-17</date><risdate>2024</risdate><abstract>The growing popularity of multi-channel wearable devices, such as smart glasses, has led to a surge of applications such as targeted speech recognition and enhanced hearing. However, current approaches to solve these tasks use independently trained models, which may not benefit from large amounts of unlabeled data. In this paper, we propose M-BEST-RQ, the first multi-channel speech foundation model for smart glasses, which is designed to leverage large-scale self-supervised learning (SSL) in an array-geometry agnostic approach. While prior work on multi-channel speech SSL only evaluated on simulated settings, we curate a suite of real downstream tasks to evaluate our model, namely (i) conversational automatic speech recognition (ASR), (ii) spherical active source localization, and (iii) glasses wearer voice activity detection, which are sourced from the MMCSG and EasyCom datasets. We show that a general-purpose M-BEST-RQ encoder is able to match or surpass supervised models across all tasks. For the conversational ASR task in particular, using only 8 hours of labeled speech, our model outperforms a supervised ASR baseline that is trained on 2000 hours of labeled data, which demonstrates the effectiveness of our approach.</abstract><doi>10.48550/arxiv.2409.11494</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2409.11494
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2409_11494
source	arXiv.org
subjects	Computer Science - Sound
title	M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-08T14%3A01%3A10IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=M-BEST-RQ:%20A%20Multi-Channel%20Speech%20Foundation%20Model%20for%20Smart%20Glasses&rft.au=Yang,%20Yufeng&rft.date=2024-09-17&rft_id=info:doi/10.48550/arxiv.2409.11494&rft_dat=%3Carxiv_GOX%3E2409_11494%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true