Multi-Head State Space Model for Speech Recognition

State space models (SSMs) have recently shown promising results on small-scale sequence and language modelling tasks, rivalling and outperforming many attention-based approaches. In this paper, we propose a multi-head state space (MH-SSM) architecture equipped with special gating mechanisms, where p...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Fathullah, Yassir, Wu, Chunyang, Shangguan, Yuan, Jia, Junteng, Xiong, Wenhan, Mahadeokar, Jay, Liu, Chunxi, Shi, Yangyang, Kalinli, Ozlem, Seltzer, Mike, Gales, Mark J. F
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Fathullah, Yassir
Wu, Chunyang
Shangguan, Yuan
Jia, Junteng
Xiong, Wenhan
Mahadeokar, Jay
Liu, Chunxi
Shi, Yangyang
Kalinli, Ozlem
Seltzer, Mike
Gales, Mark J. F
description State space models (SSMs) have recently shown promising results on small-scale sequence and language modelling tasks, rivalling and outperforming many attention-based approaches. In this paper, we propose a multi-head state space (MH-SSM) architecture equipped with special gating mechanisms, where parallel heads are taught to learn local and global temporal dynamics on sequence data. As a drop-in replacement for multi-head attention in transformer encoders, this new model significantly outperforms the transformer transducer on the LibriSpeech speech recognition corpus. Furthermore, we augment the transformer block with MH-SSMs layers, referred to as the Stateformer, achieving state-of-the-art performance on the LibriSpeech task, with word error rates of 1.76\%/4.37\% on the development and 1.91\%/4.36\% on the test sets without using an external language model.
doi_str_mv 10.48550/arxiv.2305.12498
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2305_12498</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2305_12498</sourcerecordid><originalsourceid>FETCH-LOGICAL-a678-cb32b79e6a91fe8cc496e68495af0e3f1c8359b4b4a38349f1245a5001efa5c3</originalsourceid><addsrcrecordid>eNotzksKwjAUheFMHIi6AEdmA61Jb1KToYgvUATrvNzGGw1UK7WK7t7n6PBPDh9jfSliZbQWQ6wf4R4nIHQsE2VNm8H6VjYhWhDuedZgQzy7oCO-rvZUcl_V7yZyR74lVx3OoQnVuctaHssr9f7bYdlsupssotVmvpyMVxGmIxO5ApJiZClFKz0Z55RNKTXKavSCwEtnQNtCFQrBgLL-LdKohZDkUTvosMHv9YvOL3U4Yf3MP_j8i4cXXbQ-AQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Multi-Head State Space Model for Speech Recognition</title><source>arXiv.org</source><creator>Fathullah, Yassir ; Wu, Chunyang ; Shangguan, Yuan ; Jia, Junteng ; Xiong, Wenhan ; Mahadeokar, Jay ; Liu, Chunxi ; Shi, Yangyang ; Kalinli, Ozlem ; Seltzer, Mike ; Gales, Mark J. F</creator><creatorcontrib>Fathullah, Yassir ; Wu, Chunyang ; Shangguan, Yuan ; Jia, Junteng ; Xiong, Wenhan ; Mahadeokar, Jay ; Liu, Chunxi ; Shi, Yangyang ; Kalinli, Ozlem ; Seltzer, Mike ; Gales, Mark J. F</creatorcontrib><description>State space models (SSMs) have recently shown promising results on small-scale sequence and language modelling tasks, rivalling and outperforming many attention-based approaches. In this paper, we propose a multi-head state space (MH-SSM) architecture equipped with special gating mechanisms, where parallel heads are taught to learn local and global temporal dynamics on sequence data. As a drop-in replacement for multi-head attention in transformer encoders, this new model significantly outperforms the transformer transducer on the LibriSpeech speech recognition corpus. Furthermore, we augment the transformer block with MH-SSMs layers, referred to as the Stateformer, achieving state-of-the-art performance on the LibriSpeech task, with word error rates of 1.76\%/4.37\% on the development and 1.91\%/4.36\% on the test sets without using an external language model.</description><identifier>DOI: 10.48550/arxiv.2305.12498</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language ; Computer Science - Learning ; Computer Science - Sound</subject><creationdate>2023-05</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2305.12498$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2305.12498$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Fathullah, Yassir</creatorcontrib><creatorcontrib>Wu, Chunyang</creatorcontrib><creatorcontrib>Shangguan, Yuan</creatorcontrib><creatorcontrib>Jia, Junteng</creatorcontrib><creatorcontrib>Xiong, Wenhan</creatorcontrib><creatorcontrib>Mahadeokar, Jay</creatorcontrib><creatorcontrib>Liu, Chunxi</creatorcontrib><creatorcontrib>Shi, Yangyang</creatorcontrib><creatorcontrib>Kalinli, Ozlem</creatorcontrib><creatorcontrib>Seltzer, Mike</creatorcontrib><creatorcontrib>Gales, Mark J. F</creatorcontrib><title>Multi-Head State Space Model for Speech Recognition</title><description>State space models (SSMs) have recently shown promising results on small-scale sequence and language modelling tasks, rivalling and outperforming many attention-based approaches. In this paper, we propose a multi-head state space (MH-SSM) architecture equipped with special gating mechanisms, where parallel heads are taught to learn local and global temporal dynamics on sequence data. As a drop-in replacement for multi-head attention in transformer encoders, this new model significantly outperforms the transformer transducer on the LibriSpeech speech recognition corpus. Furthermore, we augment the transformer block with MH-SSMs layers, referred to as the Stateformer, achieving state-of-the-art performance on the LibriSpeech task, with word error rates of 1.76\%/4.37\% on the development and 1.91\%/4.36\% on the test sets without using an external language model.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Learning</subject><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotzksKwjAUheFMHIi6AEdmA61Jb1KToYgvUATrvNzGGw1UK7WK7t7n6PBPDh9jfSliZbQWQ6wf4R4nIHQsE2VNm8H6VjYhWhDuedZgQzy7oCO-rvZUcl_V7yZyR74lVx3OoQnVuctaHssr9f7bYdlsupssotVmvpyMVxGmIxO5ApJiZClFKz0Z55RNKTXKavSCwEtnQNtCFQrBgLL-LdKohZDkUTvosMHv9YvOL3U4Yf3MP_j8i4cXXbQ-AQ</recordid><startdate>20230521</startdate><enddate>20230521</enddate><creator>Fathullah, Yassir</creator><creator>Wu, Chunyang</creator><creator>Shangguan, Yuan</creator><creator>Jia, Junteng</creator><creator>Xiong, Wenhan</creator><creator>Mahadeokar, Jay</creator><creator>Liu, Chunxi</creator><creator>Shi, Yangyang</creator><creator>Kalinli, Ozlem</creator><creator>Seltzer, Mike</creator><creator>Gales, Mark J. F</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230521</creationdate><title>Multi-Head State Space Model for Speech Recognition</title><author>Fathullah, Yassir ; Wu, Chunyang ; Shangguan, Yuan ; Jia, Junteng ; Xiong, Wenhan ; Mahadeokar, Jay ; Liu, Chunxi ; Shi, Yangyang ; Kalinli, Ozlem ; Seltzer, Mike ; Gales, Mark J. F</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a678-cb32b79e6a91fe8cc496e68495af0e3f1c8359b4b4a38349f1245a5001efa5c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Learning</topic><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Fathullah, Yassir</creatorcontrib><creatorcontrib>Wu, Chunyang</creatorcontrib><creatorcontrib>Shangguan, Yuan</creatorcontrib><creatorcontrib>Jia, Junteng</creatorcontrib><creatorcontrib>Xiong, Wenhan</creatorcontrib><creatorcontrib>Mahadeokar, Jay</creatorcontrib><creatorcontrib>Liu, Chunxi</creatorcontrib><creatorcontrib>Shi, Yangyang</creatorcontrib><creatorcontrib>Kalinli, Ozlem</creatorcontrib><creatorcontrib>Seltzer, Mike</creatorcontrib><creatorcontrib>Gales, Mark J. F</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Fathullah, Yassir</au><au>Wu, Chunyang</au><au>Shangguan, Yuan</au><au>Jia, Junteng</au><au>Xiong, Wenhan</au><au>Mahadeokar, Jay</au><au>Liu, Chunxi</au><au>Shi, Yangyang</au><au>Kalinli, Ozlem</au><au>Seltzer, Mike</au><au>Gales, Mark J. F</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Multi-Head State Space Model for Speech Recognition</atitle><date>2023-05-21</date><risdate>2023</risdate><abstract>State space models (SSMs) have recently shown promising results on small-scale sequence and language modelling tasks, rivalling and outperforming many attention-based approaches. In this paper, we propose a multi-head state space (MH-SSM) architecture equipped with special gating mechanisms, where parallel heads are taught to learn local and global temporal dynamics on sequence data. As a drop-in replacement for multi-head attention in transformer encoders, this new model significantly outperforms the transformer transducer on the LibriSpeech speech recognition corpus. Furthermore, we augment the transformer block with MH-SSMs layers, referred to as the Stateformer, achieving state-of-the-art performance on the LibriSpeech task, with word error rates of 1.76\%/4.37\% on the development and 1.91\%/4.36\% on the test sets without using an external language model.</abstract><doi>10.48550/arxiv.2305.12498</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2305.12498
ispartof
issn
language eng
recordid cdi_arxiv_primary_2305_12498
source arXiv.org
subjects Computer Science - Artificial Intelligence
Computer Science - Computation and Language
Computer Science - Learning
Computer Science - Sound
title Multi-Head State Space Model for Speech Recognition
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-30T21%3A56%3A44IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Multi-Head%20State%20Space%20Model%20for%20Speech%20Recognition&rft.au=Fathullah,%20Yassir&rft.date=2023-05-21&rft_id=info:doi/10.48550/arxiv.2305.12498&rft_dat=%3Carxiv_GOX%3E2305_12498%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true