Hiformer: Sequence Modeling Networks With Hierarchical Attention Mechanisms

The attention-based encoder-decoder structure, such as the Transformer, has achieved state-of-the-art performance on various sequence modeling tasks, e.g., machine translation (MT) and automatic speech recognition (ASR), benefited from the superior capability of layer-wise self-attention mechanism i...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2023, Vol.31, p.3993-4003
Hauptverfasser:	Wu, Xixin, Lu, Hui, Li, Kun, Wu, Zhiyong, Liu, Xunying, Meng, Helen
Format:	Artikel
Sprache:	eng
Schlagworte:	Attention Automatic speech recognition Coders Computational modeling Decoding Encoders-Decoders Hierarchical attention mechanism Machine translation Modelling neural machine translation Phonemes Semantics Speech recognition Task analysis Training transformer Transformers Voice recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	4003
container_issue
container_start_page	3993
container_title	IEEE/ACM transactions on audio, speech, and language processing
container_volume	31
creator	Wu, Xixin Lu, Hui Li, Kun Wu, Zhiyong Liu, Xunying Meng, Helen
description	The attention-based encoder-decoder structure, such as the Transformer, has achieved state-of-the-art performance on various sequence modeling tasks, e.g., machine translation (MT) and automatic speech recognition (ASR), benefited from the superior capability of layer-wise self-attention mechanism in the encoder/decoder to access long-distance contextual information. Recently, analysis on the Transformer layers has shown that different levels of information, e.g., phoneme level, word level and semantic level, are represented at different layers. Effectively integrating information from various levels is important for structured prediction. However, the self-attention in the conventional Transformer structure only focuses on intra-layer integration, and does not explicitly model inter-layer information relationships. Also, attention across the encoder and decoder (cross-coder) only focuses on the top encoder layer but ignores the intermediate layers. In this article, we propose a sequence modeling structure equipped with a hierarchical attention mechanism, named Hiformer, that can consider the inter-layer and cross-coder hierarchical information to improve structured prediction performance. Extensive experiments conducted on both MT and ASR tasks demonstrate the effectiveness of the proposed Hiformer model.
doi_str_mv	10.1109/TASLP.2023.3313428
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_10244068</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10244068</ieee_id><sourcerecordid>2881500927</sourcerecordid><originalsourceid>FETCH-LOGICAL-c247t-82ce2da95daaf238b740c4ed6a9e2d9aaf95d499592761d55d74b4a7e172bc0c3</originalsourceid><addsrcrecordid>eNpNkE9PAjEQxRujiQT9AsbDJp4Xp3-Wbr0RomIENQHjsSndWSnCLrYlxm9vEUw8zeTNezOTHyEXFHqUgrqeDabjlx4DxnucUy5YeUQ6jDOVKw7i-K9nCk7JeQhLAKAglZKiQx5Hrm79Gv1NNsXPLTYWs0lb4co179kTxq_Wf4TszcVFNnLojbcLZ80qG8SITXRtk03QLkzjwjqckZParAKeH2qXvN7dzoajfPx8_zAcjHPLhIx5ySyyyqiiMqZmvJxLAVZg1Tcq6SqJaSSUKhSTfVoVRSXFXBiJVLK5Bcu75Gq_d-Pb9HKIetlufZNOalaWtABIyeRie5f1bQgea73xbm38t6agd9z0Lze946YP3FLoch9yiPgvwISAfsl_AHEEacg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2881500927</pqid></control><display><type>article</type><title>Hiformer: Sequence Modeling Networks With Hierarchical Attention Mechanisms</title><source>IEEE Electronic Library (IEL)</source><creator>Wu, Xixin ; Lu, Hui ; Li, Kun ; Wu, Zhiyong ; Liu, Xunying ; Meng, Helen</creator><creatorcontrib>Wu, Xixin ; Lu, Hui ; Li, Kun ; Wu, Zhiyong ; Liu, Xunying ; Meng, Helen</creatorcontrib><description>The attention-based encoder-decoder structure, such as the Transformer, has achieved state-of-the-art performance on various sequence modeling tasks, e.g., machine translation (MT) and automatic speech recognition (ASR), benefited from the superior capability of layer-wise self-attention mechanism in the encoder/decoder to access long-distance contextual information. Recently, analysis on the Transformer layers has shown that different levels of information, e.g., phoneme level, word level and semantic level, are represented at different layers. Effectively integrating information from various levels is important for structured prediction. However, the self-attention in the conventional Transformer structure only focuses on intra-layer integration, and does not explicitly model inter-layer information relationships. Also, attention across the encoder and decoder (cross-coder) only focuses on the top encoder layer but ignores the intermediate layers. In this article, we propose a sequence modeling structure equipped with a hierarchical attention mechanism, named Hiformer, that can consider the inter-layer and cross-coder hierarchical information to improve structured prediction performance. Extensive experiments conducted on both MT and ASR tasks demonstrate the effectiveness of the proposed Hiformer model.</description><identifier>ISSN: 2329-9290</identifier><identifier>EISSN: 2329-9304</identifier><identifier>DOI: 10.1109/TASLP.2023.3313428</identifier><identifier>CODEN: ITASFA</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Attention ; Automatic speech recognition ; Coders ; Computational modeling ; Decoding ; Encoders-Decoders ; Hierarchical attention mechanism ; Machine translation ; Modelling ; neural machine translation ; Phonemes ; Semantics ; Speech recognition ; Task analysis ; Training ; transformer ; Transformers ; Voice recognition</subject><ispartof>IEEE/ACM transactions on audio, speech, and language processing, 2023, Vol.31, p.3993-4003</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c247t-82ce2da95daaf238b740c4ed6a9e2d9aaf95d499592761d55d74b4a7e172bc0c3</cites><orcidid>0000-0002-7698-6262 ; 0000-0001-9543-1572 ; 0000-0002-4427-3532 ; 0000-0002-6780-4391 ; 0000-0001-8533-0524 ; 0000-0001-6725-1160</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10244068$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,4024,27923,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10244068$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Wu, Xixin</creatorcontrib><creatorcontrib>Lu, Hui</creatorcontrib><creatorcontrib>Li, Kun</creatorcontrib><creatorcontrib>Wu, Zhiyong</creatorcontrib><creatorcontrib>Liu, Xunying</creatorcontrib><creatorcontrib>Meng, Helen</creatorcontrib><title>Hiformer: Sequence Modeling Networks With Hierarchical Attention Mechanisms</title><title>IEEE/ACM transactions on audio, speech, and language processing</title><addtitle>TASLP</addtitle><description>The attention-based encoder-decoder structure, such as the Transformer, has achieved state-of-the-art performance on various sequence modeling tasks, e.g., machine translation (MT) and automatic speech recognition (ASR), benefited from the superior capability of layer-wise self-attention mechanism in the encoder/decoder to access long-distance contextual information. Recently, analysis on the Transformer layers has shown that different levels of information, e.g., phoneme level, word level and semantic level, are represented at different layers. Effectively integrating information from various levels is important for structured prediction. However, the self-attention in the conventional Transformer structure only focuses on intra-layer integration, and does not explicitly model inter-layer information relationships. Also, attention across the encoder and decoder (cross-coder) only focuses on the top encoder layer but ignores the intermediate layers. In this article, we propose a sequence modeling structure equipped with a hierarchical attention mechanism, named Hiformer, that can consider the inter-layer and cross-coder hierarchical information to improve structured prediction performance. Extensive experiments conducted on both MT and ASR tasks demonstrate the effectiveness of the proposed Hiformer model.</description><subject>Attention</subject><subject>Automatic speech recognition</subject><subject>Coders</subject><subject>Computational modeling</subject><subject>Decoding</subject><subject>Encoders-Decoders</subject><subject>Hierarchical attention mechanism</subject><subject>Machine translation</subject><subject>Modelling</subject><subject>neural machine translation</subject><subject>Phonemes</subject><subject>Semantics</subject><subject>Speech recognition</subject><subject>Task analysis</subject><subject>Training</subject><subject>transformer</subject><subject>Transformers</subject><subject>Voice recognition</subject><issn>2329-9290</issn><issn>2329-9304</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkE9PAjEQxRujiQT9AsbDJp4Xp3-Wbr0RomIENQHjsSndWSnCLrYlxm9vEUw8zeTNezOTHyEXFHqUgrqeDabjlx4DxnucUy5YeUQ6jDOVKw7i-K9nCk7JeQhLAKAglZKiQx5Hrm79Gv1NNsXPLTYWs0lb4co179kTxq_Wf4TszcVFNnLojbcLZ80qG8SITXRtk03QLkzjwjqckZParAKeH2qXvN7dzoajfPx8_zAcjHPLhIx5ySyyyqiiMqZmvJxLAVZg1Tcq6SqJaSSUKhSTfVoVRSXFXBiJVLK5Bcu75Gq_d-Pb9HKIetlufZNOalaWtABIyeRie5f1bQgea73xbm38t6agd9z0Lze946YP3FLoch9yiPgvwISAfsl_AHEEacg</recordid><startdate>2023</startdate><enddate>2023</enddate><creator>Wu, Xixin</creator><creator>Lu, Hui</creator><creator>Li, Kun</creator><creator>Wu, Zhiyong</creator><creator>Liu, Xunying</creator><creator>Meng, Helen</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7T9</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-7698-6262</orcidid><orcidid>https://orcid.org/0000-0001-9543-1572</orcidid><orcidid>https://orcid.org/0000-0002-4427-3532</orcidid><orcidid>https://orcid.org/0000-0002-6780-4391</orcidid><orcidid>https://orcid.org/0000-0001-8533-0524</orcidid><orcidid>https://orcid.org/0000-0001-6725-1160</orcidid></search><sort><creationdate>2023</creationdate><title>Hiformer: Sequence Modeling Networks With Hierarchical Attention Mechanisms</title><author>Wu, Xixin ; Lu, Hui ; Li, Kun ; Wu, Zhiyong ; Liu, Xunying ; Meng, Helen</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c247t-82ce2da95daaf238b740c4ed6a9e2d9aaf95d499592761d55d74b4a7e172bc0c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Attention</topic><topic>Automatic speech recognition</topic><topic>Coders</topic><topic>Computational modeling</topic><topic>Decoding</topic><topic>Encoders-Decoders</topic><topic>Hierarchical attention mechanism</topic><topic>Machine translation</topic><topic>Modelling</topic><topic>neural machine translation</topic><topic>Phonemes</topic><topic>Semantics</topic><topic>Speech recognition</topic><topic>Task analysis</topic><topic>Training</topic><topic>transformer</topic><topic>Transformers</topic><topic>Voice recognition</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wu, Xixin</creatorcontrib><creatorcontrib>Lu, Hui</creatorcontrib><creatorcontrib>Li, Kun</creatorcontrib><creatorcontrib>Wu, Zhiyong</creatorcontrib><creatorcontrib>Liu, Xunying</creatorcontrib><creatorcontrib>Meng, Helen</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Linguistics and Language Behavior Abstracts (LLBA)</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Wu, Xixin</au><au>Lu, Hui</au><au>Li, Kun</au><au>Wu, Zhiyong</au><au>Liu, Xunying</au><au>Meng, Helen</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Hiformer: Sequence Modeling Networks With Hierarchical Attention Mechanisms</atitle><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle><stitle>TASLP</stitle><date>2023</date><risdate>2023</risdate><volume>31</volume><spage>3993</spage><epage>4003</epage><pages>3993-4003</pages><issn>2329-9290</issn><eissn>2329-9304</eissn><coden>ITASFA</coden><abstract>The attention-based encoder-decoder structure, such as the Transformer, has achieved state-of-the-art performance on various sequence modeling tasks, e.g., machine translation (MT) and automatic speech recognition (ASR), benefited from the superior capability of layer-wise self-attention mechanism in the encoder/decoder to access long-distance contextual information. Recently, analysis on the Transformer layers has shown that different levels of information, e.g., phoneme level, word level and semantic level, are represented at different layers. Effectively integrating information from various levels is important for structured prediction. However, the self-attention in the conventional Transformer structure only focuses on intra-layer integration, and does not explicitly model inter-layer information relationships. Also, attention across the encoder and decoder (cross-coder) only focuses on the top encoder layer but ignores the intermediate layers. In this article, we propose a sequence modeling structure equipped with a hierarchical attention mechanism, named Hiformer, that can consider the inter-layer and cross-coder hierarchical information to improve structured prediction performance. Extensive experiments conducted on both MT and ASR tasks demonstrate the effectiveness of the proposed Hiformer model.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TASLP.2023.3313428</doi><tpages>11</tpages><orcidid>https://orcid.org/0000-0002-7698-6262</orcidid><orcidid>https://orcid.org/0000-0001-9543-1572</orcidid><orcidid>https://orcid.org/0000-0002-4427-3532</orcidid><orcidid>https://orcid.org/0000-0002-6780-4391</orcidid><orcidid>https://orcid.org/0000-0001-8533-0524</orcidid><orcidid>https://orcid.org/0000-0001-6725-1160</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 2329-9290
ispartof	IEEE/ACM transactions on audio, speech, and language processing, 2023, Vol.31, p.3993-4003
issn	2329-9290 2329-9304
language	eng
recordid	cdi_ieee_primary_10244068
source	IEEE Electronic Library (IEL)
subjects	Attention Automatic speech recognition Coders Computational modeling Decoding Encoders-Decoders Hierarchical attention mechanism Machine translation Modelling neural machine translation Phonemes Semantics Speech recognition Task analysis Training transformer Transformers Voice recognition
title	Hiformer: Sequence Modeling Networks With Hierarchical Attention Mechanisms
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-22T06%3A34%3A42IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Hiformer:%20Sequence%20Modeling%20Networks%20With%20Hierarchical%20Attention%20Mechanisms&rft.jtitle=IEEE/ACM%20transactions%20on%20audio,%20speech,%20and%20language%20processing&rft.au=Wu,%20Xixin&rft.date=2023&rft.volume=31&rft.spage=3993&rft.epage=4003&rft.pages=3993-4003&rft.issn=2329-9290&rft.eissn=2329-9304&rft.coden=ITASFA&rft_id=info:doi/10.1109/TASLP.2023.3313428&rft_dat=%3Cproquest_RIE%3E2881500927%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2881500927&rft_id=info:pmid/&rft_ieee_id=10244068&rfr_iscdi=true