Improving Speech Translation by Cross-modal Multi-grained Contrastive Learning

The end-to-end speech translation (E2E-ST) model has gradually become a mainstream paradigm due to its low latency and less error propagation. However, it is non-trivial to train such a model well due to the task complexity and data scarcity. The speech-and-text modality differences result in the E2...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2023-01, Vol.31, p.1-12
Hauptverfasser:	Zhang, Hao, Si, Nianwen, Chen, Yaqi, Zhang, Wenlin, Yang, Xukui, Qu, Dan, Zhang, Weiqiang
Format:	Artikel
Sprache:	eng
Schlagworte:	Constraint modelling Contrastive Learning Data mining Data models Degeneration End-to-End Explicit knowledge Feature extraction Knowledge management Learning Machine translation Representations Semantics Speech recognition Speech Translation Task analysis Task complexity Training Training data
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	12
container_issue
container_start_page	1
container_title	IEEE/ACM transactions on audio, speech, and language processing
container_volume	31
creator	Zhang, Hao Si, Nianwen Chen, Yaqi Zhang, Wenlin Yang, Xukui Qu, Dan Zhang, Weiqiang
description	The end-to-end speech translation (E2E-ST) model has gradually become a mainstream paradigm due to its low latency and less error propagation. However, it is non-trivial to train such a model well due to the task complexity and data scarcity. The speech-and-text modality differences result in the E2E-ST model performance usually inferior to the corresponding machine translation (MT) model. Based on the above observation, existing methods often use sharing mechanisms to carry out implicit knowledge transfer by imposing various constraints. However, the final model often performs worse on the MT task than the MT model trained alone, which means that the knowledge transfer ability of this method is also limited. To deal with these problems, we propose the FCCL ( F ine- and C oarse- Granularity C ontrastive L earning) approach for E2E-ST, which makes explicit knowledge transfer through cross-modal multi-grained contrastive learning. A key ingredient of our approach is applying contrastive learning at both sentence- and frame-level to give the comprehensive guide for extracting speech representations containing rich semantic information. In addition, we adopt a simple whitening method to alleviate the representation degeneration in the MT model, which adversely affects contrast learning. Experiments on the MuST-C benchmark show that our proposed approach significantly outperforms the state-of-the-art E2E-ST baselines on all eight language pairs. Further analysis indicates that FCCL can free up its capacity from learning grammatical structure information and force more layers to learn semantic information.
doi_str_mv	10.1109/TASLP.2023.3244521
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2780986786</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10042965</ieee_id><sourcerecordid>2780986786</sourcerecordid><originalsourceid>FETCH-LOGICAL-c296t-430c0174adf9883bf7afa37ee4a5b9a860f3fd135c251fb33ca6be850850d9943</originalsourceid><addsrcrecordid>eNpNkElrwzAQhUVpoSHNHyg9CHp2Olq86BhMl4C7QNKzGNtS6uDIqeQE8u_rNCkUBmYO7828-Qi5ZTBlDNTDcrYoPqYcuJgKLmXM2QUZccFVpATIy7-ZK7gmkxDWAMAgVSqVI_I232x9t2_cii62xlRfdOnRhRb7pnO0PNDcdyFEm67Glr7u2r6JVh4bZ2qad673GPpmb2hh0LthyQ25stgGMzn3Mfl8elzmL1Hx_jzPZ0VUcZX0kRRQAUsl1lZlmShtihZFaozEuFSYJWCFrZmIKx4zWwpRYVKaLIahaqWkGJP7094h_PfOhF6vu513w0nN0wxUlqRZMqj4SVUdn_DG6q1vNugPmoE-otO_6PQRnT6jG0x3J1NjjPlnADlEj8UP2qJqtA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2780986786</pqid></control><display><type>article</type><title>Improving Speech Translation by Cross-modal Multi-grained Contrastive Learning</title><source>IEEE Electronic Library (IEL)</source><creator>Zhang, Hao ; Si, Nianwen ; Chen, Yaqi ; Zhang, Wenlin ; Yang, Xukui ; Qu, Dan ; Zhang, Weiqiang</creator><creatorcontrib>Zhang, Hao ; Si, Nianwen ; Chen, Yaqi ; Zhang, Wenlin ; Yang, Xukui ; Qu, Dan ; Zhang, Weiqiang</creatorcontrib><description>The end-to-end speech translation (E2E-ST) model has gradually become a mainstream paradigm due to its low latency and less error propagation. However, it is non-trivial to train such a model well due to the task complexity and data scarcity. The speech-and-text modality differences result in the E2E-ST model performance usually inferior to the corresponding machine translation (MT) model. Based on the above observation, existing methods often use sharing mechanisms to carry out implicit knowledge transfer by imposing various constraints. However, the final model often performs worse on the MT task than the MT model trained alone, which means that the knowledge transfer ability of this method is also limited. To deal with these problems, we propose the FCCL ( F ine- and C oarse- Granularity C ontrastive L earning) approach for E2E-ST, which makes explicit knowledge transfer through cross-modal multi-grained contrastive learning. A key ingredient of our approach is applying contrastive learning at both sentence- and frame-level to give the comprehensive guide for extracting speech representations containing rich semantic information. In addition, we adopt a simple whitening method to alleviate the representation degeneration in the MT model, which adversely affects contrast learning. Experiments on the MuST-C benchmark show that our proposed approach significantly outperforms the state-of-the-art E2E-ST baselines on all eight language pairs. Further analysis indicates that FCCL can free up its capacity from learning grammatical structure information and force more layers to learn semantic information.</description><identifier>ISSN: 2329-9290</identifier><identifier>EISSN: 2329-9304</identifier><identifier>DOI: 10.1109/TASLP.2023.3244521</identifier><identifier>CODEN: ITASFA</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Constraint modelling ; Contrastive Learning ; Data mining ; Data models ; Degeneration ; End-to-End ; Explicit knowledge ; Feature extraction ; Knowledge management ; Learning ; Machine translation ; Representations ; Semantics ; Speech recognition ; Speech Translation ; Task analysis ; Task complexity ; Training ; Training data</subject><ispartof>IEEE/ACM transactions on audio, speech, and language processing, 2023-01, Vol.31, p.1-12</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c296t-430c0174adf9883bf7afa37ee4a5b9a860f3fd135c251fb33ca6be850850d9943</citedby><cites>FETCH-LOGICAL-c296t-430c0174adf9883bf7afa37ee4a5b9a860f3fd135c251fb33ca6be850850d9943</cites><orcidid>0000-0001-8852-6311 ; 0000-0003-3841-1959 ; 0000-0001-6301-802X ; 0000-0001-9917-7794 ; 0000-0002-7989-7089</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10042965$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10042965$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Zhang, Hao</creatorcontrib><creatorcontrib>Si, Nianwen</creatorcontrib><creatorcontrib>Chen, Yaqi</creatorcontrib><creatorcontrib>Zhang, Wenlin</creatorcontrib><creatorcontrib>Yang, Xukui</creatorcontrib><creatorcontrib>Qu, Dan</creatorcontrib><creatorcontrib>Zhang, Weiqiang</creatorcontrib><title>Improving Speech Translation by Cross-modal Multi-grained Contrastive Learning</title><title>IEEE/ACM transactions on audio, speech, and language processing</title><addtitle>TASLP</addtitle><description>The end-to-end speech translation (E2E-ST) model has gradually become a mainstream paradigm due to its low latency and less error propagation. However, it is non-trivial to train such a model well due to the task complexity and data scarcity. The speech-and-text modality differences result in the E2E-ST model performance usually inferior to the corresponding machine translation (MT) model. Based on the above observation, existing methods often use sharing mechanisms to carry out implicit knowledge transfer by imposing various constraints. However, the final model often performs worse on the MT task than the MT model trained alone, which means that the knowledge transfer ability of this method is also limited. To deal with these problems, we propose the FCCL ( F ine- and C oarse- Granularity C ontrastive L earning) approach for E2E-ST, which makes explicit knowledge transfer through cross-modal multi-grained contrastive learning. A key ingredient of our approach is applying contrastive learning at both sentence- and frame-level to give the comprehensive guide for extracting speech representations containing rich semantic information. In addition, we adopt a simple whitening method to alleviate the representation degeneration in the MT model, which adversely affects contrast learning. Experiments on the MuST-C benchmark show that our proposed approach significantly outperforms the state-of-the-art E2E-ST baselines on all eight language pairs. Further analysis indicates that FCCL can free up its capacity from learning grammatical structure information and force more layers to learn semantic information.</description><subject>Constraint modelling</subject><subject>Contrastive Learning</subject><subject>Data mining</subject><subject>Data models</subject><subject>Degeneration</subject><subject>End-to-End</subject><subject>Explicit knowledge</subject><subject>Feature extraction</subject><subject>Knowledge management</subject><subject>Learning</subject><subject>Machine translation</subject><subject>Representations</subject><subject>Semantics</subject><subject>Speech recognition</subject><subject>Speech Translation</subject><subject>Task analysis</subject><subject>Task complexity</subject><subject>Training</subject><subject>Training data</subject><issn>2329-9290</issn><issn>2329-9304</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkElrwzAQhUVpoSHNHyg9CHp2Olq86BhMl4C7QNKzGNtS6uDIqeQE8u_rNCkUBmYO7828-Qi5ZTBlDNTDcrYoPqYcuJgKLmXM2QUZccFVpATIy7-ZK7gmkxDWAMAgVSqVI_I232x9t2_cii62xlRfdOnRhRb7pnO0PNDcdyFEm67Glr7u2r6JVh4bZ2qad673GPpmb2hh0LthyQ25stgGMzn3Mfl8elzmL1Hx_jzPZ0VUcZX0kRRQAUsl1lZlmShtihZFaozEuFSYJWCFrZmIKx4zWwpRYVKaLIahaqWkGJP7094h_PfOhF6vu513w0nN0wxUlqRZMqj4SVUdn_DG6q1vNugPmoE-otO_6PQRnT6jG0x3J1NjjPlnADlEj8UP2qJqtA</recordid><startdate>20230101</startdate><enddate>20230101</enddate><creator>Zhang, Hao</creator><creator>Si, Nianwen</creator><creator>Chen, Yaqi</creator><creator>Zhang, Wenlin</creator><creator>Yang, Xukui</creator><creator>Qu, Dan</creator><creator>Zhang, Weiqiang</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0001-8852-6311</orcidid><orcidid>https://orcid.org/0000-0003-3841-1959</orcidid><orcidid>https://orcid.org/0000-0001-6301-802X</orcidid><orcidid>https://orcid.org/0000-0001-9917-7794</orcidid><orcidid>https://orcid.org/0000-0002-7989-7089</orcidid></search><sort><creationdate>20230101</creationdate><title>Improving Speech Translation by Cross-modal Multi-grained Contrastive Learning</title><author>Zhang, Hao ; Si, Nianwen ; Chen, Yaqi ; Zhang, Wenlin ; Yang, Xukui ; Qu, Dan ; Zhang, Weiqiang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c296t-430c0174adf9883bf7afa37ee4a5b9a860f3fd135c251fb33ca6be850850d9943</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Constraint modelling</topic><topic>Contrastive Learning</topic><topic>Data mining</topic><topic>Data models</topic><topic>Degeneration</topic><topic>End-to-End</topic><topic>Explicit knowledge</topic><topic>Feature extraction</topic><topic>Knowledge management</topic><topic>Learning</topic><topic>Machine translation</topic><topic>Representations</topic><topic>Semantics</topic><topic>Speech recognition</topic><topic>Speech Translation</topic><topic>Task analysis</topic><topic>Task complexity</topic><topic>Training</topic><topic>Training data</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Hao</creatorcontrib><creatorcontrib>Si, Nianwen</creatorcontrib><creatorcontrib>Chen, Yaqi</creatorcontrib><creatorcontrib>Zhang, Wenlin</creatorcontrib><creatorcontrib>Yang, Xukui</creatorcontrib><creatorcontrib>Qu, Dan</creatorcontrib><creatorcontrib>Zhang, Weiqiang</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhang, Hao</au><au>Si, Nianwen</au><au>Chen, Yaqi</au><au>Zhang, Wenlin</au><au>Yang, Xukui</au><au>Qu, Dan</au><au>Zhang, Weiqiang</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Improving Speech Translation by Cross-modal Multi-grained Contrastive Learning</atitle><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle><stitle>TASLP</stitle><date>2023-01-01</date><risdate>2023</risdate><volume>31</volume><spage>1</spage><epage>12</epage><pages>1-12</pages><issn>2329-9290</issn><eissn>2329-9304</eissn><coden>ITASFA</coden><abstract>The end-to-end speech translation (E2E-ST) model has gradually become a mainstream paradigm due to its low latency and less error propagation. However, it is non-trivial to train such a model well due to the task complexity and data scarcity. The speech-and-text modality differences result in the E2E-ST model performance usually inferior to the corresponding machine translation (MT) model. Based on the above observation, existing methods often use sharing mechanisms to carry out implicit knowledge transfer by imposing various constraints. However, the final model often performs worse on the MT task than the MT model trained alone, which means that the knowledge transfer ability of this method is also limited. To deal with these problems, we propose the FCCL ( F ine- and C oarse- Granularity C ontrastive L earning) approach for E2E-ST, which makes explicit knowledge transfer through cross-modal multi-grained contrastive learning. A key ingredient of our approach is applying contrastive learning at both sentence- and frame-level to give the comprehensive guide for extracting speech representations containing rich semantic information. In addition, we adopt a simple whitening method to alleviate the representation degeneration in the MT model, which adversely affects contrast learning. Experiments on the MuST-C benchmark show that our proposed approach significantly outperforms the state-of-the-art E2E-ST baselines on all eight language pairs. Further analysis indicates that FCCL can free up its capacity from learning grammatical structure information and force more layers to learn semantic information.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TASLP.2023.3244521</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0001-8852-6311</orcidid><orcidid>https://orcid.org/0000-0003-3841-1959</orcidid><orcidid>https://orcid.org/0000-0001-6301-802X</orcidid><orcidid>https://orcid.org/0000-0001-9917-7794</orcidid><orcidid>https://orcid.org/0000-0002-7989-7089</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 2329-9290
ispartof	IEEE/ACM transactions on audio, speech, and language processing, 2023-01, Vol.31, p.1-12
issn	2329-9290 2329-9304
language	eng
recordid	cdi_proquest_journals_2780986786
source	IEEE Electronic Library (IEL)
subjects	Constraint modelling Contrastive Learning Data mining Data models Degeneration End-to-End Explicit knowledge Feature extraction Knowledge management Learning Machine translation Representations Semantics Speech recognition Speech Translation Task analysis Task complexity Training Training data
title	Improving Speech Translation by Cross-modal Multi-grained Contrastive Learning
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T21%3A56%3A53IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Improving%20Speech%20Translation%20by%20Cross-modal%20Multi-grained%20Contrastive%20Learning&rft.jtitle=IEEE/ACM%20transactions%20on%20audio,%20speech,%20and%20language%20processing&rft.au=Zhang,%20Hao&rft.date=2023-01-01&rft.volume=31&rft.spage=1&rft.epage=12&rft.pages=1-12&rft.issn=2329-9290&rft.eissn=2329-9304&rft.coden=ITASFA&rft_id=info:doi/10.1109/TASLP.2023.3244521&rft_dat=%3Cproquest_RIE%3E2780986786%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2780986786&rft_id=info:pmid/&rft_ieee_id=10042965&rfr_iscdi=true