Improving Speech Translation by Cross-modal Multi-grained Contrastive Learning

The end-to-end speech translation (E2E-ST) model has gradually become a mainstream paradigm due to its low latency and less error propagation. However, it is non-trivial to train such a model well due to the task complexity and data scarcity. The speech-and-text modality differences result in the E2...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2023-01, Vol.31, p.1-12
Hauptverfasser: Zhang, Hao, Si, Nianwen, Chen, Yaqi, Zhang, Wenlin, Yang, Xukui, Qu, Dan, Zhang, Weiqiang
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 12
container_issue
container_start_page 1
container_title IEEE/ACM transactions on audio, speech, and language processing
container_volume 31
creator Zhang, Hao
Si, Nianwen
Chen, Yaqi
Zhang, Wenlin
Yang, Xukui
Qu, Dan
Zhang, Weiqiang
description The end-to-end speech translation (E2E-ST) model has gradually become a mainstream paradigm due to its low latency and less error propagation. However, it is non-trivial to train such a model well due to the task complexity and data scarcity. The speech-and-text modality differences result in the E2E-ST model performance usually inferior to the corresponding machine translation (MT) model. Based on the above observation, existing methods often use sharing mechanisms to carry out implicit knowledge transfer by imposing various constraints. However, the final model often performs worse on the MT task than the MT model trained alone, which means that the knowledge transfer ability of this method is also limited. To deal with these problems, we propose the FCCL ( F ine- and C oarse- Granularity C ontrastive L earning) approach for E2E-ST, which makes explicit knowledge transfer through cross-modal multi-grained contrastive learning. A key ingredient of our approach is applying contrastive learning at both sentence- and frame-level to give the comprehensive guide for extracting speech representations containing rich semantic information. In addition, we adopt a simple whitening method to alleviate the representation degeneration in the MT model, which adversely affects contrast learning. Experiments on the MuST-C benchmark show that our proposed approach significantly outperforms the state-of-the-art E2E-ST baselines on all eight language pairs. Further analysis indicates that FCCL can free up its capacity from learning grammatical structure information and force more layers to learn semantic information.
doi_str_mv 10.1109/TASLP.2023.3244521
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2780986786</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10042965</ieee_id><sourcerecordid>2780986786</sourcerecordid><originalsourceid>FETCH-LOGICAL-c296t-430c0174adf9883bf7afa37ee4a5b9a860f3fd135c251fb33ca6be850850d9943</originalsourceid><addsrcrecordid>eNpNkElrwzAQhUVpoSHNHyg9CHp2Olq86BhMl4C7QNKzGNtS6uDIqeQE8u_rNCkUBmYO7828-Qi5ZTBlDNTDcrYoPqYcuJgKLmXM2QUZccFVpATIy7-ZK7gmkxDWAMAgVSqVI_I232x9t2_cii62xlRfdOnRhRb7pnO0PNDcdyFEm67Glr7u2r6JVh4bZ2qad673GPpmb2hh0LthyQ25stgGMzn3Mfl8elzmL1Hx_jzPZ0VUcZX0kRRQAUsl1lZlmShtihZFaozEuFSYJWCFrZmIKx4zWwpRYVKaLIahaqWkGJP7094h_PfOhF6vu513w0nN0wxUlqRZMqj4SVUdn_DG6q1vNugPmoE-otO_6PQRnT6jG0x3J1NjjPlnADlEj8UP2qJqtA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2780986786</pqid></control><display><type>article</type><title>Improving Speech Translation by Cross-modal Multi-grained Contrastive Learning</title><source>IEEE Electronic Library (IEL)</source><creator>Zhang, Hao ; Si, Nianwen ; Chen, Yaqi ; Zhang, Wenlin ; Yang, Xukui ; Qu, Dan ; Zhang, Weiqiang</creator><creatorcontrib>Zhang, Hao ; Si, Nianwen ; Chen, Yaqi ; Zhang, Wenlin ; Yang, Xukui ; Qu, Dan ; Zhang, Weiqiang</creatorcontrib><description>The end-to-end speech translation (E2E-ST) model has gradually become a mainstream paradigm due to its low latency and less error propagation. However, it is non-trivial to train such a model well due to the task complexity and data scarcity. The speech-and-text modality differences result in the E2E-ST model performance usually inferior to the corresponding machine translation (MT) model. Based on the above observation, existing methods often use sharing mechanisms to carry out implicit knowledge transfer by imposing various constraints. However, the final model often performs worse on the MT task than the MT model trained alone, which means that the knowledge transfer ability of this method is also limited. To deal with these problems, we propose the FCCL ( F ine- and C oarse- Granularity C ontrastive L earning) approach for E2E-ST, which makes explicit knowledge transfer through cross-modal multi-grained contrastive learning. A key ingredient of our approach is applying contrastive learning at both sentence- and frame-level to give the comprehensive guide for extracting speech representations containing rich semantic information. In addition, we adopt a simple whitening method to alleviate the representation degeneration in the MT model, which adversely affects contrast learning. Experiments on the MuST-C benchmark show that our proposed approach significantly outperforms the state-of-the-art E2E-ST baselines on all eight language pairs. Further analysis indicates that FCCL can free up its capacity from learning grammatical structure information and force more layers to learn semantic information.</description><identifier>ISSN: 2329-9290</identifier><identifier>EISSN: 2329-9304</identifier><identifier>DOI: 10.1109/TASLP.2023.3244521</identifier><identifier>CODEN: ITASFA</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Constraint modelling ; Contrastive Learning ; Data mining ; Data models ; Degeneration ; End-to-End ; Explicit knowledge ; Feature extraction ; Knowledge management ; Learning ; Machine translation ; Representations ; Semantics ; Speech recognition ; Speech Translation ; Task analysis ; Task complexity ; Training ; Training data</subject><ispartof>IEEE/ACM transactions on audio, speech, and language processing, 2023-01, Vol.31, p.1-12</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c296t-430c0174adf9883bf7afa37ee4a5b9a860f3fd135c251fb33ca6be850850d9943</citedby><cites>FETCH-LOGICAL-c296t-430c0174adf9883bf7afa37ee4a5b9a860f3fd135c251fb33ca6be850850d9943</cites><orcidid>0000-0001-8852-6311 ; 0000-0003-3841-1959 ; 0000-0001-6301-802X ; 0000-0001-9917-7794 ; 0000-0002-7989-7089</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10042965$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10042965$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Zhang, Hao</creatorcontrib><creatorcontrib>Si, Nianwen</creatorcontrib><creatorcontrib>Chen, Yaqi</creatorcontrib><creatorcontrib>Zhang, Wenlin</creatorcontrib><creatorcontrib>Yang, Xukui</creatorcontrib><creatorcontrib>Qu, Dan</creatorcontrib><creatorcontrib>Zhang, Weiqiang</creatorcontrib><title>Improving Speech Translation by Cross-modal Multi-grained Contrastive Learning</title><title>IEEE/ACM transactions on audio, speech, and language processing</title><addtitle>TASLP</addtitle><description>The end-to-end speech translation (E2E-ST) model has gradually become a mainstream paradigm due to its low latency and less error propagation. However, it is non-trivial to train such a model well due to the task complexity and data scarcity. The speech-and-text modality differences result in the E2E-ST model performance usually inferior to the corresponding machine translation (MT) model. Based on the above observation, existing methods often use sharing mechanisms to carry out implicit knowledge transfer by imposing various constraints. However, the final model often performs worse on the MT task than the MT model trained alone, which means that the knowledge transfer ability of this method is also limited. To deal with these problems, we propose the FCCL ( F ine- and C oarse- Granularity C ontrastive L earning) approach for E2E-ST, which makes explicit knowledge transfer through cross-modal multi-grained contrastive learning. A key ingredient of our approach is applying contrastive learning at both sentence- and frame-level to give the comprehensive guide for extracting speech representations containing rich semantic information. In addition, we adopt a simple whitening method to alleviate the representation degeneration in the MT model, which adversely affects contrast learning. Experiments on the MuST-C benchmark show that our proposed approach significantly outperforms the state-of-the-art E2E-ST baselines on all eight language pairs. Further analysis indicates that FCCL can free up its capacity from learning grammatical structure information and force more layers to learn semantic information.</description><subject>Constraint modelling</subject><subject>Contrastive Learning</subject><subject>Data mining</subject><subject>Data models</subject><subject>Degeneration</subject><subject>End-to-End</subject><subject>Explicit knowledge</subject><subject>Feature extraction</subject><subject>Knowledge management</subject><subject>Learning</subject><subject>Machine translation</subject><subject>Representations</subject><subject>Semantics</subject><subject>Speech recognition</subject><subject>Speech Translation</subject><subject>Task analysis</subject><subject>Task complexity</subject><subject>Training</subject><subject>Training data</subject><issn>2329-9290</issn><issn>2329-9304</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkElrwzAQhUVpoSHNHyg9CHp2Olq86BhMl4C7QNKzGNtS6uDIqeQE8u_rNCkUBmYO7828-Qi5ZTBlDNTDcrYoPqYcuJgKLmXM2QUZccFVpATIy7-ZK7gmkxDWAMAgVSqVI_I232x9t2_cii62xlRfdOnRhRb7pnO0PNDcdyFEm67Glr7u2r6JVh4bZ2qad673GPpmb2hh0LthyQ25stgGMzn3Mfl8elzmL1Hx_jzPZ0VUcZX0kRRQAUsl1lZlmShtihZFaozEuFSYJWCFrZmIKx4zWwpRYVKaLIahaqWkGJP7094h_PfOhF6vu513w0nN0wxUlqRZMqj4SVUdn_DG6q1vNugPmoE-otO_6PQRnT6jG0x3J1NjjPlnADlEj8UP2qJqtA</recordid><startdate>20230101</startdate><enddate>20230101</enddate><creator>Zhang, Hao</creator><creator>Si, Nianwen</creator><creator>Chen, Yaqi</creator><creator>Zhang, Wenlin</creator><creator>Yang, Xukui</creator><creator>Qu, Dan</creator><creator>Zhang, Weiqiang</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0001-8852-6311</orcidid><orcidid>https://orcid.org/0000-0003-3841-1959</orcidid><orcidid>https://orcid.org/0000-0001-6301-802X</orcidid><orcidid>https://orcid.org/0000-0001-9917-7794</orcidid><orcidid>https://orcid.org/0000-0002-7989-7089</orcidid></search><sort><creationdate>20230101</creationdate><title>Improving Speech Translation by Cross-modal Multi-grained Contrastive Learning</title><author>Zhang, Hao ; Si, Nianwen ; Chen, Yaqi ; Zhang, Wenlin ; Yang, Xukui ; Qu, Dan ; Zhang, Weiqiang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c296t-430c0174adf9883bf7afa37ee4a5b9a860f3fd135c251fb33ca6be850850d9943</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Constraint modelling</topic><topic>Contrastive Learning</topic><topic>Data mining</topic><topic>Data models</topic><topic>Degeneration</topic><topic>End-to-End</topic><topic>Explicit knowledge</topic><topic>Feature extraction</topic><topic>Knowledge management</topic><topic>Learning</topic><topic>Machine translation</topic><topic>Representations</topic><topic>Semantics</topic><topic>Speech recognition</topic><topic>Speech Translation</topic><topic>Task analysis</topic><topic>Task complexity</topic><topic>Training</topic><topic>Training data</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Hao</creatorcontrib><creatorcontrib>Si, Nianwen</creatorcontrib><creatorcontrib>Chen, Yaqi</creatorcontrib><creatorcontrib>Zhang, Wenlin</creatorcontrib><creatorcontrib>Yang, Xukui</creatorcontrib><creatorcontrib>Qu, Dan</creatorcontrib><creatorcontrib>Zhang, Weiqiang</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhang, Hao</au><au>Si, Nianwen</au><au>Chen, Yaqi</au><au>Zhang, Wenlin</au><au>Yang, Xukui</au><au>Qu, Dan</au><au>Zhang, Weiqiang</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Improving Speech Translation by Cross-modal Multi-grained Contrastive Learning</atitle><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle><stitle>TASLP</stitle><date>2023-01-01</date><risdate>2023</risdate><volume>31</volume><spage>1</spage><epage>12</epage><pages>1-12</pages><issn>2329-9290</issn><eissn>2329-9304</eissn><coden>ITASFA</coden><abstract>The end-to-end speech translation (E2E-ST) model has gradually become a mainstream paradigm due to its low latency and less error propagation. However, it is non-trivial to train such a model well due to the task complexity and data scarcity. The speech-and-text modality differences result in the E2E-ST model performance usually inferior to the corresponding machine translation (MT) model. Based on the above observation, existing methods often use sharing mechanisms to carry out implicit knowledge transfer by imposing various constraints. However, the final model often performs worse on the MT task than the MT model trained alone, which means that the knowledge transfer ability of this method is also limited. To deal with these problems, we propose the FCCL ( F ine- and C oarse- Granularity C ontrastive L earning) approach for E2E-ST, which makes explicit knowledge transfer through cross-modal multi-grained contrastive learning. A key ingredient of our approach is applying contrastive learning at both sentence- and frame-level to give the comprehensive guide for extracting speech representations containing rich semantic information. In addition, we adopt a simple whitening method to alleviate the representation degeneration in the MT model, which adversely affects contrast learning. Experiments on the MuST-C benchmark show that our proposed approach significantly outperforms the state-of-the-art E2E-ST baselines on all eight language pairs. Further analysis indicates that FCCL can free up its capacity from learning grammatical structure information and force more layers to learn semantic information.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TASLP.2023.3244521</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0001-8852-6311</orcidid><orcidid>https://orcid.org/0000-0003-3841-1959</orcidid><orcidid>https://orcid.org/0000-0001-6301-802X</orcidid><orcidid>https://orcid.org/0000-0001-9917-7794</orcidid><orcidid>https://orcid.org/0000-0002-7989-7089</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 2329-9290
ispartof IEEE/ACM transactions on audio, speech, and language processing, 2023-01, Vol.31, p.1-12
issn 2329-9290
2329-9304
language eng
recordid cdi_proquest_journals_2780986786
source IEEE Electronic Library (IEL)
subjects Constraint modelling
Contrastive Learning
Data mining
Data models
Degeneration
End-to-End
Explicit knowledge
Feature extraction
Knowledge management
Learning
Machine translation
Representations
Semantics
Speech recognition
Speech Translation
Task analysis
Task complexity
Training
Training data
title Improving Speech Translation by Cross-modal Multi-grained Contrastive Learning
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T21%3A56%3A53IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Improving%20Speech%20Translation%20by%20Cross-modal%20Multi-grained%20Contrastive%20Learning&rft.jtitle=IEEE/ACM%20transactions%20on%20audio,%20speech,%20and%20language%20processing&rft.au=Zhang,%20Hao&rft.date=2023-01-01&rft.volume=31&rft.spage=1&rft.epage=12&rft.pages=1-12&rft.issn=2329-9290&rft.eissn=2329-9304&rft.coden=ITASFA&rft_id=info:doi/10.1109/TASLP.2023.3244521&rft_dat=%3Cproquest_RIE%3E2780986786%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2780986786&rft_id=info:pmid/&rft_ieee_id=10042965&rfr_iscdi=true