Improving Speech Translation by Cross-modal Multi-grained Contrastive Learning
The end-to-end speech translation (E2E-ST) model has gradually become a mainstream paradigm due to its low latency and less error propagation. However, it is non-trivial to train such a model well due to the task complexity and data scarcity. The speech-and-text modality differences result in the E2...
Gespeichert in:
Veröffentlicht in: | IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2023-01, Vol.31, p.1-12 |
---|---|
Hauptverfasser: | , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 12 |
---|---|
container_issue | |
container_start_page | 1 |
container_title | IEEE/ACM transactions on audio, speech, and language processing |
container_volume | 31 |
creator | Zhang, Hao Si, Nianwen Chen, Yaqi Zhang, Wenlin Yang, Xukui Qu, Dan Zhang, Weiqiang |
description | The end-to-end speech translation (E2E-ST) model has gradually become a mainstream paradigm due to its low latency and less error propagation. However, it is non-trivial to train such a model well due to the task complexity and data scarcity. The speech-and-text modality differences result in the E2E-ST model performance usually inferior to the corresponding machine translation (MT) model. Based on the above observation, existing methods often use sharing mechanisms to carry out implicit knowledge transfer by imposing various constraints. However, the final model often performs worse on the MT task than the MT model trained alone, which means that the knowledge transfer ability of this method is also limited. To deal with these problems, we propose the FCCL ( F ine- and C oarse- Granularity C ontrastive L earning) approach for E2E-ST, which makes explicit knowledge transfer through cross-modal multi-grained contrastive learning. A key ingredient of our approach is applying contrastive learning at both sentence- and frame-level to give the comprehensive guide for extracting speech representations containing rich semantic information. In addition, we adopt a simple whitening method to alleviate the representation degeneration in the MT model, which adversely affects contrast learning. Experiments on the MuST-C benchmark show that our proposed approach significantly outperforms the state-of-the-art E2E-ST baselines on all eight language pairs. Further analysis indicates that FCCL can free up its capacity from learning grammatical structure information and force more layers to learn semantic information. |
doi_str_mv | 10.1109/TASLP.2023.3244521 |
format | Article |
fullrecord | <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2780986786</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10042965</ieee_id><sourcerecordid>2780986786</sourcerecordid><originalsourceid>FETCH-LOGICAL-c296t-430c0174adf9883bf7afa37ee4a5b9a860f3fd135c251fb33ca6be850850d9943</originalsourceid><addsrcrecordid>eNpNkElrwzAQhUVpoSHNHyg9CHp2Olq86BhMl4C7QNKzGNtS6uDIqeQE8u_rNCkUBmYO7828-Qi5ZTBlDNTDcrYoPqYcuJgKLmXM2QUZccFVpATIy7-ZK7gmkxDWAMAgVSqVI_I232x9t2_cii62xlRfdOnRhRb7pnO0PNDcdyFEm67Glr7u2r6JVh4bZ2qad673GPpmb2hh0LthyQ25stgGMzn3Mfl8elzmL1Hx_jzPZ0VUcZX0kRRQAUsl1lZlmShtihZFaozEuFSYJWCFrZmIKx4zWwpRYVKaLIahaqWkGJP7094h_PfOhF6vu513w0nN0wxUlqRZMqj4SVUdn_DG6q1vNugPmoE-otO_6PQRnT6jG0x3J1NjjPlnADlEj8UP2qJqtA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2780986786</pqid></control><display><type>article</type><title>Improving Speech Translation by Cross-modal Multi-grained Contrastive Learning</title><source>IEEE Electronic Library (IEL)</source><creator>Zhang, Hao ; Si, Nianwen ; Chen, Yaqi ; Zhang, Wenlin ; Yang, Xukui ; Qu, Dan ; Zhang, Weiqiang</creator><creatorcontrib>Zhang, Hao ; Si, Nianwen ; Chen, Yaqi ; Zhang, Wenlin ; Yang, Xukui ; Qu, Dan ; Zhang, Weiqiang</creatorcontrib><description>The end-to-end speech translation (E2E-ST) model has gradually become a mainstream paradigm due to its low latency and less error propagation. However, it is non-trivial to train such a model well due to the task complexity and data scarcity. The speech-and-text modality differences result in the E2E-ST model performance usually inferior to the corresponding machine translation (MT) model. Based on the above observation, existing methods often use sharing mechanisms to carry out implicit knowledge transfer by imposing various constraints. However, the final model often performs worse on the MT task than the MT model trained alone, which means that the knowledge transfer ability of this method is also limited. To deal with these problems, we propose the FCCL ( F ine- and C oarse- Granularity C ontrastive L earning) approach for E2E-ST, which makes explicit knowledge transfer through cross-modal multi-grained contrastive learning. A key ingredient of our approach is applying contrastive learning at both sentence- and frame-level to give the comprehensive guide for extracting speech representations containing rich semantic information. In addition, we adopt a simple whitening method to alleviate the representation degeneration in the MT model, which adversely affects contrast learning. Experiments on the MuST-C benchmark show that our proposed approach significantly outperforms the state-of-the-art E2E-ST baselines on all eight language pairs. Further analysis indicates that FCCL can free up its capacity from learning grammatical structure information and force more layers to learn semantic information.</description><identifier>ISSN: 2329-9290</identifier><identifier>EISSN: 2329-9304</identifier><identifier>DOI: 10.1109/TASLP.2023.3244521</identifier><identifier>CODEN: ITASFA</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Constraint modelling ; Contrastive Learning ; Data mining ; Data models ; Degeneration ; End-to-End ; Explicit knowledge ; Feature extraction ; Knowledge management ; Learning ; Machine translation ; Representations ; Semantics ; Speech recognition ; Speech Translation ; Task analysis ; Task complexity ; Training ; Training data</subject><ispartof>IEEE/ACM transactions on audio, speech, and language processing, 2023-01, Vol.31, p.1-12</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c296t-430c0174adf9883bf7afa37ee4a5b9a860f3fd135c251fb33ca6be850850d9943</citedby><cites>FETCH-LOGICAL-c296t-430c0174adf9883bf7afa37ee4a5b9a860f3fd135c251fb33ca6be850850d9943</cites><orcidid>0000-0001-8852-6311 ; 0000-0003-3841-1959 ; 0000-0001-6301-802X ; 0000-0001-9917-7794 ; 0000-0002-7989-7089</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10042965$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10042965$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Zhang, Hao</creatorcontrib><creatorcontrib>Si, Nianwen</creatorcontrib><creatorcontrib>Chen, Yaqi</creatorcontrib><creatorcontrib>Zhang, Wenlin</creatorcontrib><creatorcontrib>Yang, Xukui</creatorcontrib><creatorcontrib>Qu, Dan</creatorcontrib><creatorcontrib>Zhang, Weiqiang</creatorcontrib><title>Improving Speech Translation by Cross-modal Multi-grained Contrastive Learning</title><title>IEEE/ACM transactions on audio, speech, and language processing</title><addtitle>TASLP</addtitle><description>The end-to-end speech translation (E2E-ST) model has gradually become a mainstream paradigm due to its low latency and less error propagation. However, it is non-trivial to train such a model well due to the task complexity and data scarcity. The speech-and-text modality differences result in the E2E-ST model performance usually inferior to the corresponding machine translation (MT) model. Based on the above observation, existing methods often use sharing mechanisms to carry out implicit knowledge transfer by imposing various constraints. However, the final model often performs worse on the MT task than the MT model trained alone, which means that the knowledge transfer ability of this method is also limited. To deal with these problems, we propose the FCCL ( F ine- and C oarse- Granularity C ontrastive L earning) approach for E2E-ST, which makes explicit knowledge transfer through cross-modal multi-grained contrastive learning. A key ingredient of our approach is applying contrastive learning at both sentence- and frame-level to give the comprehensive guide for extracting speech representations containing rich semantic information. In addition, we adopt a simple whitening method to alleviate the representation degeneration in the MT model, which adversely affects contrast learning. Experiments on the MuST-C benchmark show that our proposed approach significantly outperforms the state-of-the-art E2E-ST baselines on all eight language pairs. Further analysis indicates that FCCL can free up its capacity from learning grammatical structure information and force more layers to learn semantic information.</description><subject>Constraint modelling</subject><subject>Contrastive Learning</subject><subject>Data mining</subject><subject>Data models</subject><subject>Degeneration</subject><subject>End-to-End</subject><subject>Explicit knowledge</subject><subject>Feature extraction</subject><subject>Knowledge management</subject><subject>Learning</subject><subject>Machine translation</subject><subject>Representations</subject><subject>Semantics</subject><subject>Speech recognition</subject><subject>Speech Translation</subject><subject>Task analysis</subject><subject>Task complexity</subject><subject>Training</subject><subject>Training data</subject><issn>2329-9290</issn><issn>2329-9304</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkElrwzAQhUVpoSHNHyg9CHp2Olq86BhMl4C7QNKzGNtS6uDIqeQE8u_rNCkUBmYO7828-Qi5ZTBlDNTDcrYoPqYcuJgKLmXM2QUZccFVpATIy7-ZK7gmkxDWAMAgVSqVI_I232x9t2_cii62xlRfdOnRhRb7pnO0PNDcdyFEm67Glr7u2r6JVh4bZ2qad673GPpmb2hh0LthyQ25stgGMzn3Mfl8elzmL1Hx_jzPZ0VUcZX0kRRQAUsl1lZlmShtihZFaozEuFSYJWCFrZmIKx4zWwpRYVKaLIahaqWkGJP7094h_PfOhF6vu513w0nN0wxUlqRZMqj4SVUdn_DG6q1vNugPmoE-otO_6PQRnT6jG0x3J1NjjPlnADlEj8UP2qJqtA</recordid><startdate>20230101</startdate><enddate>20230101</enddate><creator>Zhang, Hao</creator><creator>Si, Nianwen</creator><creator>Chen, Yaqi</creator><creator>Zhang, Wenlin</creator><creator>Yang, Xukui</creator><creator>Qu, Dan</creator><creator>Zhang, Weiqiang</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0001-8852-6311</orcidid><orcidid>https://orcid.org/0000-0003-3841-1959</orcidid><orcidid>https://orcid.org/0000-0001-6301-802X</orcidid><orcidid>https://orcid.org/0000-0001-9917-7794</orcidid><orcidid>https://orcid.org/0000-0002-7989-7089</orcidid></search><sort><creationdate>20230101</creationdate><title>Improving Speech Translation by Cross-modal Multi-grained Contrastive Learning</title><author>Zhang, Hao ; Si, Nianwen ; Chen, Yaqi ; Zhang, Wenlin ; Yang, Xukui ; Qu, Dan ; Zhang, Weiqiang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c296t-430c0174adf9883bf7afa37ee4a5b9a860f3fd135c251fb33ca6be850850d9943</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Constraint modelling</topic><topic>Contrastive Learning</topic><topic>Data mining</topic><topic>Data models</topic><topic>Degeneration</topic><topic>End-to-End</topic><topic>Explicit knowledge</topic><topic>Feature extraction</topic><topic>Knowledge management</topic><topic>Learning</topic><topic>Machine translation</topic><topic>Representations</topic><topic>Semantics</topic><topic>Speech recognition</topic><topic>Speech Translation</topic><topic>Task analysis</topic><topic>Task complexity</topic><topic>Training</topic><topic>Training data</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Hao</creatorcontrib><creatorcontrib>Si, Nianwen</creatorcontrib><creatorcontrib>Chen, Yaqi</creatorcontrib><creatorcontrib>Zhang, Wenlin</creatorcontrib><creatorcontrib>Yang, Xukui</creatorcontrib><creatorcontrib>Qu, Dan</creatorcontrib><creatorcontrib>Zhang, Weiqiang</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhang, Hao</au><au>Si, Nianwen</au><au>Chen, Yaqi</au><au>Zhang, Wenlin</au><au>Yang, Xukui</au><au>Qu, Dan</au><au>Zhang, Weiqiang</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Improving Speech Translation by Cross-modal Multi-grained Contrastive Learning</atitle><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle><stitle>TASLP</stitle><date>2023-01-01</date><risdate>2023</risdate><volume>31</volume><spage>1</spage><epage>12</epage><pages>1-12</pages><issn>2329-9290</issn><eissn>2329-9304</eissn><coden>ITASFA</coden><abstract>The end-to-end speech translation (E2E-ST) model has gradually become a mainstream paradigm due to its low latency and less error propagation. However, it is non-trivial to train such a model well due to the task complexity and data scarcity. The speech-and-text modality differences result in the E2E-ST model performance usually inferior to the corresponding machine translation (MT) model. Based on the above observation, existing methods often use sharing mechanisms to carry out implicit knowledge transfer by imposing various constraints. However, the final model often performs worse on the MT task than the MT model trained alone, which means that the knowledge transfer ability of this method is also limited. To deal with these problems, we propose the FCCL ( F ine- and C oarse- Granularity C ontrastive L earning) approach for E2E-ST, which makes explicit knowledge transfer through cross-modal multi-grained contrastive learning. A key ingredient of our approach is applying contrastive learning at both sentence- and frame-level to give the comprehensive guide for extracting speech representations containing rich semantic information. In addition, we adopt a simple whitening method to alleviate the representation degeneration in the MT model, which adversely affects contrast learning. Experiments on the MuST-C benchmark show that our proposed approach significantly outperforms the state-of-the-art E2E-ST baselines on all eight language pairs. Further analysis indicates that FCCL can free up its capacity from learning grammatical structure information and force more layers to learn semantic information.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TASLP.2023.3244521</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0001-8852-6311</orcidid><orcidid>https://orcid.org/0000-0003-3841-1959</orcidid><orcidid>https://orcid.org/0000-0001-6301-802X</orcidid><orcidid>https://orcid.org/0000-0001-9917-7794</orcidid><orcidid>https://orcid.org/0000-0002-7989-7089</orcidid></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 2329-9290 |
ispartof | IEEE/ACM transactions on audio, speech, and language processing, 2023-01, Vol.31, p.1-12 |
issn | 2329-9290 2329-9304 |
language | eng |
recordid | cdi_proquest_journals_2780986786 |
source | IEEE Electronic Library (IEL) |
subjects | Constraint modelling Contrastive Learning Data mining Data models Degeneration End-to-End Explicit knowledge Feature extraction Knowledge management Learning Machine translation Representations Semantics Speech recognition Speech Translation Task analysis Task complexity Training Training data |
title | Improving Speech Translation by Cross-modal Multi-grained Contrastive Learning |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T21%3A56%3A53IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Improving%20Speech%20Translation%20by%20Cross-modal%20Multi-grained%20Contrastive%20Learning&rft.jtitle=IEEE/ACM%20transactions%20on%20audio,%20speech,%20and%20language%20processing&rft.au=Zhang,%20Hao&rft.date=2023-01-01&rft.volume=31&rft.spage=1&rft.epage=12&rft.pages=1-12&rft.issn=2329-9290&rft.eissn=2329-9304&rft.coden=ITASFA&rft_id=info:doi/10.1109/TASLP.2023.3244521&rft_dat=%3Cproquest_RIE%3E2780986786%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2780986786&rft_id=info:pmid/&rft_ieee_id=10042965&rfr_iscdi=true |