On Leveraging the Visual Modality for Neural Machine Translation

Leveraging the visual modality effectively for Neural Machine Translation (NMT) remains an open problem in computational linguistics. Recently, Caglayan et al. posit that the observed gains are limited mainly due to the very simple, short, repetitive sentences of the Multi30k dataset (the only multi...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Raunak, Vikas, Choe, Sang Keun, Lu, Quanyang, Xu, Yi, Metze, Florian
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Raunak, Vikas Choe, Sang Keun Lu, Quanyang Xu, Yi Metze, Florian
description	Leveraging the visual modality effectively for Neural Machine Translation (NMT) remains an open problem in computational linguistics. Recently, Caglayan et al. posit that the observed gains are limited mainly due to the very simple, short, repetitive sentences of the Multi30k dataset (the only multimodal MT dataset available at the time), which renders the source text sufficient for context. In this work, we further investigate this hypothesis on a new large scale multimodal Machine Translation (MMT) dataset, How2, which has 1.57 times longer mean sentence length than Multi30k and no repetition. We propose and evaluate three novel fusion techniques, each of which is designed to ensure the utilization of visual context at different stages of the Sequence-to-Sequence transduction pipeline, even under full linguistic context. However, we still obtain only marginal gains under full linguistic context and posit that visual embeddings extracted from deep vision models (ResNet for Multi30k, ResNext for How2) do not lend themselves to increasing the discriminativeness between the vocabulary elements at token level prediction in NMT. We demonstrate this qualitatively by analyzing attention distribution and quantitatively through Principal Component Analysis, arriving at the conclusion that it is the quality of the visual embeddings rather than the length of sentences, which need to be improved in existing MMT datasets.
doi_str_mv	10.48550/arxiv.1910.02754
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_1910_02754</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>1910_02754</sourcerecordid><originalsourceid>FETCH-LOGICAL-a674-913062a914b2cc534bdfeac579770adb04f43f687c08ea502186858fb96ca1423</originalsourceid><addsrcrecordid>eNotz71uwjAYhWEvHRBwAZ3wDYT6_2ejQqVUSssSsUZfHBsspU7lBFTunkKZjvQOR3oQeqZkKYyU5AXybzwvqf0LhGkpJmi1S7j0Z5_hENMBj0eP93E4QYc_-xa6OF5w6DP-8qd8a-COMXlcZUhDB2Ps0ww9BegGP3_sFFWbt2q9Lcrd-8f6tSxAaVFYyoliYKlomHOSi6YNHpzUVmsCbUNEEDwoox0xHiRh1CgjTWisckAF41O0-L-9E-qfHL8hX-obpb5T-BW77UMN</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>On Leveraging the Visual Modality for Neural Machine Translation</title><source>arXiv.org</source><creator>Raunak, Vikas ; Choe, Sang Keun ; Lu, Quanyang ; Xu, Yi ; Metze, Florian</creator><creatorcontrib>Raunak, Vikas ; Choe, Sang Keun ; Lu, Quanyang ; Xu, Yi ; Metze, Florian</creatorcontrib><description>Leveraging the visual modality effectively for Neural Machine Translation (NMT) remains an open problem in computational linguistics. Recently, Caglayan et al. posit that the observed gains are limited mainly due to the very simple, short, repetitive sentences of the Multi30k dataset (the only multimodal MT dataset available at the time), which renders the source text sufficient for context. In this work, we further investigate this hypothesis on a new large scale multimodal Machine Translation (MMT) dataset, How2, which has 1.57 times longer mean sentence length than Multi30k and no repetition. We propose and evaluate three novel fusion techniques, each of which is designed to ensure the utilization of visual context at different stages of the Sequence-to-Sequence transduction pipeline, even under full linguistic context. However, we still obtain only marginal gains under full linguistic context and posit that visual embeddings extracted from deep vision models (ResNet for Multi30k, ResNext for How2) do not lend themselves to increasing the discriminativeness between the vocabulary elements at token level prediction in NMT. We demonstrate this qualitatively by analyzing attention distribution and quantitatively through Principal Component Analysis, arriving at the conclusion that it is the quality of the visual embeddings rather than the length of sentences, which need to be improved in existing MMT datasets.</description><identifier>DOI: 10.48550/arxiv.1910.02754</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Learning</subject><creationdate>2019-10</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/1910.02754$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.1910.02754$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Raunak, Vikas</creatorcontrib><creatorcontrib>Choe, Sang Keun</creatorcontrib><creatorcontrib>Lu, Quanyang</creatorcontrib><creatorcontrib>Xu, Yi</creatorcontrib><creatorcontrib>Metze, Florian</creatorcontrib><title>On Leveraging the Visual Modality for Neural Machine Translation</title><description>Leveraging the visual modality effectively for Neural Machine Translation (NMT) remains an open problem in computational linguistics. Recently, Caglayan et al. posit that the observed gains are limited mainly due to the very simple, short, repetitive sentences of the Multi30k dataset (the only multimodal MT dataset available at the time), which renders the source text sufficient for context. In this work, we further investigate this hypothesis on a new large scale multimodal Machine Translation (MMT) dataset, How2, which has 1.57 times longer mean sentence length than Multi30k and no repetition. We propose and evaluate three novel fusion techniques, each of which is designed to ensure the utilization of visual context at different stages of the Sequence-to-Sequence transduction pipeline, even under full linguistic context. However, we still obtain only marginal gains under full linguistic context and posit that visual embeddings extracted from deep vision models (ResNet for Multi30k, ResNext for How2) do not lend themselves to increasing the discriminativeness between the vocabulary elements at token level prediction in NMT. We demonstrate this qualitatively by analyzing attention distribution and quantitatively through Principal Component Analysis, arriving at the conclusion that it is the quality of the visual embeddings rather than the length of sentences, which need to be improved in existing MMT datasets.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2019</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz71uwjAYhWEvHRBwAZ3wDYT6_2ejQqVUSssSsUZfHBsspU7lBFTunkKZjvQOR3oQeqZkKYyU5AXybzwvqf0LhGkpJmi1S7j0Z5_hENMBj0eP93E4QYc_-xa6OF5w6DP-8qd8a-COMXlcZUhDB2Ps0ww9BegGP3_sFFWbt2q9Lcrd-8f6tSxAaVFYyoliYKlomHOSi6YNHpzUVmsCbUNEEDwoox0xHiRh1CgjTWisckAF41O0-L-9E-qfHL8hX-obpb5T-BW77UMN</recordid><startdate>20191007</startdate><enddate>20191007</enddate><creator>Raunak, Vikas</creator><creator>Choe, Sang Keun</creator><creator>Lu, Quanyang</creator><creator>Xu, Yi</creator><creator>Metze, Florian</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20191007</creationdate><title>On Leveraging the Visual Modality for Neural Machine Translation</title><author>Raunak, Vikas ; Choe, Sang Keun ; Lu, Quanyang ; Xu, Yi ; Metze, Florian</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a674-913062a914b2cc534bdfeac579770adb04f43f687c08ea502186858fb96ca1423</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2019</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Raunak, Vikas</creatorcontrib><creatorcontrib>Choe, Sang Keun</creatorcontrib><creatorcontrib>Lu, Quanyang</creatorcontrib><creatorcontrib>Xu, Yi</creatorcontrib><creatorcontrib>Metze, Florian</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Raunak, Vikas</au><au>Choe, Sang Keun</au><au>Lu, Quanyang</au><au>Xu, Yi</au><au>Metze, Florian</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>On Leveraging the Visual Modality for Neural Machine Translation</atitle><date>2019-10-07</date><risdate>2019</risdate><abstract>Leveraging the visual modality effectively for Neural Machine Translation (NMT) remains an open problem in computational linguistics. Recently, Caglayan et al. posit that the observed gains are limited mainly due to the very simple, short, repetitive sentences of the Multi30k dataset (the only multimodal MT dataset available at the time), which renders the source text sufficient for context. In this work, we further investigate this hypothesis on a new large scale multimodal Machine Translation (MMT) dataset, How2, which has 1.57 times longer mean sentence length than Multi30k and no repetition. We propose and evaluate three novel fusion techniques, each of which is designed to ensure the utilization of visual context at different stages of the Sequence-to-Sequence transduction pipeline, even under full linguistic context. However, we still obtain only marginal gains under full linguistic context and posit that visual embeddings extracted from deep vision models (ResNet for Multi30k, ResNext for How2) do not lend themselves to increasing the discriminativeness between the vocabulary elements at token level prediction in NMT. We demonstrate this qualitatively by analyzing attention distribution and quantitatively through Principal Component Analysis, arriving at the conclusion that it is the quality of the visual embeddings rather than the length of sentences, which need to be improved in existing MMT datasets.</abstract><doi>10.48550/arxiv.1910.02754</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.1910.02754
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_1910_02754
source	arXiv.org
subjects	Computer Science - Computation and Language Computer Science - Learning
title	On Leveraging the Visual Modality for Neural Machine Translation
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-01T11%3A10%3A45IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=On%20Leveraging%20the%20Visual%20Modality%20for%20Neural%20Machine%20Translation&rft.au=Raunak,%20Vikas&rft.date=2019-10-07&rft_id=info:doi/10.48550/arxiv.1910.02754&rft_dat=%3Carxiv_GOX%3E1910_02754%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true