On Leveraging the Visual Modality for Neural Machine Translation

Leveraging the visual modality effectively for Neural Machine Translation (NMT) remains an open problem in computational linguistics. Recently, Caglayan et al. posit that the observed gains are limited mainly due to the very simple, short, repetitive sentences of the Multi30k dataset (the only multi...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Raunak, Vikas, Choe, Sang Keun, Lu, Quanyang, Xu, Yi, Metze, Florian
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Raunak, Vikas
Choe, Sang Keun
Lu, Quanyang
Xu, Yi
Metze, Florian
description Leveraging the visual modality effectively for Neural Machine Translation (NMT) remains an open problem in computational linguistics. Recently, Caglayan et al. posit that the observed gains are limited mainly due to the very simple, short, repetitive sentences of the Multi30k dataset (the only multimodal MT dataset available at the time), which renders the source text sufficient for context. In this work, we further investigate this hypothesis on a new large scale multimodal Machine Translation (MMT) dataset, How2, which has 1.57 times longer mean sentence length than Multi30k and no repetition. We propose and evaluate three novel fusion techniques, each of which is designed to ensure the utilization of visual context at different stages of the Sequence-to-Sequence transduction pipeline, even under full linguistic context. However, we still obtain only marginal gains under full linguistic context and posit that visual embeddings extracted from deep vision models (ResNet for Multi30k, ResNext for How2) do not lend themselves to increasing the discriminativeness between the vocabulary elements at token level prediction in NMT. We demonstrate this qualitatively by analyzing attention distribution and quantitatively through Principal Component Analysis, arriving at the conclusion that it is the quality of the visual embeddings rather than the length of sentences, which need to be improved in existing MMT datasets.
doi_str_mv 10.48550/arxiv.1910.02754
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_1910_02754</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>1910_02754</sourcerecordid><originalsourceid>FETCH-LOGICAL-a674-913062a914b2cc534bdfeac579770adb04f43f687c08ea502186858fb96ca1423</originalsourceid><addsrcrecordid>eNotz71uwjAYhWEvHRBwAZ3wDYT6_2ejQqVUSssSsUZfHBsspU7lBFTunkKZjvQOR3oQeqZkKYyU5AXybzwvqf0LhGkpJmi1S7j0Z5_hENMBj0eP93E4QYc_-xa6OF5w6DP-8qd8a-COMXlcZUhDB2Ps0ww9BegGP3_sFFWbt2q9Lcrd-8f6tSxAaVFYyoliYKlomHOSi6YNHpzUVmsCbUNEEDwoox0xHiRh1CgjTWisckAF41O0-L-9E-qfHL8hX-obpb5T-BW77UMN</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>On Leveraging the Visual Modality for Neural Machine Translation</title><source>arXiv.org</source><creator>Raunak, Vikas ; Choe, Sang Keun ; Lu, Quanyang ; Xu, Yi ; Metze, Florian</creator><creatorcontrib>Raunak, Vikas ; Choe, Sang Keun ; Lu, Quanyang ; Xu, Yi ; Metze, Florian</creatorcontrib><description>Leveraging the visual modality effectively for Neural Machine Translation (NMT) remains an open problem in computational linguistics. Recently, Caglayan et al. posit that the observed gains are limited mainly due to the very simple, short, repetitive sentences of the Multi30k dataset (the only multimodal MT dataset available at the time), which renders the source text sufficient for context. In this work, we further investigate this hypothesis on a new large scale multimodal Machine Translation (MMT) dataset, How2, which has 1.57 times longer mean sentence length than Multi30k and no repetition. We propose and evaluate three novel fusion techniques, each of which is designed to ensure the utilization of visual context at different stages of the Sequence-to-Sequence transduction pipeline, even under full linguistic context. However, we still obtain only marginal gains under full linguistic context and posit that visual embeddings extracted from deep vision models (ResNet for Multi30k, ResNext for How2) do not lend themselves to increasing the discriminativeness between the vocabulary elements at token level prediction in NMT. We demonstrate this qualitatively by analyzing attention distribution and quantitatively through Principal Component Analysis, arriving at the conclusion that it is the quality of the visual embeddings rather than the length of sentences, which need to be improved in existing MMT datasets.</description><identifier>DOI: 10.48550/arxiv.1910.02754</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Learning</subject><creationdate>2019-10</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/1910.02754$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.1910.02754$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Raunak, Vikas</creatorcontrib><creatorcontrib>Choe, Sang Keun</creatorcontrib><creatorcontrib>Lu, Quanyang</creatorcontrib><creatorcontrib>Xu, Yi</creatorcontrib><creatorcontrib>Metze, Florian</creatorcontrib><title>On Leveraging the Visual Modality for Neural Machine Translation</title><description>Leveraging the visual modality effectively for Neural Machine Translation (NMT) remains an open problem in computational linguistics. Recently, Caglayan et al. posit that the observed gains are limited mainly due to the very simple, short, repetitive sentences of the Multi30k dataset (the only multimodal MT dataset available at the time), which renders the source text sufficient for context. In this work, we further investigate this hypothesis on a new large scale multimodal Machine Translation (MMT) dataset, How2, which has 1.57 times longer mean sentence length than Multi30k and no repetition. We propose and evaluate three novel fusion techniques, each of which is designed to ensure the utilization of visual context at different stages of the Sequence-to-Sequence transduction pipeline, even under full linguistic context. However, we still obtain only marginal gains under full linguistic context and posit that visual embeddings extracted from deep vision models (ResNet for Multi30k, ResNext for How2) do not lend themselves to increasing the discriminativeness between the vocabulary elements at token level prediction in NMT. We demonstrate this qualitatively by analyzing attention distribution and quantitatively through Principal Component Analysis, arriving at the conclusion that it is the quality of the visual embeddings rather than the length of sentences, which need to be improved in existing MMT datasets.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2019</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz71uwjAYhWEvHRBwAZ3wDYT6_2ejQqVUSssSsUZfHBsspU7lBFTunkKZjvQOR3oQeqZkKYyU5AXybzwvqf0LhGkpJmi1S7j0Z5_hENMBj0eP93E4QYc_-xa6OF5w6DP-8qd8a-COMXlcZUhDB2Ps0ww9BegGP3_sFFWbt2q9Lcrd-8f6tSxAaVFYyoliYKlomHOSi6YNHpzUVmsCbUNEEDwoox0xHiRh1CgjTWisckAF41O0-L-9E-qfHL8hX-obpb5T-BW77UMN</recordid><startdate>20191007</startdate><enddate>20191007</enddate><creator>Raunak, Vikas</creator><creator>Choe, Sang Keun</creator><creator>Lu, Quanyang</creator><creator>Xu, Yi</creator><creator>Metze, Florian</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20191007</creationdate><title>On Leveraging the Visual Modality for Neural Machine Translation</title><author>Raunak, Vikas ; Choe, Sang Keun ; Lu, Quanyang ; Xu, Yi ; Metze, Florian</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a674-913062a914b2cc534bdfeac579770adb04f43f687c08ea502186858fb96ca1423</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2019</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Raunak, Vikas</creatorcontrib><creatorcontrib>Choe, Sang Keun</creatorcontrib><creatorcontrib>Lu, Quanyang</creatorcontrib><creatorcontrib>Xu, Yi</creatorcontrib><creatorcontrib>Metze, Florian</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Raunak, Vikas</au><au>Choe, Sang Keun</au><au>Lu, Quanyang</au><au>Xu, Yi</au><au>Metze, Florian</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>On Leveraging the Visual Modality for Neural Machine Translation</atitle><date>2019-10-07</date><risdate>2019</risdate><abstract>Leveraging the visual modality effectively for Neural Machine Translation (NMT) remains an open problem in computational linguistics. Recently, Caglayan et al. posit that the observed gains are limited mainly due to the very simple, short, repetitive sentences of the Multi30k dataset (the only multimodal MT dataset available at the time), which renders the source text sufficient for context. In this work, we further investigate this hypothesis on a new large scale multimodal Machine Translation (MMT) dataset, How2, which has 1.57 times longer mean sentence length than Multi30k and no repetition. We propose and evaluate three novel fusion techniques, each of which is designed to ensure the utilization of visual context at different stages of the Sequence-to-Sequence transduction pipeline, even under full linguistic context. However, we still obtain only marginal gains under full linguistic context and posit that visual embeddings extracted from deep vision models (ResNet for Multi30k, ResNext for How2) do not lend themselves to increasing the discriminativeness between the vocabulary elements at token level prediction in NMT. We demonstrate this qualitatively by analyzing attention distribution and quantitatively through Principal Component Analysis, arriving at the conclusion that it is the quality of the visual embeddings rather than the length of sentences, which need to be improved in existing MMT datasets.</abstract><doi>10.48550/arxiv.1910.02754</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.1910.02754
ispartof
issn
language eng
recordid cdi_arxiv_primary_1910_02754
source arXiv.org
subjects Computer Science - Computation and Language
Computer Science - Learning
title On Leveraging the Visual Modality for Neural Machine Translation
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-01T11%3A10%3A45IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=On%20Leveraging%20the%20Visual%20Modality%20for%20Neural%20Machine%20Translation&rft.au=Raunak,%20Vikas&rft.date=2019-10-07&rft_id=info:doi/10.48550/arxiv.1910.02754&rft_dat=%3Carxiv_GOX%3E1910_02754%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true