Improved feature decay algorithms for statistical machine translation

In machine-learning applications, data selection is of crucial importance if good runtime performance is to be achieved. In a scenario where the test set is accessible when the model is being built, training instances can be selected so they are the most relevant for the test set. Feature Decay Algo...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Natural language engineering 2022-01, Vol.28 (1), p.71-91
Hauptverfasser: Poncelas, Alberto, Maillette de Buy Wenniger, Gideon, Way, Andy
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 91
container_issue 1
container_start_page 71
container_title Natural language engineering
container_volume 28
creator Poncelas, Alberto
Maillette de Buy Wenniger, Gideon
Way, Andy
description In machine-learning applications, data selection is of crucial importance if good runtime performance is to be achieved. In a scenario where the test set is accessible when the model is being built, training instances can be selected so they are the most relevant for the test set. Feature Decay Algorithms (FDA) are a technique for data selection that has demonstrated excellent performance in a number of tasks. This method maximizes the diversity of the n-grams in the training set by devaluing those ones that have already been included. We focus on this method to undertake deeper research on how to select better training data instances. We give an overview of FDA and propose improvements in terms of speed and quality. Using German-to-English parallel data, first we create a novel approach that decreases the execution time of FDA when multiple computation units are available. In addition, we obtain improvements on translation quality by extending FDA using information from the parallel corpus that is generally ignored.
doi_str_mv 10.1017/S1351324920000467
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2618439427</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><cupid>10_1017_S1351324920000467</cupid><sourcerecordid>2618439427</sourcerecordid><originalsourceid>FETCH-LOGICAL-c317t-cea55318cd1957528d8024e0acfe718407f1c842e25d88653adaf061392bc1a73</originalsourceid><addsrcrecordid>eNp1UEtLAzEQDqJgrf4AbwueVzN5bLJHKdUWCh7U8zJNsu2WfdQkFfrvzdKCB3EuM8z3mOEj5B7oI1BQT-_AJXAmSkZTiUJdkAmIosw1AL1Mc4LzEb8mNyHsRg4oMSHzZbf3w7ezWe0wHrzLrDN4zLDdDL6J2y5k9eCzEDE2ITYG26xDs216l0WPfWjTfuhvyVWNbXB35z4lny_zj9kiX729LmfPq9xwUDE3DqXkoI2FUirJtNWUCUfR1E6BFlTVYLRgjkmrdSE5WqxpAbxkawOo-JQ8nHzTz18HF2K1Gw6-TycrViQDXgo2suDEMn4Iwbu62vumQ3-sgFZjWtWftJKGnzXYrX1jN-7X-n_VD-Y3a3Y</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2618439427</pqid></control><display><type>article</type><title>Improved feature decay algorithms for statistical machine translation</title><source>Cambridge University Press Journals Complete</source><creator>Poncelas, Alberto ; Maillette de Buy Wenniger, Gideon ; Way, Andy</creator><creatorcontrib>Poncelas, Alberto ; Maillette de Buy Wenniger, Gideon ; Way, Andy</creatorcontrib><description>In machine-learning applications, data selection is of crucial importance if good runtime performance is to be achieved. In a scenario where the test set is accessible when the model is being built, training instances can be selected so they are the most relevant for the test set. Feature Decay Algorithms (FDA) are a technique for data selection that has demonstrated excellent performance in a number of tasks. This method maximizes the diversity of the n-grams in the training set by devaluing those ones that have already been included. We focus on this method to undertake deeper research on how to select better training data instances. We give an overview of FDA and propose improvements in terms of speed and quality. Using German-to-English parallel data, first we create a novel approach that decreases the execution time of FDA when multiple computation units are available. In addition, we obtain improvements on translation quality by extending FDA using information from the parallel corpus that is generally ignored.</description><identifier>ISSN: 1351-3249</identifier><identifier>EISSN: 1469-8110</identifier><identifier>DOI: 10.1017/S1351324920000467</identifier><language>eng</language><publisher>Cambridge, UK: Cambridge University Press</publisher><subject>Algorithms ; Bilingualism ; Decay ; English language ; German language ; Language ; Machine learning ; Machine translation ; Methods ; N-Gram language models ; Parallel corpora ; Test sets ; Training ; Translation methods and strategies</subject><ispartof>Natural language engineering, 2022-01, Vol.28 (1), p.71-91</ispartof><rights>The Author(s), 2020. Published by Cambridge University Press</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c317t-cea55318cd1957528d8024e0acfe718407f1c842e25d88653adaf061392bc1a73</citedby><cites>FETCH-LOGICAL-c317t-cea55318cd1957528d8024e0acfe718407f1c842e25d88653adaf061392bc1a73</cites><orcidid>0000-0002-5089-1687 ; 0000-0001-8427-7055 ; 0000-0001-5736-5930</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.cambridge.org/core/product/identifier/S1351324920000467/type/journal_article$$EHTML$$P50$$Gcambridge$$H</linktohtml><link.rule.ids>164,314,780,784,27924,27925,55628</link.rule.ids></links><search><creatorcontrib>Poncelas, Alberto</creatorcontrib><creatorcontrib>Maillette de Buy Wenniger, Gideon</creatorcontrib><creatorcontrib>Way, Andy</creatorcontrib><title>Improved feature decay algorithms for statistical machine translation</title><title>Natural language engineering</title><addtitle>Nat. Lang. Eng</addtitle><description>In machine-learning applications, data selection is of crucial importance if good runtime performance is to be achieved. In a scenario where the test set is accessible when the model is being built, training instances can be selected so they are the most relevant for the test set. Feature Decay Algorithms (FDA) are a technique for data selection that has demonstrated excellent performance in a number of tasks. This method maximizes the diversity of the n-grams in the training set by devaluing those ones that have already been included. We focus on this method to undertake deeper research on how to select better training data instances. We give an overview of FDA and propose improvements in terms of speed and quality. Using German-to-English parallel data, first we create a novel approach that decreases the execution time of FDA when multiple computation units are available. In addition, we obtain improvements on translation quality by extending FDA using information from the parallel corpus that is generally ignored.</description><subject>Algorithms</subject><subject>Bilingualism</subject><subject>Decay</subject><subject>English language</subject><subject>German language</subject><subject>Language</subject><subject>Machine learning</subject><subject>Machine translation</subject><subject>Methods</subject><subject>N-Gram language models</subject><subject>Parallel corpora</subject><subject>Test sets</subject><subject>Training</subject><subject>Translation methods and strategies</subject><issn>1351-3249</issn><issn>1469-8110</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><recordid>eNp1UEtLAzEQDqJgrf4AbwueVzN5bLJHKdUWCh7U8zJNsu2WfdQkFfrvzdKCB3EuM8z3mOEj5B7oI1BQT-_AJXAmSkZTiUJdkAmIosw1AL1Mc4LzEb8mNyHsRg4oMSHzZbf3w7ezWe0wHrzLrDN4zLDdDL6J2y5k9eCzEDE2ITYG26xDs216l0WPfWjTfuhvyVWNbXB35z4lny_zj9kiX729LmfPq9xwUDE3DqXkoI2FUirJtNWUCUfR1E6BFlTVYLRgjkmrdSE5WqxpAbxkawOo-JQ8nHzTz18HF2K1Gw6-TycrViQDXgo2suDEMn4Iwbu62vumQ3-sgFZjWtWftJKGnzXYrX1jN-7X-n_VD-Y3a3Y</recordid><startdate>20220101</startdate><enddate>20220101</enddate><creator>Poncelas, Alberto</creator><creator>Maillette de Buy Wenniger, Gideon</creator><creator>Way, Andy</creator><general>Cambridge University Press</general><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7T9</scope><scope>7XB</scope><scope>88G</scope><scope>8AL</scope><scope>8FE</scope><scope>8FG</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AEUYN</scope><scope>AFKRA</scope><scope>ALSLI</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>CPGLG</scope><scope>CRLPW</scope><scope>DWQXO</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>L6V</scope><scope>M0N</scope><scope>M2M</scope><scope>M7S</scope><scope>P5Z</scope><scope>P62</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PSYQQ</scope><scope>PTHSS</scope><scope>Q9U</scope><orcidid>https://orcid.org/0000-0002-5089-1687</orcidid><orcidid>https://orcid.org/0000-0001-8427-7055</orcidid><orcidid>https://orcid.org/0000-0001-5736-5930</orcidid></search><sort><creationdate>20220101</creationdate><title>Improved feature decay algorithms for statistical machine translation</title><author>Poncelas, Alberto ; Maillette de Buy Wenniger, Gideon ; Way, Andy</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c317t-cea55318cd1957528d8024e0acfe718407f1c842e25d88653adaf061392bc1a73</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Algorithms</topic><topic>Bilingualism</topic><topic>Decay</topic><topic>English language</topic><topic>German language</topic><topic>Language</topic><topic>Machine learning</topic><topic>Machine translation</topic><topic>Methods</topic><topic>N-Gram language models</topic><topic>Parallel corpora</topic><topic>Test sets</topic><topic>Training</topic><topic>Translation methods and strategies</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Poncelas, Alberto</creatorcontrib><creatorcontrib>Maillette de Buy Wenniger, Gideon</creatorcontrib><creatorcontrib>Way, Andy</creatorcontrib><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Linguistics and Language Behavior Abstracts (LLBA)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Psychology Database (Alumni)</collection><collection>Computing Database (Alumni Edition)</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest One Sustainability</collection><collection>ProQuest Central UK/Ireland</collection><collection>Social Science Premium Collection</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>Linguistics Collection</collection><collection>Linguistics Database</collection><collection>ProQuest Central Korea</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>ProQuest Engineering Collection</collection><collection>Computing Database</collection><collection>Psychology Database</collection><collection>Engineering Database</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest One Psychology</collection><collection>Engineering Collection</collection><collection>ProQuest Central Basic</collection><jtitle>Natural language engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Poncelas, Alberto</au><au>Maillette de Buy Wenniger, Gideon</au><au>Way, Andy</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Improved feature decay algorithms for statistical machine translation</atitle><jtitle>Natural language engineering</jtitle><addtitle>Nat. Lang. Eng</addtitle><date>2022-01-01</date><risdate>2022</risdate><volume>28</volume><issue>1</issue><spage>71</spage><epage>91</epage><pages>71-91</pages><issn>1351-3249</issn><eissn>1469-8110</eissn><abstract>In machine-learning applications, data selection is of crucial importance if good runtime performance is to be achieved. In a scenario where the test set is accessible when the model is being built, training instances can be selected so they are the most relevant for the test set. Feature Decay Algorithms (FDA) are a technique for data selection that has demonstrated excellent performance in a number of tasks. This method maximizes the diversity of the n-grams in the training set by devaluing those ones that have already been included. We focus on this method to undertake deeper research on how to select better training data instances. We give an overview of FDA and propose improvements in terms of speed and quality. Using German-to-English parallel data, first we create a novel approach that decreases the execution time of FDA when multiple computation units are available. In addition, we obtain improvements on translation quality by extending FDA using information from the parallel corpus that is generally ignored.</abstract><cop>Cambridge, UK</cop><pub>Cambridge University Press</pub><doi>10.1017/S1351324920000467</doi><tpages>21</tpages><orcidid>https://orcid.org/0000-0002-5089-1687</orcidid><orcidid>https://orcid.org/0000-0001-8427-7055</orcidid><orcidid>https://orcid.org/0000-0001-5736-5930</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 1351-3249
ispartof Natural language engineering, 2022-01, Vol.28 (1), p.71-91
issn 1351-3249
1469-8110
language eng
recordid cdi_proquest_journals_2618439427
source Cambridge University Press Journals Complete
subjects Algorithms
Bilingualism
Decay
English language
German language
Language
Machine learning
Machine translation
Methods
N-Gram language models
Parallel corpora
Test sets
Training
Translation methods and strategies
title Improved feature decay algorithms for statistical machine translation
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T20%3A21%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Improved%20feature%20decay%20algorithms%20for%20statistical%20machine%20translation&rft.jtitle=Natural%20language%20engineering&rft.au=Poncelas,%20Alberto&rft.date=2022-01-01&rft.volume=28&rft.issue=1&rft.spage=71&rft.epage=91&rft.pages=71-91&rft.issn=1351-3249&rft.eissn=1469-8110&rft_id=info:doi/10.1017/S1351324920000467&rft_dat=%3Cproquest_cross%3E2618439427%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2618439427&rft_id=info:pmid/&rft_cupid=10_1017_S1351324920000467&rfr_iscdi=true