Improved feature decay algorithms for statistical machine translation

In machine-learning applications, data selection is of crucial importance if good runtime performance is to be achieved. In a scenario where the test set is accessible when the model is being built, training instances can be selected so they are the most relevant for the test set. Feature Decay Algo...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Natural language engineering 2022-01, Vol.28 (1), p.71-91
Hauptverfasser:	Poncelas, Alberto, Maillette de Buy Wenniger, Gideon, Way, Andy
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Bilingualism Decay English language German language Language Machine learning Machine translation Methods N-Gram language models Parallel corpora Test sets Training Translation methods and strategies
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	91
container_issue	1
container_start_page	71
container_title	Natural language engineering
container_volume	28
creator	Poncelas, Alberto Maillette de Buy Wenniger, Gideon Way, Andy
description	In machine-learning applications, data selection is of crucial importance if good runtime performance is to be achieved. In a scenario where the test set is accessible when the model is being built, training instances can be selected so they are the most relevant for the test set. Feature Decay Algorithms (FDA) are a technique for data selection that has demonstrated excellent performance in a number of tasks. This method maximizes the diversity of the n-grams in the training set by devaluing those ones that have already been included. We focus on this method to undertake deeper research on how to select better training data instances. We give an overview of FDA and propose improvements in terms of speed and quality. Using German-to-English parallel data, first we create a novel approach that decreases the execution time of FDA when multiple computation units are available. In addition, we obtain improvements on translation quality by extending FDA using information from the parallel corpus that is generally ignored.
doi_str_mv	10.1017/S1351324920000467
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2618439427</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><cupid>10_1017_S1351324920000467</cupid><sourcerecordid>2618439427</sourcerecordid><originalsourceid>FETCH-LOGICAL-c317t-cea55318cd1957528d8024e0acfe718407f1c842e25d88653adaf061392bc1a73</originalsourceid><addsrcrecordid>eNp1UEtLAzEQDqJgrf4AbwueVzN5bLJHKdUWCh7U8zJNsu2WfdQkFfrvzdKCB3EuM8z3mOEj5B7oI1BQT-_AJXAmSkZTiUJdkAmIosw1AL1Mc4LzEb8mNyHsRg4oMSHzZbf3w7ezWe0wHrzLrDN4zLDdDL6J2y5k9eCzEDE2ITYG26xDs216l0WPfWjTfuhvyVWNbXB35z4lny_zj9kiX729LmfPq9xwUDE3DqXkoI2FUirJtNWUCUfR1E6BFlTVYLRgjkmrdSE5WqxpAbxkawOo-JQ8nHzTz18HF2K1Gw6-TycrViQDXgo2suDEMn4Iwbu62vumQ3-sgFZjWtWftJKGnzXYrX1jN-7X-n_VD-Y3a3Y</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2618439427</pqid></control><display><type>article</type><title>Improved feature decay algorithms for statistical machine translation</title><source>Cambridge University Press Journals Complete</source><creator>Poncelas, Alberto ; Maillette de Buy Wenniger, Gideon ; Way, Andy</creator><creatorcontrib>Poncelas, Alberto ; Maillette de Buy Wenniger, Gideon ; Way, Andy</creatorcontrib><description>In machine-learning applications, data selection is of crucial importance if good runtime performance is to be achieved. In a scenario where the test set is accessible when the model is being built, training instances can be selected so they are the most relevant for the test set. Feature Decay Algorithms (FDA) are a technique for data selection that has demonstrated excellent performance in a number of tasks. This method maximizes the diversity of the n-grams in the training set by devaluing those ones that have already been included. We focus on this method to undertake deeper research on how to select better training data instances. We give an overview of FDA and propose improvements in terms of speed and quality. Using German-to-English parallel data, first we create a novel approach that decreases the execution time of FDA when multiple computation units are available. In addition, we obtain improvements on translation quality by extending FDA using information from the parallel corpus that is generally ignored.</description><identifier>ISSN: 1351-3249</identifier><identifier>EISSN: 1469-8110</identifier><identifier>DOI: 10.1017/S1351324920000467</identifier><language>eng</language><publisher>Cambridge, UK: Cambridge University Press</publisher><subject>Algorithms ; Bilingualism ; Decay ; English language ; German language ; Language ; Machine learning ; Machine translation ; Methods ; N-Gram language models ; Parallel corpora ; Test sets ; Training ; Translation methods and strategies</subject><ispartof>Natural language engineering, 2022-01, Vol.28 (1), p.71-91</ispartof><rights>The Author(s), 2020. Published by Cambridge University Press</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c317t-cea55318cd1957528d8024e0acfe718407f1c842e25d88653adaf061392bc1a73</citedby><cites>FETCH-LOGICAL-c317t-cea55318cd1957528d8024e0acfe718407f1c842e25d88653adaf061392bc1a73</cites><orcidid>0000-0002-5089-1687 ; 0000-0001-8427-7055 ; 0000-0001-5736-5930</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.cambridge.org/core/product/identifier/S1351324920000467/type/journal_article$$EHTML$$P50$$Gcambridge$$H</linktohtml><link.rule.ids>164,314,780,784,27924,27925,55628</link.rule.ids></links><search><creatorcontrib>Poncelas, Alberto</creatorcontrib><creatorcontrib>Maillette de Buy Wenniger, Gideon</creatorcontrib><creatorcontrib>Way, Andy</creatorcontrib><title>Improved feature decay algorithms for statistical machine translation</title><title>Natural language engineering</title><addtitle>Nat. Lang. Eng</addtitle><description>In machine-learning applications, data selection is of crucial importance if good runtime performance is to be achieved. In a scenario where the test set is accessible when the model is being built, training instances can be selected so they are the most relevant for the test set. Feature Decay Algorithms (FDA) are a technique for data selection that has demonstrated excellent performance in a number of tasks. This method maximizes the diversity of the n-grams in the training set by devaluing those ones that have already been included. We focus on this method to undertake deeper research on how to select better training data instances. We give an overview of FDA and propose improvements in terms of speed and quality. Using German-to-English parallel data, first we create a novel approach that decreases the execution time of FDA when multiple computation units are available. In addition, we obtain improvements on translation quality by extending FDA using information from the parallel corpus that is generally ignored.</description><subject>Algorithms</subject><subject>Bilingualism</subject><subject>Decay</subject><subject>English language</subject><subject>German language</subject><subject>Language</subject><subject>Machine learning</subject><subject>Machine translation</subject><subject>Methods</subject><subject>N-Gram language models</subject><subject>Parallel corpora</subject><subject>Test sets</subject><subject>Training</subject><subject>Translation methods and strategies</subject><issn>1351-3249</issn><issn>1469-8110</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><recordid>eNp1UEtLAzEQDqJgrf4AbwueVzN5bLJHKdUWCh7U8zJNsu2WfdQkFfrvzdKCB3EuM8z3mOEj5B7oI1BQT-_AJXAmSkZTiUJdkAmIosw1AL1Mc4LzEb8mNyHsRg4oMSHzZbf3w7ezWe0wHrzLrDN4zLDdDL6J2y5k9eCzEDE2ITYG26xDs216l0WPfWjTfuhvyVWNbXB35z4lny_zj9kiX729LmfPq9xwUDE3DqXkoI2FUirJtNWUCUfR1E6BFlTVYLRgjkmrdSE5WqxpAbxkawOo-JQ8nHzTz18HF2K1Gw6-TycrViQDXgo2suDEMn4Iwbu62vumQ3-sgFZjWtWftJKGnzXYrX1jN-7X-n_VD-Y3a3Y</recordid><startdate>20220101</startdate><enddate>20220101</enddate><creator>Poncelas, Alberto</creator><creator>Maillette de Buy Wenniger, Gideon</creator><creator>Way, Andy</creator><general>Cambridge University Press</general><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7T9</scope><scope>7XB</scope><scope>88G</scope><scope>8AL</scope><scope>8FE</scope><scope>8FG</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AEUYN</scope><scope>AFKRA</scope><scope>ALSLI</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>CPGLG</scope><scope>CRLPW</scope><scope>DWQXO</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>L6V</scope><scope>M0N</scope><scope>M2M</scope><scope>M7S</scope><scope>P5Z</scope><scope>P62</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PSYQQ</scope><scope>PTHSS</scope><scope>Q9U</scope><orcidid>https://orcid.org/0000-0002-5089-1687</orcidid><orcidid>https://orcid.org/0000-0001-8427-7055</orcidid><orcidid>https://orcid.org/0000-0001-5736-5930</orcidid></search><sort><creationdate>20220101</creationdate><title>Improved feature decay algorithms for statistical machine translation</title><author>Poncelas, Alberto ; Maillette de Buy Wenniger, Gideon ; Way, Andy</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c317t-cea55318cd1957528d8024e0acfe718407f1c842e25d88653adaf061392bc1a73</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Algorithms</topic><topic>Bilingualism</topic><topic>Decay</topic><topic>English language</topic><topic>German language</topic><topic>Language</topic><topic>Machine learning</topic><topic>Machine translation</topic><topic>Methods</topic><topic>N-Gram language models</topic><topic>Parallel corpora</topic><topic>Test sets</topic><topic>Training</topic><topic>Translation methods and strategies</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Poncelas, Alberto</creatorcontrib><creatorcontrib>Maillette de Buy Wenniger, Gideon</creatorcontrib><creatorcontrib>Way, Andy</creatorcontrib><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Linguistics and Language Behavior Abstracts (LLBA)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Psychology Database (Alumni)</collection><collection>Computing Database (Alumni Edition)</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest One Sustainability</collection><collection>ProQuest Central UK/Ireland</collection><collection>Social Science Premium Collection</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>Linguistics Collection</collection><collection>Linguistics Database</collection><collection>ProQuest Central Korea</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>ProQuest Engineering Collection</collection><collection>Computing Database</collection><collection>Psychology Database</collection><collection>Engineering Database</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest One Psychology</collection><collection>Engineering Collection</collection><collection>ProQuest Central Basic</collection><jtitle>Natural language engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Poncelas, Alberto</au><au>Maillette de Buy Wenniger, Gideon</au><au>Way, Andy</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Improved feature decay algorithms for statistical machine translation</atitle><jtitle>Natural language engineering</jtitle><addtitle>Nat. Lang. Eng</addtitle><date>2022-01-01</date><risdate>2022</risdate><volume>28</volume><issue>1</issue><spage>71</spage><epage>91</epage><pages>71-91</pages><issn>1351-3249</issn><eissn>1469-8110</eissn><abstract>In machine-learning applications, data selection is of crucial importance if good runtime performance is to be achieved. In a scenario where the test set is accessible when the model is being built, training instances can be selected so they are the most relevant for the test set. Feature Decay Algorithms (FDA) are a technique for data selection that has demonstrated excellent performance in a number of tasks. This method maximizes the diversity of the n-grams in the training set by devaluing those ones that have already been included. We focus on this method to undertake deeper research on how to select better training data instances. We give an overview of FDA and propose improvements in terms of speed and quality. Using German-to-English parallel data, first we create a novel approach that decreases the execution time of FDA when multiple computation units are available. In addition, we obtain improvements on translation quality by extending FDA using information from the parallel corpus that is generally ignored.</abstract><cop>Cambridge, UK</cop><pub>Cambridge University Press</pub><doi>10.1017/S1351324920000467</doi><tpages>21</tpages><orcidid>https://orcid.org/0000-0002-5089-1687</orcidid><orcidid>https://orcid.org/0000-0001-8427-7055</orcidid><orcidid>https://orcid.org/0000-0001-5736-5930</orcidid></addata></record>
fulltext	fulltext
identifier	ISSN: 1351-3249
ispartof	Natural language engineering, 2022-01, Vol.28 (1), p.71-91
issn	1351-3249 1469-8110
language	eng
recordid	cdi_proquest_journals_2618439427
source	Cambridge University Press Journals Complete
subjects	Algorithms Bilingualism Decay English language German language Language Machine learning Machine translation Methods N-Gram language models Parallel corpora Test sets Training Translation methods and strategies
title	Improved feature decay algorithms for statistical machine translation
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T20%3A21%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Improved%20feature%20decay%20algorithms%20for%20statistical%20machine%20translation&rft.jtitle=Natural%20language%20engineering&rft.au=Poncelas,%20Alberto&rft.date=2022-01-01&rft.volume=28&rft.issue=1&rft.spage=71&rft.epage=91&rft.pages=71-91&rft.issn=1351-3249&rft.eissn=1469-8110&rft_id=info:doi/10.1017/S1351324920000467&rft_dat=%3Cproquest_cross%3E2618439427%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2618439427&rft_id=info:pmid/&rft_cupid=10_1017_S1351324920000467&rfr_iscdi=true