Improved feature decay algorithms for statistical machine translation
In machine-learning applications, data selection is of crucial importance if good runtime performance is to be achieved. In a scenario where the test set is accessible when the model is being built, training instances can be selected so they are the most relevant for the test set. Feature Decay Algo...
Gespeichert in:
Veröffentlicht in: | Natural language engineering 2022-01, Vol.28 (1), p.71-91 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 91 |
---|---|
container_issue | 1 |
container_start_page | 71 |
container_title | Natural language engineering |
container_volume | 28 |
creator | Poncelas, Alberto Maillette de Buy Wenniger, Gideon Way, Andy |
description | In machine-learning applications, data selection is of crucial importance if good runtime performance is to be achieved. In a scenario where the test set is accessible when the model is being built, training instances can be selected so they are the most relevant for the test set. Feature Decay Algorithms (FDA) are a technique for data selection that has demonstrated excellent performance in a number of tasks. This method maximizes the diversity of the n-grams in the training set by devaluing those ones that have already been included. We focus on this method to undertake deeper research on how to select better training data instances. We give an overview of FDA and propose improvements in terms of speed and quality. Using German-to-English parallel data, first we create a novel approach that decreases the execution time of FDA when multiple computation units are available. In addition, we obtain improvements on translation quality by extending FDA using information from the parallel corpus that is generally ignored. |
doi_str_mv | 10.1017/S1351324920000467 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2618439427</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><cupid>10_1017_S1351324920000467</cupid><sourcerecordid>2618439427</sourcerecordid><originalsourceid>FETCH-LOGICAL-c317t-cea55318cd1957528d8024e0acfe718407f1c842e25d88653adaf061392bc1a73</originalsourceid><addsrcrecordid>eNp1UEtLAzEQDqJgrf4AbwueVzN5bLJHKdUWCh7U8zJNsu2WfdQkFfrvzdKCB3EuM8z3mOEj5B7oI1BQT-_AJXAmSkZTiUJdkAmIosw1AL1Mc4LzEb8mNyHsRg4oMSHzZbf3w7ezWe0wHrzLrDN4zLDdDL6J2y5k9eCzEDE2ITYG26xDs216l0WPfWjTfuhvyVWNbXB35z4lny_zj9kiX729LmfPq9xwUDE3DqXkoI2FUirJtNWUCUfR1E6BFlTVYLRgjkmrdSE5WqxpAbxkawOo-JQ8nHzTz18HF2K1Gw6-TycrViQDXgo2suDEMn4Iwbu62vumQ3-sgFZjWtWftJKGnzXYrX1jN-7X-n_VD-Y3a3Y</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2618439427</pqid></control><display><type>article</type><title>Improved feature decay algorithms for statistical machine translation</title><source>Cambridge University Press Journals Complete</source><creator>Poncelas, Alberto ; Maillette de Buy Wenniger, Gideon ; Way, Andy</creator><creatorcontrib>Poncelas, Alberto ; Maillette de Buy Wenniger, Gideon ; Way, Andy</creatorcontrib><description>In machine-learning applications, data selection is of crucial importance if good runtime performance is to be achieved. In a scenario where the test set is accessible when the model is being built, training instances can be selected so they are the most relevant for the test set. Feature Decay Algorithms (FDA) are a technique for data selection that has demonstrated excellent performance in a number of tasks. This method maximizes the diversity of the n-grams in the training set by devaluing those ones that have already been included. We focus on this method to undertake deeper research on how to select better training data instances. We give an overview of FDA and propose improvements in terms of speed and quality. Using German-to-English parallel data, first we create a novel approach that decreases the execution time of FDA when multiple computation units are available. In addition, we obtain improvements on translation quality by extending FDA using information from the parallel corpus that is generally ignored.</description><identifier>ISSN: 1351-3249</identifier><identifier>EISSN: 1469-8110</identifier><identifier>DOI: 10.1017/S1351324920000467</identifier><language>eng</language><publisher>Cambridge, UK: Cambridge University Press</publisher><subject>Algorithms ; Bilingualism ; Decay ; English language ; German language ; Language ; Machine learning ; Machine translation ; Methods ; N-Gram language models ; Parallel corpora ; Test sets ; Training ; Translation methods and strategies</subject><ispartof>Natural language engineering, 2022-01, Vol.28 (1), p.71-91</ispartof><rights>The Author(s), 2020. Published by Cambridge University Press</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c317t-cea55318cd1957528d8024e0acfe718407f1c842e25d88653adaf061392bc1a73</citedby><cites>FETCH-LOGICAL-c317t-cea55318cd1957528d8024e0acfe718407f1c842e25d88653adaf061392bc1a73</cites><orcidid>0000-0002-5089-1687 ; 0000-0001-8427-7055 ; 0000-0001-5736-5930</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.cambridge.org/core/product/identifier/S1351324920000467/type/journal_article$$EHTML$$P50$$Gcambridge$$H</linktohtml><link.rule.ids>164,314,780,784,27924,27925,55628</link.rule.ids></links><search><creatorcontrib>Poncelas, Alberto</creatorcontrib><creatorcontrib>Maillette de Buy Wenniger, Gideon</creatorcontrib><creatorcontrib>Way, Andy</creatorcontrib><title>Improved feature decay algorithms for statistical machine translation</title><title>Natural language engineering</title><addtitle>Nat. Lang. Eng</addtitle><description>In machine-learning applications, data selection is of crucial importance if good runtime performance is to be achieved. In a scenario where the test set is accessible when the model is being built, training instances can be selected so they are the most relevant for the test set. Feature Decay Algorithms (FDA) are a technique for data selection that has demonstrated excellent performance in a number of tasks. This method maximizes the diversity of the n-grams in the training set by devaluing those ones that have already been included. We focus on this method to undertake deeper research on how to select better training data instances. We give an overview of FDA and propose improvements in terms of speed and quality. Using German-to-English parallel data, first we create a novel approach that decreases the execution time of FDA when multiple computation units are available. In addition, we obtain improvements on translation quality by extending FDA using information from the parallel corpus that is generally ignored.</description><subject>Algorithms</subject><subject>Bilingualism</subject><subject>Decay</subject><subject>English language</subject><subject>German language</subject><subject>Language</subject><subject>Machine learning</subject><subject>Machine translation</subject><subject>Methods</subject><subject>N-Gram language models</subject><subject>Parallel corpora</subject><subject>Test sets</subject><subject>Training</subject><subject>Translation methods and strategies</subject><issn>1351-3249</issn><issn>1469-8110</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><recordid>eNp1UEtLAzEQDqJgrf4AbwueVzN5bLJHKdUWCh7U8zJNsu2WfdQkFfrvzdKCB3EuM8z3mOEj5B7oI1BQT-_AJXAmSkZTiUJdkAmIosw1AL1Mc4LzEb8mNyHsRg4oMSHzZbf3w7ezWe0wHrzLrDN4zLDdDL6J2y5k9eCzEDE2ITYG26xDs216l0WPfWjTfuhvyVWNbXB35z4lny_zj9kiX729LmfPq9xwUDE3DqXkoI2FUirJtNWUCUfR1E6BFlTVYLRgjkmrdSE5WqxpAbxkawOo-JQ8nHzTz18HF2K1Gw6-TycrViQDXgo2suDEMn4Iwbu62vumQ3-sgFZjWtWftJKGnzXYrX1jN-7X-n_VD-Y3a3Y</recordid><startdate>20220101</startdate><enddate>20220101</enddate><creator>Poncelas, Alberto</creator><creator>Maillette de Buy Wenniger, Gideon</creator><creator>Way, Andy</creator><general>Cambridge University Press</general><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7T9</scope><scope>7XB</scope><scope>88G</scope><scope>8AL</scope><scope>8FE</scope><scope>8FG</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AEUYN</scope><scope>AFKRA</scope><scope>ALSLI</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>CPGLG</scope><scope>CRLPW</scope><scope>DWQXO</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>L6V</scope><scope>M0N</scope><scope>M2M</scope><scope>M7S</scope><scope>P5Z</scope><scope>P62</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PSYQQ</scope><scope>PTHSS</scope><scope>Q9U</scope><orcidid>https://orcid.org/0000-0002-5089-1687</orcidid><orcidid>https://orcid.org/0000-0001-8427-7055</orcidid><orcidid>https://orcid.org/0000-0001-5736-5930</orcidid></search><sort><creationdate>20220101</creationdate><title>Improved feature decay algorithms for statistical machine translation</title><author>Poncelas, Alberto ; Maillette de Buy Wenniger, Gideon ; Way, Andy</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c317t-cea55318cd1957528d8024e0acfe718407f1c842e25d88653adaf061392bc1a73</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Algorithms</topic><topic>Bilingualism</topic><topic>Decay</topic><topic>English language</topic><topic>German language</topic><topic>Language</topic><topic>Machine learning</topic><topic>Machine translation</topic><topic>Methods</topic><topic>N-Gram language models</topic><topic>Parallel corpora</topic><topic>Test sets</topic><topic>Training</topic><topic>Translation methods and strategies</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Poncelas, Alberto</creatorcontrib><creatorcontrib>Maillette de Buy Wenniger, Gideon</creatorcontrib><creatorcontrib>Way, Andy</creatorcontrib><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Linguistics and Language Behavior Abstracts (LLBA)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Psychology Database (Alumni)</collection><collection>Computing Database (Alumni Edition)</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest One Sustainability</collection><collection>ProQuest Central UK/Ireland</collection><collection>Social Science Premium Collection</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>Linguistics Collection</collection><collection>Linguistics Database</collection><collection>ProQuest Central Korea</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>ProQuest Engineering Collection</collection><collection>Computing Database</collection><collection>Psychology Database</collection><collection>Engineering Database</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest One Psychology</collection><collection>Engineering Collection</collection><collection>ProQuest Central Basic</collection><jtitle>Natural language engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Poncelas, Alberto</au><au>Maillette de Buy Wenniger, Gideon</au><au>Way, Andy</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Improved feature decay algorithms for statistical machine translation</atitle><jtitle>Natural language engineering</jtitle><addtitle>Nat. Lang. Eng</addtitle><date>2022-01-01</date><risdate>2022</risdate><volume>28</volume><issue>1</issue><spage>71</spage><epage>91</epage><pages>71-91</pages><issn>1351-3249</issn><eissn>1469-8110</eissn><abstract>In machine-learning applications, data selection is of crucial importance if good runtime performance is to be achieved. In a scenario where the test set is accessible when the model is being built, training instances can be selected so they are the most relevant for the test set. Feature Decay Algorithms (FDA) are a technique for data selection that has demonstrated excellent performance in a number of tasks. This method maximizes the diversity of the n-grams in the training set by devaluing those ones that have already been included. We focus on this method to undertake deeper research on how to select better training data instances. We give an overview of FDA and propose improvements in terms of speed and quality. Using German-to-English parallel data, first we create a novel approach that decreases the execution time of FDA when multiple computation units are available. In addition, we obtain improvements on translation quality by extending FDA using information from the parallel corpus that is generally ignored.</abstract><cop>Cambridge, UK</cop><pub>Cambridge University Press</pub><doi>10.1017/S1351324920000467</doi><tpages>21</tpages><orcidid>https://orcid.org/0000-0002-5089-1687</orcidid><orcidid>https://orcid.org/0000-0001-8427-7055</orcidid><orcidid>https://orcid.org/0000-0001-5736-5930</orcidid></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1351-3249 |
ispartof | Natural language engineering, 2022-01, Vol.28 (1), p.71-91 |
issn | 1351-3249 1469-8110 |
language | eng |
recordid | cdi_proquest_journals_2618439427 |
source | Cambridge University Press Journals Complete |
subjects | Algorithms Bilingualism Decay English language German language Language Machine learning Machine translation Methods N-Gram language models Parallel corpora Test sets Training Translation methods and strategies |
title | Improved feature decay algorithms for statistical machine translation |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T20%3A21%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Improved%20feature%20decay%20algorithms%20for%20statistical%20machine%20translation&rft.jtitle=Natural%20language%20engineering&rft.au=Poncelas,%20Alberto&rft.date=2022-01-01&rft.volume=28&rft.issue=1&rft.spage=71&rft.epage=91&rft.pages=71-91&rft.issn=1351-3249&rft.eissn=1469-8110&rft_id=info:doi/10.1017/S1351324920000467&rft_dat=%3Cproquest_cross%3E2618439427%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2618439427&rft_id=info:pmid/&rft_cupid=10_1017_S1351324920000467&rfr_iscdi=true |