Language Chameleon: Transformation analysis between languages using Cross-lingual Post-training based on Pre-trained language models

As pre-trained language models become more resource-demanding, the inequality between resource-rich languages such as English and resource-scarce languages is worsening. This can be attributed to the fact that the amount of available training data in each language follows the power-law distribution,...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2022-09
Hauptverfasser: Son, Suhyune, Park, Chanjun, Lee, Jungseob, Shim, Midan, Lee, Chanhee, Jang, Yoonna, Seo, Jaehyung, Lim, Heuiseok
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Son, Suhyune
Park, Chanjun
Lee, Jungseob
Shim, Midan
Lee, Chanhee
Jang, Yoonna
Seo, Jaehyung
Lim, Heuiseok
description As pre-trained language models become more resource-demanding, the inequality between resource-rich languages such as English and resource-scarce languages is worsening. This can be attributed to the fact that the amount of available training data in each language follows the power-law distribution, and most of the languages belong to the long tail of the distribution. Some research areas attempt to mitigate this problem. For example, in cross-lingual transfer learning and multilingual training, the goal is to benefit long-tail languages via the knowledge acquired from resource-rich languages. Although being successful, existing work has mainly focused on experimenting on as many languages as possible. As a result, targeted in-depth analysis is mostly absent. In this study, we focus on a single low-resource language and perform extensive evaluation and probing experiments using cross-lingual post-training (XPT). To make the transfer scenario challenging, we choose Korean as the target language, as it is a language isolate and thus shares almost no typology with English. Results show that XPT not only outperforms or performs on par with monolingual models trained with orders of magnitudes more data but also is highly efficient in the transfer process.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2714785107</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2714785107</sourcerecordid><originalsourceid>FETCH-proquest_journals_27147851073</originalsourceid><addsrcrecordid>eNqNTk0LgkAQXYIgqf7DQGdB10zpKkWHDh66y4iTbay7tbMS3fvhKdW90-N98iYikEkSh_layplYMl-jKJKbTKZpEojXEU3bY0tQXLAjTdZs4eTQ8Nm6Dr2yBtCgfrJiqMk_iAzob4ehZ2VaKJxlDrUaVQ2lZR96h8qMXo1MDQwrpaOPOtDfAHS2Ic0LMT2jZlp-cS5W-92pOIQ3Z-89sa-utnfDC65kFq-zPI2jLPkv9QYbAVMa</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2714785107</pqid></control><display><type>article</type><title>Language Chameleon: Transformation analysis between languages using Cross-lingual Post-training based on Pre-trained language models</title><source>Free E- Journals</source><creator>Son, Suhyune ; Park, Chanjun ; Lee, Jungseob ; Shim, Midan ; Lee, Chanhee ; Jang, Yoonna ; Seo, Jaehyung ; Lim, Heuiseok</creator><creatorcontrib>Son, Suhyune ; Park, Chanjun ; Lee, Jungseob ; Shim, Midan ; Lee, Chanhee ; Jang, Yoonna ; Seo, Jaehyung ; Lim, Heuiseok</creatorcontrib><description>As pre-trained language models become more resource-demanding, the inequality between resource-rich languages such as English and resource-scarce languages is worsening. This can be attributed to the fact that the amount of available training data in each language follows the power-law distribution, and most of the languages belong to the long tail of the distribution. Some research areas attempt to mitigate this problem. For example, in cross-lingual transfer learning and multilingual training, the goal is to benefit long-tail languages via the knowledge acquired from resource-rich languages. Although being successful, existing work has mainly focused on experimenting on as many languages as possible. As a result, targeted in-depth analysis is mostly absent. In this study, we focus on a single low-resource language and perform extensive evaluation and probing experiments using cross-lingual post-training (XPT). To make the transfer scenario challenging, we choose Korean as the target language, as it is a language isolate and thus shares almost no typology with English. Results show that XPT not only outperforms or performs on par with monolingual models trained with orders of magnitudes more data but also is highly efficient in the transfer process.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>English language ; Knowledge acquisition ; Language ; Languages ; Training</subject><ispartof>arXiv.org, 2022-09</ispartof><rights>2022. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>777,781</link.rule.ids></links><search><creatorcontrib>Son, Suhyune</creatorcontrib><creatorcontrib>Park, Chanjun</creatorcontrib><creatorcontrib>Lee, Jungseob</creatorcontrib><creatorcontrib>Shim, Midan</creatorcontrib><creatorcontrib>Lee, Chanhee</creatorcontrib><creatorcontrib>Jang, Yoonna</creatorcontrib><creatorcontrib>Seo, Jaehyung</creatorcontrib><creatorcontrib>Lim, Heuiseok</creatorcontrib><title>Language Chameleon: Transformation analysis between languages using Cross-lingual Post-training based on Pre-trained language models</title><title>arXiv.org</title><description>As pre-trained language models become more resource-demanding, the inequality between resource-rich languages such as English and resource-scarce languages is worsening. This can be attributed to the fact that the amount of available training data in each language follows the power-law distribution, and most of the languages belong to the long tail of the distribution. Some research areas attempt to mitigate this problem. For example, in cross-lingual transfer learning and multilingual training, the goal is to benefit long-tail languages via the knowledge acquired from resource-rich languages. Although being successful, existing work has mainly focused on experimenting on as many languages as possible. As a result, targeted in-depth analysis is mostly absent. In this study, we focus on a single low-resource language and perform extensive evaluation and probing experiments using cross-lingual post-training (XPT). To make the transfer scenario challenging, we choose Korean as the target language, as it is a language isolate and thus shares almost no typology with English. Results show that XPT not only outperforms or performs on par with monolingual models trained with orders of magnitudes more data but also is highly efficient in the transfer process.</description><subject>English language</subject><subject>Knowledge acquisition</subject><subject>Language</subject><subject>Languages</subject><subject>Training</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNTk0LgkAQXYIgqf7DQGdB10zpKkWHDh66y4iTbay7tbMS3fvhKdW90-N98iYikEkSh_layplYMl-jKJKbTKZpEojXEU3bY0tQXLAjTdZs4eTQ8Nm6Dr2yBtCgfrJiqMk_iAzob4ehZ2VaKJxlDrUaVQ2lZR96h8qMXo1MDQwrpaOPOtDfAHS2Ic0LMT2jZlp-cS5W-92pOIQ3Z-89sa-utnfDC65kFq-zPI2jLPkv9QYbAVMa</recordid><startdate>20220914</startdate><enddate>20220914</enddate><creator>Son, Suhyune</creator><creator>Park, Chanjun</creator><creator>Lee, Jungseob</creator><creator>Shim, Midan</creator><creator>Lee, Chanhee</creator><creator>Jang, Yoonna</creator><creator>Seo, Jaehyung</creator><creator>Lim, Heuiseok</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20220914</creationdate><title>Language Chameleon: Transformation analysis between languages using Cross-lingual Post-training based on Pre-trained language models</title><author>Son, Suhyune ; Park, Chanjun ; Lee, Jungseob ; Shim, Midan ; Lee, Chanhee ; Jang, Yoonna ; Seo, Jaehyung ; Lim, Heuiseok</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_27147851073</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>English language</topic><topic>Knowledge acquisition</topic><topic>Language</topic><topic>Languages</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Son, Suhyune</creatorcontrib><creatorcontrib>Park, Chanjun</creatorcontrib><creatorcontrib>Lee, Jungseob</creatorcontrib><creatorcontrib>Shim, Midan</creatorcontrib><creatorcontrib>Lee, Chanhee</creatorcontrib><creatorcontrib>Jang, Yoonna</creatorcontrib><creatorcontrib>Seo, Jaehyung</creatorcontrib><creatorcontrib>Lim, Heuiseok</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Son, Suhyune</au><au>Park, Chanjun</au><au>Lee, Jungseob</au><au>Shim, Midan</au><au>Lee, Chanhee</au><au>Jang, Yoonna</au><au>Seo, Jaehyung</au><au>Lim, Heuiseok</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Language Chameleon: Transformation analysis between languages using Cross-lingual Post-training based on Pre-trained language models</atitle><jtitle>arXiv.org</jtitle><date>2022-09-14</date><risdate>2022</risdate><eissn>2331-8422</eissn><abstract>As pre-trained language models become more resource-demanding, the inequality between resource-rich languages such as English and resource-scarce languages is worsening. This can be attributed to the fact that the amount of available training data in each language follows the power-law distribution, and most of the languages belong to the long tail of the distribution. Some research areas attempt to mitigate this problem. For example, in cross-lingual transfer learning and multilingual training, the goal is to benefit long-tail languages via the knowledge acquired from resource-rich languages. Although being successful, existing work has mainly focused on experimenting on as many languages as possible. As a result, targeted in-depth analysis is mostly absent. In this study, we focus on a single low-resource language and perform extensive evaluation and probing experiments using cross-lingual post-training (XPT). To make the transfer scenario challenging, we choose Korean as the target language, as it is a language isolate and thus shares almost no typology with English. Results show that XPT not only outperforms or performs on par with monolingual models trained with orders of magnitudes more data but also is highly efficient in the transfer process.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2022-09
issn 2331-8422
language eng
recordid cdi_proquest_journals_2714785107
source Free E- Journals
subjects English language
Knowledge acquisition
Language
Languages
Training
title Language Chameleon: Transformation analysis between languages using Cross-lingual Post-training based on Pre-trained language models
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-21T09%3A39%3A42IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Language%20Chameleon:%20Transformation%20analysis%20between%20languages%20using%20Cross-lingual%20Post-training%20based%20on%20Pre-trained%20language%20models&rft.jtitle=arXiv.org&rft.au=Son,%20Suhyune&rft.date=2022-09-14&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2714785107%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2714785107&rft_id=info:pmid/&rfr_iscdi=true