Language Chameleon: Transformation analysis between languages using Cross-lingual Post-training based on Pre-trained language models
As pre-trained language models become more resource-demanding, the inequality between resource-rich languages such as English and resource-scarce languages is worsening. This can be attributed to the fact that the amount of available training data in each language follows the power-law distribution,...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2022-09 |
---|---|
Hauptverfasser: | , , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Son, Suhyune Park, Chanjun Lee, Jungseob Shim, Midan Lee, Chanhee Jang, Yoonna Seo, Jaehyung Lim, Heuiseok |
description | As pre-trained language models become more resource-demanding, the inequality between resource-rich languages such as English and resource-scarce languages is worsening. This can be attributed to the fact that the amount of available training data in each language follows the power-law distribution, and most of the languages belong to the long tail of the distribution. Some research areas attempt to mitigate this problem. For example, in cross-lingual transfer learning and multilingual training, the goal is to benefit long-tail languages via the knowledge acquired from resource-rich languages. Although being successful, existing work has mainly focused on experimenting on as many languages as possible. As a result, targeted in-depth analysis is mostly absent. In this study, we focus on a single low-resource language and perform extensive evaluation and probing experiments using cross-lingual post-training (XPT). To make the transfer scenario challenging, we choose Korean as the target language, as it is a language isolate and thus shares almost no typology with English. Results show that XPT not only outperforms or performs on par with monolingual models trained with orders of magnitudes more data but also is highly efficient in the transfer process. |
format | Article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2714785107</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2714785107</sourcerecordid><originalsourceid>FETCH-proquest_journals_27147851073</originalsourceid><addsrcrecordid>eNqNTk0LgkAQXYIgqf7DQGdB10zpKkWHDh66y4iTbay7tbMS3fvhKdW90-N98iYikEkSh_layplYMl-jKJKbTKZpEojXEU3bY0tQXLAjTdZs4eTQ8Nm6Dr2yBtCgfrJiqMk_iAzob4ehZ2VaKJxlDrUaVQ2lZR96h8qMXo1MDQwrpaOPOtDfAHS2Ic0LMT2jZlp-cS5W-92pOIQ3Z-89sa-utnfDC65kFq-zPI2jLPkv9QYbAVMa</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2714785107</pqid></control><display><type>article</type><title>Language Chameleon: Transformation analysis between languages using Cross-lingual Post-training based on Pre-trained language models</title><source>Free E- Journals</source><creator>Son, Suhyune ; Park, Chanjun ; Lee, Jungseob ; Shim, Midan ; Lee, Chanhee ; Jang, Yoonna ; Seo, Jaehyung ; Lim, Heuiseok</creator><creatorcontrib>Son, Suhyune ; Park, Chanjun ; Lee, Jungseob ; Shim, Midan ; Lee, Chanhee ; Jang, Yoonna ; Seo, Jaehyung ; Lim, Heuiseok</creatorcontrib><description>As pre-trained language models become more resource-demanding, the inequality between resource-rich languages such as English and resource-scarce languages is worsening. This can be attributed to the fact that the amount of available training data in each language follows the power-law distribution, and most of the languages belong to the long tail of the distribution. Some research areas attempt to mitigate this problem. For example, in cross-lingual transfer learning and multilingual training, the goal is to benefit long-tail languages via the knowledge acquired from resource-rich languages. Although being successful, existing work has mainly focused on experimenting on as many languages as possible. As a result, targeted in-depth analysis is mostly absent. In this study, we focus on a single low-resource language and perform extensive evaluation and probing experiments using cross-lingual post-training (XPT). To make the transfer scenario challenging, we choose Korean as the target language, as it is a language isolate and thus shares almost no typology with English. Results show that XPT not only outperforms or performs on par with monolingual models trained with orders of magnitudes more data but also is highly efficient in the transfer process.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>English language ; Knowledge acquisition ; Language ; Languages ; Training</subject><ispartof>arXiv.org, 2022-09</ispartof><rights>2022. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>777,781</link.rule.ids></links><search><creatorcontrib>Son, Suhyune</creatorcontrib><creatorcontrib>Park, Chanjun</creatorcontrib><creatorcontrib>Lee, Jungseob</creatorcontrib><creatorcontrib>Shim, Midan</creatorcontrib><creatorcontrib>Lee, Chanhee</creatorcontrib><creatorcontrib>Jang, Yoonna</creatorcontrib><creatorcontrib>Seo, Jaehyung</creatorcontrib><creatorcontrib>Lim, Heuiseok</creatorcontrib><title>Language Chameleon: Transformation analysis between languages using Cross-lingual Post-training based on Pre-trained language models</title><title>arXiv.org</title><description>As pre-trained language models become more resource-demanding, the inequality between resource-rich languages such as English and resource-scarce languages is worsening. This can be attributed to the fact that the amount of available training data in each language follows the power-law distribution, and most of the languages belong to the long tail of the distribution. Some research areas attempt to mitigate this problem. For example, in cross-lingual transfer learning and multilingual training, the goal is to benefit long-tail languages via the knowledge acquired from resource-rich languages. Although being successful, existing work has mainly focused on experimenting on as many languages as possible. As a result, targeted in-depth analysis is mostly absent. In this study, we focus on a single low-resource language and perform extensive evaluation and probing experiments using cross-lingual post-training (XPT). To make the transfer scenario challenging, we choose Korean as the target language, as it is a language isolate and thus shares almost no typology with English. Results show that XPT not only outperforms or performs on par with monolingual models trained with orders of magnitudes more data but also is highly efficient in the transfer process.</description><subject>English language</subject><subject>Knowledge acquisition</subject><subject>Language</subject><subject>Languages</subject><subject>Training</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNTk0LgkAQXYIgqf7DQGdB10zpKkWHDh66y4iTbay7tbMS3fvhKdW90-N98iYikEkSh_layplYMl-jKJKbTKZpEojXEU3bY0tQXLAjTdZs4eTQ8Nm6Dr2yBtCgfrJiqMk_iAzob4ehZ2VaKJxlDrUaVQ2lZR96h8qMXo1MDQwrpaOPOtDfAHS2Ic0LMT2jZlp-cS5W-92pOIQ3Z-89sa-utnfDC65kFq-zPI2jLPkv9QYbAVMa</recordid><startdate>20220914</startdate><enddate>20220914</enddate><creator>Son, Suhyune</creator><creator>Park, Chanjun</creator><creator>Lee, Jungseob</creator><creator>Shim, Midan</creator><creator>Lee, Chanhee</creator><creator>Jang, Yoonna</creator><creator>Seo, Jaehyung</creator><creator>Lim, Heuiseok</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20220914</creationdate><title>Language Chameleon: Transformation analysis between languages using Cross-lingual Post-training based on Pre-trained language models</title><author>Son, Suhyune ; Park, Chanjun ; Lee, Jungseob ; Shim, Midan ; Lee, Chanhee ; Jang, Yoonna ; Seo, Jaehyung ; Lim, Heuiseok</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_27147851073</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>English language</topic><topic>Knowledge acquisition</topic><topic>Language</topic><topic>Languages</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Son, Suhyune</creatorcontrib><creatorcontrib>Park, Chanjun</creatorcontrib><creatorcontrib>Lee, Jungseob</creatorcontrib><creatorcontrib>Shim, Midan</creatorcontrib><creatorcontrib>Lee, Chanhee</creatorcontrib><creatorcontrib>Jang, Yoonna</creatorcontrib><creatorcontrib>Seo, Jaehyung</creatorcontrib><creatorcontrib>Lim, Heuiseok</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Son, Suhyune</au><au>Park, Chanjun</au><au>Lee, Jungseob</au><au>Shim, Midan</au><au>Lee, Chanhee</au><au>Jang, Yoonna</au><au>Seo, Jaehyung</au><au>Lim, Heuiseok</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Language Chameleon: Transformation analysis between languages using Cross-lingual Post-training based on Pre-trained language models</atitle><jtitle>arXiv.org</jtitle><date>2022-09-14</date><risdate>2022</risdate><eissn>2331-8422</eissn><abstract>As pre-trained language models become more resource-demanding, the inequality between resource-rich languages such as English and resource-scarce languages is worsening. This can be attributed to the fact that the amount of available training data in each language follows the power-law distribution, and most of the languages belong to the long tail of the distribution. Some research areas attempt to mitigate this problem. For example, in cross-lingual transfer learning and multilingual training, the goal is to benefit long-tail languages via the knowledge acquired from resource-rich languages. Although being successful, existing work has mainly focused on experimenting on as many languages as possible. As a result, targeted in-depth analysis is mostly absent. In this study, we focus on a single low-resource language and perform extensive evaluation and probing experiments using cross-lingual post-training (XPT). To make the transfer scenario challenging, we choose Korean as the target language, as it is a language isolate and thus shares almost no typology with English. Results show that XPT not only outperforms or performs on par with monolingual models trained with orders of magnitudes more data but also is highly efficient in the transfer process.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2022-09 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_2714785107 |
source | Free E- Journals |
subjects | English language Knowledge acquisition Language Languages Training |
title | Language Chameleon: Transformation analysis between languages using Cross-lingual Post-training based on Pre-trained language models |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-21T09%3A39%3A42IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Language%20Chameleon:%20Transformation%20analysis%20between%20languages%20using%20Cross-lingual%20Post-training%20based%20on%20Pre-trained%20language%20models&rft.jtitle=arXiv.org&rft.au=Son,%20Suhyune&rft.date=2022-09-14&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2714785107%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2714785107&rft_id=info:pmid/&rfr_iscdi=true |