Empirical Analysis of Korean Public AI Hub Parallel Corpora and in-depth Analysis using LIWC

Machine translation (MT) system aims to translate source language into target language. Recent studies on MT systems mainly focus on neural machine translation (NMT). One factor that significantly affects the performance of NMT is the availability of high-quality parallel corpora. However, high-qual...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2021-10
Hauptverfasser: Park, Chanjun, Shim, Midan, Eo, Sugyeong, Lee, Seolhwa, Seo, Jaehyung, Moon, Hyeonseok, Lim, Heuiseok
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Park, Chanjun
Shim, Midan
Eo, Sugyeong
Lee, Seolhwa
Seo, Jaehyung
Moon, Hyeonseok
Lim, Heuiseok
description Machine translation (MT) system aims to translate source language into target language. Recent studies on MT systems mainly focus on neural machine translation (NMT). One factor that significantly affects the performance of NMT is the availability of high-quality parallel corpora. However, high-quality parallel corpora concerning Korean are relatively scarce compared to those associated with other high-resource languages, such as German or Italian. To address this problem, AI Hub recently released seven types of parallel corpora for Korean. In this study, we conduct an in-depth verification of the quality of corresponding parallel corpora through Linguistic Inquiry and Word Count (LIWC) and several relevant experiments. LIWC is a word-counting software program that can analyze corpora in multiple ways and extract linguistic features as a dictionary base. To the best of our knowledge, this study is the first to use LIWC to analyze parallel corpora in the field of NMT. Our findings suggest the direction of further research toward obtaining the improved quality parallel corpora through our correlation analysis in LIWC and NMT performance.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2588154517</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2588154517</sourcerecordid><originalsourceid>FETCH-proquest_journals_25881545173</originalsourceid><addsrcrecordid>eNqNy0ELgjAYgOERBEn5Hz7oLOjm0quIodShQ9AlkKmzJmtbmzv07-sQdO30Xt5ngQJMSBLlKcYrFDo3xXGMdxmmlAToWj2MsKJnEgrF5MsJB3qEg7acKTj5TooeigZq38GJWSYll1Bqa7RlwNQAQkUDN_P9x70T6gbH5lJu0HJk0vHw2zXa7qtzWUfG6qfnbm4n7e3HuRbTPE9oSpOM_He9AfRwQdw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2588154517</pqid></control><display><type>article</type><title>Empirical Analysis of Korean Public AI Hub Parallel Corpora and in-depth Analysis using LIWC</title><source>Free E- Journals</source><creator>Park, Chanjun ; Shim, Midan ; Eo, Sugyeong ; Lee, Seolhwa ; Seo, Jaehyung ; Moon, Hyeonseok ; Lim, Heuiseok</creator><creatorcontrib>Park, Chanjun ; Shim, Midan ; Eo, Sugyeong ; Lee, Seolhwa ; Seo, Jaehyung ; Moon, Hyeonseok ; Lim, Heuiseok</creatorcontrib><description>Machine translation (MT) system aims to translate source language into target language. Recent studies on MT systems mainly focus on neural machine translation (NMT). One factor that significantly affects the performance of NMT is the availability of high-quality parallel corpora. However, high-quality parallel corpora concerning Korean are relatively scarce compared to those associated with other high-resource languages, such as German or Italian. To address this problem, AI Hub recently released seven types of parallel corpora for Korean. In this study, we conduct an in-depth verification of the quality of corresponding parallel corpora through Linguistic Inquiry and Word Count (LIWC) and several relevant experiments. LIWC is a word-counting software program that can analyze corpora in multiple ways and extract linguistic features as a dictionary base. To the best of our knowledge, this study is the first to use LIWC to analyze parallel corpora in the field of NMT. Our findings suggest the direction of further research toward obtaining the improved quality parallel corpora through our correlation analysis in LIWC and NMT performance.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Correlation analysis ; Counting ; Empirical analysis ; Feature extraction ; Linguistics ; Machine translation ; Software ; Words (language)</subject><ispartof>arXiv.org, 2021-10</ispartof><rights>2021. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>777,781</link.rule.ids></links><search><creatorcontrib>Park, Chanjun</creatorcontrib><creatorcontrib>Shim, Midan</creatorcontrib><creatorcontrib>Eo, Sugyeong</creatorcontrib><creatorcontrib>Lee, Seolhwa</creatorcontrib><creatorcontrib>Seo, Jaehyung</creatorcontrib><creatorcontrib>Moon, Hyeonseok</creatorcontrib><creatorcontrib>Lim, Heuiseok</creatorcontrib><title>Empirical Analysis of Korean Public AI Hub Parallel Corpora and in-depth Analysis using LIWC</title><title>arXiv.org</title><description>Machine translation (MT) system aims to translate source language into target language. Recent studies on MT systems mainly focus on neural machine translation (NMT). One factor that significantly affects the performance of NMT is the availability of high-quality parallel corpora. However, high-quality parallel corpora concerning Korean are relatively scarce compared to those associated with other high-resource languages, such as German or Italian. To address this problem, AI Hub recently released seven types of parallel corpora for Korean. In this study, we conduct an in-depth verification of the quality of corresponding parallel corpora through Linguistic Inquiry and Word Count (LIWC) and several relevant experiments. LIWC is a word-counting software program that can analyze corpora in multiple ways and extract linguistic features as a dictionary base. To the best of our knowledge, this study is the first to use LIWC to analyze parallel corpora in the field of NMT. Our findings suggest the direction of further research toward obtaining the improved quality parallel corpora through our correlation analysis in LIWC and NMT performance.</description><subject>Correlation analysis</subject><subject>Counting</subject><subject>Empirical analysis</subject><subject>Feature extraction</subject><subject>Linguistics</subject><subject>Machine translation</subject><subject>Software</subject><subject>Words (language)</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNy0ELgjAYgOERBEn5Hz7oLOjm0quIodShQ9AlkKmzJmtbmzv07-sQdO30Xt5ngQJMSBLlKcYrFDo3xXGMdxmmlAToWj2MsKJnEgrF5MsJB3qEg7acKTj5TooeigZq38GJWSYll1Bqa7RlwNQAQkUDN_P9x70T6gbH5lJu0HJk0vHw2zXa7qtzWUfG6qfnbm4n7e3HuRbTPE9oSpOM_He9AfRwQdw</recordid><startdate>20211028</startdate><enddate>20211028</enddate><creator>Park, Chanjun</creator><creator>Shim, Midan</creator><creator>Eo, Sugyeong</creator><creator>Lee, Seolhwa</creator><creator>Seo, Jaehyung</creator><creator>Moon, Hyeonseok</creator><creator>Lim, Heuiseok</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20211028</creationdate><title>Empirical Analysis of Korean Public AI Hub Parallel Corpora and in-depth Analysis using LIWC</title><author>Park, Chanjun ; Shim, Midan ; Eo, Sugyeong ; Lee, Seolhwa ; Seo, Jaehyung ; Moon, Hyeonseok ; Lim, Heuiseok</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_25881545173</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Correlation analysis</topic><topic>Counting</topic><topic>Empirical analysis</topic><topic>Feature extraction</topic><topic>Linguistics</topic><topic>Machine translation</topic><topic>Software</topic><topic>Words (language)</topic><toplevel>online_resources</toplevel><creatorcontrib>Park, Chanjun</creatorcontrib><creatorcontrib>Shim, Midan</creatorcontrib><creatorcontrib>Eo, Sugyeong</creatorcontrib><creatorcontrib>Lee, Seolhwa</creatorcontrib><creatorcontrib>Seo, Jaehyung</creatorcontrib><creatorcontrib>Moon, Hyeonseok</creatorcontrib><creatorcontrib>Lim, Heuiseok</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Park, Chanjun</au><au>Shim, Midan</au><au>Eo, Sugyeong</au><au>Lee, Seolhwa</au><au>Seo, Jaehyung</au><au>Moon, Hyeonseok</au><au>Lim, Heuiseok</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Empirical Analysis of Korean Public AI Hub Parallel Corpora and in-depth Analysis using LIWC</atitle><jtitle>arXiv.org</jtitle><date>2021-10-28</date><risdate>2021</risdate><eissn>2331-8422</eissn><abstract>Machine translation (MT) system aims to translate source language into target language. Recent studies on MT systems mainly focus on neural machine translation (NMT). One factor that significantly affects the performance of NMT is the availability of high-quality parallel corpora. However, high-quality parallel corpora concerning Korean are relatively scarce compared to those associated with other high-resource languages, such as German or Italian. To address this problem, AI Hub recently released seven types of parallel corpora for Korean. In this study, we conduct an in-depth verification of the quality of corresponding parallel corpora through Linguistic Inquiry and Word Count (LIWC) and several relevant experiments. LIWC is a word-counting software program that can analyze corpora in multiple ways and extract linguistic features as a dictionary base. To the best of our knowledge, this study is the first to use LIWC to analyze parallel corpora in the field of NMT. Our findings suggest the direction of further research toward obtaining the improved quality parallel corpora through our correlation analysis in LIWC and NMT performance.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2021-10
issn 2331-8422
language eng
recordid cdi_proquest_journals_2588154517
source Free E- Journals
subjects Correlation analysis
Counting
Empirical analysis
Feature extraction
Linguistics
Machine translation
Software
Words (language)
title Empirical Analysis of Korean Public AI Hub Parallel Corpora and in-depth Analysis using LIWC
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-21T08%3A24%3A58IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Empirical%20Analysis%20of%20Korean%20Public%20AI%20Hub%20Parallel%20Corpora%20and%20in-depth%20Analysis%20using%20LIWC&rft.jtitle=arXiv.org&rft.au=Park,%20Chanjun&rft.date=2021-10-28&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2588154517%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2588154517&rft_id=info:pmid/&rfr_iscdi=true