Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi

One of the most popular downstream tasks in the field of Natural Language Processing is text classification. Text classification tasks have become more daunting when the texts are code-mixed. Though they are not exposed to such text during pre-training, different BERT models have demonstrated succes...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2024-03
Hauptverfasser: Md Nishat Raihan, Goswami, Dhiman, Mahmud, Antara
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Md Nishat Raihan
Goswami, Dhiman
Mahmud, Antara
description One of the most popular downstream tasks in the field of Natural Language Processing is text classification. Text classification tasks have become more daunting when the texts are code-mixed. Though they are not exposed to such text during pre-training, different BERT models have demonstrated success in tackling Code-Mixed NLP challenges. Again, in order to enhance their performance, Code-Mixed NLP models have depended on combining synthetic data with real-world data. It is crucial to understand how the BERT models' performance is impacted when they are pretrained using corresponding code-mixed languages. In this paper, we introduce Tri-Distil-BERT, a multilingual model pre-trained on Bangla, English, and Hindi, and Mixed-Distil-BERT, a model fine-tuned on code-mixed data. Both models are evaluated across multiple NLP tasks and demonstrate competitive performance against larger models like mBERT and XLM-R. Our two-tiered pre-training approach offers efficient alternatives for multilingual and code-mixed language understanding, contributing to advancements in the field.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2866523039</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2866523039</sourcerecordid><originalsourceid>FETCH-proquest_journals_28665230393</originalsourceid><addsrcrecordid>eNqNykELgjAYxvERBEn5HV7o6mBtadZRMzykh_Aug02brK2cQh-_BX2ATn_4Pc8CBZSxHU73lK5Q6NxACKHJgcYxC1BdqbcU-KzcpDTOiltzgtwKiR9fhys3_cx7CZU3rUwPnR0h86p5BIWPcvcIuBFQKiPUBi07rp0Mf12j7aVo8hI_R_uapZvawc6j8VNL0ySJKSPsyP57fQDJZjwr</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2866523039</pqid></control><display><type>article</type><title>Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi</title><source>Free E- Journals</source><creator>Md Nishat Raihan ; Goswami, Dhiman ; Mahmud, Antara</creator><creatorcontrib>Md Nishat Raihan ; Goswami, Dhiman ; Mahmud, Antara</creatorcontrib><description>One of the most popular downstream tasks in the field of Natural Language Processing is text classification. Text classification tasks have become more daunting when the texts are code-mixed. Though they are not exposed to such text during pre-training, different BERT models have demonstrated success in tackling Code-Mixed NLP challenges. Again, in order to enhance their performance, Code-Mixed NLP models have depended on combining synthetic data with real-world data. It is crucial to understand how the BERT models' performance is impacted when they are pretrained using corresponding code-mixed languages. In this paper, we introduce Tri-Distil-BERT, a multilingual model pre-trained on Bangla, English, and Hindi, and Mixed-Distil-BERT, a model fine-tuned on code-mixed data. Both models are evaluated across multiple NLP tasks and demonstrate competitive performance against larger models like mBERT and XLM-R. Our two-tiered pre-training approach offers efficient alternatives for multilingual and code-mixed language understanding, contributing to advancements in the field.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Classification ; English language ; Natural language processing ; Synthetic data ; Training</subject><ispartof>arXiv.org, 2024-03</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Md Nishat Raihan</creatorcontrib><creatorcontrib>Goswami, Dhiman</creatorcontrib><creatorcontrib>Mahmud, Antara</creatorcontrib><title>Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi</title><title>arXiv.org</title><description>One of the most popular downstream tasks in the field of Natural Language Processing is text classification. Text classification tasks have become more daunting when the texts are code-mixed. Though they are not exposed to such text during pre-training, different BERT models have demonstrated success in tackling Code-Mixed NLP challenges. Again, in order to enhance their performance, Code-Mixed NLP models have depended on combining synthetic data with real-world data. It is crucial to understand how the BERT models' performance is impacted when they are pretrained using corresponding code-mixed languages. In this paper, we introduce Tri-Distil-BERT, a multilingual model pre-trained on Bangla, English, and Hindi, and Mixed-Distil-BERT, a model fine-tuned on code-mixed data. Both models are evaluated across multiple NLP tasks and demonstrate competitive performance against larger models like mBERT and XLM-R. Our two-tiered pre-training approach offers efficient alternatives for multilingual and code-mixed language understanding, contributing to advancements in the field.</description><subject>Classification</subject><subject>English language</subject><subject>Natural language processing</subject><subject>Synthetic data</subject><subject>Training</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNykELgjAYxvERBEn5HV7o6mBtadZRMzykh_Aug02brK2cQh-_BX2ATn_4Pc8CBZSxHU73lK5Q6NxACKHJgcYxC1BdqbcU-KzcpDTOiltzgtwKiR9fhys3_cx7CZU3rUwPnR0h86p5BIWPcvcIuBFQKiPUBi07rp0Mf12j7aVo8hI_R_uapZvawc6j8VNL0ySJKSPsyP57fQDJZjwr</recordid><startdate>20240314</startdate><enddate>20240314</enddate><creator>Md Nishat Raihan</creator><creator>Goswami, Dhiman</creator><creator>Mahmud, Antara</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240314</creationdate><title>Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi</title><author>Md Nishat Raihan ; Goswami, Dhiman ; Mahmud, Antara</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_28665230393</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Classification</topic><topic>English language</topic><topic>Natural language processing</topic><topic>Synthetic data</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Md Nishat Raihan</creatorcontrib><creatorcontrib>Goswami, Dhiman</creatorcontrib><creatorcontrib>Mahmud, Antara</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Md Nishat Raihan</au><au>Goswami, Dhiman</au><au>Mahmud, Antara</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi</atitle><jtitle>arXiv.org</jtitle><date>2024-03-14</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>One of the most popular downstream tasks in the field of Natural Language Processing is text classification. Text classification tasks have become more daunting when the texts are code-mixed. Though they are not exposed to such text during pre-training, different BERT models have demonstrated success in tackling Code-Mixed NLP challenges. Again, in order to enhance their performance, Code-Mixed NLP models have depended on combining synthetic data with real-world data. It is crucial to understand how the BERT models' performance is impacted when they are pretrained using corresponding code-mixed languages. In this paper, we introduce Tri-Distil-BERT, a multilingual model pre-trained on Bangla, English, and Hindi, and Mixed-Distil-BERT, a model fine-tuned on code-mixed data. Both models are evaluated across multiple NLP tasks and demonstrate competitive performance against larger models like mBERT and XLM-R. Our two-tiered pre-training approach offers efficient alternatives for multilingual and code-mixed language understanding, contributing to advancements in the field.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-03
issn 2331-8422
language eng
recordid cdi_proquest_journals_2866523039
source Free E- Journals
subjects Classification
English language
Natural language processing
Synthetic data
Training
title Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-23T15%3A40%3A45IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Mixed-Distil-BERT:%20Code-mixed%20Language%20Modeling%20for%20Bangla,%20English,%20and%20Hindi&rft.jtitle=arXiv.org&rft.au=Md%20Nishat%20Raihan&rft.date=2024-03-14&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2866523039%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2866523039&rft_id=info:pmid/&rfr_iscdi=true