BanglaEmbed: Efficient Sentence Embedding Models for a Low-Resource Language Using Cross-Lingual Distillation Techniques

Sentence-level embedding is essential for various tasks that require understanding natural language. Many studies have explored such embeddings for high-resource languages like English. However, low-resource languages like Bengali (a language spoken by almost two hundred and thirty million people) a...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2024-11
Hauptverfasser: Kabir, Muhammad Rafsan, Md Mohibur Rahman Nabil, Khan, Mohammad Ashrafuzzaman
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Kabir, Muhammad Rafsan
Md Mohibur Rahman Nabil
Khan, Mohammad Ashrafuzzaman
description Sentence-level embedding is essential for various tasks that require understanding natural language. Many studies have explored such embeddings for high-resource languages like English. However, low-resource languages like Bengali (a language spoken by almost two hundred and thirty million people) are still under-explored. This work introduces two lightweight sentence transformers for the Bangla language, leveraging a novel cross-lingual knowledge distillation approach. This method distills knowledge from a pre-trained, high-performing English sentence transformer. Proposed models are evaluated across multiple downstream tasks, including paraphrase detection, semantic textual similarity (STS), and Bangla hate speech detection. The new method consistently outperformed existing Bangla sentence transformers. Moreover, the lightweight architecture and shorter inference time make the models highly suitable for deployment in resource-constrained environments, making them valuable for practical NLP applications in low-resource languages.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3133048730</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3133048730</sourcerecordid><originalsourceid>FETCH-proquest_journals_31330487303</originalsourceid><addsrcrecordid>eNqNjMsKwjAURIMgKOo_XHBdiEl94FKtuKgbrWuJ7W2NxERzW_TzTcUPcDMDcw7TYX0h5SRaxEL02IjoxjkXs7mYTmWfvVfKVkYl9wsWS0jKUucabQ3HEGhzhC8ptK1g7wo0BKXzoCB1r-iA5BofnDR8NKpCOFErrr0jilLdjgY2mmptjKq1s5BhfrX62SANWbdUhnD06wEbb5NsvYse3rW8Pt_CuQ3oLCdS8ngxl1z-Z30AAWlNQw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3133048730</pqid></control><display><type>article</type><title>BanglaEmbed: Efficient Sentence Embedding Models for a Low-Resource Language Using Cross-Lingual Distillation Techniques</title><source>Free E- Journals</source><creator>Kabir, Muhammad Rafsan ; Md Mohibur Rahman Nabil ; Khan, Mohammad Ashrafuzzaman</creator><creatorcontrib>Kabir, Muhammad Rafsan ; Md Mohibur Rahman Nabil ; Khan, Mohammad Ashrafuzzaman</creatorcontrib><description>Sentence-level embedding is essential for various tasks that require understanding natural language. Many studies have explored such embeddings for high-resource languages like English. However, low-resource languages like Bengali (a language spoken by almost two hundred and thirty million people) are still under-explored. This work introduces two lightweight sentence transformers for the Bangla language, leveraging a novel cross-lingual knowledge distillation approach. This method distills knowledge from a pre-trained, high-performing English sentence transformer. Proposed models are evaluated across multiple downstream tasks, including paraphrase detection, semantic textual similarity (STS), and Bangla hate speech detection. The new method consistently outperformed existing Bangla sentence transformers. Moreover, the lightweight architecture and shorter inference time make the models highly suitable for deployment in resource-constrained environments, making them valuable for practical NLP applications in low-resource languages.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Embedding ; English language ; Lightweight ; Natural language processing ; Sentences ; Weight reduction</subject><ispartof>arXiv.org, 2024-11</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Kabir, Muhammad Rafsan</creatorcontrib><creatorcontrib>Md Mohibur Rahman Nabil</creatorcontrib><creatorcontrib>Khan, Mohammad Ashrafuzzaman</creatorcontrib><title>BanglaEmbed: Efficient Sentence Embedding Models for a Low-Resource Language Using Cross-Lingual Distillation Techniques</title><title>arXiv.org</title><description>Sentence-level embedding is essential for various tasks that require understanding natural language. Many studies have explored such embeddings for high-resource languages like English. However, low-resource languages like Bengali (a language spoken by almost two hundred and thirty million people) are still under-explored. This work introduces two lightweight sentence transformers for the Bangla language, leveraging a novel cross-lingual knowledge distillation approach. This method distills knowledge from a pre-trained, high-performing English sentence transformer. Proposed models are evaluated across multiple downstream tasks, including paraphrase detection, semantic textual similarity (STS), and Bangla hate speech detection. The new method consistently outperformed existing Bangla sentence transformers. Moreover, the lightweight architecture and shorter inference time make the models highly suitable for deployment in resource-constrained environments, making them valuable for practical NLP applications in low-resource languages.</description><subject>Embedding</subject><subject>English language</subject><subject>Lightweight</subject><subject>Natural language processing</subject><subject>Sentences</subject><subject>Weight reduction</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNjMsKwjAURIMgKOo_XHBdiEl94FKtuKgbrWuJ7W2NxERzW_TzTcUPcDMDcw7TYX0h5SRaxEL02IjoxjkXs7mYTmWfvVfKVkYl9wsWS0jKUucabQ3HEGhzhC8ptK1g7wo0BKXzoCB1r-iA5BofnDR8NKpCOFErrr0jilLdjgY2mmptjKq1s5BhfrX62SANWbdUhnD06wEbb5NsvYse3rW8Pt_CuQ3oLCdS8ngxl1z-Z30AAWlNQw</recordid><startdate>20241122</startdate><enddate>20241122</enddate><creator>Kabir, Muhammad Rafsan</creator><creator>Md Mohibur Rahman Nabil</creator><creator>Khan, Mohammad Ashrafuzzaman</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241122</creationdate><title>BanglaEmbed: Efficient Sentence Embedding Models for a Low-Resource Language Using Cross-Lingual Distillation Techniques</title><author>Kabir, Muhammad Rafsan ; Md Mohibur Rahman Nabil ; Khan, Mohammad Ashrafuzzaman</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31330487303</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Embedding</topic><topic>English language</topic><topic>Lightweight</topic><topic>Natural language processing</topic><topic>Sentences</topic><topic>Weight reduction</topic><toplevel>online_resources</toplevel><creatorcontrib>Kabir, Muhammad Rafsan</creatorcontrib><creatorcontrib>Md Mohibur Rahman Nabil</creatorcontrib><creatorcontrib>Khan, Mohammad Ashrafuzzaman</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kabir, Muhammad Rafsan</au><au>Md Mohibur Rahman Nabil</au><au>Khan, Mohammad Ashrafuzzaman</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>BanglaEmbed: Efficient Sentence Embedding Models for a Low-Resource Language Using Cross-Lingual Distillation Techniques</atitle><jtitle>arXiv.org</jtitle><date>2024-11-22</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Sentence-level embedding is essential for various tasks that require understanding natural language. Many studies have explored such embeddings for high-resource languages like English. However, low-resource languages like Bengali (a language spoken by almost two hundred and thirty million people) are still under-explored. This work introduces two lightweight sentence transformers for the Bangla language, leveraging a novel cross-lingual knowledge distillation approach. This method distills knowledge from a pre-trained, high-performing English sentence transformer. Proposed models are evaluated across multiple downstream tasks, including paraphrase detection, semantic textual similarity (STS), and Bangla hate speech detection. The new method consistently outperformed existing Bangla sentence transformers. Moreover, the lightweight architecture and shorter inference time make the models highly suitable for deployment in resource-constrained environments, making them valuable for practical NLP applications in low-resource languages.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-11
issn 2331-8422
language eng
recordid cdi_proquest_journals_3133048730
source Free E- Journals
subjects Embedding
English language
Lightweight
Natural language processing
Sentences
Weight reduction
title BanglaEmbed: Efficient Sentence Embedding Models for a Low-Resource Language Using Cross-Lingual Distillation Techniques
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-05T01%3A21%3A38IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=BanglaEmbed:%20Efficient%20Sentence%20Embedding%20Models%20for%20a%20Low-Resource%20Language%20Using%20Cross-Lingual%20Distillation%20Techniques&rft.jtitle=arXiv.org&rft.au=Kabir,%20Muhammad%20Rafsan&rft.date=2024-11-22&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3133048730%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3133048730&rft_id=info:pmid/&rfr_iscdi=true