BanglaEmbed: Efficient Sentence Embedding Models for a Low-Resource Language Using Cross-Lingual Distillation Techniques

Sentence-level embedding is essential for various tasks that require understanding natural language. Many studies have explored such embeddings for high-resource languages like English. However, low-resource languages like Bengali (a language spoken by almost two hundred and thirty million people) a...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-11
Hauptverfasser:	Kabir, Muhammad Rafsan, Md Mohibur Rahman Nabil, Khan, Mohammad Ashrafuzzaman
Format:	Artikel
Sprache:	eng
Schlagworte:	Embedding English language Lightweight Natural language processing Sentences Weight reduction
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Kabir, Muhammad Rafsan Md Mohibur Rahman Nabil Khan, Mohammad Ashrafuzzaman
description	Sentence-level embedding is essential for various tasks that require understanding natural language. Many studies have explored such embeddings for high-resource languages like English. However, low-resource languages like Bengali (a language spoken by almost two hundred and thirty million people) are still under-explored. This work introduces two lightweight sentence transformers for the Bangla language, leveraging a novel cross-lingual knowledge distillation approach. This method distills knowledge from a pre-trained, high-performing English sentence transformer. Proposed models are evaluated across multiple downstream tasks, including paraphrase detection, semantic textual similarity (STS), and Bangla hate speech detection. The new method consistently outperformed existing Bangla sentence transformers. Moreover, the lightweight architecture and shorter inference time make the models highly suitable for deployment in resource-constrained environments, making them valuable for practical NLP applications in low-resource languages.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3133048730</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3133048730</sourcerecordid><originalsourceid>FETCH-proquest_journals_31330487303</originalsourceid><addsrcrecordid>eNqNjMsKwjAURIMgKOo_XHBdiEl94FKtuKgbrWuJ7W2NxERzW_TzTcUPcDMDcw7TYX0h5SRaxEL02IjoxjkXs7mYTmWfvVfKVkYl9wsWS0jKUucabQ3HEGhzhC8ptK1g7wo0BKXzoCB1r-iA5BofnDR8NKpCOFErrr0jilLdjgY2mmptjKq1s5BhfrX62SANWbdUhnD06wEbb5NsvYse3rW8Pt_CuQ3oLCdS8ngxl1z-Z30AAWlNQw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3133048730</pqid></control><display><type>article</type><title>BanglaEmbed: Efficient Sentence Embedding Models for a Low-Resource Language Using Cross-Lingual Distillation Techniques</title><source>Free E- Journals</source><creator>Kabir, Muhammad Rafsan ; Md Mohibur Rahman Nabil ; Khan, Mohammad Ashrafuzzaman</creator><creatorcontrib>Kabir, Muhammad Rafsan ; Md Mohibur Rahman Nabil ; Khan, Mohammad Ashrafuzzaman</creatorcontrib><description>Sentence-level embedding is essential for various tasks that require understanding natural language. Many studies have explored such embeddings for high-resource languages like English. However, low-resource languages like Bengali (a language spoken by almost two hundred and thirty million people) are still under-explored. This work introduces two lightweight sentence transformers for the Bangla language, leveraging a novel cross-lingual knowledge distillation approach. This method distills knowledge from a pre-trained, high-performing English sentence transformer. Proposed models are evaluated across multiple downstream tasks, including paraphrase detection, semantic textual similarity (STS), and Bangla hate speech detection. The new method consistently outperformed existing Bangla sentence transformers. Moreover, the lightweight architecture and shorter inference time make the models highly suitable for deployment in resource-constrained environments, making them valuable for practical NLP applications in low-resource languages.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Embedding ; English language ; Lightweight ; Natural language processing ; Sentences ; Weight reduction</subject><ispartof>arXiv.org, 2024-11</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Kabir, Muhammad Rafsan</creatorcontrib><creatorcontrib>Md Mohibur Rahman Nabil</creatorcontrib><creatorcontrib>Khan, Mohammad Ashrafuzzaman</creatorcontrib><title>BanglaEmbed: Efficient Sentence Embedding Models for a Low-Resource Language Using Cross-Lingual Distillation Techniques</title><title>arXiv.org</title><description>Sentence-level embedding is essential for various tasks that require understanding natural language. Many studies have explored such embeddings for high-resource languages like English. However, low-resource languages like Bengali (a language spoken by almost two hundred and thirty million people) are still under-explored. This work introduces two lightweight sentence transformers for the Bangla language, leveraging a novel cross-lingual knowledge distillation approach. This method distills knowledge from a pre-trained, high-performing English sentence transformer. Proposed models are evaluated across multiple downstream tasks, including paraphrase detection, semantic textual similarity (STS), and Bangla hate speech detection. The new method consistently outperformed existing Bangla sentence transformers. Moreover, the lightweight architecture and shorter inference time make the models highly suitable for deployment in resource-constrained environments, making them valuable for practical NLP applications in low-resource languages.</description><subject>Embedding</subject><subject>English language</subject><subject>Lightweight</subject><subject>Natural language processing</subject><subject>Sentences</subject><subject>Weight reduction</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNjMsKwjAURIMgKOo_XHBdiEl94FKtuKgbrWuJ7W2NxERzW_TzTcUPcDMDcw7TYX0h5SRaxEL02IjoxjkXs7mYTmWfvVfKVkYl9wsWS0jKUucabQ3HEGhzhC8ptK1g7wo0BKXzoCB1r-iA5BofnDR8NKpCOFErrr0jilLdjgY2mmptjKq1s5BhfrX62SANWbdUhnD06wEbb5NsvYse3rW8Pt_CuQ3oLCdS8ngxl1z-Z30AAWlNQw</recordid><startdate>20241122</startdate><enddate>20241122</enddate><creator>Kabir, Muhammad Rafsan</creator><creator>Md Mohibur Rahman Nabil</creator><creator>Khan, Mohammad Ashrafuzzaman</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241122</creationdate><title>BanglaEmbed: Efficient Sentence Embedding Models for a Low-Resource Language Using Cross-Lingual Distillation Techniques</title><author>Kabir, Muhammad Rafsan ; Md Mohibur Rahman Nabil ; Khan, Mohammad Ashrafuzzaman</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31330487303</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Embedding</topic><topic>English language</topic><topic>Lightweight</topic><topic>Natural language processing</topic><topic>Sentences</topic><topic>Weight reduction</topic><toplevel>online_resources</toplevel><creatorcontrib>Kabir, Muhammad Rafsan</creatorcontrib><creatorcontrib>Md Mohibur Rahman Nabil</creatorcontrib><creatorcontrib>Khan, Mohammad Ashrafuzzaman</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kabir, Muhammad Rafsan</au><au>Md Mohibur Rahman Nabil</au><au>Khan, Mohammad Ashrafuzzaman</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>BanglaEmbed: Efficient Sentence Embedding Models for a Low-Resource Language Using Cross-Lingual Distillation Techniques</atitle><jtitle>arXiv.org</jtitle><date>2024-11-22</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Sentence-level embedding is essential for various tasks that require understanding natural language. Many studies have explored such embeddings for high-resource languages like English. However, low-resource languages like Bengali (a language spoken by almost two hundred and thirty million people) are still under-explored. This work introduces two lightweight sentence transformers for the Bangla language, leveraging a novel cross-lingual knowledge distillation approach. This method distills knowledge from a pre-trained, high-performing English sentence transformer. Proposed models are evaluated across multiple downstream tasks, including paraphrase detection, semantic textual similarity (STS), and Bangla hate speech detection. The new method consistently outperformed existing Bangla sentence transformers. Moreover, the lightweight architecture and shorter inference time make the models highly suitable for deployment in resource-constrained environments, making them valuable for practical NLP applications in low-resource languages.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-11
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3133048730
source	Free E- Journals
subjects	Embedding English language Lightweight Natural language processing Sentences Weight reduction
title	BanglaEmbed: Efficient Sentence Embedding Models for a Low-Resource Language Using Cross-Lingual Distillation Techniques
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-05T01%3A21%3A38IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=BanglaEmbed:%20Efficient%20Sentence%20Embedding%20Models%20for%20a%20Low-Resource%20Language%20Using%20Cross-Lingual%20Distillation%20Techniques&rft.jtitle=arXiv.org&rft.au=Kabir,%20Muhammad%20Rafsan&rft.date=2024-11-22&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3133048730%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3133048730&rft_id=info:pmid/&rfr_iscdi=true