BanglaEmbed: Efficient Sentence Embedding Models for a Low-Resource Language Using Cross-Lingual Distillation Techniques
Sentence-level embedding is essential for various tasks that require understanding natural language. Many studies have explored such embeddings for high-resource languages like English. However, low-resource languages like Bengali (a language spoken by almost two hundred and thirty million people) a...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2024-11 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Kabir, Muhammad Rafsan Md Mohibur Rahman Nabil Khan, Mohammad Ashrafuzzaman |
description | Sentence-level embedding is essential for various tasks that require understanding natural language. Many studies have explored such embeddings for high-resource languages like English. However, low-resource languages like Bengali (a language spoken by almost two hundred and thirty million people) are still under-explored. This work introduces two lightweight sentence transformers for the Bangla language, leveraging a novel cross-lingual knowledge distillation approach. This method distills knowledge from a pre-trained, high-performing English sentence transformer. Proposed models are evaluated across multiple downstream tasks, including paraphrase detection, semantic textual similarity (STS), and Bangla hate speech detection. The new method consistently outperformed existing Bangla sentence transformers. Moreover, the lightweight architecture and shorter inference time make the models highly suitable for deployment in resource-constrained environments, making them valuable for practical NLP applications in low-resource languages. |
format | Article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3133048730</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3133048730</sourcerecordid><originalsourceid>FETCH-proquest_journals_31330487303</originalsourceid><addsrcrecordid>eNqNjMsKwjAURIMgKOo_XHBdiEl94FKtuKgbrWuJ7W2NxERzW_TzTcUPcDMDcw7TYX0h5SRaxEL02IjoxjkXs7mYTmWfvVfKVkYl9wsWS0jKUucabQ3HEGhzhC8ptK1g7wo0BKXzoCB1r-iA5BofnDR8NKpCOFErrr0jilLdjgY2mmptjKq1s5BhfrX62SANWbdUhnD06wEbb5NsvYse3rW8Pt_CuQ3oLCdS8ngxl1z-Z30AAWlNQw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3133048730</pqid></control><display><type>article</type><title>BanglaEmbed: Efficient Sentence Embedding Models for a Low-Resource Language Using Cross-Lingual Distillation Techniques</title><source>Free E- Journals</source><creator>Kabir, Muhammad Rafsan ; Md Mohibur Rahman Nabil ; Khan, Mohammad Ashrafuzzaman</creator><creatorcontrib>Kabir, Muhammad Rafsan ; Md Mohibur Rahman Nabil ; Khan, Mohammad Ashrafuzzaman</creatorcontrib><description>Sentence-level embedding is essential for various tasks that require understanding natural language. Many studies have explored such embeddings for high-resource languages like English. However, low-resource languages like Bengali (a language spoken by almost two hundred and thirty million people) are still under-explored. This work introduces two lightweight sentence transformers for the Bangla language, leveraging a novel cross-lingual knowledge distillation approach. This method distills knowledge from a pre-trained, high-performing English sentence transformer. Proposed models are evaluated across multiple downstream tasks, including paraphrase detection, semantic textual similarity (STS), and Bangla hate speech detection. The new method consistently outperformed existing Bangla sentence transformers. Moreover, the lightweight architecture and shorter inference time make the models highly suitable for deployment in resource-constrained environments, making them valuable for practical NLP applications in low-resource languages.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Embedding ; English language ; Lightweight ; Natural language processing ; Sentences ; Weight reduction</subject><ispartof>arXiv.org, 2024-11</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Kabir, Muhammad Rafsan</creatorcontrib><creatorcontrib>Md Mohibur Rahman Nabil</creatorcontrib><creatorcontrib>Khan, Mohammad Ashrafuzzaman</creatorcontrib><title>BanglaEmbed: Efficient Sentence Embedding Models for a Low-Resource Language Using Cross-Lingual Distillation Techniques</title><title>arXiv.org</title><description>Sentence-level embedding is essential for various tasks that require understanding natural language. Many studies have explored such embeddings for high-resource languages like English. However, low-resource languages like Bengali (a language spoken by almost two hundred and thirty million people) are still under-explored. This work introduces two lightweight sentence transformers for the Bangla language, leveraging a novel cross-lingual knowledge distillation approach. This method distills knowledge from a pre-trained, high-performing English sentence transformer. Proposed models are evaluated across multiple downstream tasks, including paraphrase detection, semantic textual similarity (STS), and Bangla hate speech detection. The new method consistently outperformed existing Bangla sentence transformers. Moreover, the lightweight architecture and shorter inference time make the models highly suitable for deployment in resource-constrained environments, making them valuable for practical NLP applications in low-resource languages.</description><subject>Embedding</subject><subject>English language</subject><subject>Lightweight</subject><subject>Natural language processing</subject><subject>Sentences</subject><subject>Weight reduction</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNjMsKwjAURIMgKOo_XHBdiEl94FKtuKgbrWuJ7W2NxERzW_TzTcUPcDMDcw7TYX0h5SRaxEL02IjoxjkXs7mYTmWfvVfKVkYl9wsWS0jKUucabQ3HEGhzhC8ptK1g7wo0BKXzoCB1r-iA5BofnDR8NKpCOFErrr0jilLdjgY2mmptjKq1s5BhfrX62SANWbdUhnD06wEbb5NsvYse3rW8Pt_CuQ3oLCdS8ngxl1z-Z30AAWlNQw</recordid><startdate>20241122</startdate><enddate>20241122</enddate><creator>Kabir, Muhammad Rafsan</creator><creator>Md Mohibur Rahman Nabil</creator><creator>Khan, Mohammad Ashrafuzzaman</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241122</creationdate><title>BanglaEmbed: Efficient Sentence Embedding Models for a Low-Resource Language Using Cross-Lingual Distillation Techniques</title><author>Kabir, Muhammad Rafsan ; Md Mohibur Rahman Nabil ; Khan, Mohammad Ashrafuzzaman</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31330487303</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Embedding</topic><topic>English language</topic><topic>Lightweight</topic><topic>Natural language processing</topic><topic>Sentences</topic><topic>Weight reduction</topic><toplevel>online_resources</toplevel><creatorcontrib>Kabir, Muhammad Rafsan</creatorcontrib><creatorcontrib>Md Mohibur Rahman Nabil</creatorcontrib><creatorcontrib>Khan, Mohammad Ashrafuzzaman</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kabir, Muhammad Rafsan</au><au>Md Mohibur Rahman Nabil</au><au>Khan, Mohammad Ashrafuzzaman</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>BanglaEmbed: Efficient Sentence Embedding Models for a Low-Resource Language Using Cross-Lingual Distillation Techniques</atitle><jtitle>arXiv.org</jtitle><date>2024-11-22</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Sentence-level embedding is essential for various tasks that require understanding natural language. Many studies have explored such embeddings for high-resource languages like English. However, low-resource languages like Bengali (a language spoken by almost two hundred and thirty million people) are still under-explored. This work introduces two lightweight sentence transformers for the Bangla language, leveraging a novel cross-lingual knowledge distillation approach. This method distills knowledge from a pre-trained, high-performing English sentence transformer. Proposed models are evaluated across multiple downstream tasks, including paraphrase detection, semantic textual similarity (STS), and Bangla hate speech detection. The new method consistently outperformed existing Bangla sentence transformers. Moreover, the lightweight architecture and shorter inference time make the models highly suitable for deployment in resource-constrained environments, making them valuable for practical NLP applications in low-resource languages.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2024-11 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_3133048730 |
source | Free E- Journals |
subjects | Embedding English language Lightweight Natural language processing Sentences Weight reduction |
title | BanglaEmbed: Efficient Sentence Embedding Models for a Low-Resource Language Using Cross-Lingual Distillation Techniques |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-05T01%3A21%3A38IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=BanglaEmbed:%20Efficient%20Sentence%20Embedding%20Models%20for%20a%20Low-Resource%20Language%20Using%20Cross-Lingual%20Distillation%20Techniques&rft.jtitle=arXiv.org&rft.au=Kabir,%20Muhammad%20Rafsan&rft.date=2024-11-22&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3133048730%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3133048730&rft_id=info:pmid/&rfr_iscdi=true |