Canonical and Surface Morphological Segmentation for Nguni Languages

Morphological Segmentation involves decomposing words into morphemes, the smallest meaning-bearing units of language. This is an important NLP task for morphologically-rich agglutinative languages such as the Southern African Nguni language group. In this paper, we investigate supervised and unsuper...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Moeng, Tumi, Reay, Sheldon, Daniels, Aaron, Buys, Jan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Moeng, Tumi
Reay, Sheldon
Daniels, Aaron
Buys, Jan
description Morphological Segmentation involves decomposing words into morphemes, the smallest meaning-bearing units of language. This is an important NLP task for morphologically-rich agglutinative languages such as the Southern African Nguni language group. In this paper, we investigate supervised and unsupervised models for two variants of morphological segmentation: canonical and surface segmentation. We train sequence-to-sequence models for canonical segmentation, where the underlying morphemes may not be equal to the surface form of the word, and Conditional Random Fields (CRF) for surface segmentation. Transformers outperform LSTMs with attention on canonical segmentation, obtaining an average F1 score of 72.5% across 4 languages. Feature-based CRFs outperform bidirectional LSTM-CRFs to obtain an average of 97.1% F1 on surface segmentation. In the unsupervised setting, an entropy-based approach using a character-level LSTM language model fails to outperforms a Morfessor baseline, while on some of the languages neither approach performs much better than a random baseline. We hope that the high performance of the supervised segmentation models will help to facilitate the development of better NLP tools for Nguni languages.
doi_str_mv 10.48550/arxiv.2104.00767
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2104_00767</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2104_00767</sourcerecordid><originalsourceid>FETCH-LOGICAL-a677-f50e97cf45e1fa6f712381b0024920c00ca8e21138dc05e942447f440462ef4c3</originalsourceid><addsrcrecordid>eNotz71OwzAUhmEvDKhwAUz4BhKOneM4GVH4lQIM7R6dusdupNSu3AbB3SMC0ze80ic9QtwoKLExBu4of42fpVaAJYCt7aV46CimODqaJMWdXM_Zk2P5lvJxn6YUlrLmcOB4pvOYovQpy_cwx1H2FMNMgU9X4sLTdOLr_12JzdPjpnsp-o_n1-6-L6i2tvAGuLXOo2HlqfZW6apRWwCNrQYH4KhhrVTV7BwYblEjWo8IWGv26KqVuP27XRjDMY8Hyt_DL2dYONUPM2tEdQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Canonical and Surface Morphological Segmentation for Nguni Languages</title><source>arXiv.org</source><creator>Moeng, Tumi ; Reay, Sheldon ; Daniels, Aaron ; Buys, Jan</creator><creatorcontrib>Moeng, Tumi ; Reay, Sheldon ; Daniels, Aaron ; Buys, Jan</creatorcontrib><description>Morphological Segmentation involves decomposing words into morphemes, the smallest meaning-bearing units of language. This is an important NLP task for morphologically-rich agglutinative languages such as the Southern African Nguni language group. In this paper, we investigate supervised and unsupervised models for two variants of morphological segmentation: canonical and surface segmentation. We train sequence-to-sequence models for canonical segmentation, where the underlying morphemes may not be equal to the surface form of the word, and Conditional Random Fields (CRF) for surface segmentation. Transformers outperform LSTMs with attention on canonical segmentation, obtaining an average F1 score of 72.5% across 4 languages. Feature-based CRFs outperform bidirectional LSTM-CRFs to obtain an average of 97.1% F1 on surface segmentation. In the unsupervised setting, an entropy-based approach using a character-level LSTM language model fails to outperforms a Morfessor baseline, while on some of the languages neither approach performs much better than a random baseline. We hope that the high performance of the supervised segmentation models will help to facilitate the development of better NLP tools for Nguni languages.</description><identifier>DOI: 10.48550/arxiv.2104.00767</identifier><language>eng</language><subject>Computer Science - Computation and Language</subject><creationdate>2021-04</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2104.00767$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2104.00767$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Moeng, Tumi</creatorcontrib><creatorcontrib>Reay, Sheldon</creatorcontrib><creatorcontrib>Daniels, Aaron</creatorcontrib><creatorcontrib>Buys, Jan</creatorcontrib><title>Canonical and Surface Morphological Segmentation for Nguni Languages</title><description>Morphological Segmentation involves decomposing words into morphemes, the smallest meaning-bearing units of language. This is an important NLP task for morphologically-rich agglutinative languages such as the Southern African Nguni language group. In this paper, we investigate supervised and unsupervised models for two variants of morphological segmentation: canonical and surface segmentation. We train sequence-to-sequence models for canonical segmentation, where the underlying morphemes may not be equal to the surface form of the word, and Conditional Random Fields (CRF) for surface segmentation. Transformers outperform LSTMs with attention on canonical segmentation, obtaining an average F1 score of 72.5% across 4 languages. Feature-based CRFs outperform bidirectional LSTM-CRFs to obtain an average of 97.1% F1 on surface segmentation. In the unsupervised setting, an entropy-based approach using a character-level LSTM language model fails to outperforms a Morfessor baseline, while on some of the languages neither approach performs much better than a random baseline. We hope that the high performance of the supervised segmentation models will help to facilitate the development of better NLP tools for Nguni languages.</description><subject>Computer Science - Computation and Language</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz71OwzAUhmEvDKhwAUz4BhKOneM4GVH4lQIM7R6dusdupNSu3AbB3SMC0ze80ic9QtwoKLExBu4of42fpVaAJYCt7aV46CimODqaJMWdXM_Zk2P5lvJxn6YUlrLmcOB4pvOYovQpy_cwx1H2FMNMgU9X4sLTdOLr_12JzdPjpnsp-o_n1-6-L6i2tvAGuLXOo2HlqfZW6apRWwCNrQYH4KhhrVTV7BwYblEjWo8IWGv26KqVuP27XRjDMY8Hyt_DL2dYONUPM2tEdQ</recordid><startdate>20210401</startdate><enddate>20210401</enddate><creator>Moeng, Tumi</creator><creator>Reay, Sheldon</creator><creator>Daniels, Aaron</creator><creator>Buys, Jan</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20210401</creationdate><title>Canonical and Surface Morphological Segmentation for Nguni Languages</title><author>Moeng, Tumi ; Reay, Sheldon ; Daniels, Aaron ; Buys, Jan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a677-f50e97cf45e1fa6f712381b0024920c00ca8e21138dc05e942447f440462ef4c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Computer Science - Computation and Language</topic><toplevel>online_resources</toplevel><creatorcontrib>Moeng, Tumi</creatorcontrib><creatorcontrib>Reay, Sheldon</creatorcontrib><creatorcontrib>Daniels, Aaron</creatorcontrib><creatorcontrib>Buys, Jan</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Moeng, Tumi</au><au>Reay, Sheldon</au><au>Daniels, Aaron</au><au>Buys, Jan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Canonical and Surface Morphological Segmentation for Nguni Languages</atitle><date>2021-04-01</date><risdate>2021</risdate><abstract>Morphological Segmentation involves decomposing words into morphemes, the smallest meaning-bearing units of language. This is an important NLP task for morphologically-rich agglutinative languages such as the Southern African Nguni language group. In this paper, we investigate supervised and unsupervised models for two variants of morphological segmentation: canonical and surface segmentation. We train sequence-to-sequence models for canonical segmentation, where the underlying morphemes may not be equal to the surface form of the word, and Conditional Random Fields (CRF) for surface segmentation. Transformers outperform LSTMs with attention on canonical segmentation, obtaining an average F1 score of 72.5% across 4 languages. Feature-based CRFs outperform bidirectional LSTM-CRFs to obtain an average of 97.1% F1 on surface segmentation. In the unsupervised setting, an entropy-based approach using a character-level LSTM language model fails to outperforms a Morfessor baseline, while on some of the languages neither approach performs much better than a random baseline. We hope that the high performance of the supervised segmentation models will help to facilitate the development of better NLP tools for Nguni languages.</abstract><doi>10.48550/arxiv.2104.00767</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2104.00767
ispartof
issn
language eng
recordid cdi_arxiv_primary_2104_00767
source arXiv.org
subjects Computer Science - Computation and Language
title Canonical and Surface Morphological Segmentation for Nguni Languages
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-25T08%3A52%3A40IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Canonical%20and%20Surface%20Morphological%20Segmentation%20for%20Nguni%20Languages&rft.au=Moeng,%20Tumi&rft.date=2021-04-01&rft_id=info:doi/10.48550/arxiv.2104.00767&rft_dat=%3Carxiv_GOX%3E2104_00767%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true