Kencorpus: Kenyan Languages Corpus

This project collected text and speech corpora for Languages in Kenya. In KenCorpus project, three languages were strategically selected i.e. Kiswahili, Luhya, and Dholuo. The Luhya Language has several dialects. In the project, 3 dialects were chosen as a start: Lumarachi, Logooli and Lubukusi. Pri...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Wanjawa, Barack, Wanzare, Lilian D.A., Indede, Florence, McOnyango, Owen, Ombui, Edward, Muchemi, Lawrence
Format: Dataset
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Wanjawa, Barack
Wanzare, Lilian D.A.
Indede, Florence
McOnyango, Owen
Ombui, Edward
Muchemi, Lawrence
description This project collected text and speech corpora for Languages in Kenya. In KenCorpus project, three languages were strategically selected i.e. Kiswahili, Luhya, and Dholuo. The Luhya Language has several dialects. In the project, 3 dialects were chosen as a start: Lumarachi, Logooli and Lubukusi. Primary data was collected from the respective language communities, which also included indiginous stories and other narratives from student compositions, native language media stations, and publishers. This went beyond the conventional religious texts to include other genres of texts that made the corpus more representative of everyday language use in the communities. Text data : A total of 4442 texts were collected: 546 texts for Dholuo, 483 texts for Luhya-Lumarachi, 135 texts for Luhya-Lubukusu and 359 texts for Luhya-Logooli. Spontaneous Speech data: A total of 1,152 files were collected which total to 176hr 29min and 46sec of spontaneous speech data: 104 files (19hr 10min 57sec) for Swahili, 512 files (99hr 3min 8sec) for Dholuo, 138 files (15hr 37min 46sec) for Luhya-Lumarachi, 354 files (30hr 11min) for Luhya-Lubukusu and 44 files (12hr 26min 55sec) for Luhya-Logooli. Acknowledgement of data collectors: Kiswahili - Rose Felynix, Khalid Kitito, Dr. Benard Okal Luo - Jotham Ondu Ajiki, Dr. Jackline Okello, Jonathan Muga, Mercy Lavinca Oduoll Luhyia (Logooli) - Salano Odari, Dr. Phillip Lumwamu Luhyia (Bukusu) - Mactilda Nekesa Makana, Mulwale Martin Luhyia (Marachi) - Yonah Weunda
doi_str_mv 10.7910/dvn/6n5v1k
format Dataset
fullrecord <record><control><sourceid>datacite_PQ8</sourceid><recordid>TN_cdi_datacite_primary_10_7910_dvn_6n5v1k</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>10_7910_dvn_6n5v1k</sourcerecordid><originalsourceid>FETCH-LOGICAL-d71k-60bada0a04c80dbb3e63181b07ad320f2966c21bcbd2cb2b3cfcf803a740a36f3</originalsourceid><addsrcrecordid>eNotzkkLwjAUBOBcPEj14i8oHoXqS6JJ9SbFDQtevIeXrZRqlC6C_971NAMDw0fIiMJULinM7CPMRFg8aNUn46ML5lbfu2YVv-sTQ5xjKDosXBNn32FAeh4vjRv-MyLn7eac7ZP8tDtk6zyxklaJAI0WAWFuUrBacyc4TakGiZYz8GwphGFUG22Z0Uxz441PgaOcA3LheUQmv1uLLZqydepel1esn4qC-rjV261-bv4C9Ew8pQ</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>dataset</recordtype></control><display><type>dataset</type><title>Kencorpus: Kenyan Languages Corpus</title><source>DataCite</source><creator>Wanjawa, Barack ; Wanzare, Lilian D.A. ; Indede, Florence ; McOnyango, Owen ; Ombui, Edward ; Muchemi, Lawrence</creator><creatorcontrib>Wanjawa, Barack ; Wanzare, Lilian D.A. ; Indede, Florence ; McOnyango, Owen ; Ombui, Edward ; Muchemi, Lawrence</creatorcontrib><description>This project collected text and speech corpora for Languages in Kenya. In KenCorpus project, three languages were strategically selected i.e. Kiswahili, Luhya, and Dholuo. The Luhya Language has several dialects. In the project, 3 dialects were chosen as a start: Lumarachi, Logooli and Lubukusi. Primary data was collected from the respective language communities, which also included indiginous stories and other narratives from student compositions, native language media stations, and publishers. This went beyond the conventional religious texts to include other genres of texts that made the corpus more representative of everyday language use in the communities. Text data : A total of 4442 texts were collected: 546 texts for Dholuo, 483 texts for Luhya-Lumarachi, 135 texts for Luhya-Lubukusu and 359 texts for Luhya-Logooli. Spontaneous Speech data: A total of 1,152 files were collected which total to 176hr 29min and 46sec of spontaneous speech data: 104 files (19hr 10min 57sec) for Swahili, 512 files (99hr 3min 8sec) for Dholuo, 138 files (15hr 37min 46sec) for Luhya-Lumarachi, 354 files (30hr 11min) for Luhya-Lubukusu and 44 files (12hr 26min 55sec) for Luhya-Logooli. Acknowledgement of data collectors: Kiswahili - Rose Felynix, Khalid Kitito, Dr. Benard Okal Luo - Jotham Ondu Ajiki, Dr. Jackline Okello, Jonathan Muga, Mercy Lavinca Oduoll Luhyia (Logooli) - Salano Odari, Dr. Phillip Lumwamu Luhyia (Bukusu) - Mactilda Nekesa Makana, Mulwale Martin Luhyia (Marachi) - Yonah Weunda</description><identifier>DOI: 10.7910/dvn/6n5v1k</identifier><language>eng</language><publisher>Harvard Dataverse</publisher><subject>African languages ; Computer and Information Science ; Dataset curation ; Datasets ; low resource languages ; Social Sciences</subject><creationdate>2022</creationdate><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,1887</link.rule.ids><linktorsrc>$$Uhttps://commons.datacite.org/doi.org/10.7910/dvn/6n5v1k$$EView_record_in_DataCite.org$$FView_record_in_$$GDataCite.org$$Hfree_for_read</linktorsrc></links><search><creatorcontrib>Wanjawa, Barack</creatorcontrib><creatorcontrib>Wanzare, Lilian D.A.</creatorcontrib><creatorcontrib>Indede, Florence</creatorcontrib><creatorcontrib>McOnyango, Owen</creatorcontrib><creatorcontrib>Ombui, Edward</creatorcontrib><creatorcontrib>Muchemi, Lawrence</creatorcontrib><title>Kencorpus: Kenyan Languages Corpus</title><description>This project collected text and speech corpora for Languages in Kenya. In KenCorpus project, three languages were strategically selected i.e. Kiswahili, Luhya, and Dholuo. The Luhya Language has several dialects. In the project, 3 dialects were chosen as a start: Lumarachi, Logooli and Lubukusi. Primary data was collected from the respective language communities, which also included indiginous stories and other narratives from student compositions, native language media stations, and publishers. This went beyond the conventional religious texts to include other genres of texts that made the corpus more representative of everyday language use in the communities. Text data : A total of 4442 texts were collected: 546 texts for Dholuo, 483 texts for Luhya-Lumarachi, 135 texts for Luhya-Lubukusu and 359 texts for Luhya-Logooli. Spontaneous Speech data: A total of 1,152 files were collected which total to 176hr 29min and 46sec of spontaneous speech data: 104 files (19hr 10min 57sec) for Swahili, 512 files (99hr 3min 8sec) for Dholuo, 138 files (15hr 37min 46sec) for Luhya-Lumarachi, 354 files (30hr 11min) for Luhya-Lubukusu and 44 files (12hr 26min 55sec) for Luhya-Logooli. Acknowledgement of data collectors: Kiswahili - Rose Felynix, Khalid Kitito, Dr. Benard Okal Luo - Jotham Ondu Ajiki, Dr. Jackline Okello, Jonathan Muga, Mercy Lavinca Oduoll Luhyia (Logooli) - Salano Odari, Dr. Phillip Lumwamu Luhyia (Bukusu) - Mactilda Nekesa Makana, Mulwale Martin Luhyia (Marachi) - Yonah Weunda</description><subject>African languages</subject><subject>Computer and Information Science</subject><subject>Dataset curation</subject><subject>Datasets</subject><subject>low resource languages</subject><subject>Social Sciences</subject><fulltext>true</fulltext><rsrctype>dataset</rsrctype><creationdate>2022</creationdate><recordtype>dataset</recordtype><sourceid>PQ8</sourceid><recordid>eNotzkkLwjAUBOBcPEj14i8oHoXqS6JJ9SbFDQtevIeXrZRqlC6C_971NAMDw0fIiMJULinM7CPMRFg8aNUn46ML5lbfu2YVv-sTQ5xjKDosXBNn32FAeh4vjRv-MyLn7eac7ZP8tDtk6zyxklaJAI0WAWFuUrBacyc4TakGiZYz8GwphGFUG22Z0Uxz441PgaOcA3LheUQmv1uLLZqydepel1esn4qC-rjV261-bv4C9Ew8pQ</recordid><startdate>2022</startdate><enddate>2022</enddate><creator>Wanjawa, Barack</creator><creator>Wanzare, Lilian D.A.</creator><creator>Indede, Florence</creator><creator>McOnyango, Owen</creator><creator>Ombui, Edward</creator><creator>Muchemi, Lawrence</creator><general>Harvard Dataverse</general><scope>DYCCY</scope><scope>PQ8</scope></search><sort><creationdate>2022</creationdate><title>Kencorpus: Kenyan Languages Corpus</title><author>Wanjawa, Barack ; Wanzare, Lilian D.A. ; Indede, Florence ; McOnyango, Owen ; Ombui, Edward ; Muchemi, Lawrence</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-d71k-60bada0a04c80dbb3e63181b07ad320f2966c21bcbd2cb2b3cfcf803a740a36f3</frbrgroupid><rsrctype>datasets</rsrctype><prefilter>datasets</prefilter><language>eng</language><creationdate>2022</creationdate><topic>African languages</topic><topic>Computer and Information Science</topic><topic>Dataset curation</topic><topic>Datasets</topic><topic>low resource languages</topic><topic>Social Sciences</topic><toplevel>online_resources</toplevel><creatorcontrib>Wanjawa, Barack</creatorcontrib><creatorcontrib>Wanzare, Lilian D.A.</creatorcontrib><creatorcontrib>Indede, Florence</creatorcontrib><creatorcontrib>McOnyango, Owen</creatorcontrib><creatorcontrib>Ombui, Edward</creatorcontrib><creatorcontrib>Muchemi, Lawrence</creatorcontrib><collection>DataCite (Open Access)</collection><collection>DataCite</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Wanjawa, Barack</au><au>Wanzare, Lilian D.A.</au><au>Indede, Florence</au><au>McOnyango, Owen</au><au>Ombui, Edward</au><au>Muchemi, Lawrence</au><format>book</format><genre>unknown</genre><ristype>DATA</ristype><title>Kencorpus: Kenyan Languages Corpus</title><date>2022</date><risdate>2022</risdate><abstract>This project collected text and speech corpora for Languages in Kenya. In KenCorpus project, three languages were strategically selected i.e. Kiswahili, Luhya, and Dholuo. The Luhya Language has several dialects. In the project, 3 dialects were chosen as a start: Lumarachi, Logooli and Lubukusi. Primary data was collected from the respective language communities, which also included indiginous stories and other narratives from student compositions, native language media stations, and publishers. This went beyond the conventional religious texts to include other genres of texts that made the corpus more representative of everyday language use in the communities. Text data : A total of 4442 texts were collected: 546 texts for Dholuo, 483 texts for Luhya-Lumarachi, 135 texts for Luhya-Lubukusu and 359 texts for Luhya-Logooli. Spontaneous Speech data: A total of 1,152 files were collected which total to 176hr 29min and 46sec of spontaneous speech data: 104 files (19hr 10min 57sec) for Swahili, 512 files (99hr 3min 8sec) for Dholuo, 138 files (15hr 37min 46sec) for Luhya-Lumarachi, 354 files (30hr 11min) for Luhya-Lubukusu and 44 files (12hr 26min 55sec) for Luhya-Logooli. Acknowledgement of data collectors: Kiswahili - Rose Felynix, Khalid Kitito, Dr. Benard Okal Luo - Jotham Ondu Ajiki, Dr. Jackline Okello, Jonathan Muga, Mercy Lavinca Oduoll Luhyia (Logooli) - Salano Odari, Dr. Phillip Lumwamu Luhyia (Bukusu) - Mactilda Nekesa Makana, Mulwale Martin Luhyia (Marachi) - Yonah Weunda</abstract><pub>Harvard Dataverse</pub><doi>10.7910/dvn/6n5v1k</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.7910/dvn/6n5v1k
ispartof
issn
language eng
recordid cdi_datacite_primary_10_7910_dvn_6n5v1k
source DataCite
subjects African languages
Computer and Information Science
Dataset curation
Datasets
low resource languages
Social Sciences
title Kencorpus: Kenyan Languages Corpus
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-27T05%3A47%3A27IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-datacite_PQ8&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=unknown&rft.au=Wanjawa,%20Barack&rft.date=2022&rft_id=info:doi/10.7910/dvn/6n5v1k&rft_dat=%3Cdatacite_PQ8%3E10_7910_dvn_6n5v1k%3C/datacite_PQ8%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true