Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks

We present Bloom Library, a linguistically diverse set of multimodal and multilingual datasets for language modeling, image captioning, visual storytelling, and speech synthesis/recognition. These datasets represent either the most, or among the most, multilingual datasets for each of the included d...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2022-10
Hauptverfasser:	Leong, Colin, Nemecek, Joshua, Mansdorfer, Jacob, Filighera, Anna, Owodunni, Abraham, Whitenack, Daniel
Format:	Artikel
Sprache:	eng
Schlagworte:	Datasets Languages Libraries Object recognition Population statistics Speech recognition
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Leong, Colin Nemecek, Joshua Mansdorfer, Jacob Filighera, Anna Owodunni, Abraham Whitenack, Daniel
description	We present Bloom Library, a linguistically diverse set of multimodal and multilingual datasets for language modeling, image captioning, visual storytelling, and speech synthesis/recognition. These datasets represent either the most, or among the most, multilingual datasets for each of the included downstream tasks. In total, the initial release of the Bloom Library datasets covers 363 languages across 32 language families. We train downstream task models for various languages represented in the data, showing the viability of the data for future work in low-resource, multimodal NLP and establishing the first known baselines for these downstream tasks in certain languages (e.g., Bisu [bzi], with an estimated population of 700 users). Some of these first-of-their-kind baselines are comparable to state-of-the-art performance for higher-resourced languages. The Bloom Library datasets are released under Creative Commons licenses on the Hugging Face datasets hub to catalyze more linguistically diverse research in the included downstream tasks.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2729291377</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2729291377</sourcerecordid><originalsourceid>FETCH-proquest_journals_27292913773</originalsourceid><addsrcrecordid>eNqNykELgjAYgOERBEn5Hz7oGMLcMrNjWXSom3iVL5qiTVf7NsJ_X4d-QKf38LwTFggp42i7FmLGQqKOcy42qUgSGbByr43p4dLeLNpxB1evXdubO2rI0SEpR9AOIDlfwQWHxmOjCGpjAaFE2yo3gqkhN--BnFXYQ4H0oAWb1qhJhb_O2fJ0LA7n6GnNyytyVWe8Hb5UiVRkIotlmsr_rg9BlT_N</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2729291377</pqid></control><display><type>article</type><title>Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks</title><source>Free E- Journals</source><creator>Leong, Colin ; Nemecek, Joshua ; Mansdorfer, Jacob ; Filighera, Anna ; Owodunni, Abraham ; Whitenack, Daniel</creator><creatorcontrib>Leong, Colin ; Nemecek, Joshua ; Mansdorfer, Jacob ; Filighera, Anna ; Owodunni, Abraham ; Whitenack, Daniel</creatorcontrib><description>We present Bloom Library, a linguistically diverse set of multimodal and multilingual datasets for language modeling, image captioning, visual storytelling, and speech synthesis/recognition. These datasets represent either the most, or among the most, multilingual datasets for each of the included downstream tasks. In total, the initial release of the Bloom Library datasets covers 363 languages across 32 language families. We train downstream task models for various languages represented in the data, showing the viability of the data for future work in low-resource, multimodal NLP and establishing the first known baselines for these downstream tasks in certain languages (e.g., Bisu [bzi], with an estimated population of 700 users). Some of these first-of-their-kind baselines are comparable to state-of-the-art performance for higher-resourced languages. The Bloom Library datasets are released under Creative Commons licenses on the Hugging Face datasets hub to catalyze more linguistically diverse research in the included downstream tasks.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Datasets ; Languages ; Libraries ; Object recognition ; Population statistics ; Speech recognition</subject><ispartof>arXiv.org, 2022-10</ispartof><rights>2022. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Leong, Colin</creatorcontrib><creatorcontrib>Nemecek, Joshua</creatorcontrib><creatorcontrib>Mansdorfer, Jacob</creatorcontrib><creatorcontrib>Filighera, Anna</creatorcontrib><creatorcontrib>Owodunni, Abraham</creatorcontrib><creatorcontrib>Whitenack, Daniel</creatorcontrib><title>Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks</title><title>arXiv.org</title><description>We present Bloom Library, a linguistically diverse set of multimodal and multilingual datasets for language modeling, image captioning, visual storytelling, and speech synthesis/recognition. These datasets represent either the most, or among the most, multilingual datasets for each of the included downstream tasks. In total, the initial release of the Bloom Library datasets covers 363 languages across 32 language families. We train downstream task models for various languages represented in the data, showing the viability of the data for future work in low-resource, multimodal NLP and establishing the first known baselines for these downstream tasks in certain languages (e.g., Bisu [bzi], with an estimated population of 700 users). Some of these first-of-their-kind baselines are comparable to state-of-the-art performance for higher-resourced languages. The Bloom Library datasets are released under Creative Commons licenses on the Hugging Face datasets hub to catalyze more linguistically diverse research in the included downstream tasks.</description><subject>Datasets</subject><subject>Languages</subject><subject>Libraries</subject><subject>Object recognition</subject><subject>Population statistics</subject><subject>Speech recognition</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNykELgjAYgOERBEn5Hz7oGMLcMrNjWXSom3iVL5qiTVf7NsJ_X4d-QKf38LwTFggp42i7FmLGQqKOcy42qUgSGbByr43p4dLeLNpxB1evXdubO2rI0SEpR9AOIDlfwQWHxmOjCGpjAaFE2yo3gqkhN--BnFXYQ4H0oAWb1qhJhb_O2fJ0LA7n6GnNyytyVWe8Hb5UiVRkIotlmsr_rg9BlT_N</recordid><startdate>20221026</startdate><enddate>20221026</enddate><creator>Leong, Colin</creator><creator>Nemecek, Joshua</creator><creator>Mansdorfer, Jacob</creator><creator>Filighera, Anna</creator><creator>Owodunni, Abraham</creator><creator>Whitenack, Daniel</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20221026</creationdate><title>Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks</title><author>Leong, Colin ; Nemecek, Joshua ; Mansdorfer, Jacob ; Filighera, Anna ; Owodunni, Abraham ; Whitenack, Daniel</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_27292913773</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Datasets</topic><topic>Languages</topic><topic>Libraries</topic><topic>Object recognition</topic><topic>Population statistics</topic><topic>Speech recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Leong, Colin</creatorcontrib><creatorcontrib>Nemecek, Joshua</creatorcontrib><creatorcontrib>Mansdorfer, Jacob</creatorcontrib><creatorcontrib>Filighera, Anna</creatorcontrib><creatorcontrib>Owodunni, Abraham</creatorcontrib><creatorcontrib>Whitenack, Daniel</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Leong, Colin</au><au>Nemecek, Joshua</au><au>Mansdorfer, Jacob</au><au>Filighera, Anna</au><au>Owodunni, Abraham</au><au>Whitenack, Daniel</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks</atitle><jtitle>arXiv.org</jtitle><date>2022-10-26</date><risdate>2022</risdate><eissn>2331-8422</eissn><abstract>We present Bloom Library, a linguistically diverse set of multimodal and multilingual datasets for language modeling, image captioning, visual storytelling, and speech synthesis/recognition. These datasets represent either the most, or among the most, multilingual datasets for each of the included downstream tasks. In total, the initial release of the Bloom Library datasets covers 363 languages across 32 language families. We train downstream task models for various languages represented in the data, showing the viability of the data for future work in low-resource, multimodal NLP and establishing the first known baselines for these downstream tasks in certain languages (e.g., Bisu [bzi], with an estimated population of 700 users). Some of these first-of-their-kind baselines are comparable to state-of-the-art performance for higher-resourced languages. The Bloom Library datasets are released under Creative Commons licenses on the Hugging Face datasets hub to catalyze more linguistically diverse research in the included downstream tasks.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2022-10
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2729291377
source	Free E- Journals
subjects	Datasets Languages Libraries Object recognition Population statistics Speech recognition
title	Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T23%3A32%3A00IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Bloom%20Library:%20Multimodal%20Datasets%20in%20300+%20Languages%20for%20a%20Variety%20of%20Downstream%20Tasks&rft.jtitle=arXiv.org&rft.au=Leong,%20Colin&rft.date=2022-10-26&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2729291377%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2729291377&rft_id=info:pmid/&rfr_iscdi=true