Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks

We present Bloom Library, a linguistically diverse set of multimodal and multilingual datasets for language modeling, image captioning, visual storytelling, and speech synthesis/recognition. These datasets represent either the most, or among the most, multilingual datasets for each of the included d...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2022-10
Hauptverfasser: Leong, Colin, Nemecek, Joshua, Mansdorfer, Jacob, Filighera, Anna, Owodunni, Abraham, Whitenack, Daniel
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Leong, Colin
Nemecek, Joshua
Mansdorfer, Jacob
Filighera, Anna
Owodunni, Abraham
Whitenack, Daniel
description We present Bloom Library, a linguistically diverse set of multimodal and multilingual datasets for language modeling, image captioning, visual storytelling, and speech synthesis/recognition. These datasets represent either the most, or among the most, multilingual datasets for each of the included downstream tasks. In total, the initial release of the Bloom Library datasets covers 363 languages across 32 language families. We train downstream task models for various languages represented in the data, showing the viability of the data for future work in low-resource, multimodal NLP and establishing the first known baselines for these downstream tasks in certain languages (e.g., Bisu [bzi], with an estimated population of 700 users). Some of these first-of-their-kind baselines are comparable to state-of-the-art performance for higher-resourced languages. The Bloom Library datasets are released under Creative Commons licenses on the Hugging Face datasets hub to catalyze more linguistically diverse research in the included downstream tasks.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2729291377</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2729291377</sourcerecordid><originalsourceid>FETCH-proquest_journals_27292913773</originalsourceid><addsrcrecordid>eNqNykELgjAYgOERBEn5Hz7oGMLcMrNjWXSom3iVL5qiTVf7NsJ_X4d-QKf38LwTFggp42i7FmLGQqKOcy42qUgSGbByr43p4dLeLNpxB1evXdubO2rI0SEpR9AOIDlfwQWHxmOjCGpjAaFE2yo3gqkhN--BnFXYQ4H0oAWb1qhJhb_O2fJ0LA7n6GnNyytyVWe8Hb5UiVRkIotlmsr_rg9BlT_N</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2729291377</pqid></control><display><type>article</type><title>Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks</title><source>Free E- Journals</source><creator>Leong, Colin ; Nemecek, Joshua ; Mansdorfer, Jacob ; Filighera, Anna ; Owodunni, Abraham ; Whitenack, Daniel</creator><creatorcontrib>Leong, Colin ; Nemecek, Joshua ; Mansdorfer, Jacob ; Filighera, Anna ; Owodunni, Abraham ; Whitenack, Daniel</creatorcontrib><description>We present Bloom Library, a linguistically diverse set of multimodal and multilingual datasets for language modeling, image captioning, visual storytelling, and speech synthesis/recognition. These datasets represent either the most, or among the most, multilingual datasets for each of the included downstream tasks. In total, the initial release of the Bloom Library datasets covers 363 languages across 32 language families. We train downstream task models for various languages represented in the data, showing the viability of the data for future work in low-resource, multimodal NLP and establishing the first known baselines for these downstream tasks in certain languages (e.g., Bisu [bzi], with an estimated population of 700 users). Some of these first-of-their-kind baselines are comparable to state-of-the-art performance for higher-resourced languages. The Bloom Library datasets are released under Creative Commons licenses on the Hugging Face datasets hub to catalyze more linguistically diverse research in the included downstream tasks.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Datasets ; Languages ; Libraries ; Object recognition ; Population statistics ; Speech recognition</subject><ispartof>arXiv.org, 2022-10</ispartof><rights>2022. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Leong, Colin</creatorcontrib><creatorcontrib>Nemecek, Joshua</creatorcontrib><creatorcontrib>Mansdorfer, Jacob</creatorcontrib><creatorcontrib>Filighera, Anna</creatorcontrib><creatorcontrib>Owodunni, Abraham</creatorcontrib><creatorcontrib>Whitenack, Daniel</creatorcontrib><title>Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks</title><title>arXiv.org</title><description>We present Bloom Library, a linguistically diverse set of multimodal and multilingual datasets for language modeling, image captioning, visual storytelling, and speech synthesis/recognition. These datasets represent either the most, or among the most, multilingual datasets for each of the included downstream tasks. In total, the initial release of the Bloom Library datasets covers 363 languages across 32 language families. We train downstream task models for various languages represented in the data, showing the viability of the data for future work in low-resource, multimodal NLP and establishing the first known baselines for these downstream tasks in certain languages (e.g., Bisu [bzi], with an estimated population of 700 users). Some of these first-of-their-kind baselines are comparable to state-of-the-art performance for higher-resourced languages. The Bloom Library datasets are released under Creative Commons licenses on the Hugging Face datasets hub to catalyze more linguistically diverse research in the included downstream tasks.</description><subject>Datasets</subject><subject>Languages</subject><subject>Libraries</subject><subject>Object recognition</subject><subject>Population statistics</subject><subject>Speech recognition</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNykELgjAYgOERBEn5Hz7oGMLcMrNjWXSom3iVL5qiTVf7NsJ_X4d-QKf38LwTFggp42i7FmLGQqKOcy42qUgSGbByr43p4dLeLNpxB1evXdubO2rI0SEpR9AOIDlfwQWHxmOjCGpjAaFE2yo3gqkhN--BnFXYQ4H0oAWb1qhJhb_O2fJ0LA7n6GnNyytyVWe8Hb5UiVRkIotlmsr_rg9BlT_N</recordid><startdate>20221026</startdate><enddate>20221026</enddate><creator>Leong, Colin</creator><creator>Nemecek, Joshua</creator><creator>Mansdorfer, Jacob</creator><creator>Filighera, Anna</creator><creator>Owodunni, Abraham</creator><creator>Whitenack, Daniel</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20221026</creationdate><title>Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks</title><author>Leong, Colin ; Nemecek, Joshua ; Mansdorfer, Jacob ; Filighera, Anna ; Owodunni, Abraham ; Whitenack, Daniel</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_27292913773</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Datasets</topic><topic>Languages</topic><topic>Libraries</topic><topic>Object recognition</topic><topic>Population statistics</topic><topic>Speech recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Leong, Colin</creatorcontrib><creatorcontrib>Nemecek, Joshua</creatorcontrib><creatorcontrib>Mansdorfer, Jacob</creatorcontrib><creatorcontrib>Filighera, Anna</creatorcontrib><creatorcontrib>Owodunni, Abraham</creatorcontrib><creatorcontrib>Whitenack, Daniel</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Leong, Colin</au><au>Nemecek, Joshua</au><au>Mansdorfer, Jacob</au><au>Filighera, Anna</au><au>Owodunni, Abraham</au><au>Whitenack, Daniel</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks</atitle><jtitle>arXiv.org</jtitle><date>2022-10-26</date><risdate>2022</risdate><eissn>2331-8422</eissn><abstract>We present Bloom Library, a linguistically diverse set of multimodal and multilingual datasets for language modeling, image captioning, visual storytelling, and speech synthesis/recognition. These datasets represent either the most, or among the most, multilingual datasets for each of the included downstream tasks. In total, the initial release of the Bloom Library datasets covers 363 languages across 32 language families. We train downstream task models for various languages represented in the data, showing the viability of the data for future work in low-resource, multimodal NLP and establishing the first known baselines for these downstream tasks in certain languages (e.g., Bisu [bzi], with an estimated population of 700 users). Some of these first-of-their-kind baselines are comparable to state-of-the-art performance for higher-resourced languages. The Bloom Library datasets are released under Creative Commons licenses on the Hugging Face datasets hub to catalyze more linguistically diverse research in the included downstream tasks.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2022-10
issn 2331-8422
language eng
recordid cdi_proquest_journals_2729291377
source Free E- Journals
subjects Datasets
Languages
Libraries
Object recognition
Population statistics
Speech recognition
title Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T23%3A32%3A00IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Bloom%20Library:%20Multimodal%20Datasets%20in%20300+%20Languages%20for%20a%20Variety%20of%20Downstream%20Tasks&rft.jtitle=arXiv.org&rft.au=Leong,%20Colin&rft.date=2022-10-26&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2729291377%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2729291377&rft_id=info:pmid/&rfr_iscdi=true