Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks
We present Bloom Library, a linguistically diverse set of multimodal and multilingual datasets for language modeling, image captioning, visual storytelling, and speech synthesis/recognition. These datasets represent either the most, or among the most, multilingual datasets for each of the included d...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2022-10 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Leong, Colin Nemecek, Joshua Mansdorfer, Jacob Filighera, Anna Owodunni, Abraham Whitenack, Daniel |
description | We present Bloom Library, a linguistically diverse set of multimodal and multilingual datasets for language modeling, image captioning, visual storytelling, and speech synthesis/recognition. These datasets represent either the most, or among the most, multilingual datasets for each of the included downstream tasks. In total, the initial release of the Bloom Library datasets covers 363 languages across 32 language families. We train downstream task models for various languages represented in the data, showing the viability of the data for future work in low-resource, multimodal NLP and establishing the first known baselines for these downstream tasks in certain languages (e.g., Bisu [bzi], with an estimated population of 700 users). Some of these first-of-their-kind baselines are comparable to state-of-the-art performance for higher-resourced languages. The Bloom Library datasets are released under Creative Commons licenses on the Hugging Face datasets hub to catalyze more linguistically diverse research in the included downstream tasks. |
format | Article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2729291377</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2729291377</sourcerecordid><originalsourceid>FETCH-proquest_journals_27292913773</originalsourceid><addsrcrecordid>eNqNykELgjAYgOERBEn5Hz7oGMLcMrNjWXSom3iVL5qiTVf7NsJ_X4d-QKf38LwTFggp42i7FmLGQqKOcy42qUgSGbByr43p4dLeLNpxB1evXdubO2rI0SEpR9AOIDlfwQWHxmOjCGpjAaFE2yo3gqkhN--BnFXYQ4H0oAWb1qhJhb_O2fJ0LA7n6GnNyytyVWe8Hb5UiVRkIotlmsr_rg9BlT_N</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2729291377</pqid></control><display><type>article</type><title>Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks</title><source>Free E- Journals</source><creator>Leong, Colin ; Nemecek, Joshua ; Mansdorfer, Jacob ; Filighera, Anna ; Owodunni, Abraham ; Whitenack, Daniel</creator><creatorcontrib>Leong, Colin ; Nemecek, Joshua ; Mansdorfer, Jacob ; Filighera, Anna ; Owodunni, Abraham ; Whitenack, Daniel</creatorcontrib><description>We present Bloom Library, a linguistically diverse set of multimodal and multilingual datasets for language modeling, image captioning, visual storytelling, and speech synthesis/recognition. These datasets represent either the most, or among the most, multilingual datasets for each of the included downstream tasks. In total, the initial release of the Bloom Library datasets covers 363 languages across 32 language families. We train downstream task models for various languages represented in the data, showing the viability of the data for future work in low-resource, multimodal NLP and establishing the first known baselines for these downstream tasks in certain languages (e.g., Bisu [bzi], with an estimated population of 700 users). Some of these first-of-their-kind baselines are comparable to state-of-the-art performance for higher-resourced languages. The Bloom Library datasets are released under Creative Commons licenses on the Hugging Face datasets hub to catalyze more linguistically diverse research in the included downstream tasks.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Datasets ; Languages ; Libraries ; Object recognition ; Population statistics ; Speech recognition</subject><ispartof>arXiv.org, 2022-10</ispartof><rights>2022. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Leong, Colin</creatorcontrib><creatorcontrib>Nemecek, Joshua</creatorcontrib><creatorcontrib>Mansdorfer, Jacob</creatorcontrib><creatorcontrib>Filighera, Anna</creatorcontrib><creatorcontrib>Owodunni, Abraham</creatorcontrib><creatorcontrib>Whitenack, Daniel</creatorcontrib><title>Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks</title><title>arXiv.org</title><description>We present Bloom Library, a linguistically diverse set of multimodal and multilingual datasets for language modeling, image captioning, visual storytelling, and speech synthesis/recognition. These datasets represent either the most, or among the most, multilingual datasets for each of the included downstream tasks. In total, the initial release of the Bloom Library datasets covers 363 languages across 32 language families. We train downstream task models for various languages represented in the data, showing the viability of the data for future work in low-resource, multimodal NLP and establishing the first known baselines for these downstream tasks in certain languages (e.g., Bisu [bzi], with an estimated population of 700 users). Some of these first-of-their-kind baselines are comparable to state-of-the-art performance for higher-resourced languages. The Bloom Library datasets are released under Creative Commons licenses on the Hugging Face datasets hub to catalyze more linguistically diverse research in the included downstream tasks.</description><subject>Datasets</subject><subject>Languages</subject><subject>Libraries</subject><subject>Object recognition</subject><subject>Population statistics</subject><subject>Speech recognition</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNykELgjAYgOERBEn5Hz7oGMLcMrNjWXSom3iVL5qiTVf7NsJ_X4d-QKf38LwTFggp42i7FmLGQqKOcy42qUgSGbByr43p4dLeLNpxB1evXdubO2rI0SEpR9AOIDlfwQWHxmOjCGpjAaFE2yo3gqkhN--BnFXYQ4H0oAWb1qhJhb_O2fJ0LA7n6GnNyytyVWe8Hb5UiVRkIotlmsr_rg9BlT_N</recordid><startdate>20221026</startdate><enddate>20221026</enddate><creator>Leong, Colin</creator><creator>Nemecek, Joshua</creator><creator>Mansdorfer, Jacob</creator><creator>Filighera, Anna</creator><creator>Owodunni, Abraham</creator><creator>Whitenack, Daniel</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20221026</creationdate><title>Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks</title><author>Leong, Colin ; Nemecek, Joshua ; Mansdorfer, Jacob ; Filighera, Anna ; Owodunni, Abraham ; Whitenack, Daniel</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_27292913773</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Datasets</topic><topic>Languages</topic><topic>Libraries</topic><topic>Object recognition</topic><topic>Population statistics</topic><topic>Speech recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Leong, Colin</creatorcontrib><creatorcontrib>Nemecek, Joshua</creatorcontrib><creatorcontrib>Mansdorfer, Jacob</creatorcontrib><creatorcontrib>Filighera, Anna</creatorcontrib><creatorcontrib>Owodunni, Abraham</creatorcontrib><creatorcontrib>Whitenack, Daniel</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Leong, Colin</au><au>Nemecek, Joshua</au><au>Mansdorfer, Jacob</au><au>Filighera, Anna</au><au>Owodunni, Abraham</au><au>Whitenack, Daniel</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks</atitle><jtitle>arXiv.org</jtitle><date>2022-10-26</date><risdate>2022</risdate><eissn>2331-8422</eissn><abstract>We present Bloom Library, a linguistically diverse set of multimodal and multilingual datasets for language modeling, image captioning, visual storytelling, and speech synthesis/recognition. These datasets represent either the most, or among the most, multilingual datasets for each of the included downstream tasks. In total, the initial release of the Bloom Library datasets covers 363 languages across 32 language families. We train downstream task models for various languages represented in the data, showing the viability of the data for future work in low-resource, multimodal NLP and establishing the first known baselines for these downstream tasks in certain languages (e.g., Bisu [bzi], with an estimated population of 700 users). Some of these first-of-their-kind baselines are comparable to state-of-the-art performance for higher-resourced languages. The Bloom Library datasets are released under Creative Commons licenses on the Hugging Face datasets hub to catalyze more linguistically diverse research in the included downstream tasks.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2022-10 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_2729291377 |
source | Free E- Journals |
subjects | Datasets Languages Libraries Object recognition Population statistics Speech recognition |
title | Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T23%3A32%3A00IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Bloom%20Library:%20Multimodal%20Datasets%20in%20300+%20Languages%20for%20a%20Variety%20of%20Downstream%20Tasks&rft.jtitle=arXiv.org&rft.au=Leong,%20Colin&rft.date=2022-10-26&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2729291377%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2729291377&rft_id=info:pmid/&rfr_iscdi=true |