Acoustic models of Brazilian Portuguese Speech based on Neural Transformers - Pretraining Datasets raw audios from CORAA

This repository contains all the pretraining datasets used in the paper: Acoustic models of Brazilian Portuguese Speech based on Neural Transformers by Marcelo Gauy and Marcelo Finger. These datasets are part of a collection of datasets from the TaRSila project (see https://sites.google.com/view/tar...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Matheus Gauy, Marcelo, Finger, Marcelo, Aluisio, Sandra Maria, Svartman, Flaviane Romani Fernandes, Candido Junior, Arnaldo, Casanova, Edresson, Leite, Marli Quadros, Soares, Anderson, Oliveira, Frederico Santos de, Oliveira, Lucas, Fernandes Jr, Ricardo, Silva, Daniel da, Fayet, Fernando Gorgulho, Carlotto, Bruno Baldissera, Gris, Lucas R, Santos, Vinícius Gonçalves dos
Format:	Dataset
Sprache:	por
Schlagworte:	Brazilian Portuguese speech Unsupervised pretraining (self supervision)
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	This repository contains all the pretraining datasets used in the paper: Acoustic models of Brazilian Portuguese Speech based on Neural Transformers by Marcelo Gauy and Marcelo Finger. These datasets are part of a collection of datasets from the TaRSila project (see https://sites.google.com/view/tarsila-c4ai). The audios published here were in part also published with annotations and transcriptions as the CORAA dataset (see https://github.com/nilc-nlp/CORAA). Here we publish the original raw audios from the following datasets (without transcriptions) - ALIP, C-Oral, SP2010, NURC-Recife, NURC-São Paulo and Programa Certas Palavras. In total, the datasets contain about 800 hours of Brazilian Portuguese Speech. The audios have been converted to mp3 to facilitate the upload. ALIP, C-Oral and SP2010 are integrally contained in one file each. Programa Certas Palavras and NURC-Recife are split in 3 parts each, while NURC-SP is split in 7 parts of roughly equal size. More information on the datasets can be found in the paper Acoustic models of Brazilian Portuguese Speech based on Neural Transformers as well as on the original references which created these datasets.
DOI:	10.5281/zenodo.6794923