Acoustic models of Brazilian Portuguese Speech based on Neural Transformers - Pretraining Datasets raw audios from CORAA
This repository contains all the pretraining datasets used in the paper: Acoustic models of Brazilian Portuguese Speech based on Neural Transformers by Marcelo Gauy and Marcelo Finger. These datasets are part of a collection of datasets from the TaRSila project (see https://sites.google.com/view/tar...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , , |
---|---|
Format: | Dataset |
Sprache: | por |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | This repository contains all the pretraining datasets used in the paper: Acoustic models of Brazilian Portuguese Speech based on Neural Transformers by Marcelo Gauy and Marcelo Finger. These datasets are part of a collection of datasets from the TaRSila project (see https://sites.google.com/view/tarsila-c4ai). The audios published here were in part also published with annotations and transcriptions as the CORAA dataset (see https://github.com/nilc-nlp/CORAA). Here we publish the original raw audios from the following datasets (without transcriptions) - ALIP, C-Oral, SP2010, NURC-Recife, NURC-São Paulo and Programa Certas Palavras. In total, the datasets contain about 800 hours of Brazilian Portuguese Speech. The audios have been converted to mp3 to facilitate the upload. ALIP, C-Oral and SP2010 are integrally contained in one file each. Programa Certas Palavras and NURC-Recife are split in 3 parts each, while NURC-SP is split in 7 parts of roughly equal size. More information on the datasets can be found in the paper Acoustic models of Brazilian Portuguese Speech based on Neural Transformers as well as on the original references which created these datasets. |
---|---|
DOI: | 10.5281/zenodo.6794923 |