Towards a Speech Foundation Model for Singapore and Beyond
This technical report describes the MERaLiON Speech Encoder, a foundation model designed to support a wide range of downstream speech applications. Developed as part of Singapore's National Multimodal Large Language Model Programme, the MERaLiON Speech Encoder is tailored to address the speech...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | This technical report describes the MERaLiON Speech Encoder, a foundation
model designed to support a wide range of downstream speech applications.
Developed as part of Singapore's National Multimodal Large Language Model
Programme, the MERaLiON Speech Encoder is tailored to address the speech
processing needs in Singapore and the surrounding Southeast Asian region. The
model currently supports mainly English, including the variety spoken in
Singapore. We are actively expanding our datasets to gradually cover other
languages in subsequent releases. The MERaLiON Speech Encoder was pre-trained
from scratch on 200K hours of unlabelled speech data using a self-supervised
learning approach based on masked language modelling. We describe our training
procedure and hyperparameter tuning experiments in detail below. Our evaluation
demonstrates improvements to spontaneous and Singapore speech benchmarks for
speech recognition, while remaining competitive to other state-of-the-art
speech encoders across ten other speech tasks. We commit to releasing our
model, supporting broader research endeavours, both in Singapore and beyond. |
---|---|
DOI: | 10.48550/arxiv.2412.11538 |