Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks

Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-11
Hauptverfasser:	Chien-yu, Huang, Chen, Wei-Chih, Shu-wen, Yang, Liu, Andy T, Chen-An, Li, Yu-Xiang, Lin, Wei-Cheng, Tseng, Diwan, Anuj, Yi-Jen Shih, Shi, Jiatong, Chen, William, Chen, Xuanjun, Chi-Yuan, Hsiao, Peng, Puyuan, Shih-Heng, Wang, Chun-Yi, Kuan, Ke-Han, Lu, Kai-Wei, Chang, Chih-Kai, Yang, Ritter-Gutierrez, Fabian, Ming To Chuang, Kuan-Po Huang, Arora, Siddhant, You-Kuan, Lin, Yeo, Eunjung, Chang, Kalvin, Chung-Ming, Chien, Choi, Kwanghee, Cheng-Hsiu Hsieh, Yi-Cheng, Lin, Chee-En Yu, I-Hsiang, Chiu, Guimarães, Heitor R, Han, Jionghao, Lin, Tzu-Quan, Lin, Tzu-Yuan, Chang, Homu, Ting-Wu, Chang, Chun Wei Chen, Shou-Jen Chen, Yu-Hua, Chen, Hsi-Chun, Cheng, Dhawan, Kunal, Jia-Lin, Fang, Shi-Xin, Fang, Kuan-Yu, Fang Chiang, Chi An Fu, Hsien-Fu Hsiao, Ching Yu Hsu, Huang, Shao-Syuan, Lee Chen Wei, Hsi-Che Lin, Hsuan-Hao, Lin, Hsuan-Ting, Lin, Jian-Ren, Lin, Ting-Chun, Liu, Li-Chun, Lu, Tsung-Min Pai, Pasad, Ankita, Shih-Yun, Shan Kuan, Shon, Suwon, Tang, Yuxun, Yun-Shao, Tsai, Jui-Chiang, Wei, Wei, Tzu-Chieh, Wu, Chengxi, Wu, Dien-Ruei, Chao-Han, Huck Yang, Chieh-Chi Yang, Jia Qi Yip, Shao-Xiang, Yuan, Noroozi, Vahid, Chen, Zhehuai, Wu, Haibin, Livescu, Karen, Harwath, David, Watanabe, Shinji, Hung-yi, Lee
Format:	Artikel
Sprache:	eng
Schlagworte:	Audio data Benchmarks Emotion recognition Evaluation Natural language processing Speech
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluation benchmark poses a significant challenge. We present Dynamic-SUPERB Phase-2, an open and evolving benchmark for the comprehensive evaluation of instruction-based universal speech models. Building upon the first generation, this second version incorporates 125 new tasks contributed collaboratively by the global research community, expanding the benchmark to a total of 180 tasks, making it the largest benchmark for speech and audio evaluation. While the first generation of Dynamic-SUPERB was limited to classification tasks, Dynamic-SUPERB Phase-2 broadens its evaluation capabilities by introducing a wide array of novel and diverse tasks, including regression and sequence generation, across speech, music, and environmental audio. Evaluation results indicate that none of the models performed well universally. SALMONN-13B excelled in English ASR, while WavLLM demonstrated high accuracy in emotion recognition, but current models still require further innovations to handle a broader range of tasks. We will soon open-source all task data and the evaluation pipeline.
ISSN:	2331-8422