Audio self-supervised learning: A survey

Similar to humans’ cognitive ability to generalize knowledge and skills, self-supervised learning (SSL) targets discovering general representations from large-scale data. This, through the use of pre-trained SSL models for downstream tasks, alleviates the need for human annotation, which is an expen...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Patterns (New York, N.Y.) N.Y.), 2022-12, Vol.3 (12), p.100616-100616, Article 100616
Hauptverfasser:	Liu, Shuo, Mallol-Ragolta, Adria, Parada-Cabaleiro, Emilia, Qian, Kun, Jing, Xin, Kathan, Alexander, Hu, Bin, Schuller, Björn W.
Format:	Artikel
Sprache:	eng
Schlagworte:	audio and speech processing multi-modal SSL representation learning Review self-supervised learning unsupervised learning
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Similar to humans’ cognitive ability to generalize knowledge and skills, self-supervised learning (SSL) targets discovering general representations from large-scale data. This, through the use of pre-trained SSL models for downstream tasks, alleviates the need for human annotation, which is an expensive and time-consuming task. Its success in the fields of computer vision and natural language processing have prompted its recent adoption into the field of audio and speech processing. Comprehensive reviews summarizing the knowledge in audio SSL are currently missing. To fill this gap, we provide an overview of the SSL methods used for audio and speech processing applications. Herein, we also summarize the empirical works that exploit audio modality in multi-modal SSL frameworks and the existing suitable benchmarks to evaluate the power of SSL in the computer audition domain. Finally, we discuss some open problems and point out the future directions in the development of audio SSL. Several current review studies seek to provide the scientific community with an overview of the existing literature on self-supervised learning (SSL). However, these studies clearly favor computer vision (CV) and natural language processing (NLP) owing to their widespread use in these domains. The success of SSL in these fields has inspired its incorporation into audio processing. Therefore, the purpose of this survey is to present an overview of the SSL techniques used in audio and speech processing applications. In addition, we summarize the empirical research that uses the audio modality in multi-modal SSL frameworks, as well as the available benchmarks that can be used to assess the effectiveness of SSL in the area of computer audition. Recent research has shown an ever-growing interest in applying SSL to audio and speech processing. As this rapid emerging field has not yet been thoroughly explored, we provide a survey on SSL with a focus on the recent advancements, including for the first time an overview of SSL in audio within unified frameworks. This review is intended to benefit practitioners, both beginners and more experienced researchers, who are interested in the use of SSL for audio signal processing.
ISSN:	2666-3899 2666-3899
DOI:	10.1016/j.patter.2022.100616