Zero-Shot Mono-to-Binaural Speech Synthesis
We present ZeroBAS, a neural method to synthesize binaural audio from monaural audio recordings and positional information without training on any binaural data. To our knowledge, this is the first published zero-shot neural approach to mono-to-binaural audio synthesis. Specifically, we show that a...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We present ZeroBAS, a neural method to synthesize binaural audio from
monaural audio recordings and positional information without training on any
binaural data. To our knowledge, this is the first published zero-shot neural
approach to mono-to-binaural audio synthesis. Specifically, we show that a
parameter-free geometric time warping and amplitude scaling based on source
location suffices to get an initial binaural synthesis that can be refined by
iteratively applying a pretrained denoising vocoder. Furthermore, we find this
leads to generalization across room conditions, which we measure by introducing
a new dataset, TUT Mono-to-Binaural, to evaluate state-of-the-art
monaural-to-binaural synthesis methods on unseen conditions. Our zero-shot
method is perceptually on-par with the performance of supervised methods on the
standard mono-to-binaural dataset, and even surpasses them on our
out-of-distribution TUT Mono-to-Binaural dataset. Our results highlight the
potential of pretrained generative audio models and zero-shot learning to
unlock robust binaural audio synthesis. |
---|---|
DOI: | 10.48550/arxiv.2412.08356 |