Phoneme-Level Contrastive Learning for User-Defined Keyword Spotting with Flexible Enrollment
User-defined keyword spotting (KWS) enhances the user experience by allowing individuals to customize keywords. However, in open-vocabulary scenarios, most existing methods commonly suffer from high false alarm rates with confusable words and are limited to either audio-only or text-only enrollment....
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | User-defined keyword spotting (KWS) enhances the user experience by allowing
individuals to customize keywords. However, in open-vocabulary scenarios, most
existing methods commonly suffer from high false alarm rates with confusable
words and are limited to either audio-only or text-only enrollment. Therefore,
in this paper, we first explore the model's robustness against confusable
words. Specifically, we propose Phoneme-Level Contrastive Learning (PLCL),
which refines and aligns query and source feature representations at the
phoneme level. This method enhances the model's disambiguation capability
through fine-grained positive and negative comparisons for more accurate
alignment, and it is generalizable to jointly optimize both audio-text and
audio-audio matching, adapting to various enrollment modes. Furthermore, we
maintain a context-agnostic phoneme memory bank to construct confusable
negatives for data augmentation. Based on this, a third-category discriminator
is specifically designed to distinguish hard negatives. Overall, we develop a
robust and flexible KWS system, supporting different modality enrollment
methods within a unified framework. Verified on the LibriPhrase dataset, the
proposed approach achieves state-of-the-art performance. |
---|---|
DOI: | 10.48550/arxiv.2412.20805 |