Device-Directed Speech Detection: Regularization via Distillation for Weakly-Supervised Models
We address the problem of detecting speech directed to a device that does not contain a specific wake-word. Specifically, we focus on audio coming from a touch-based invocation. Mitigating virtual assistants (VAs) activation due to accidental button presses is critical for user experience. While the...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We address the problem of detecting speech directed to a device that does not
contain a specific wake-word. Specifically, we focus on audio coming from a
touch-based invocation. Mitigating virtual assistants (VAs) activation due to
accidental button presses is critical for user experience. While the majority
of approaches to false trigger mitigation (FTM) are designed to detect the
presence of a target keyword, inferring user intent in absence of keyword is
difficult. This also poses a challenge when creating the training/evaluation
data for such systems due to inherent ambiguity in the user's data. To this
end, we propose a novel FTM approach that uses weakly-labeled training data
obtained with a newly introduced data sampling strategy. While this sampling
strategy reduces data annotation efforts, the data labels are noisy as the data
are not annotated manually. We use these data to train an acoustics-only model
for the FTM task by regularizing its loss function via knowledge distillation
from an ASR-based (LatticeRNN) model. This improves the model decisions,
resulting in 66% gain in accuracy, as measured by equal-error-rate (EER), over
the base acoustics-only model. We also show that the ensemble of the LatticeRNN
and acoustic-distilled models brings further accuracy improvement of 20%. |
---|---|
DOI: | 10.48550/arxiv.2203.15975 |