Boosting keyword spotting through on-device learnable user speech characteristics
Keyword spotting systems for always-on TinyML-constrained applications require on-site tuning to boost the accuracy of offline trained classifiers when deployed in unseen inference conditions. Adapting to the speech peculiarities of target users requires many in-domain samples, often unavailable in...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Keyword spotting systems for always-on TinyML-constrained applications
require on-site tuning to boost the accuracy of offline trained classifiers
when deployed in unseen inference conditions. Adapting to the speech
peculiarities of target users requires many in-domain samples, often
unavailable in real-world scenarios. Furthermore, current on-device learning
techniques rely on computationally intensive and memory-hungry backbone update
schemes, unfit for always-on, battery-powered devices. In this work, we propose
a novel on-device learning architecture, composed of a pretrained backbone and
a user-aware embedding learning the user's speech characteristics. The
so-generated features are fused and used to classify the input utterance. For
domain shifts generated by unseen speakers, we measure error rate reductions of
up to 19% from 30.1% to 24.3% based on the 35-class problem of the Google
Speech Commands dataset, through the inexpensive update of the user
projections. We moreover demonstrate the few-shot learning capabilities of our
proposed architecture in sample- and class-scarce learning conditions. With
23.7 kparameters and 1 MFLOP per epoch required for on-device training, our
system is feasible for TinyML applications aimed at battery-powered
microcontrollers. |
---|---|
DOI: | 10.48550/arxiv.2403.07802 |