Beyond $L_p$ clipping: Equalization-based Psychoacoustic Attacks against ASRs
Automatic Speech Recognition (ASR) systems convert speech into text and can be placed into two broad categories: traditional and fully end-to-end. Both types have been shown to be vulnerable to adversarial audio examples that sound benign to the human ear but force the ASR to produce malicious trans...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Automatic Speech Recognition (ASR) systems convert speech into text and can
be placed into two broad categories: traditional and fully end-to-end. Both
types have been shown to be vulnerable to adversarial audio examples that sound
benign to the human ear but force the ASR to produce malicious transcriptions.
Of these attacks, only the "psychoacoustic" attacks can create examples with
relatively imperceptible perturbations, as they leverage the knowledge of the
human auditory system. Unfortunately, existing psychoacoustic attacks can only
be applied against traditional models, and are obsolete against the newer,
fully end-to-end ASRs. In this paper, we propose an equalization-based
psychoacoustic attack that can exploit both traditional and fully end-to-end
ASRs. We successfully demonstrate our attack against real-world ASRs that
include DeepSpeech and Wav2Letter. Moreover, we employ a user study to verify
that our method creates low audible distortion. Specifically, 80 of the 100
participants voted in favor of all our attack audio samples as less noisier
than the existing state-of-the-art attack. Through this, we demonstrate both
types of existing ASR pipelines can be exploited with minimum degradation to
attack audio quality. |
---|---|
DOI: | 10.48550/arxiv.2110.13250 |