How adversarial attacks can disrupt seemingly stable accurate classifiers

Adversarial attacks dramatically change the output of an otherwise accurate learning system using a seemingly inconsequential modification to a piece of input data. Paradoxically, empirical evidence indicates that even systems which are robust to large random perturbations of the input data remain s...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Neural networks 2024-12, Vol.180, p.106711, Article 106711
Hauptverfasser:	Sutton, Oliver J., Zhou, Qinghua, Tyukin, Ivan Y., Gorban, Alexander N., Bastounis, Alexander, Higham, Desmond J.
Format:	Artikel
Sprache:	eng
Schlagworte:	Adversarial attacks Algorithms Humans Machine Learning Measure concentration theory Neural networks Neural Networks, Computer Stability
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Adversarial attacks dramatically change the output of an otherwise accurate learning system using a seemingly inconsequential modification to a piece of input data. Paradoxically, empirical evidence indicates that even systems which are robust to large random perturbations of the input data remain susceptible to small, easily constructed, adversarial perturbations of their inputs. Here, we show that this may be seen as a fundamental feature of classifiers working with high dimensional input data. We introduce a simple generic and generalisable framework for which key behaviours observed in practical systems arise with high probability—notably the simultaneous susceptibility of the (otherwise accurate) model to easily constructed adversarial attacks, and robustness to random perturbations of the input data. We confirm that the same phenomena are directly observed in practical neural networks trained on standard image classification problems, where even large additive random noise fails to trigger the adversarial instability of the network. A surprising takeaway is that even small margins separating a classifier’s decision surface from training and testing data can hide adversarial susceptibility from being detected using randomly sampled perturbations. Counter-intuitively, using additive noise during training or testing is therefore inefficient for eradicating or detecting adversarial examples, and more demanding adversarial training is required. •A new theory for studying accuracy, adversarial attacks, and robustness is presented.•We present experiments confirming the theory on standard benchmarks.•The theory reveals when adversarial attacks affect seemingly stable classifiers.•Adding noise during training is inefficient for eradicating adversarial examples.
ISSN:	0893-6080 1879-2782 1879-2782
DOI:	10.1016/j.neunet.2024.106711