Focus-Shifting Attack: An Adversarial Attack That Retains Saliency Map Information and Manipulates Model Explanations

With the increased use of deep learning in many fields, a question has been raised: "How much should we trust the results generated by deep learning models?" Thus, there has been much research into the interpretations of model results, in order to open the black box of deep learning. The f...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on reliability 2024-06, Vol.73 (2), p.808-819
Hauptverfasser:	Huang, Qi-Xian, Chiang, Lin-Kuan, Chiu, Min-Yi, Sun, Hung-Min
Format:	Artikel
Sprache:	eng
Schlagworte:	Adversarial attack Algorithms Artificial intelligence Data models Deep learning deep learning (DL) Explainable AI explainable artificial intelligence (XAI) model explanations Perturbation Perturbation methods Predictive models Salience Saliency detection saliency map Training
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	With the increased use of deep learning in many fields, a question has been raised: "How much should we trust the results generated by deep learning models?" Thus, there has been much research into the interpretations of model results, in order to open the black box of deep learning. The focus is more on interpretation than prediction in some fields such as medicine. Adversarial attacks are the most direct threats to deep learning models. They can add undetectable perturbations to the data to make the models give incorrect results, and model explanations are also susceptible to attacks. This leads to a loss of trust in explanations provided by the models, limiting the application and commercial value of deep learning. This research proposes a targeted adversarial attack algorithm that manipulates the interpretation of the model. Unlike other adversarial attacks on model interpretation, focus-shifting attack (FS Attack) can preserve the numerical depth of the original saliency map without specifying a perturbation budget. Experiments have shown that the FS Attack has a higher degree of image similarity and misleading interpretation than other adversarial attacks, and the property of preserving the numerical depth of the original saliency map makes it more difficult to detect. This study uses several common explanation methods as experimental subjects to investigate how these explanations can be manipulated and evaluate the effectiveness of the attack under different conditions. Under a particular interpretation, the FS Attack has a highly successful attack rate of 94.6\%, which is a critical adversarial attack.
ISSN:	0018-9529 1558-1721
DOI:	10.1109/TR.2023.3303923