Do Counterfactual Examples Complicate Adversarial Training?
We leverage diffusion models to study the robustness-performance tradeoff of robust classifiers. Our approach introduces a simple, pretrained diffusion method to generate low-norm counterfactual examples (CEs): semantically altered data which results in different true class membership. We report tha...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We leverage diffusion models to study the robustness-performance tradeoff of
robust classifiers. Our approach introduces a simple, pretrained diffusion
method to generate low-norm counterfactual examples (CEs): semantically altered
data which results in different true class membership. We report that the
confidence and accuracy of robust models on their clean training data are
associated with the proximity of the data to their CEs. Moreover, robust models
perform very poorly when evaluated on the CEs directly, as they become
increasingly invariant to the low-norm, semantic changes brought by CEs. The
results indicate a significant overlap between non-robust and semantic
features, countering the common assumption that non-robust features are not
interpretable. |
---|---|
DOI: | 10.48550/arxiv.2404.10588 |