HOMOGRAPH: a novel textual adversarial attack architecture to unmask the susceptibility of linguistic acceptability classifiers

The increasing adoption of deep learning algorithms for automating downstream natural language processing (NLP) tasks has created a need to enhance their capability to assess linguistic acceptability. The CoLA corpus was created to aid in the development of models that can accurately assess grammati...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:International journal of information security 2025-02, Vol.24 (1), p.6, Article 6
Hauptverfasser: Aggarwal, Sajal, Bajaj, Ashish, Vishwakarma, Dinesh Kumar
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The increasing adoption of deep learning algorithms for automating downstream natural language processing (NLP) tasks has created a need to enhance their capability to assess linguistic acceptability. The CoLA corpus was created to aid in the development of models that can accurately assess grammatical acceptability and evaluate linguistic proficiency. Transformer models, widely utilized in various natural language processing tasks, including the evaluation of linguistic acceptability, may possess limitations that undermine their perceived robustness. These models exhibit susceptibility to adversarial text attacks, which are characterized by inconspicuous modifications made to the original input text. The tactfully chosen modifications are such that the adversarial examples generated, although correctly classified by human observers, successfully mislead the targeted model of the attack, consequently hindering its reliability. This paper presents a novel framework called ‘Homograph’ to generate adversarial text in a black-box setting. The efficacy of the suggested attack in undermining models designed for linguistic acceptability is significantly enhanced by its capability to generate visually similar adversarial examples that do not compromise the grammatical acceptability of the original input samples. These examples effectively deceive the model, causing it to modify its predicted label. In the context of the linguistic acceptability task, our attack was effectively applied to five transformer models: ALBERT, BERT, DistilBERT, RoBERTa, and XL-Net, fine-tuned on the CoLA dataset. Our work distinguishes itself from existing text-based attacks through several contributions. Firstly, we surpass previous baselines in terms of attack success rate ( ASR ) and average perturbation rate ( APR ) for models trained on the CoLA dataset. Secondly, we generate more potent adversarial examples that contain imperceptible modifications, thereby preserving the original label. Lastly, we employ a straightforward character-level transformation technique to produce adversarial examples that closely resemble the original text.
ISSN:1615-5262
1615-5270
DOI:10.1007/s10207-024-00925-w