Combing for Credentials: Active Pattern Extraction from Smart Reply
Pre-trained large language models, such as GPT\nobreakdash-2 and BERT, are often fine-tuned to achieve state-of-the-art performance on a downstream task. One natural example is the ``Smart Reply'' application where a pre-trained model is tuned to provide suggested responses for a given que...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Pre-trained large language models, such as GPT\nobreakdash-2 and BERT, are
often fine-tuned to achieve state-of-the-art performance on a downstream task.
One natural example is the ``Smart Reply'' application where a pre-trained
model is tuned to provide suggested responses for a given query message. Since
the tuning data is often sensitive data such as emails or chat transcripts, it
is important to understand and mitigate the risk that the model leaks its
tuning data. We investigate potential information leakage vulnerabilities in a
typical Smart Reply pipeline. We consider a realistic setting where the
adversary can only interact with the underlying model through a front-end
interface that constrains what types of queries can be sent to the model.
Previous attacks do not work in these settings, but require the ability to send
unconstrained queries directly to the model. Even when there are no constraints
on the queries, previous attacks typically require thousands, or even millions,
of queries to extract useful information, while our attacks can extract
sensitive data in just a handful of queries. We introduce a new type of active
extraction attack that exploits canonical patterns in text containing sensitive
data. We show experimentally that it is possible for an adversary to extract
sensitive user information present in the training data, even in realistic
settings where all interactions with the model must go through a front-end that
limits the types of queries. We explore potential mitigation strategies and
demonstrate empirically how differential privacy appears to be a reasonably
effective defense mechanism to such pattern extraction attacks. |
---|---|
DOI: | 10.48550/arxiv.2207.10802 |