Provably Robust Multi-bit Watermarking for AI-generated Text
Large Language Models (LLMs) have demonstrated remarkable capabilities of generating texts resembling human language. However, they can be misused by criminals to create deceptive content, such as fake news and phishing emails, which raises ethical concerns. Watermarking is a key technique to addres...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Large Language Models (LLMs) have demonstrated remarkable capabilities of
generating texts resembling human language. However, they can be misused by
criminals to create deceptive content, such as fake news and phishing emails,
which raises ethical concerns. Watermarking is a key technique to address these
concerns, which embeds a message (e.g., a bit string) into a text generated by
an LLM. By embedding the user ID (represented as a bit string) into generated
texts, we can trace generated texts to the user, known as content source
tracing. The major limitation of existing watermarking techniques is that they
achieve sub-optimal performance for content source tracing in real-world
scenarios. The reason is that they cannot accurately or efficiently extract a
long message from a generated text. We aim to address the limitations.
In this work, we introduce a new watermarking method for LLM-generated text
grounded in pseudo-random segment assignment. We also propose multiple
techniques to further enhance the robustness of our watermarking algorithm. We
conduct extensive experiments to evaluate our method. Our experimental results
show that our method substantially outperforms existing baselines in both
accuracy and robustness on benchmark datasets. For instance, when embedding a
message of length 20 into a 200-token generated text, our method achieves a
match rate of $97.6\%$, while the state-of-the-art work Yoo et al. only
achieves $49.2\%$. Additionally, we prove that our watermark can tolerate edits
within an edit distance of 17 on average for each paragraph under the same
setting. |
---|---|
DOI: | 10.48550/arxiv.2401.16820 |