Increasing Trust in Language Models through the Reuse of Verified Circuits
Language Models (LMs) are increasingly used for a wide range of prediction tasks, but their training can often neglect rare edge cases, reducing their reliability. Here, we define a stringent standard of trustworthiness whereby the task algorithm and circuit implementation must be verified, accounti...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Language Models (LMs) are increasingly used for a wide range of prediction
tasks, but their training can often neglect rare edge cases, reducing their
reliability. Here, we define a stringent standard of trustworthiness whereby
the task algorithm and circuit implementation must be verified, accounting for
edge cases, with no known failure modes. We show that a model can be trained to
meet this standard if built using mathematically and logically specified
frameworks. In this paper, we fully verify an auto-regressive transformer model
for n-digit integer addition. To exhibit the reusability of verified modules,
we insert the trained integer addition model into a larger untrained model and
train the combined model to perform both addition and subtraction. We find
extensive reuse of the addition circuits for both tasks, easing verification of
the more complex subtractor model. We discuss how inserting verified task
modules into LMs can leverage model reuse to improve verifiability and
trustworthiness of language models built using them. The reuse of verified
circuits reduces the effort to verify more complex composite models which we
believe to be a significant step towards safety of language models. |
---|---|
DOI: | 10.48550/arxiv.2402.02619 |