Do LLMs "know" internally when they follow instructions?
Instruction-following is crucial for building AI agents with large language models (LLMs), as these models must adhere strictly to user-provided constraints and guidelines. However, LLMs often fail to follow even simple and clear instructions. To improve instruction-following behavior and prevent un...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Instruction-following is crucial for building AI agents with large language
models (LLMs), as these models must adhere strictly to user-provided
constraints and guidelines. However, LLMs often fail to follow even simple and
clear instructions. To improve instruction-following behavior and prevent
undesirable outputs, a deeper understanding of how LLMs' internal states relate
to these outcomes is required. Our analysis of LLM internal states reveal a
dimension in the input embedding space linked to successful
instruction-following. We demonstrate that modifying representations along this
dimension improves instruction-following success rates compared to random
changes, without compromising response quality. Further investigation reveals
that this dimension is more closely related to the phrasing of prompts rather
than the inherent difficulty of the task or instructions. This discovery also
suggests explanations for why LLMs sometimes fail to follow clear instructions
and why prompt engineering is often effective, even when the content remains
largely unchanged. This work provides insight into the internal workings of
LLMs' instruction-following, paving the way for reliable LLM agents. |
---|---|
DOI: | 10.48550/arxiv.2410.14516 |