From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning
Large Language Models (LLMs) have achieved remarkable success, where instruction tuning is the critical step in aligning LLMs with user intentions. In this work, we investigate how the instruction tuning adjusts pre-trained models with a focus on intrinsic changes. Specifically, we first develop sev...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Large Language Models (LLMs) have achieved remarkable success, where
instruction tuning is the critical step in aligning LLMs with user intentions.
In this work, we investigate how the instruction tuning adjusts pre-trained
models with a focus on intrinsic changes. Specifically, we first develop
several local and global explanation methods, including a gradient-based method
for input-output attribution, and techniques for interpreting patterns and
concepts in self-attention and feed-forward layers. The impact of instruction
tuning is then studied by comparing the explanations derived from the
pre-trained and instruction-tuned models. This approach provides an internal
perspective of the model shifts on a human-comprehensible level. Our findings
reveal three significant impacts of instruction tuning: 1) It empowers LLMs to
recognize the instruction parts of user prompts, and promotes the response
generation constantly conditioned on the instructions. 2) It encourages the
self-attention heads to capture more word-word relationships about instruction
verbs. 3) It encourages the feed-forward networks to rotate their pre-trained
knowledge toward user-oriented tasks. These insights contribute to a more
comprehensive understanding of instruction tuning and lay the groundwork for
future work that aims at explaining and optimizing LLMs for various
applications. Our code and data are publicly available at
https://github.com/JacksonWuxs/Interpret_Instruction_Tuning_LLMs. |
---|---|
DOI: | 10.48550/arxiv.2310.00492 |