Understanding Game-Playing Agents with Natural Language Annotations
We present a new dataset containing 10K human-annotated games of Go and show how these natural language annotations can be used as a tool for model interpretability. Given a board state and its associated comment, our approach uses linear probing to predict mentions of domain-specific terms (e.g., k...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We present a new dataset containing 10K human-annotated games of Go and show
how these natural language annotations can be used as a tool for model
interpretability. Given a board state and its associated comment, our approach
uses linear probing to predict mentions of domain-specific terms (e.g., ko,
atari) from the intermediate state representations of game-playing agents like
AlphaGo Zero. We find these game concepts are nontrivially encoded in two
distinct policy networks, one trained via imitation learning and another
trained via reinforcement learning. Furthermore, mentions of domain-specific
terms are most easily predicted from the later layers of both models,
suggesting that these policy networks encode high-level abstractions similar to
those used in the natural language annotations. |
---|---|
DOI: | 10.48550/arxiv.2204.07531 |