Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries

We introduce Michelangelo: a minimal, synthetic, and unleaked long-context reasoning evaluation for large language models which is also easy to automatically score. This evaluation is derived via a novel, unifying framework for evaluations over arbitrarily long contexts which measure the model'...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-09
Hauptverfasser:	Vodrahalli, Kiran, Ontanon, Santiago, Tripuraneni, Nilesh, Xu, Kelvin, Jain, Sanil, Shivanna, Rakesh, Hui, Jeffrey, Dikkala, Nishanth, Kazemi, Mehran, Fatemi, Bahare, Rohan, Anil, Dyer, Ethan, Shakeri, Siamak, Vij, Roopali, Mehta, Harsh, Ramasesh, Vinay, Le, Quoc, Chi, Ed, Lu, Yifeng, Firat, Orhan, Lazaridou, Angeliki, Jean-Baptiste Lespiau, Attaluri, Nithya, Olszewska, Kate
Format:	Artikel
Sprache:	eng
Schlagworte:	Context Information retrieval Large language models State-of-the-art reviews
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	We introduce Michelangelo: a minimal, synthetic, and unleaked long-context reasoning evaluation for large language models which is also easy to automatically score. This evaluation is derived via a novel, unifying framework for evaluations over arbitrarily long contexts which measure the model's ability to do more than retrieve a single piece of information from its context. The central idea of the Latent Structure Queries framework (LSQ) is to construct tasks which require a model to ``chisel away'' the irrelevant information in the context, revealing a latent structure in the context. To verify a model's understanding of this latent structure, we query the model for details of the structure. Using LSQ, we produce three diagnostic long-context evaluations across code and natural-language domains intended to provide a stronger signal of long-context language model capabilities. We perform evaluations on several state-of-the-art models and demonstrate both that a) the proposed evaluations are high-signal and b) that there is significant room for improvement in synthesizing long-context information.
ISSN:	2331-8422