Archon: An Architecture Search Framework for Inference-Time Techniques
Inference-time techniques are emerging as highly effective tools to enhance large language model (LLM) capabilities. However, best practices for developing systems that combine these techniques remain underdeveloped due to our limited understanding of the utility of individual inference-time techniq...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Inference-time techniques are emerging as highly effective tools to enhance
large language model (LLM) capabilities. However, best practices for developing
systems that combine these techniques remain underdeveloped due to our limited
understanding of the utility of individual inference-time techniques and the
interactions between them. Additionally, efficiently and automatically
searching the space of model choices, inference-time techniques, and their
compositions is challenging due to the large design space. To address these
challenges, we introduce Archon, a modular framework for selecting, combining,
and stacking layers of inference-time techniques to construct optimized LLM
systems for target benchmarks. Rather than relying on a single LLM called once,
we leverage a diverse set of LLMs and inference-time techniques, creating LLM
systems greater than the sum of their parts. Archon defines an extensible
design space, encompassing techniques such as generation ensembling, repeated
sampling, ranking, fusion, critiquing, verification, and unit testing. It
transforms the problem of building LLM systems into a hyperparameter
optimization objective. Given the available LLMs, inference-time techniques,
and compute budget, Archon utilizes hyperparameter search techniques to
discover optimized architectures for target benchmark(s). We evaluate Archon
architectures across a range of instruction-following, reasoning, and coding
benchmarks, including MT-Bench, Arena-Hard-Auto, AlpacaEval 2.0, MixEval,
MixEval Hard, MATH, and CodeContests. Archon architectures outperform frontier
models, such as GPT-4o and Claude 3.5 Sonnet, on these benchmarks, achieving an
average accuracy increase of 15.1 percentage points by using all available
LLMs. We make our code and datasets available publicly on Github:
https://github.com/ScalingIntelligence/Archon. |
---|---|
DOI: | 10.48550/arxiv.2409.15254 |