Revolutionizing Database Q&A with Large Language Models: Comprehensive Benchmark and Evaluation
The development of Large Language Models (LLMs) has revolutionized QA across various industries, including the database domain. However, there is still a lack of a comprehensive benchmark to evaluate the capabilities of different LLMs and their modular components in database QA. To this end, we intr...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The development of Large Language Models (LLMs) has revolutionized QA across
various industries, including the database domain. However, there is still a
lack of a comprehensive benchmark to evaluate the capabilities of different
LLMs and their modular components in database QA. To this end, we introduce
DQABench, the first comprehensive database QA benchmark for LLMs. DQABench
features an innovative LLM-based method to automate the generation, cleaning,
and rewriting of evaluation dataset, resulting in over 200,000 QA pairs in
English and Chinese, separately. These QA pairs cover a wide range of
database-related knowledge extracted from manuals, online communities, and
database instances. This inclusion allows for an additional assessment of LLMs'
Retrieval-Augmented Generation (RAG) and Tool Invocation Generation (TIG)
capabilities in the database QA task. Furthermore, we propose a comprehensive
LLM-based database QA testbed DQATestbed. This testbed is highly modular and
scalable, with basic and advanced components such as Question Classification
Routing (QCR), RAG, TIG, and Prompt Template Engineering (PTE). Moreover,
DQABench provides a comprehensive evaluation pipeline that computes various
metrics throughout a standardized evaluation process to ensure the accuracy and
fairness of the evaluation. We use DQABench to evaluate the database QA
capabilities under the proposed testbed comprehensively. The evaluation reveals
findings like (i) the strengths and limitations of nine LLM-based QA bots and
(ii) the performance impact and potential improvements of various service
components (e.g., QCR, RAG, TIG). Our benchmark and findings will guide the
future development of LLM-based database QA research. |
---|---|
DOI: | 10.48550/arxiv.2409.04475 |