BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data
Large language models (LLMs) have become increasingly pivotal across various domains, especially in handling complex data types. This includes structured data processing, as exemplified by ChartQA and ChatGPT-Ada, and multimodal unstructured data processing as seen in Visual Question Answering (VQA)...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Large language models (LLMs) have become increasingly pivotal across various
domains, especially in handling complex data types. This includes structured
data processing, as exemplified by ChartQA and ChatGPT-Ada, and multimodal
unstructured data processing as seen in Visual Question Answering (VQA). These
areas have attracted significant attention from both industry and academia.
Despite this, there remains a lack of unified evaluation methodologies for
these diverse data handling scenarios. In response, we introduce BabelBench, an
innovative benchmark framework that evaluates the proficiency of LLMs in
managing multimodal multistructured data with code execution. BabelBench
incorporates a dataset comprising 247 meticulously curated problems that
challenge the models with tasks in perception, commonsense reasoning, logical
reasoning, and so on. Besides the basic capabilities of multimodal
understanding, structured data processing as well as code generation, these
tasks demand advanced capabilities in exploration, planning, reasoning and
debugging. Our experimental findings on BabelBench indicate that even
cutting-edge models like ChatGPT 4 exhibit substantial room for improvement.
The insights derived from our comprehensive analysis offer valuable guidance
for future research within the community. The benchmark data can be found at
https://github.com/FFD8FFE/babelbench. |
---|---|
DOI: | 10.48550/arxiv.2410.00773 |