MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective
Large Multimodal Models (LMMs) have demonstrated remarkable capabilities. While existing benchmarks for evaluating LMMs mainly focus on image comprehension, few works evaluate them from the image generation perspective. To address this issue, we propose a straightforward automated evaluation pipelin...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Large Multimodal Models (LMMs) have demonstrated remarkable capabilities.
While existing benchmarks for evaluating LMMs mainly focus on image
comprehension, few works evaluate them from the image generation perspective.
To address this issue, we propose a straightforward automated evaluation
pipeline. Specifically, this pipeline requires LMMs to generate an image-prompt
from a given input image. Subsequently, it employs text-to-image generative
models to create a new image based on these generated prompts. Finally, we
evaluate the performance of LMMs by comparing the original image with the
generated one. Furthermore, we introduce MMGenBench-Test, a comprehensive
benchmark developed to evaluate LMMs across 13 distinct image patterns, and
MMGenBench-Domain, targeting the performance evaluation of LMMs within the
generative image domain. A thorough evaluation involving over 50 popular LMMs
demonstrates the effectiveness and reliability in both the pipeline and
benchmark. Our observations indicate that numerous LMMs excelling in existing
benchmarks fail to adequately complete the basic tasks, related to image
understanding and description. This finding highlights the substantial
potential for performance improvement in current LMMs and suggests avenues for
future model optimization. Concurrently, our pipeline facilitates the efficient
assessment of LMMs performance across diverse domains by using solely image
inputs. |
---|---|
DOI: | 10.48550/arxiv.2411.14062 |