A Pilot Evaluation of ChatGPT and DALL-E 2 on Decision Making and Spatial Reasoning
We conduct a pilot study selectively evaluating the cognitive abilities (decision making and spatial reasoning) of two recently released generative transformer models, ChatGPT and DALL-E 2. Input prompts were constructed following neutral a priori guidelines, rather than adversarial intent. Post hoc...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We conduct a pilot study selectively evaluating the cognitive abilities
(decision making and spatial reasoning) of two recently released generative
transformer models, ChatGPT and DALL-E 2. Input prompts were constructed
following neutral a priori guidelines, rather than adversarial intent. Post hoc
qualitative analysis of the outputs shows that DALL-E 2 is able to generate at
least one correct image for each spatial reasoning prompt, but most images
generated are incorrect (even though the model seems to have a clear
understanding of the objects mentioned in the prompt). Similarly, in evaluating
ChatGPT on the rationality axioms developed under the classical Von
Neumann-Morgenstern utility theorem, we find that, although it demonstrates
some level of rational decision-making, many of its decisions violate at least
one of the axioms even under reasonable constructions of preferences, bets, and
decision-making prompts. ChatGPT's outputs on such problems generally tended to
be unpredictable: even as it made irrational decisions (or employed an
incorrect reasoning process) for some simpler decision-making problems, it was
able to draw correct conclusions for more complex bet structures. We briefly
comment on the nuances and challenges involved in scaling up such a 'cognitive'
evaluation or conducting it with a closed set of answer keys ('ground truth'),
given that these models are inherently generative and open-ended in responding
to prompts. |
---|---|
DOI: | 10.48550/arxiv.2302.09068 |