MenatQA: A New Dataset for Testing the Temporal Comprehension and Reasoning Abilities of Large Language Models
Large language models (LLMs) have shown nearly saturated performance on many natural language processing (NLP) tasks. As a result, it is natural for people to believe that LLMs have also mastered abilities such as time understanding and reasoning. However, research on the temporal sensitivity of LLM...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Large language models (LLMs) have shown nearly saturated performance on many
natural language processing (NLP) tasks. As a result, it is natural for people
to believe that LLMs have also mastered abilities such as time understanding
and reasoning. However, research on the temporal sensitivity of LLMs has been
insufficiently emphasized. To fill this gap, this paper constructs Multiple
Sensitive Factors Time QA (MenatQA), which encompasses three temporal factors
(scope factor, order factor, counterfactual factor) with total 2,853 samples
for evaluating the time comprehension and reasoning abilities of LLMs. This
paper tests current mainstream LLMs with different parameter sizes, ranging
from billions to hundreds of billions. The results show most LLMs fall behind
smaller temporal reasoning models with different degree on these factors. In
specific, LLMs show a significant vulnerability to temporal biases and depend
heavily on the temporal information provided in questions. Furthermore, this
paper undertakes a preliminary investigation into potential improvement
strategies by devising specific prompts and leveraging external tools. These
approaches serve as valuable baselines or references for future research
endeavors. |
---|---|
DOI: | 10.48550/arxiv.2310.05157 |