Are Large Language Models Temporally Grounded?
Are Large language models (LLMs) temporally grounded? Since LLMs cannot perceive and interact with the environment, it is impossible to answer this question directly. Instead, we provide LLMs with textual narratives and probe them with respect to their common-sense knowledge of the structure and dur...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Qiu, Yifu Zhao, Zheng Ziser, Yftah Korhonen, Anna Ponti, Edoardo M Cohen, Shay B |
description | Are Large language models (LLMs) temporally grounded? Since LLMs cannot
perceive and interact with the environment, it is impossible to answer this
question directly. Instead, we provide LLMs with textual narratives and probe
them with respect to their common-sense knowledge of the structure and duration
of events, their ability to order events along a timeline, and self-consistency
within their temporal model (e.g., temporal relations such as after and before
are mutually exclusive for any pair of events). We evaluate state-of-the-art
LLMs (such as LLaMA 2 and GPT-4) on three tasks reflecting these abilities.
Generally, we find that LLMs lag significantly behind both human performance as
well as small-scale, specialised LMs. In-context learning, instruction tuning,
and chain-of-thought prompting reduce this gap only to a limited degree.
Crucially, LLMs struggle the most with self-consistency, displaying incoherent
behaviour in at least 27.23% of their predictions. Contrary to expectations, we
also find that scaling the model size does not guarantee positive gains in
performance. To explain these results, we study the sources from which LLMs may
gather temporal information: we find that sentence ordering in unlabelled
texts, available during pre-training, is only weakly correlated with event
ordering. Moreover, public instruction tuning mixtures contain few temporal
tasks. Hence, we conclude that current LLMs lack a consistent temporal model of
textual narratives. Code, datasets, and LLM outputs are available at
https://github.com/yfqiu-nlp/temporal-llms. |
doi_str_mv | 10.48550/arxiv.2311.08398 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2311_08398</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2311_08398</sourcerecordid><originalsourceid>FETCH-LOGICAL-a678-e0431236f560daec5ef44b8df3aaf8d6a154e29df0974249786a2835f0ed8c143</originalsourceid><addsrcrecordid>eNotzrsKwkAUBNBtLET9ACvzA4n7zk0lIhqFiE36cHXvipAXK4r-vc9mZqrhMDYVPNFgDJ9jeFzuiVRCJBxUBkOWLANFBYbzJ9vzDd9j3zmqr1FJTd8FrOtnlIfu1jpyizEbeKyvNPn3iJWbdbnaxsUh362WRYw2hZi4VkIq643lDulkyGt9BOcVogdnURhNMnOeZ6mWOkvBogRlPCcHJ6HViM1-t19w1YdLg-FZfeDVF65eQSc8lg</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Are Large Language Models Temporally Grounded?</title><source>arXiv.org</source><creator>Qiu, Yifu ; Zhao, Zheng ; Ziser, Yftah ; Korhonen, Anna ; Ponti, Edoardo M ; Cohen, Shay B</creator><creatorcontrib>Qiu, Yifu ; Zhao, Zheng ; Ziser, Yftah ; Korhonen, Anna ; Ponti, Edoardo M ; Cohen, Shay B</creatorcontrib><description>Are Large language models (LLMs) temporally grounded? Since LLMs cannot
perceive and interact with the environment, it is impossible to answer this
question directly. Instead, we provide LLMs with textual narratives and probe
them with respect to their common-sense knowledge of the structure and duration
of events, their ability to order events along a timeline, and self-consistency
within their temporal model (e.g., temporal relations such as after and before
are mutually exclusive for any pair of events). We evaluate state-of-the-art
LLMs (such as LLaMA 2 and GPT-4) on three tasks reflecting these abilities.
Generally, we find that LLMs lag significantly behind both human performance as
well as small-scale, specialised LMs. In-context learning, instruction tuning,
and chain-of-thought prompting reduce this gap only to a limited degree.
Crucially, LLMs struggle the most with self-consistency, displaying incoherent
behaviour in at least 27.23% of their predictions. Contrary to expectations, we
also find that scaling the model size does not guarantee positive gains in
performance. To explain these results, we study the sources from which LLMs may
gather temporal information: we find that sentence ordering in unlabelled
texts, available during pre-training, is only weakly correlated with event
ordering. Moreover, public instruction tuning mixtures contain few temporal
tasks. Hence, we conclude that current LLMs lack a consistent temporal model of
textual narratives. Code, datasets, and LLM outputs are available at
https://github.com/yfqiu-nlp/temporal-llms.</description><identifier>DOI: 10.48550/arxiv.2311.08398</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language</subject><creationdate>2023-11</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2311.08398$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2311.08398$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Qiu, Yifu</creatorcontrib><creatorcontrib>Zhao, Zheng</creatorcontrib><creatorcontrib>Ziser, Yftah</creatorcontrib><creatorcontrib>Korhonen, Anna</creatorcontrib><creatorcontrib>Ponti, Edoardo M</creatorcontrib><creatorcontrib>Cohen, Shay B</creatorcontrib><title>Are Large Language Models Temporally Grounded?</title><description>Are Large language models (LLMs) temporally grounded? Since LLMs cannot
perceive and interact with the environment, it is impossible to answer this
question directly. Instead, we provide LLMs with textual narratives and probe
them with respect to their common-sense knowledge of the structure and duration
of events, their ability to order events along a timeline, and self-consistency
within their temporal model (e.g., temporal relations such as after and before
are mutually exclusive for any pair of events). We evaluate state-of-the-art
LLMs (such as LLaMA 2 and GPT-4) on three tasks reflecting these abilities.
Generally, we find that LLMs lag significantly behind both human performance as
well as small-scale, specialised LMs. In-context learning, instruction tuning,
and chain-of-thought prompting reduce this gap only to a limited degree.
Crucially, LLMs struggle the most with self-consistency, displaying incoherent
behaviour in at least 27.23% of their predictions. Contrary to expectations, we
also find that scaling the model size does not guarantee positive gains in
performance. To explain these results, we study the sources from which LLMs may
gather temporal information: we find that sentence ordering in unlabelled
texts, available during pre-training, is only weakly correlated with event
ordering. Moreover, public instruction tuning mixtures contain few temporal
tasks. Hence, we conclude that current LLMs lack a consistent temporal model of
textual narratives. Code, datasets, and LLM outputs are available at
https://github.com/yfqiu-nlp/temporal-llms.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotzrsKwkAUBNBtLET9ACvzA4n7zk0lIhqFiE36cHXvipAXK4r-vc9mZqrhMDYVPNFgDJ9jeFzuiVRCJBxUBkOWLANFBYbzJ9vzDd9j3zmqr1FJTd8FrOtnlIfu1jpyizEbeKyvNPn3iJWbdbnaxsUh362WRYw2hZi4VkIq643lDulkyGt9BOcVogdnURhNMnOeZ6mWOkvBogRlPCcHJ6HViM1-t19w1YdLg-FZfeDVF65eQSc8lg</recordid><startdate>20231114</startdate><enddate>20231114</enddate><creator>Qiu, Yifu</creator><creator>Zhao, Zheng</creator><creator>Ziser, Yftah</creator><creator>Korhonen, Anna</creator><creator>Ponti, Edoardo M</creator><creator>Cohen, Shay B</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231114</creationdate><title>Are Large Language Models Temporally Grounded?</title><author>Qiu, Yifu ; Zhao, Zheng ; Ziser, Yftah ; Korhonen, Anna ; Ponti, Edoardo M ; Cohen, Shay B</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a678-e0431236f560daec5ef44b8df3aaf8d6a154e29df0974249786a2835f0ed8c143</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><toplevel>online_resources</toplevel><creatorcontrib>Qiu, Yifu</creatorcontrib><creatorcontrib>Zhao, Zheng</creatorcontrib><creatorcontrib>Ziser, Yftah</creatorcontrib><creatorcontrib>Korhonen, Anna</creatorcontrib><creatorcontrib>Ponti, Edoardo M</creatorcontrib><creatorcontrib>Cohen, Shay B</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Qiu, Yifu</au><au>Zhao, Zheng</au><au>Ziser, Yftah</au><au>Korhonen, Anna</au><au>Ponti, Edoardo M</au><au>Cohen, Shay B</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Are Large Language Models Temporally Grounded?</atitle><date>2023-11-14</date><risdate>2023</risdate><abstract>Are Large language models (LLMs) temporally grounded? Since LLMs cannot
perceive and interact with the environment, it is impossible to answer this
question directly. Instead, we provide LLMs with textual narratives and probe
them with respect to their common-sense knowledge of the structure and duration
of events, their ability to order events along a timeline, and self-consistency
within their temporal model (e.g., temporal relations such as after and before
are mutually exclusive for any pair of events). We evaluate state-of-the-art
LLMs (such as LLaMA 2 and GPT-4) on three tasks reflecting these abilities.
Generally, we find that LLMs lag significantly behind both human performance as
well as small-scale, specialised LMs. In-context learning, instruction tuning,
and chain-of-thought prompting reduce this gap only to a limited degree.
Crucially, LLMs struggle the most with self-consistency, displaying incoherent
behaviour in at least 27.23% of their predictions. Contrary to expectations, we
also find that scaling the model size does not guarantee positive gains in
performance. To explain these results, we study the sources from which LLMs may
gather temporal information: we find that sentence ordering in unlabelled
texts, available during pre-training, is only weakly correlated with event
ordering. Moreover, public instruction tuning mixtures contain few temporal
tasks. Hence, we conclude that current LLMs lack a consistent temporal model of
textual narratives. Code, datasets, and LLM outputs are available at
https://github.com/yfqiu-nlp/temporal-llms.</abstract><doi>10.48550/arxiv.2311.08398</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2311.08398 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2311_08398 |
source | arXiv.org |
subjects | Computer Science - Artificial Intelligence Computer Science - Computation and Language |
title | Are Large Language Models Temporally Grounded? |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-06T18%3A05%3A23IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Are%20Large%20Language%20Models%20Temporally%20Grounded?&rft.au=Qiu,%20Yifu&rft.date=2023-11-14&rft_id=info:doi/10.48550/arxiv.2311.08398&rft_dat=%3Carxiv_GOX%3E2311_08398%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |