ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos
We introduce ReXTime, a benchmark designed to rigorously test AI models' ability to perform temporal reasoning within video events. Specifically, ReXTime focuses on reasoning across time, i.e. human-like understanding when the question and its corresponding answer occur in different video segme...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Chen, Jr-Jen Liao, Yu-Chien Lin, Hsi-Che Yu, Yu-Chu Chen, Yen-Chun Wang, Yu-Chiang Frank |
description | We introduce ReXTime, a benchmark designed to rigorously test AI models'
ability to perform temporal reasoning within video events. Specifically,
ReXTime focuses on reasoning across time, i.e. human-like understanding when
the question and its corresponding answer occur in different video segments.
This form of reasoning, requiring advanced understanding of cause-and-effect
relationships across video segments, poses significant challenges to even the
frontier multimodal large language models. To facilitate this evaluation, we
develop an automated pipeline for generating temporal reasoning question-answer
pairs, significantly reducing the need for labor-intensive manual annotations.
Our benchmark includes 921 carefully vetted validation samples and 2,143 test
samples, each manually curated for accuracy and relevance. Evaluation results
show that while frontier large language models outperform academic models, they
still lag behind human performance by a significant 14.3% accuracy gap.
Additionally, our pipeline creates a training dataset of 9,695 machine
generated samples without manual effort, which empirical studies suggest can
enhance the across-time reasoning via fine-tuning. |
doi_str_mv | 10.48550/arxiv.2406.19392 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2406_19392</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2406_19392</sourcerecordid><originalsourceid>FETCH-LOGICAL-a672-47418db41678a2e9b6bc8131a547ca9ed9333b19857131a047d4465e00e370d93</originalsourceid><addsrcrecordid>eNotj89KAzEYxHPpQVofwJN5gazJ5ss_D4W11CoUhLqItyW7-9WGtruS1NK-vd3qaWBmGOZHyJ3gGVil-IOPp3DMcuA6E066_IZMV_hZhj0-0oI-Ydds9j5u6ftPOCBd95Gu0Ke-C90XK5rYp8SGMg0d_Qgt9mlCRmu_S3j7r2NSPs_L2Qtbvi1eZ8WSeW1yBgaEbWsQ2lifo6t13VghhVdgGu-wdVLKWjirzOByMC2AVsg5SsMv6Zjc_81eAarvGC43z9UAUl1B5C9drUC-</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos</title><source>arXiv.org</source><creator>Chen, Jr-Jen ; Liao, Yu-Chien ; Lin, Hsi-Che ; Yu, Yu-Chu ; Chen, Yen-Chun ; Wang, Yu-Chiang Frank</creator><creatorcontrib>Chen, Jr-Jen ; Liao, Yu-Chien ; Lin, Hsi-Che ; Yu, Yu-Chu ; Chen, Yen-Chun ; Wang, Yu-Chiang Frank</creatorcontrib><description>We introduce ReXTime, a benchmark designed to rigorously test AI models'
ability to perform temporal reasoning within video events. Specifically,
ReXTime focuses on reasoning across time, i.e. human-like understanding when
the question and its corresponding answer occur in different video segments.
This form of reasoning, requiring advanced understanding of cause-and-effect
relationships across video segments, poses significant challenges to even the
frontier multimodal large language models. To facilitate this evaluation, we
develop an automated pipeline for generating temporal reasoning question-answer
pairs, significantly reducing the need for labor-intensive manual annotations.
Our benchmark includes 921 carefully vetted validation samples and 2,143 test
samples, each manually curated for accuracy and relevance. Evaluation results
show that while frontier large language models outperform academic models, they
still lag behind human performance by a significant 14.3% accuracy gap.
Additionally, our pipeline creates a training dataset of 9,695 machine
generated samples without manual effort, which empirical studies suggest can
enhance the across-time reasoning via fine-tuning.</description><identifier>DOI: 10.48550/arxiv.2406.19392</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-06</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,777,882</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2406.19392$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2406.19392$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Chen, Jr-Jen</creatorcontrib><creatorcontrib>Liao, Yu-Chien</creatorcontrib><creatorcontrib>Lin, Hsi-Che</creatorcontrib><creatorcontrib>Yu, Yu-Chu</creatorcontrib><creatorcontrib>Chen, Yen-Chun</creatorcontrib><creatorcontrib>Wang, Yu-Chiang Frank</creatorcontrib><title>ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos</title><description>We introduce ReXTime, a benchmark designed to rigorously test AI models'
ability to perform temporal reasoning within video events. Specifically,
ReXTime focuses on reasoning across time, i.e. human-like understanding when
the question and its corresponding answer occur in different video segments.
This form of reasoning, requiring advanced understanding of cause-and-effect
relationships across video segments, poses significant challenges to even the
frontier multimodal large language models. To facilitate this evaluation, we
develop an automated pipeline for generating temporal reasoning question-answer
pairs, significantly reducing the need for labor-intensive manual annotations.
Our benchmark includes 921 carefully vetted validation samples and 2,143 test
samples, each manually curated for accuracy and relevance. Evaluation results
show that while frontier large language models outperform academic models, they
still lag behind human performance by a significant 14.3% accuracy gap.
Additionally, our pipeline creates a training dataset of 9,695 machine
generated samples without manual effort, which empirical studies suggest can
enhance the across-time reasoning via fine-tuning.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj89KAzEYxHPpQVofwJN5gazJ5ss_D4W11CoUhLqItyW7-9WGtruS1NK-vd3qaWBmGOZHyJ3gGVil-IOPp3DMcuA6E066_IZMV_hZhj0-0oI-Ydds9j5u6ftPOCBd95Gu0Ke-C90XK5rYp8SGMg0d_Qgt9mlCRmu_S3j7r2NSPs_L2Qtbvi1eZ8WSeW1yBgaEbWsQ2lifo6t13VghhVdgGu-wdVLKWjirzOByMC2AVsg5SsMv6Zjc_81eAarvGC43z9UAUl1B5C9drUC-</recordid><startdate>20240627</startdate><enddate>20240627</enddate><creator>Chen, Jr-Jen</creator><creator>Liao, Yu-Chien</creator><creator>Lin, Hsi-Che</creator><creator>Yu, Yu-Chu</creator><creator>Chen, Yen-Chun</creator><creator>Wang, Yu-Chiang Frank</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240627</creationdate><title>ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos</title><author>Chen, Jr-Jen ; Liao, Yu-Chien ; Lin, Hsi-Che ; Yu, Yu-Chu ; Chen, Yen-Chun ; Wang, Yu-Chiang Frank</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a672-47418db41678a2e9b6bc8131a547ca9ed9333b19857131a047d4465e00e370d93</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Chen, Jr-Jen</creatorcontrib><creatorcontrib>Liao, Yu-Chien</creatorcontrib><creatorcontrib>Lin, Hsi-Che</creatorcontrib><creatorcontrib>Yu, Yu-Chu</creatorcontrib><creatorcontrib>Chen, Yen-Chun</creatorcontrib><creatorcontrib>Wang, Yu-Chiang Frank</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Chen, Jr-Jen</au><au>Liao, Yu-Chien</au><au>Lin, Hsi-Che</au><au>Yu, Yu-Chu</au><au>Chen, Yen-Chun</au><au>Wang, Yu-Chiang Frank</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos</atitle><date>2024-06-27</date><risdate>2024</risdate><abstract>We introduce ReXTime, a benchmark designed to rigorously test AI models'
ability to perform temporal reasoning within video events. Specifically,
ReXTime focuses on reasoning across time, i.e. human-like understanding when
the question and its corresponding answer occur in different video segments.
This form of reasoning, requiring advanced understanding of cause-and-effect
relationships across video segments, poses significant challenges to even the
frontier multimodal large language models. To facilitate this evaluation, we
develop an automated pipeline for generating temporal reasoning question-answer
pairs, significantly reducing the need for labor-intensive manual annotations.
Our benchmark includes 921 carefully vetted validation samples and 2,143 test
samples, each manually curated for accuracy and relevance. Evaluation results
show that while frontier large language models outperform academic models, they
still lag behind human performance by a significant 14.3% accuracy gap.
Additionally, our pipeline creates a training dataset of 9,695 machine
generated samples without manual effort, which empirical studies suggest can
enhance the across-time reasoning via fine-tuning.</abstract><doi>10.48550/arxiv.2406.19392</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2406.19392 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2406_19392 |
source | arXiv.org |
subjects | Computer Science - Computer Vision and Pattern Recognition |
title | ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-17T19%3A10%3A57IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=ReXTime:%20A%20Benchmark%20Suite%20for%20Reasoning-Across-Time%20in%20Videos&rft.au=Chen,%20Jr-Jen&rft.date=2024-06-27&rft_id=info:doi/10.48550/arxiv.2406.19392&rft_dat=%3Carxiv_GOX%3E2406_19392%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |