Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems
LLMs and RAG systems are now capable of handling millions of input tokens or more. However, evaluating the output quality of such systems on long-context tasks remains challenging, as tasks like Needle-in-a-Haystack lack complexity. In this work, we argue that summarization can play a central role i...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Laban, Philippe Fabbri, Alexander R Xiong, Caiming Wu, Chien-Sheng |
description | LLMs and RAG systems are now capable of handling millions of input tokens or
more. However, evaluating the output quality of such systems on long-context
tasks remains challenging, as tasks like Needle-in-a-Haystack lack complexity.
In this work, we argue that summarization can play a central role in such
evaluation. We design a procedure to synthesize Haystacks of documents,
ensuring that specific \textit{insights} repeat across documents. The "Summary
of a Haystack" (SummHay) task then requires a system to process the Haystack
and generate, given a query, a summary that identifies the relevant insights
and precisely cites the source documents. Since we have precise knowledge of
what insights should appear in a haystack summary and what documents should be
cited, we implement a highly reproducible automatic evaluation that can score
summaries on two aspects - Coverage and Citation. We generate Haystacks in two
domains (conversation, news), and perform a large-scale evaluation of 10 LLMs
and corresponding 50 RAG systems. Our findings indicate that SummHay is an open
challenge for current systems, as even systems provided with an Oracle signal
of document relevance lag our estimate of human performance (56\%) by 10+
points on a Joint Score. Without a retriever, long-context LLMs like GPT-4o and
Claude 3 Opus score below 20% on SummHay. We show SummHay can also be used to
study enterprise RAG systems and position bias in long-context models. We hope
future systems can equal and surpass human performance on SummHay. |
doi_str_mv | 10.48550/arxiv.2407.01370 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2407_01370</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2407_01370</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2407_013703</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjEw1zMwNDY34GRwDy7NzU0sqlTIT1NIVPBIrCwuSUzOtlJwVHDOSMzJSc1LT1UoyVfwyc9L13XOzytJrShR8PHxLVZIzEtRCHJ0VwgG6kjNLeZhYE1LzClO5YXS3Azybq4hzh66YCvjC4oyQbbEg6yOB1ttTFgFAM_aN18</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems</title><source>arXiv.org</source><creator>Laban, Philippe ; Fabbri, Alexander R ; Xiong, Caiming ; Wu, Chien-Sheng</creator><creatorcontrib>Laban, Philippe ; Fabbri, Alexander R ; Xiong, Caiming ; Wu, Chien-Sheng</creatorcontrib><description>LLMs and RAG systems are now capable of handling millions of input tokens or
more. However, evaluating the output quality of such systems on long-context
tasks remains challenging, as tasks like Needle-in-a-Haystack lack complexity.
In this work, we argue that summarization can play a central role in such
evaluation. We design a procedure to synthesize Haystacks of documents,
ensuring that specific \textit{insights} repeat across documents. The "Summary
of a Haystack" (SummHay) task then requires a system to process the Haystack
and generate, given a query, a summary that identifies the relevant insights
and precisely cites the source documents. Since we have precise knowledge of
what insights should appear in a haystack summary and what documents should be
cited, we implement a highly reproducible automatic evaluation that can score
summaries on two aspects - Coverage and Citation. We generate Haystacks in two
domains (conversation, news), and perform a large-scale evaluation of 10 LLMs
and corresponding 50 RAG systems. Our findings indicate that SummHay is an open
challenge for current systems, as even systems provided with an Oracle signal
of document relevance lag our estimate of human performance (56\%) by 10+
points on a Joint Score. Without a retriever, long-context LLMs like GPT-4o and
Claude 3 Opus score below 20% on SummHay. We show SummHay can also be used to
study enterprise RAG systems and position bias in long-context models. We hope
future systems can equal and surpass human performance on SummHay.</description><identifier>DOI: 10.48550/arxiv.2407.01370</identifier><language>eng</language><subject>Computer Science - Computation and Language</subject><creationdate>2024-07</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,777,882</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2407.01370$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2407.01370$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Laban, Philippe</creatorcontrib><creatorcontrib>Fabbri, Alexander R</creatorcontrib><creatorcontrib>Xiong, Caiming</creatorcontrib><creatorcontrib>Wu, Chien-Sheng</creatorcontrib><title>Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems</title><description>LLMs and RAG systems are now capable of handling millions of input tokens or
more. However, evaluating the output quality of such systems on long-context
tasks remains challenging, as tasks like Needle-in-a-Haystack lack complexity.
In this work, we argue that summarization can play a central role in such
evaluation. We design a procedure to synthesize Haystacks of documents,
ensuring that specific \textit{insights} repeat across documents. The "Summary
of a Haystack" (SummHay) task then requires a system to process the Haystack
and generate, given a query, a summary that identifies the relevant insights
and precisely cites the source documents. Since we have precise knowledge of
what insights should appear in a haystack summary and what documents should be
cited, we implement a highly reproducible automatic evaluation that can score
summaries on two aspects - Coverage and Citation. We generate Haystacks in two
domains (conversation, news), and perform a large-scale evaluation of 10 LLMs
and corresponding 50 RAG systems. Our findings indicate that SummHay is an open
challenge for current systems, as even systems provided with an Oracle signal
of document relevance lag our estimate of human performance (56\%) by 10+
points on a Joint Score. Without a retriever, long-context LLMs like GPT-4o and
Claude 3 Opus score below 20% on SummHay. We show SummHay can also be used to
study enterprise RAG systems and position bias in long-context models. We hope
future systems can equal and surpass human performance on SummHay.</description><subject>Computer Science - Computation and Language</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjEw1zMwNDY34GRwDy7NzU0sqlTIT1NIVPBIrCwuSUzOtlJwVHDOSMzJSc1LT1UoyVfwyc9L13XOzytJrShR8PHxLVZIzEtRCHJ0VwgG6kjNLeZhYE1LzClO5YXS3Azybq4hzh66YCvjC4oyQbbEg6yOB1ttTFgFAM_aN18</recordid><startdate>20240701</startdate><enddate>20240701</enddate><creator>Laban, Philippe</creator><creator>Fabbri, Alexander R</creator><creator>Xiong, Caiming</creator><creator>Wu, Chien-Sheng</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240701</creationdate><title>Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems</title><author>Laban, Philippe ; Fabbri, Alexander R ; Xiong, Caiming ; Wu, Chien-Sheng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2407_013703</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computation and Language</topic><toplevel>online_resources</toplevel><creatorcontrib>Laban, Philippe</creatorcontrib><creatorcontrib>Fabbri, Alexander R</creatorcontrib><creatorcontrib>Xiong, Caiming</creatorcontrib><creatorcontrib>Wu, Chien-Sheng</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Laban, Philippe</au><au>Fabbri, Alexander R</au><au>Xiong, Caiming</au><au>Wu, Chien-Sheng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems</atitle><date>2024-07-01</date><risdate>2024</risdate><abstract>LLMs and RAG systems are now capable of handling millions of input tokens or
more. However, evaluating the output quality of such systems on long-context
tasks remains challenging, as tasks like Needle-in-a-Haystack lack complexity.
In this work, we argue that summarization can play a central role in such
evaluation. We design a procedure to synthesize Haystacks of documents,
ensuring that specific \textit{insights} repeat across documents. The "Summary
of a Haystack" (SummHay) task then requires a system to process the Haystack
and generate, given a query, a summary that identifies the relevant insights
and precisely cites the source documents. Since we have precise knowledge of
what insights should appear in a haystack summary and what documents should be
cited, we implement a highly reproducible automatic evaluation that can score
summaries on two aspects - Coverage and Citation. We generate Haystacks in two
domains (conversation, news), and perform a large-scale evaluation of 10 LLMs
and corresponding 50 RAG systems. Our findings indicate that SummHay is an open
challenge for current systems, as even systems provided with an Oracle signal
of document relevance lag our estimate of human performance (56\%) by 10+
points on a Joint Score. Without a retriever, long-context LLMs like GPT-4o and
Claude 3 Opus score below 20% on SummHay. We show SummHay can also be used to
study enterprise RAG systems and position bias in long-context models. We hope
future systems can equal and surpass human performance on SummHay.</abstract><doi>10.48550/arxiv.2407.01370</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2407.01370 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2407_01370 |
source | arXiv.org |
subjects | Computer Science - Computation and Language |
title | Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-19T16%3A18%3A09IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Summary%20of%20a%20Haystack:%20A%20Challenge%20to%20Long-Context%20LLMs%20and%20RAG%20Systems&rft.au=Laban,%20Philippe&rft.date=2024-07-01&rft_id=info:doi/10.48550/arxiv.2407.01370&rft_dat=%3Carxiv_GOX%3E2407_01370%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |