LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K

State-of-the-art large language models (LLMs) are now claiming remarkable supported context lengths of 256k or even more. In contrast, the average context lengths of mainstream benchmarks are insufficient (5k-21k), and they suffer from potential knowledge leakage and inaccurate metrics, resulting in...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Yuan, Tao, Ning, Xuefei, Zhou, Dong, Yang, Zhijie, Li, Shiyao, Zhuang, Minghui, Tan, Zheyue, Yao, Zhuyu, Lin, Dahua, Li, Boxun, Dai, Guohao, Yan, Shengen, Wang, Yu
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Yuan, Tao
Ning, Xuefei
Zhou, Dong
Yang, Zhijie
Li, Shiyao
Zhuang, Minghui
Tan, Zheyue
Yao, Zhuyu
Lin, Dahua
Li, Boxun
Dai, Guohao
Yan, Shengen
Wang, Yu
description State-of-the-art large language models (LLMs) are now claiming remarkable supported context lengths of 256k or even more. In contrast, the average context lengths of mainstream benchmarks are insufficient (5k-21k), and they suffer from potential knowledge leakage and inaccurate metrics, resulting in biased evaluation. This paper introduces LV-Eval, a challenging long-context benchmark with five length levels (16k, 32k, 64k, 128k, and 256k) reaching up to 256k words. LV-Eval features two main tasks, single-hop QA and multi-hop QA, comprising 11 bilingual datasets. The design of LV-Eval has incorporated three key techniques, namely confusing facts insertion, keyword and phrase replacement, and keyword-recall-based metric design. The advantages of LV-Eval include controllable evaluation across different context lengths, challenging test instances with confusing facts, mitigated knowledge leakage, and more objective evaluations. We evaluate 15 LLMs on LV-Eval and conduct ablation studies on the benchmarking techniques. The results reveal that: (i) Moonshot-v1 and recent large-scale open-source models, such as Qwen-2.5-72B and Llama-3.1-70B, achieve the highest performance on LV-Eval, particularly at lengths below 64k. (ii) Models exhibit distinct score trends. For example, GLM-4-9B-128k, Yi-6B-200k, and Llama3-8B-1M exhibit a relatively gentle degradation of performance, but their absolute performances may not necessarily be higher than those of LLMs with shorter context lengths. (iii) LLMs' performances can significantly degrade in the presence of confusing information, especially in the pressure test of "needle in a haystack". (iv) Issues related to knowledge leakage and inaccurate metrics introduce bias in evaluation, and these concerns are alleviated in LV-Eval. All datasets and evaluation codes are released at: https://github.com/infinigence/LVEval.
doi_str_mv 10.48550/arxiv.2402.05136
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2402_05136</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2402_05136</sourcerecordid><originalsourceid>FETCH-LOGICAL-a676-1fe3ee8d593a7da7e8bbe8276332867937377fb26813df7388b05da8f46556573</originalsourceid><addsrcrecordid>eNotz71OwzAUBWAvDKjwAEz4BRwc39jXZWuj8lMssbRdI6e-biOCU6VRKG9PKUznTEfnY-wul1lhtZYPvj81Y6YKqTKpczDXbOk2YjH69pHP-Ny3Pm0pcNelnSi7NNBp4HNK2_2n7z_4VzPsueaO0u5cHI3UHvn6wIeOK23ebthV9O2Rbv9zwlZPi1X5Itz782s5c8IbNCKPBEQ26Cl4DB7J1jVZhQZAWYNTQECMtTI2hxARrK2lDt7GwmhtNMKE3f_NXjDVoW_O576rX1R1QcEPSpVD7g</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K</title><source>arXiv.org</source><creator>Yuan, Tao ; Ning, Xuefei ; Zhou, Dong ; Yang, Zhijie ; Li, Shiyao ; Zhuang, Minghui ; Tan, Zheyue ; Yao, Zhuyu ; Lin, Dahua ; Li, Boxun ; Dai, Guohao ; Yan, Shengen ; Wang, Yu</creator><creatorcontrib>Yuan, Tao ; Ning, Xuefei ; Zhou, Dong ; Yang, Zhijie ; Li, Shiyao ; Zhuang, Minghui ; Tan, Zheyue ; Yao, Zhuyu ; Lin, Dahua ; Li, Boxun ; Dai, Guohao ; Yan, Shengen ; Wang, Yu</creatorcontrib><description>State-of-the-art large language models (LLMs) are now claiming remarkable supported context lengths of 256k or even more. In contrast, the average context lengths of mainstream benchmarks are insufficient (5k-21k), and they suffer from potential knowledge leakage and inaccurate metrics, resulting in biased evaluation. This paper introduces LV-Eval, a challenging long-context benchmark with five length levels (16k, 32k, 64k, 128k, and 256k) reaching up to 256k words. LV-Eval features two main tasks, single-hop QA and multi-hop QA, comprising 11 bilingual datasets. The design of LV-Eval has incorporated three key techniques, namely confusing facts insertion, keyword and phrase replacement, and keyword-recall-based metric design. The advantages of LV-Eval include controllable evaluation across different context lengths, challenging test instances with confusing facts, mitigated knowledge leakage, and more objective evaluations. We evaluate 15 LLMs on LV-Eval and conduct ablation studies on the benchmarking techniques. The results reveal that: (i) Moonshot-v1 and recent large-scale open-source models, such as Qwen-2.5-72B and Llama-3.1-70B, achieve the highest performance on LV-Eval, particularly at lengths below 64k. (ii) Models exhibit distinct score trends. For example, GLM-4-9B-128k, Yi-6B-200k, and Llama3-8B-1M exhibit a relatively gentle degradation of performance, but their absolute performances may not necessarily be higher than those of LLMs with shorter context lengths. (iii) LLMs' performances can significantly degrade in the presence of confusing information, especially in the pressure test of "needle in a haystack". (iv) Issues related to knowledge leakage and inaccurate metrics introduce bias in evaluation, and these concerns are alleviated in LV-Eval. All datasets and evaluation codes are released at: https://github.com/infinigence/LVEval.</description><identifier>DOI: 10.48550/arxiv.2402.05136</identifier><language>eng</language><subject>Computer Science - Computation and Language</subject><creationdate>2024-02</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2402.05136$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2402.05136$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Yuan, Tao</creatorcontrib><creatorcontrib>Ning, Xuefei</creatorcontrib><creatorcontrib>Zhou, Dong</creatorcontrib><creatorcontrib>Yang, Zhijie</creatorcontrib><creatorcontrib>Li, Shiyao</creatorcontrib><creatorcontrib>Zhuang, Minghui</creatorcontrib><creatorcontrib>Tan, Zheyue</creatorcontrib><creatorcontrib>Yao, Zhuyu</creatorcontrib><creatorcontrib>Lin, Dahua</creatorcontrib><creatorcontrib>Li, Boxun</creatorcontrib><creatorcontrib>Dai, Guohao</creatorcontrib><creatorcontrib>Yan, Shengen</creatorcontrib><creatorcontrib>Wang, Yu</creatorcontrib><title>LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K</title><description>State-of-the-art large language models (LLMs) are now claiming remarkable supported context lengths of 256k or even more. In contrast, the average context lengths of mainstream benchmarks are insufficient (5k-21k), and they suffer from potential knowledge leakage and inaccurate metrics, resulting in biased evaluation. This paper introduces LV-Eval, a challenging long-context benchmark with five length levels (16k, 32k, 64k, 128k, and 256k) reaching up to 256k words. LV-Eval features two main tasks, single-hop QA and multi-hop QA, comprising 11 bilingual datasets. The design of LV-Eval has incorporated three key techniques, namely confusing facts insertion, keyword and phrase replacement, and keyword-recall-based metric design. The advantages of LV-Eval include controllable evaluation across different context lengths, challenging test instances with confusing facts, mitigated knowledge leakage, and more objective evaluations. We evaluate 15 LLMs on LV-Eval and conduct ablation studies on the benchmarking techniques. The results reveal that: (i) Moonshot-v1 and recent large-scale open-source models, such as Qwen-2.5-72B and Llama-3.1-70B, achieve the highest performance on LV-Eval, particularly at lengths below 64k. (ii) Models exhibit distinct score trends. For example, GLM-4-9B-128k, Yi-6B-200k, and Llama3-8B-1M exhibit a relatively gentle degradation of performance, but their absolute performances may not necessarily be higher than those of LLMs with shorter context lengths. (iii) LLMs' performances can significantly degrade in the presence of confusing information, especially in the pressure test of "needle in a haystack". (iv) Issues related to knowledge leakage and inaccurate metrics introduce bias in evaluation, and these concerns are alleviated in LV-Eval. All datasets and evaluation codes are released at: https://github.com/infinigence/LVEval.</description><subject>Computer Science - Computation and Language</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz71OwzAUBWAvDKjwAEz4BRwc39jXZWuj8lMssbRdI6e-biOCU6VRKG9PKUznTEfnY-wul1lhtZYPvj81Y6YKqTKpczDXbOk2YjH69pHP-Ny3Pm0pcNelnSi7NNBp4HNK2_2n7z_4VzPsueaO0u5cHI3UHvn6wIeOK23ebthV9O2Rbv9zwlZPi1X5Itz782s5c8IbNCKPBEQ26Cl4DB7J1jVZhQZAWYNTQECMtTI2hxARrK2lDt7GwmhtNMKE3f_NXjDVoW_O576rX1R1QcEPSpVD7g</recordid><startdate>20240206</startdate><enddate>20240206</enddate><creator>Yuan, Tao</creator><creator>Ning, Xuefei</creator><creator>Zhou, Dong</creator><creator>Yang, Zhijie</creator><creator>Li, Shiyao</creator><creator>Zhuang, Minghui</creator><creator>Tan, Zheyue</creator><creator>Yao, Zhuyu</creator><creator>Lin, Dahua</creator><creator>Li, Boxun</creator><creator>Dai, Guohao</creator><creator>Yan, Shengen</creator><creator>Wang, Yu</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240206</creationdate><title>LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K</title><author>Yuan, Tao ; Ning, Xuefei ; Zhou, Dong ; Yang, Zhijie ; Li, Shiyao ; Zhuang, Minghui ; Tan, Zheyue ; Yao, Zhuyu ; Lin, Dahua ; Li, Boxun ; Dai, Guohao ; Yan, Shengen ; Wang, Yu</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a676-1fe3ee8d593a7da7e8bbe8276332867937377fb26813df7388b05da8f46556573</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computation and Language</topic><toplevel>online_resources</toplevel><creatorcontrib>Yuan, Tao</creatorcontrib><creatorcontrib>Ning, Xuefei</creatorcontrib><creatorcontrib>Zhou, Dong</creatorcontrib><creatorcontrib>Yang, Zhijie</creatorcontrib><creatorcontrib>Li, Shiyao</creatorcontrib><creatorcontrib>Zhuang, Minghui</creatorcontrib><creatorcontrib>Tan, Zheyue</creatorcontrib><creatorcontrib>Yao, Zhuyu</creatorcontrib><creatorcontrib>Lin, Dahua</creatorcontrib><creatorcontrib>Li, Boxun</creatorcontrib><creatorcontrib>Dai, Guohao</creatorcontrib><creatorcontrib>Yan, Shengen</creatorcontrib><creatorcontrib>Wang, Yu</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Yuan, Tao</au><au>Ning, Xuefei</au><au>Zhou, Dong</au><au>Yang, Zhijie</au><au>Li, Shiyao</au><au>Zhuang, Minghui</au><au>Tan, Zheyue</au><au>Yao, Zhuyu</au><au>Lin, Dahua</au><au>Li, Boxun</au><au>Dai, Guohao</au><au>Yan, Shengen</au><au>Wang, Yu</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K</atitle><date>2024-02-06</date><risdate>2024</risdate><abstract>State-of-the-art large language models (LLMs) are now claiming remarkable supported context lengths of 256k or even more. In contrast, the average context lengths of mainstream benchmarks are insufficient (5k-21k), and they suffer from potential knowledge leakage and inaccurate metrics, resulting in biased evaluation. This paper introduces LV-Eval, a challenging long-context benchmark with five length levels (16k, 32k, 64k, 128k, and 256k) reaching up to 256k words. LV-Eval features two main tasks, single-hop QA and multi-hop QA, comprising 11 bilingual datasets. The design of LV-Eval has incorporated three key techniques, namely confusing facts insertion, keyword and phrase replacement, and keyword-recall-based metric design. The advantages of LV-Eval include controllable evaluation across different context lengths, challenging test instances with confusing facts, mitigated knowledge leakage, and more objective evaluations. We evaluate 15 LLMs on LV-Eval and conduct ablation studies on the benchmarking techniques. The results reveal that: (i) Moonshot-v1 and recent large-scale open-source models, such as Qwen-2.5-72B and Llama-3.1-70B, achieve the highest performance on LV-Eval, particularly at lengths below 64k. (ii) Models exhibit distinct score trends. For example, GLM-4-9B-128k, Yi-6B-200k, and Llama3-8B-1M exhibit a relatively gentle degradation of performance, but their absolute performances may not necessarily be higher than those of LLMs with shorter context lengths. (iii) LLMs' performances can significantly degrade in the presence of confusing information, especially in the pressure test of "needle in a haystack". (iv) Issues related to knowledge leakage and inaccurate metrics introduce bias in evaluation, and these concerns are alleviated in LV-Eval. All datasets and evaluation codes are released at: https://github.com/infinigence/LVEval.</abstract><doi>10.48550/arxiv.2402.05136</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2402.05136
ispartof
issn
language eng
recordid cdi_arxiv_primary_2402_05136
source arXiv.org
subjects Computer Science - Computation and Language
title LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T21%3A03%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=LV-Eval:%20A%20Balanced%20Long-Context%20Benchmark%20with%205%20Length%20Levels%20Up%20to%20256K&rft.au=Yuan,%20Tao&rft.date=2024-02-06&rft_id=info:doi/10.48550/arxiv.2402.05136&rft_dat=%3Carxiv_GOX%3E2402_05136%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true