LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K

State-of-the-art large language models (LLMs) are now claiming remarkable supported context lengths of 256k or even more. In contrast, the average context lengths of mainstream benchmarks are insufficient (5k-21k), and they suffer from potential knowledge leakage and inaccurate metrics, resulting in...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Yuan, Tao, Ning, Xuefei, Zhou, Dong, Yang, Zhijie, Li, Shiyao, Zhuang, Minghui, Tan, Zheyue, Yao, Zhuyu, Lin, Dahua, Li, Boxun, Dai, Guohao, Yan, Shengen, Wang, Yu
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Yuan, Tao Ning, Xuefei Zhou, Dong Yang, Zhijie Li, Shiyao Zhuang, Minghui Tan, Zheyue Yao, Zhuyu Lin, Dahua Li, Boxun Dai, Guohao Yan, Shengen Wang, Yu
description	State-of-the-art large language models (LLMs) are now claiming remarkable supported context lengths of 256k or even more. In contrast, the average context lengths of mainstream benchmarks are insufficient (5k-21k), and they suffer from potential knowledge leakage and inaccurate metrics, resulting in biased evaluation. This paper introduces LV-Eval, a challenging long-context benchmark with five length levels (16k, 32k, 64k, 128k, and 256k) reaching up to 256k words. LV-Eval features two main tasks, single-hop QA and multi-hop QA, comprising 11 bilingual datasets. The design of LV-Eval has incorporated three key techniques, namely confusing facts insertion, keyword and phrase replacement, and keyword-recall-based metric design. The advantages of LV-Eval include controllable evaluation across different context lengths, challenging test instances with confusing facts, mitigated knowledge leakage, and more objective evaluations. We evaluate 15 LLMs on LV-Eval and conduct ablation studies on the benchmarking techniques. The results reveal that: (i) Moonshot-v1 and recent large-scale open-source models, such as Qwen-2.5-72B and Llama-3.1-70B, achieve the highest performance on LV-Eval, particularly at lengths below 64k. (ii) Models exhibit distinct score trends. For example, GLM-4-9B-128k, Yi-6B-200k, and Llama3-8B-1M exhibit a relatively gentle degradation of performance, but their absolute performances may not necessarily be higher than those of LLMs with shorter context lengths. (iii) LLMs' performances can significantly degrade in the presence of confusing information, especially in the pressure test of "needle in a haystack". (iv) Issues related to knowledge leakage and inaccurate metrics introduce bias in evaluation, and these concerns are alleviated in LV-Eval. All datasets and evaluation codes are released at: https://github.com/infinigence/LVEval.
doi_str_mv	10.48550/arxiv.2402.05136
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2402_05136</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2402_05136</sourcerecordid><originalsourceid>FETCH-LOGICAL-a676-1fe3ee8d593a7da7e8bbe8276332867937377fb26813df7388b05da8f46556573</originalsourceid><addsrcrecordid>eNotz71OwzAUBWAvDKjwAEz4BRwc39jXZWuj8lMssbRdI6e-biOCU6VRKG9PKUznTEfnY-wul1lhtZYPvj81Y6YKqTKpczDXbOk2YjH69pHP-Ny3Pm0pcNelnSi7NNBp4HNK2_2n7z_4VzPsueaO0u5cHI3UHvn6wIeOK23ebthV9O2Rbv9zwlZPi1X5Itz782s5c8IbNCKPBEQ26Cl4DB7J1jVZhQZAWYNTQECMtTI2hxARrK2lDt7GwmhtNMKE3f_NXjDVoW_O576rX1R1QcEPSpVD7g</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K</title><source>arXiv.org</source><creator>Yuan, Tao ; Ning, Xuefei ; Zhou, Dong ; Yang, Zhijie ; Li, Shiyao ; Zhuang, Minghui ; Tan, Zheyue ; Yao, Zhuyu ; Lin, Dahua ; Li, Boxun ; Dai, Guohao ; Yan, Shengen ; Wang, Yu</creator><creatorcontrib>Yuan, Tao ; Ning, Xuefei ; Zhou, Dong ; Yang, Zhijie ; Li, Shiyao ; Zhuang, Minghui ; Tan, Zheyue ; Yao, Zhuyu ; Lin, Dahua ; Li, Boxun ; Dai, Guohao ; Yan, Shengen ; Wang, Yu</creatorcontrib><description>State-of-the-art large language models (LLMs) are now claiming remarkable supported context lengths of 256k or even more. In contrast, the average context lengths of mainstream benchmarks are insufficient (5k-21k), and they suffer from potential knowledge leakage and inaccurate metrics, resulting in biased evaluation. This paper introduces LV-Eval, a challenging long-context benchmark with five length levels (16k, 32k, 64k, 128k, and 256k) reaching up to 256k words. LV-Eval features two main tasks, single-hop QA and multi-hop QA, comprising 11 bilingual datasets. The design of LV-Eval has incorporated three key techniques, namely confusing facts insertion, keyword and phrase replacement, and keyword-recall-based metric design. The advantages of LV-Eval include controllable evaluation across different context lengths, challenging test instances with confusing facts, mitigated knowledge leakage, and more objective evaluations. We evaluate 15 LLMs on LV-Eval and conduct ablation studies on the benchmarking techniques. The results reveal that: (i) Moonshot-v1 and recent large-scale open-source models, such as Qwen-2.5-72B and Llama-3.1-70B, achieve the highest performance on LV-Eval, particularly at lengths below 64k. (ii) Models exhibit distinct score trends. For example, GLM-4-9B-128k, Yi-6B-200k, and Llama3-8B-1M exhibit a relatively gentle degradation of performance, but their absolute performances may not necessarily be higher than those of LLMs with shorter context lengths. (iii) LLMs' performances can significantly degrade in the presence of confusing information, especially in the pressure test of "needle in a haystack". (iv) Issues related to knowledge leakage and inaccurate metrics introduce bias in evaluation, and these concerns are alleviated in LV-Eval. All datasets and evaluation codes are released at: https://github.com/infinigence/LVEval.</description><identifier>DOI: 10.48550/arxiv.2402.05136</identifier><language>eng</language><subject>Computer Science - Computation and Language</subject><creationdate>2024-02</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2402.05136$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2402.05136$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Yuan, Tao</creatorcontrib><creatorcontrib>Ning, Xuefei</creatorcontrib><creatorcontrib>Zhou, Dong</creatorcontrib><creatorcontrib>Yang, Zhijie</creatorcontrib><creatorcontrib>Li, Shiyao</creatorcontrib><creatorcontrib>Zhuang, Minghui</creatorcontrib><creatorcontrib>Tan, Zheyue</creatorcontrib><creatorcontrib>Yao, Zhuyu</creatorcontrib><creatorcontrib>Lin, Dahua</creatorcontrib><creatorcontrib>Li, Boxun</creatorcontrib><creatorcontrib>Dai, Guohao</creatorcontrib><creatorcontrib>Yan, Shengen</creatorcontrib><creatorcontrib>Wang, Yu</creatorcontrib><title>LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K</title><description>State-of-the-art large language models (LLMs) are now claiming remarkable supported context lengths of 256k or even more. In contrast, the average context lengths of mainstream benchmarks are insufficient (5k-21k), and they suffer from potential knowledge leakage and inaccurate metrics, resulting in biased evaluation. This paper introduces LV-Eval, a challenging long-context benchmark with five length levels (16k, 32k, 64k, 128k, and 256k) reaching up to 256k words. LV-Eval features two main tasks, single-hop QA and multi-hop QA, comprising 11 bilingual datasets. The design of LV-Eval has incorporated three key techniques, namely confusing facts insertion, keyword and phrase replacement, and keyword-recall-based metric design. The advantages of LV-Eval include controllable evaluation across different context lengths, challenging test instances with confusing facts, mitigated knowledge leakage, and more objective evaluations. We evaluate 15 LLMs on LV-Eval and conduct ablation studies on the benchmarking techniques. The results reveal that: (i) Moonshot-v1 and recent large-scale open-source models, such as Qwen-2.5-72B and Llama-3.1-70B, achieve the highest performance on LV-Eval, particularly at lengths below 64k. (ii) Models exhibit distinct score trends. For example, GLM-4-9B-128k, Yi-6B-200k, and Llama3-8B-1M exhibit a relatively gentle degradation of performance, but their absolute performances may not necessarily be higher than those of LLMs with shorter context lengths. (iii) LLMs' performances can significantly degrade in the presence of confusing information, especially in the pressure test of "needle in a haystack". (iv) Issues related to knowledge leakage and inaccurate metrics introduce bias in evaluation, and these concerns are alleviated in LV-Eval. All datasets and evaluation codes are released at: https://github.com/infinigence/LVEval.</description><subject>Computer Science - Computation and Language</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz71OwzAUBWAvDKjwAEz4BRwc39jXZWuj8lMssbRdI6e-biOCU6VRKG9PKUznTEfnY-wul1lhtZYPvj81Y6YKqTKpczDXbOk2YjH69pHP-Ny3Pm0pcNelnSi7NNBp4HNK2_2n7z_4VzPsueaO0u5cHI3UHvn6wIeOK23ebthV9O2Rbv9zwlZPi1X5Itz782s5c8IbNCKPBEQ26Cl4DB7J1jVZhQZAWYNTQECMtTI2hxARrK2lDt7GwmhtNMKE3f_NXjDVoW_O576rX1R1QcEPSpVD7g</recordid><startdate>20240206</startdate><enddate>20240206</enddate><creator>Yuan, Tao</creator><creator>Ning, Xuefei</creator><creator>Zhou, Dong</creator><creator>Yang, Zhijie</creator><creator>Li, Shiyao</creator><creator>Zhuang, Minghui</creator><creator>Tan, Zheyue</creator><creator>Yao, Zhuyu</creator><creator>Lin, Dahua</creator><creator>Li, Boxun</creator><creator>Dai, Guohao</creator><creator>Yan, Shengen</creator><creator>Wang, Yu</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240206</creationdate><title>LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K</title><author>Yuan, Tao ; Ning, Xuefei ; Zhou, Dong ; Yang, Zhijie ; Li, Shiyao ; Zhuang, Minghui ; Tan, Zheyue ; Yao, Zhuyu ; Lin, Dahua ; Li, Boxun ; Dai, Guohao ; Yan, Shengen ; Wang, Yu</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a676-1fe3ee8d593a7da7e8bbe8276332867937377fb26813df7388b05da8f46556573</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computation and Language</topic><toplevel>online_resources</toplevel><creatorcontrib>Yuan, Tao</creatorcontrib><creatorcontrib>Ning, Xuefei</creatorcontrib><creatorcontrib>Zhou, Dong</creatorcontrib><creatorcontrib>Yang, Zhijie</creatorcontrib><creatorcontrib>Li, Shiyao</creatorcontrib><creatorcontrib>Zhuang, Minghui</creatorcontrib><creatorcontrib>Tan, Zheyue</creatorcontrib><creatorcontrib>Yao, Zhuyu</creatorcontrib><creatorcontrib>Lin, Dahua</creatorcontrib><creatorcontrib>Li, Boxun</creatorcontrib><creatorcontrib>Dai, Guohao</creatorcontrib><creatorcontrib>Yan, Shengen</creatorcontrib><creatorcontrib>Wang, Yu</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Yuan, Tao</au><au>Ning, Xuefei</au><au>Zhou, Dong</au><au>Yang, Zhijie</au><au>Li, Shiyao</au><au>Zhuang, Minghui</au><au>Tan, Zheyue</au><au>Yao, Zhuyu</au><au>Lin, Dahua</au><au>Li, Boxun</au><au>Dai, Guohao</au><au>Yan, Shengen</au><au>Wang, Yu</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K</atitle><date>2024-02-06</date><risdate>2024</risdate><abstract>State-of-the-art large language models (LLMs) are now claiming remarkable supported context lengths of 256k or even more. In contrast, the average context lengths of mainstream benchmarks are insufficient (5k-21k), and they suffer from potential knowledge leakage and inaccurate metrics, resulting in biased evaluation. This paper introduces LV-Eval, a challenging long-context benchmark with five length levels (16k, 32k, 64k, 128k, and 256k) reaching up to 256k words. LV-Eval features two main tasks, single-hop QA and multi-hop QA, comprising 11 bilingual datasets. The design of LV-Eval has incorporated three key techniques, namely confusing facts insertion, keyword and phrase replacement, and keyword-recall-based metric design. The advantages of LV-Eval include controllable evaluation across different context lengths, challenging test instances with confusing facts, mitigated knowledge leakage, and more objective evaluations. We evaluate 15 LLMs on LV-Eval and conduct ablation studies on the benchmarking techniques. The results reveal that: (i) Moonshot-v1 and recent large-scale open-source models, such as Qwen-2.5-72B and Llama-3.1-70B, achieve the highest performance on LV-Eval, particularly at lengths below 64k. (ii) Models exhibit distinct score trends. For example, GLM-4-9B-128k, Yi-6B-200k, and Llama3-8B-1M exhibit a relatively gentle degradation of performance, but their absolute performances may not necessarily be higher than those of LLMs with shorter context lengths. (iii) LLMs' performances can significantly degrade in the presence of confusing information, especially in the pressure test of "needle in a haystack". (iv) Issues related to knowledge leakage and inaccurate metrics introduce bias in evaluation, and these concerns are alleviated in LV-Eval. All datasets and evaluation codes are released at: https://github.com/infinigence/LVEval.</abstract><doi>10.48550/arxiv.2402.05136</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2402.05136
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2402_05136
source	arXiv.org
subjects	Computer Science - Computation and Language
title	LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T21%3A03%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=LV-Eval:%20A%20Balanced%20Long-Context%20Benchmark%20with%205%20Length%20Levels%20Up%20to%20256K&rft.au=Yuan,%20Tao&rft.date=2024-02-06&rft_id=info:doi/10.48550/arxiv.2402.05136&rft_dat=%3Carxiv_GOX%3E2402_05136%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true