SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity

Instruction-tuned Large Language Models (LLMs) have recently showcased remarkable advancements in their ability to generate fitting responses to natural language instructions. However, many current works rely on manual evaluation to judge the quality of generated responses. Since such manual evaluat...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-02
Hauptverfasser:	Aynetdinov, Ansar, Akbik, Alan
Format:	Artikel
Sprache:	eng
Schlagworte:	Large language models Natural language processing Semantics Similarity
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Aynetdinov, Ansar Akbik, Alan
description	Instruction-tuned Large Language Models (LLMs) have recently showcased remarkable advancements in their ability to generate fitting responses to natural language instructions. However, many current works rely on manual evaluation to judge the quality of generated responses. Since such manual evaluation is time-consuming, it does not easily scale to the evaluation of multiple models and model variants. In this short paper, we propose a straightforward but remarkably effective evaluation metric called SemScore, in which we directly compare model outputs to gold target responses using semantic textual similarity (STS). We conduct a comparative evaluation of the model outputs of 12 prominent instruction-tuned LLMs using 8 widely-used evaluation metrics for text generation. We find that our proposed SemScore metric outperforms all other, in many cases more complex, evaluation metrics in terms of correlation to human evaluation. These findings indicate the utility of our proposed metric for the evaluation of instruction-tuned LLMs.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2920396800</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2920396800</sourcerecordid><originalsourceid>FETCH-proquest_journals_29203968003</originalsourceid><addsrcrecordid>eNqNjE0KwjAYRIMgWLR3CLguxMTW1p1IRaGu2q2Uz5pCSppofkRvbwQP4GpmeI-ZoIgytkryNaUzFFs7EEJotqFpyiJ0qflYd9rwLd55p0dw_IbLJ0gPTmiFdY9Pyjrju-9MGq8Cr6qzxVewoQYlPIByosMNfzkPEtdiFBKMcO8FmvYgLY9_OUfLQ9nsj8nd6Ifn1rWD9kYF1NKCElZkOSHsP-sD8I9DoQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2920396800</pqid></control><display><type>article</type><title>SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity</title><source>Free E- Journals</source><creator>Aynetdinov, Ansar ; Akbik, Alan</creator><creatorcontrib>Aynetdinov, Ansar ; Akbik, Alan</creatorcontrib><description>Instruction-tuned Large Language Models (LLMs) have recently showcased remarkable advancements in their ability to generate fitting responses to natural language instructions. However, many current works rely on manual evaluation to judge the quality of generated responses. Since such manual evaluation is time-consuming, it does not easily scale to the evaluation of multiple models and model variants. In this short paper, we propose a straightforward but remarkably effective evaluation metric called SemScore, in which we directly compare model outputs to gold target responses using semantic textual similarity (STS). We conduct a comparative evaluation of the model outputs of 12 prominent instruction-tuned LLMs using 8 widely-used evaluation metrics for text generation. We find that our proposed SemScore metric outperforms all other, in many cases more complex, evaluation metrics in terms of correlation to human evaluation. These findings indicate the utility of our proposed metric for the evaluation of instruction-tuned LLMs.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Large language models ; Natural language processing ; Semantics ; Similarity</subject><ispartof>arXiv.org, 2024-02</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Aynetdinov, Ansar</creatorcontrib><creatorcontrib>Akbik, Alan</creatorcontrib><title>SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity</title><title>arXiv.org</title><description>Instruction-tuned Large Language Models (LLMs) have recently showcased remarkable advancements in their ability to generate fitting responses to natural language instructions. However, many current works rely on manual evaluation to judge the quality of generated responses. Since such manual evaluation is time-consuming, it does not easily scale to the evaluation of multiple models and model variants. In this short paper, we propose a straightforward but remarkably effective evaluation metric called SemScore, in which we directly compare model outputs to gold target responses using semantic textual similarity (STS). We conduct a comparative evaluation of the model outputs of 12 prominent instruction-tuned LLMs using 8 widely-used evaluation metrics for text generation. We find that our proposed SemScore metric outperforms all other, in many cases more complex, evaluation metrics in terms of correlation to human evaluation. These findings indicate the utility of our proposed metric for the evaluation of instruction-tuned LLMs.</description><subject>Large language models</subject><subject>Natural language processing</subject><subject>Semantics</subject><subject>Similarity</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNjE0KwjAYRIMgWLR3CLguxMTW1p1IRaGu2q2Uz5pCSppofkRvbwQP4GpmeI-ZoIgytkryNaUzFFs7EEJotqFpyiJ0qflYd9rwLd55p0dw_IbLJ0gPTmiFdY9Pyjrju-9MGq8Cr6qzxVewoQYlPIByosMNfzkPEtdiFBKMcO8FmvYgLY9_OUfLQ9nsj8nd6Ifn1rWD9kYF1NKCElZkOSHsP-sD8I9DoQ</recordid><startdate>20240205</startdate><enddate>20240205</enddate><creator>Aynetdinov, Ansar</creator><creator>Akbik, Alan</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PHGZM</scope><scope>PHGZT</scope><scope>PIMPY</scope><scope>PKEHL</scope><scope>PQEST</scope><scope>PQGLB</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240205</creationdate><title>SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity</title><author>Aynetdinov, Ansar ; Akbik, Alan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_29203968003</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Large language models</topic><topic>Natural language processing</topic><topic>Semantics</topic><topic>Similarity</topic><toplevel>online_resources</toplevel><creatorcontrib>Aynetdinov, Ansar</creatorcontrib><creatorcontrib>Akbik, Alan</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>ProQuest Central (New)</collection><collection>ProQuest One Academic (New)</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Middle East (New)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Applied & Life Sciences</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Aynetdinov, Ansar</au><au>Akbik, Alan</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity</atitle><jtitle>arXiv.org</jtitle><date>2024-02-05</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Instruction-tuned Large Language Models (LLMs) have recently showcased remarkable advancements in their ability to generate fitting responses to natural language instructions. However, many current works rely on manual evaluation to judge the quality of generated responses. Since such manual evaluation is time-consuming, it does not easily scale to the evaluation of multiple models and model variants. In this short paper, we propose a straightforward but remarkably effective evaluation metric called SemScore, in which we directly compare model outputs to gold target responses using semantic textual similarity (STS). We conduct a comparative evaluation of the model outputs of 12 prominent instruction-tuned LLMs using 8 widely-used evaluation metrics for text generation. We find that our proposed SemScore metric outperforms all other, in many cases more complex, evaluation metrics in terms of correlation to human evaluation. These findings indicate the utility of our proposed metric for the evaluation of instruction-tuned LLMs.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-02
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2920396800
source	Free E- Journals
subjects	Large language models Natural language processing Semantics Similarity
title	SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-19T23%3A54%3A42IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=SemScore:%20Automated%20Evaluation%20of%20Instruction-Tuned%20LLMs%20based%20on%20Semantic%20Textual%20Similarity&rft.jtitle=arXiv.org&rft.au=Aynetdinov,%20Ansar&rft.date=2024-02-05&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2920396800%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2920396800&rft_id=info:pmid/&rfr_iscdi=true