On the effect of relevance scales in crowdsourcing relevance assessments for Information Retrieval evaluation

Relevance is a key concept in information retrieval and widely used for the evaluation of search systems using test collections. We present a comprehensive study of the effect of the choice of relevance scales on the evaluation of information retrieval systems. Our work analyzes and compares four cr...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Information processing & management 2021-11, Vol.58 (6), p.102688, Article 102688
Hauptverfasser: Roitero, Kevin, Maddalena, Eddy, Mizzaro, Stefano, Scholer, Falk
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue 6
container_start_page 102688
container_title Information processing & management
container_volume 58
creator Roitero, Kevin
Maddalena, Eddy
Mizzaro, Stefano
Scholer, Falk
description Relevance is a key concept in information retrieval and widely used for the evaluation of search systems using test collections. We present a comprehensive study of the effect of the choice of relevance scales on the evaluation of information retrieval systems. Our work analyzes and compares four crowdsourced scales (2-levels, 4-levels, and 100-levels ordinal scales, and a magnitude estimation scale) and two expert-labeled datasets (on 2- and 4-levels ordinal scales). We compare the scales considering internal and external agreement, the effect on IR evaluation both in terms of system effectiveness and topic ease, and we discuss the effect of such scales and datasets on the perception of relevance levels by assessors. Our analyses show that: crowdsourced judgment distributions are consistent across scales, both overall and at the per-topic level; on all scales crowdsourced judgments agree with the expert judgments, and overall the crowd assessors are able to express reliable relevance judgments; all scales lead to a similar level of external agreement with the ground truth, while the internal agreement among crowd workers is higher for fine-grained scales; more fine-grained scales consistently lead to higher correlation values for both system ranking and topic ease; finally, we found that the considered scales lead to different perceived distances between relevance levels. •We collect relevance judgments for 4 crowdsourced scales.•We compare the crowd judgments with two expert-labeled datasets.•We study the effect on IR evaluation in terms of system effectiveness and topic ease.•We release the data publicly.
doi_str_mv 10.1016/j.ipm.2021.102688
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2593193183</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0306457321001734</els_id><sourcerecordid>2593193183</sourcerecordid><originalsourceid>FETCH-LOGICAL-c325t-e66eba3b617b931e343ba8bcba8799e58d8eb22ae303f36e3354b42dfaca8ecd3</originalsourceid><addsrcrecordid>eNp9kE1LAzEQhoMoWKs_wFvA89Yksx8pnqT4URAKoueQzU40y262JtuK_97U9eBJCG_I8L6TmYeQS84WnPHyul24bb8QTPD0FqWUR2TGZQVZARU_JjMGrMzyooJTchZjyxjLCy5mpN94Or4jRWvRjHSwNGCHe-0N0mh0h5E6T00YPps47IJx_u2PQ8eIMfbox0jtEOjaJ-316AZPn3EMLvk6epDdT_GcnFjdRbz4vefk9f7uZfWYPW0e1qvbp8yAKMYMyxJrDXXJq3oJHCGHWsvaJKmWSyxkI7EWQiMwsFAiQJHXuWisNlqiaWBOrqa-2zB87DCOqk3D-_SlEkXqmI6E5OKTK60XY0CrtsH1OnwpztSBqmpVoqoOVNVENWVupgym8fcOg4rGYWLRuJAAqmZw_6S_ASiigns</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2593193183</pqid></control><display><type>article</type><title>On the effect of relevance scales in crowdsourcing relevance assessments for Information Retrieval evaluation</title><source>Elsevier ScienceDirect Journals</source><creator>Roitero, Kevin ; Maddalena, Eddy ; Mizzaro, Stefano ; Scholer, Falk</creator><creatorcontrib>Roitero, Kevin ; Maddalena, Eddy ; Mizzaro, Stefano ; Scholer, Falk</creatorcontrib><description>Relevance is a key concept in information retrieval and widely used for the evaluation of search systems using test collections. We present a comprehensive study of the effect of the choice of relevance scales on the evaluation of information retrieval systems. Our work analyzes and compares four crowdsourced scales (2-levels, 4-levels, and 100-levels ordinal scales, and a magnitude estimation scale) and two expert-labeled datasets (on 2- and 4-levels ordinal scales). We compare the scales considering internal and external agreement, the effect on IR evaluation both in terms of system effectiveness and topic ease, and we discuss the effect of such scales and datasets on the perception of relevance levels by assessors. Our analyses show that: crowdsourced judgment distributions are consistent across scales, both overall and at the per-topic level; on all scales crowdsourced judgments agree with the expert judgments, and overall the crowd assessors are able to express reliable relevance judgments; all scales lead to a similar level of external agreement with the ground truth, while the internal agreement among crowd workers is higher for fine-grained scales; more fine-grained scales consistently lead to higher correlation values for both system ranking and topic ease; finally, we found that the considered scales lead to different perceived distances between relevance levels. •We collect relevance judgments for 4 crowdsourced scales.•We compare the crowd judgments with two expert-labeled datasets.•We study the effect on IR evaluation in terms of system effectiveness and topic ease.•We release the data publicly.</description><identifier>ISSN: 0306-4573</identifier><identifier>EISSN: 1873-5371</identifier><identifier>DOI: 10.1016/j.ipm.2021.102688</identifier><language>eng</language><publisher>Oxford: Elsevier Ltd</publisher><subject>Agreements ; Crowdsourcing ; Datasets ; Evaluation ; Information retrieval ; Information Retrieval evaluation ; Mathematical models ; Relevance ; Relevance assessment ; Relevance scales ; System effectiveness ; Systems analysis</subject><ispartof>Information processing &amp; management, 2021-11, Vol.58 (6), p.102688, Article 102688</ispartof><rights>2021 Elsevier Ltd</rights><rights>Copyright Pergamon Press Inc. Nov 2021</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c325t-e66eba3b617b931e343ba8bcba8799e58d8eb22ae303f36e3354b42dfaca8ecd3</citedby><cites>FETCH-LOGICAL-c325t-e66eba3b617b931e343ba8bcba8799e58d8eb22ae303f36e3354b42dfaca8ecd3</cites><orcidid>0000-0002-9191-3280 ; 0000-0001-9094-0810 ; 0000-0002-2852-168X</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.sciencedirect.com/science/article/pii/S0306457321001734$$EHTML$$P50$$Gelsevier$$H</linktohtml><link.rule.ids>314,776,780,3537,27901,27902,65306</link.rule.ids></links><search><creatorcontrib>Roitero, Kevin</creatorcontrib><creatorcontrib>Maddalena, Eddy</creatorcontrib><creatorcontrib>Mizzaro, Stefano</creatorcontrib><creatorcontrib>Scholer, Falk</creatorcontrib><title>On the effect of relevance scales in crowdsourcing relevance assessments for Information Retrieval evaluation</title><title>Information processing &amp; management</title><description>Relevance is a key concept in information retrieval and widely used for the evaluation of search systems using test collections. We present a comprehensive study of the effect of the choice of relevance scales on the evaluation of information retrieval systems. Our work analyzes and compares four crowdsourced scales (2-levels, 4-levels, and 100-levels ordinal scales, and a magnitude estimation scale) and two expert-labeled datasets (on 2- and 4-levels ordinal scales). We compare the scales considering internal and external agreement, the effect on IR evaluation both in terms of system effectiveness and topic ease, and we discuss the effect of such scales and datasets on the perception of relevance levels by assessors. Our analyses show that: crowdsourced judgment distributions are consistent across scales, both overall and at the per-topic level; on all scales crowdsourced judgments agree with the expert judgments, and overall the crowd assessors are able to express reliable relevance judgments; all scales lead to a similar level of external agreement with the ground truth, while the internal agreement among crowd workers is higher for fine-grained scales; more fine-grained scales consistently lead to higher correlation values for both system ranking and topic ease; finally, we found that the considered scales lead to different perceived distances between relevance levels. •We collect relevance judgments for 4 crowdsourced scales.•We compare the crowd judgments with two expert-labeled datasets.•We study the effect on IR evaluation in terms of system effectiveness and topic ease.•We release the data publicly.</description><subject>Agreements</subject><subject>Crowdsourcing</subject><subject>Datasets</subject><subject>Evaluation</subject><subject>Information retrieval</subject><subject>Information Retrieval evaluation</subject><subject>Mathematical models</subject><subject>Relevance</subject><subject>Relevance assessment</subject><subject>Relevance scales</subject><subject>System effectiveness</subject><subject>Systems analysis</subject><issn>0306-4573</issn><issn>1873-5371</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><recordid>eNp9kE1LAzEQhoMoWKs_wFvA89Yksx8pnqT4URAKoueQzU40y262JtuK_97U9eBJCG_I8L6TmYeQS84WnPHyul24bb8QTPD0FqWUR2TGZQVZARU_JjMGrMzyooJTchZjyxjLCy5mpN94Or4jRWvRjHSwNGCHe-0N0mh0h5E6T00YPps47IJx_u2PQ8eIMfbox0jtEOjaJ-316AZPn3EMLvk6epDdT_GcnFjdRbz4vefk9f7uZfWYPW0e1qvbp8yAKMYMyxJrDXXJq3oJHCGHWsvaJKmWSyxkI7EWQiMwsFAiQJHXuWisNlqiaWBOrqa-2zB87DCOqk3D-_SlEkXqmI6E5OKTK60XY0CrtsH1OnwpztSBqmpVoqoOVNVENWVupgym8fcOg4rGYWLRuJAAqmZw_6S_ASiigns</recordid><startdate>202111</startdate><enddate>202111</enddate><creator>Roitero, Kevin</creator><creator>Maddalena, Eddy</creator><creator>Mizzaro, Stefano</creator><creator>Scholer, Falk</creator><general>Elsevier Ltd</general><general>Elsevier Science Ltd</general><scope>AAYXX</scope><scope>CITATION</scope><scope>E3H</scope><scope>F2A</scope><orcidid>https://orcid.org/0000-0002-9191-3280</orcidid><orcidid>https://orcid.org/0000-0001-9094-0810</orcidid><orcidid>https://orcid.org/0000-0002-2852-168X</orcidid></search><sort><creationdate>202111</creationdate><title>On the effect of relevance scales in crowdsourcing relevance assessments for Information Retrieval evaluation</title><author>Roitero, Kevin ; Maddalena, Eddy ; Mizzaro, Stefano ; Scholer, Falk</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c325t-e66eba3b617b931e343ba8bcba8799e58d8eb22ae303f36e3354b42dfaca8ecd3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Agreements</topic><topic>Crowdsourcing</topic><topic>Datasets</topic><topic>Evaluation</topic><topic>Information retrieval</topic><topic>Information Retrieval evaluation</topic><topic>Mathematical models</topic><topic>Relevance</topic><topic>Relevance assessment</topic><topic>Relevance scales</topic><topic>System effectiveness</topic><topic>Systems analysis</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Roitero, Kevin</creatorcontrib><creatorcontrib>Maddalena, Eddy</creatorcontrib><creatorcontrib>Mizzaro, Stefano</creatorcontrib><creatorcontrib>Scholer, Falk</creatorcontrib><collection>CrossRef</collection><collection>Library &amp; Information Sciences Abstracts (LISA)</collection><collection>Library &amp; Information Science Abstracts (LISA)</collection><jtitle>Information processing &amp; management</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Roitero, Kevin</au><au>Maddalena, Eddy</au><au>Mizzaro, Stefano</au><au>Scholer, Falk</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>On the effect of relevance scales in crowdsourcing relevance assessments for Information Retrieval evaluation</atitle><jtitle>Information processing &amp; management</jtitle><date>2021-11</date><risdate>2021</risdate><volume>58</volume><issue>6</issue><spage>102688</spage><pages>102688-</pages><artnum>102688</artnum><issn>0306-4573</issn><eissn>1873-5371</eissn><abstract>Relevance is a key concept in information retrieval and widely used for the evaluation of search systems using test collections. We present a comprehensive study of the effect of the choice of relevance scales on the evaluation of information retrieval systems. Our work analyzes and compares four crowdsourced scales (2-levels, 4-levels, and 100-levels ordinal scales, and a magnitude estimation scale) and two expert-labeled datasets (on 2- and 4-levels ordinal scales). We compare the scales considering internal and external agreement, the effect on IR evaluation both in terms of system effectiveness and topic ease, and we discuss the effect of such scales and datasets on the perception of relevance levels by assessors. Our analyses show that: crowdsourced judgment distributions are consistent across scales, both overall and at the per-topic level; on all scales crowdsourced judgments agree with the expert judgments, and overall the crowd assessors are able to express reliable relevance judgments; all scales lead to a similar level of external agreement with the ground truth, while the internal agreement among crowd workers is higher for fine-grained scales; more fine-grained scales consistently lead to higher correlation values for both system ranking and topic ease; finally, we found that the considered scales lead to different perceived distances between relevance levels. •We collect relevance judgments for 4 crowdsourced scales.•We compare the crowd judgments with two expert-labeled datasets.•We study the effect on IR evaluation in terms of system effectiveness and topic ease.•We release the data publicly.</abstract><cop>Oxford</cop><pub>Elsevier Ltd</pub><doi>10.1016/j.ipm.2021.102688</doi><orcidid>https://orcid.org/0000-0002-9191-3280</orcidid><orcidid>https://orcid.org/0000-0001-9094-0810</orcidid><orcidid>https://orcid.org/0000-0002-2852-168X</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 0306-4573
ispartof Information processing & management, 2021-11, Vol.58 (6), p.102688, Article 102688
issn 0306-4573
1873-5371
language eng
recordid cdi_proquest_journals_2593193183
source Elsevier ScienceDirect Journals
subjects Agreements
Crowdsourcing
Datasets
Evaluation
Information retrieval
Information Retrieval evaluation
Mathematical models
Relevance
Relevance assessment
Relevance scales
System effectiveness
Systems analysis
title On the effect of relevance scales in crowdsourcing relevance assessments for Information Retrieval evaluation
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-09T23%3A09%3A06IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=On%20the%20effect%20of%20relevance%20scales%20in%20crowdsourcing%20relevance%20assessments%20for%20Information%20Retrieval%20evaluation&rft.jtitle=Information%20processing%20&%20management&rft.au=Roitero,%20Kevin&rft.date=2021-11&rft.volume=58&rft.issue=6&rft.spage=102688&rft.pages=102688-&rft.artnum=102688&rft.issn=0306-4573&rft.eissn=1873-5371&rft_id=info:doi/10.1016/j.ipm.2021.102688&rft_dat=%3Cproquest_cross%3E2593193183%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2593193183&rft_id=info:pmid/&rft_els_id=S0306457321001734&rfr_iscdi=true