The strange case of reproducibility versus representativeness in contextual suggestion test collections

The most common approach to measuring the effectiveness of Information Retrieval systems is by using test collections. The Contextual Suggestion (CS) TREC track provides an evaluation framework for systems that recommend items to users given their geographical context. The specific nature of this tr...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Information retrieval (Boston) 2016-06, Vol.19 (3), p.230-255
Hauptverfasser: Samar, Thaer, Bellogín, Alejandro, de Vries, Arjen P.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 255
container_issue 3
container_start_page 230
container_title Information retrieval (Boston)
container_volume 19
creator Samar, Thaer
Bellogín, Alejandro
de Vries, Arjen P.
description The most common approach to measuring the effectiveness of Information Retrieval systems is by using test collections. The Contextual Suggestion (CS) TREC track provides an evaluation framework for systems that recommend items to users given their geographical context. The specific nature of this track allows the participating teams to identify candidate documents either from the Open Web or from the ClueWeb12 collection, a static version of the web. In the judging pool, the documents from the Open Web and ClueWeb12 collection are distinguished. Hence, each system submission should be based only on one resource, either Open Web (identified by URLs) or ClueWeb12 (identified by ids). To achieve reproducibility, ranking web pages from ClueWeb12 should be the preferred method for scientific evaluation of CS systems, but it has been found that the systems that build their suggestion algorithms on top of input taken from the Open Web achieve consistently a higher effectiveness. Because most of the systems take a rather similar approach to making CSs, this raises the question whether systems built by researchers on top of ClueWeb12 are still representative of those that would work directly on industry-strength web search engines. Do we need to sacrifice reproducibility for the sake of representativeness? We study the difference in effectiveness between Open Web systems and ClueWeb12 systems through analyzing the relevance assessments of documents identified from both the Open Web and ClueWeb12. Then, we identify documents that overlap between the relevance assessments of the Open Web and ClueWeb12, observing a dependency between relevance assessments and the source of the document being taken from the Open Web or from ClueWeb12. After that, we identify documents from the relevance assessments of the Open Web which exist in the ClueWeb12 collection but do not exist in the ClueWeb12 relevance assessments. We use these documents to expand the ClueWeb12 relevance assessments. Our main findings are twofold. First, our empirical analysis of the relevance assessments of 2  years of CS track shows that Open Web documents receive better ratings than ClueWeb12 documents, especially if we look at the documents in the overlap. Second, our approach for selecting candidate documents from ClueWeb12 collection based on information obtained from the Open Web makes an improvement step towards partially bridging the gap in effectiveness between Open Web and ClueWeb12 systems, while
doi_str_mv 10.1007/s10791-015-9276-9
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_1856378544</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>4296170741</sourcerecordid><originalsourceid>FETCH-LOGICAL-c402t-6a21179abc3fb76c62e7c8138ffff13c9adef42c7d1262c32db2bb8ccde73c1a3</originalsourceid><addsrcrecordid>eNp1kE1PwzAMhiMEEmPwA7hF4hzIR5u0RzTxJU3iMs5RmrqlU2lHnE7s35NRDlzIxXH8vo79EHIt-K3g3Nyh4KYUjIucldJoVp6QhciNYkbn5Wm6q0KzLNfZOblA3HLOdZaVC9Ju3oFiDG5ogXqHQMeGBtiFsZ58V3V9Fw90DwEn_HkGhCG62O1hAETaDdSPQ4SvOLme4tS2gLEbBxpTTKW-B3_M8ZKcNa5HuPqNS_L2-LBZPbP169PL6n7NfMZlZNpJIUzpKq-aymivJRhfpOGbdITypauhyaQ3tZBaeiXrSlZV4X0NRnnh1JLczH3TBp9TGsJuxykM6UsrilwrU-RZllRiVvkwIgZo7C50Hy4crOD2yNPOPG3iaY88bZk8cvZg0iZa4U_nf03fdDR8SA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1856378544</pqid></control><display><type>article</type><title>The strange case of reproducibility versus representativeness in contextual suggestion test collections</title><source>Alma/SFX Local Collection</source><creator>Samar, Thaer ; Bellogín, Alejandro ; de Vries, Arjen P.</creator><creatorcontrib>Samar, Thaer ; Bellogín, Alejandro ; de Vries, Arjen P.</creatorcontrib><description>The most common approach to measuring the effectiveness of Information Retrieval systems is by using test collections. The Contextual Suggestion (CS) TREC track provides an evaluation framework for systems that recommend items to users given their geographical context. The specific nature of this track allows the participating teams to identify candidate documents either from the Open Web or from the ClueWeb12 collection, a static version of the web. In the judging pool, the documents from the Open Web and ClueWeb12 collection are distinguished. Hence, each system submission should be based only on one resource, either Open Web (identified by URLs) or ClueWeb12 (identified by ids). To achieve reproducibility, ranking web pages from ClueWeb12 should be the preferred method for scientific evaluation of CS systems, but it has been found that the systems that build their suggestion algorithms on top of input taken from the Open Web achieve consistently a higher effectiveness. Because most of the systems take a rather similar approach to making CSs, this raises the question whether systems built by researchers on top of ClueWeb12 are still representative of those that would work directly on industry-strength web search engines. Do we need to sacrifice reproducibility for the sake of representativeness? We study the difference in effectiveness between Open Web systems and ClueWeb12 systems through analyzing the relevance assessments of documents identified from both the Open Web and ClueWeb12. Then, we identify documents that overlap between the relevance assessments of the Open Web and ClueWeb12, observing a dependency between relevance assessments and the source of the document being taken from the Open Web or from ClueWeb12. After that, we identify documents from the relevance assessments of the Open Web which exist in the ClueWeb12 collection but do not exist in the ClueWeb12 relevance assessments. We use these documents to expand the ClueWeb12 relevance assessments. Our main findings are twofold. First, our empirical analysis of the relevance assessments of 2  years of CS track shows that Open Web documents receive better ratings than ClueWeb12 documents, especially if we look at the documents in the overlap. Second, our approach for selecting candidate documents from ClueWeb12 collection based on information obtained from the Open Web makes an improvement step towards partially bridging the gap in effectiveness between Open Web and ClueWeb12 systems, while at the same time we achieve reproducible results on well-known representative sample of the web.</description><identifier>ISSN: 1386-4564</identifier><identifier>EISSN: 1573-7659</identifier><identifier>DOI: 10.1007/s10791-015-9276-9</identifier><language>eng</language><publisher>Dordrecht: Springer Netherlands</publisher><subject>Algorithms ; Collections ; Computer Science ; Data Mining and Knowledge Discovery ; Data Structures and Information Theory ; Information retrieval ; Information Retrieval Evaluation Using Test Collections ; Information Storage and Retrieval ; Natural Language Processing (NLP) ; Pattern Recognition ; Recommender systems ; Relevance ; Reproducibility ; Search engines ; Searches ; Social networks ; Studies</subject><ispartof>Information retrieval (Boston), 2016-06, Vol.19 (3), p.230-255</ispartof><rights>The Author(s) 2015</rights><rights>Information Retrieval Journal is a copyright of Springer, 2016.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c402t-6a21179abc3fb76c62e7c8138ffff13c9adef42c7d1262c32db2bb8ccde73c1a3</citedby><cites>FETCH-LOGICAL-c402t-6a21179abc3fb76c62e7c8138ffff13c9adef42c7d1262c32db2bb8ccde73c1a3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids></links><search><creatorcontrib>Samar, Thaer</creatorcontrib><creatorcontrib>Bellogín, Alejandro</creatorcontrib><creatorcontrib>de Vries, Arjen P.</creatorcontrib><title>The strange case of reproducibility versus representativeness in contextual suggestion test collections</title><title>Information retrieval (Boston)</title><addtitle>Inf Retrieval J</addtitle><description>The most common approach to measuring the effectiveness of Information Retrieval systems is by using test collections. The Contextual Suggestion (CS) TREC track provides an evaluation framework for systems that recommend items to users given their geographical context. The specific nature of this track allows the participating teams to identify candidate documents either from the Open Web or from the ClueWeb12 collection, a static version of the web. In the judging pool, the documents from the Open Web and ClueWeb12 collection are distinguished. Hence, each system submission should be based only on one resource, either Open Web (identified by URLs) or ClueWeb12 (identified by ids). To achieve reproducibility, ranking web pages from ClueWeb12 should be the preferred method for scientific evaluation of CS systems, but it has been found that the systems that build their suggestion algorithms on top of input taken from the Open Web achieve consistently a higher effectiveness. Because most of the systems take a rather similar approach to making CSs, this raises the question whether systems built by researchers on top of ClueWeb12 are still representative of those that would work directly on industry-strength web search engines. Do we need to sacrifice reproducibility for the sake of representativeness? We study the difference in effectiveness between Open Web systems and ClueWeb12 systems through analyzing the relevance assessments of documents identified from both the Open Web and ClueWeb12. Then, we identify documents that overlap between the relevance assessments of the Open Web and ClueWeb12, observing a dependency between relevance assessments and the source of the document being taken from the Open Web or from ClueWeb12. After that, we identify documents from the relevance assessments of the Open Web which exist in the ClueWeb12 collection but do not exist in the ClueWeb12 relevance assessments. We use these documents to expand the ClueWeb12 relevance assessments. Our main findings are twofold. First, our empirical analysis of the relevance assessments of 2  years of CS track shows that Open Web documents receive better ratings than ClueWeb12 documents, especially if we look at the documents in the overlap. Second, our approach for selecting candidate documents from ClueWeb12 collection based on information obtained from the Open Web makes an improvement step towards partially bridging the gap in effectiveness between Open Web and ClueWeb12 systems, while at the same time we achieve reproducible results on well-known representative sample of the web.</description><subject>Algorithms</subject><subject>Collections</subject><subject>Computer Science</subject><subject>Data Mining and Knowledge Discovery</subject><subject>Data Structures and Information Theory</subject><subject>Information retrieval</subject><subject>Information Retrieval Evaluation Using Test Collections</subject><subject>Information Storage and Retrieval</subject><subject>Natural Language Processing (NLP)</subject><subject>Pattern Recognition</subject><subject>Recommender systems</subject><subject>Relevance</subject><subject>Reproducibility</subject><subject>Search engines</subject><subject>Searches</subject><subject>Social networks</subject><subject>Studies</subject><issn>1386-4564</issn><issn>1573-7659</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2016</creationdate><recordtype>article</recordtype><sourceid>C6C</sourceid><sourceid>BENPR</sourceid><recordid>eNp1kE1PwzAMhiMEEmPwA7hF4hzIR5u0RzTxJU3iMs5RmrqlU2lHnE7s35NRDlzIxXH8vo79EHIt-K3g3Nyh4KYUjIucldJoVp6QhciNYkbn5Wm6q0KzLNfZOblA3HLOdZaVC9Ju3oFiDG5ogXqHQMeGBtiFsZ58V3V9Fw90DwEn_HkGhCG62O1hAETaDdSPQ4SvOLme4tS2gLEbBxpTTKW-B3_M8ZKcNa5HuPqNS_L2-LBZPbP169PL6n7NfMZlZNpJIUzpKq-aymivJRhfpOGbdITypauhyaQ3tZBaeiXrSlZV4X0NRnnh1JLczH3TBp9TGsJuxykM6UsrilwrU-RZllRiVvkwIgZo7C50Hy4crOD2yNPOPG3iaY88bZk8cvZg0iZa4U_nf03fdDR8SA</recordid><startdate>20160601</startdate><enddate>20160601</enddate><creator>Samar, Thaer</creator><creator>Bellogín, Alejandro</creator><creator>de Vries, Arjen P.</creator><general>Springer Netherlands</general><general>Springer Nature B.V</general><scope>C6C</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>0U~</scope><scope>1-H</scope><scope>3V.</scope><scope>7SC</scope><scope>7WY</scope><scope>7WZ</scope><scope>7XB</scope><scope>87Z</scope><scope>88I</scope><scope>8AL</scope><scope>8AO</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FK</scope><scope>8FL</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BEZIV</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FRNLG</scope><scope>F~G</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K60</scope><scope>K6~</scope><scope>K7-</scope><scope>L.-</scope><scope>L.0</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>M0C</scope><scope>M0N</scope><scope>M2P</scope><scope>P5Z</scope><scope>P62</scope><scope>PQBIZ</scope><scope>PQBZA</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PYYUZ</scope><scope>Q9U</scope></search><sort><creationdate>20160601</creationdate><title>The strange case of reproducibility versus representativeness in contextual suggestion test collections</title><author>Samar, Thaer ; Bellogín, Alejandro ; de Vries, Arjen P.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c402t-6a21179abc3fb76c62e7c8138ffff13c9adef42c7d1262c32db2bb8ccde73c1a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2016</creationdate><topic>Algorithms</topic><topic>Collections</topic><topic>Computer Science</topic><topic>Data Mining and Knowledge Discovery</topic><topic>Data Structures and Information Theory</topic><topic>Information retrieval</topic><topic>Information Retrieval Evaluation Using Test Collections</topic><topic>Information Storage and Retrieval</topic><topic>Natural Language Processing (NLP)</topic><topic>Pattern Recognition</topic><topic>Recommender systems</topic><topic>Relevance</topic><topic>Reproducibility</topic><topic>Search engines</topic><topic>Searches</topic><topic>Social networks</topic><topic>Studies</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Samar, Thaer</creatorcontrib><creatorcontrib>Bellogín, Alejandro</creatorcontrib><creatorcontrib>de Vries, Arjen P.</creatorcontrib><collection>Springer Nature OA Free Journals</collection><collection>CrossRef</collection><collection>Global News &amp; ABI/Inform Professional</collection><collection>Trade PRO</collection><collection>ProQuest Central (Corporate)</collection><collection>Computer and Information Systems Abstracts</collection><collection>ABI/INFORM Collection</collection><collection>ABI/INFORM Global (PDF only)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>ABI/INFORM Global (Alumni Edition)</collection><collection>Science Database (Alumni Edition)</collection><collection>Computing Database (Alumni Edition)</collection><collection>ProQuest Pharma Collection</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ABI/INFORM Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Business Premium Collection</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Business Premium Collection (Alumni)</collection><collection>ABI/INFORM Global (Corporate)</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Business Collection (Alumni Edition)</collection><collection>ProQuest Business Collection</collection><collection>Computer Science Database</collection><collection>ABI/INFORM Professional Advanced</collection><collection>ABI/INFORM Professional Standard</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>ABI/INFORM Global</collection><collection>Computing Database</collection><collection>Science Database</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>One Business (ProQuest)</collection><collection>ProQuest One Business (Alumni)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ABI/INFORM Collection China</collection><collection>ProQuest Central Basic</collection><jtitle>Information retrieval (Boston)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Samar, Thaer</au><au>Bellogín, Alejandro</au><au>de Vries, Arjen P.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>The strange case of reproducibility versus representativeness in contextual suggestion test collections</atitle><jtitle>Information retrieval (Boston)</jtitle><stitle>Inf Retrieval J</stitle><date>2016-06-01</date><risdate>2016</risdate><volume>19</volume><issue>3</issue><spage>230</spage><epage>255</epage><pages>230-255</pages><issn>1386-4564</issn><eissn>1573-7659</eissn><abstract>The most common approach to measuring the effectiveness of Information Retrieval systems is by using test collections. The Contextual Suggestion (CS) TREC track provides an evaluation framework for systems that recommend items to users given their geographical context. The specific nature of this track allows the participating teams to identify candidate documents either from the Open Web or from the ClueWeb12 collection, a static version of the web. In the judging pool, the documents from the Open Web and ClueWeb12 collection are distinguished. Hence, each system submission should be based only on one resource, either Open Web (identified by URLs) or ClueWeb12 (identified by ids). To achieve reproducibility, ranking web pages from ClueWeb12 should be the preferred method for scientific evaluation of CS systems, but it has been found that the systems that build their suggestion algorithms on top of input taken from the Open Web achieve consistently a higher effectiveness. Because most of the systems take a rather similar approach to making CSs, this raises the question whether systems built by researchers on top of ClueWeb12 are still representative of those that would work directly on industry-strength web search engines. Do we need to sacrifice reproducibility for the sake of representativeness? We study the difference in effectiveness between Open Web systems and ClueWeb12 systems through analyzing the relevance assessments of documents identified from both the Open Web and ClueWeb12. Then, we identify documents that overlap between the relevance assessments of the Open Web and ClueWeb12, observing a dependency between relevance assessments and the source of the document being taken from the Open Web or from ClueWeb12. After that, we identify documents from the relevance assessments of the Open Web which exist in the ClueWeb12 collection but do not exist in the ClueWeb12 relevance assessments. We use these documents to expand the ClueWeb12 relevance assessments. Our main findings are twofold. First, our empirical analysis of the relevance assessments of 2  years of CS track shows that Open Web documents receive better ratings than ClueWeb12 documents, especially if we look at the documents in the overlap. Second, our approach for selecting candidate documents from ClueWeb12 collection based on information obtained from the Open Web makes an improvement step towards partially bridging the gap in effectiveness between Open Web and ClueWeb12 systems, while at the same time we achieve reproducible results on well-known representative sample of the web.</abstract><cop>Dordrecht</cop><pub>Springer Netherlands</pub><doi>10.1007/s10791-015-9276-9</doi><tpages>26</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1386-4564
ispartof Information retrieval (Boston), 2016-06, Vol.19 (3), p.230-255
issn 1386-4564
1573-7659
language eng
recordid cdi_proquest_journals_1856378544
source Alma/SFX Local Collection
subjects Algorithms
Collections
Computer Science
Data Mining and Knowledge Discovery
Data Structures and Information Theory
Information retrieval
Information Retrieval Evaluation Using Test Collections
Information Storage and Retrieval
Natural Language Processing (NLP)
Pattern Recognition
Recommender systems
Relevance
Reproducibility
Search engines
Searches
Social networks
Studies
title The strange case of reproducibility versus representativeness in contextual suggestion test collections
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-01T05%3A47%3A06IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=The%20strange%20case%20of%20reproducibility%20versus%20representativeness%20in%20contextual%20suggestion%20test%20collections&rft.jtitle=Information%20retrieval%20(Boston)&rft.au=Samar,%20Thaer&rft.date=2016-06-01&rft.volume=19&rft.issue=3&rft.spage=230&rft.epage=255&rft.pages=230-255&rft.issn=1386-4564&rft.eissn=1573-7659&rft_id=info:doi/10.1007/s10791-015-9276-9&rft_dat=%3Cproquest_cross%3E4296170741%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1856378544&rft_id=info:pmid/&rfr_iscdi=true