A technique for measuring the relative size and overlap of public Web search engines

Search engines are among the most useful and popular services on the Web. Users are eager to know how they compare. Which one has the largest coverage? Have they indexed the same portion of the Web? How many pages are out there? Although these questions have been debated in the popular and technical...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Computer networks (Amsterdam, Netherlands : 1999) Netherlands : 1999), 1998-04, Vol.30 (1), p.379-388
Hauptverfasser: Bharat, Krishna, Broder, Andrei
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 388
container_issue 1
container_start_page 379
container_title Computer networks (Amsterdam, Netherlands : 1999)
container_volume 30
creator Bharat, Krishna
Broder, Andrei
description Search engines are among the most useful and popular services on the Web. Users are eager to know how they compare. Which one has the largest coverage? Have they indexed the same portion of the Web? How many pages are out there? Although these questions have been debated in the popular and technical press, no objective evaluation methodology has been proposed and few clear answers have emerged. In this paper we describe a standardized, statistical way of measuring search engine coverage and overlap through random queries. Our technique does not require privileged access to any database. It can be implemented by third-party evaluators using only public query interfaces. We present results from our experiments showing size and overlap estimates for HotBot, AltaVista, Excite, and Infoseek as percentages of their total joint coverage in mid 1997 and in November 1997. Our method does not provide absolute values. However using data from other sources we estimate that as of November 1997 the number of pages indexed by HotBot, AltaVista, Excite, and Infoseek were respectively roughly 77M, 100M, 32M, and 17M and the joint total coverage was 160 million pages. We further conjecture that the size of the static, public Web as of November was over 200 million pages. The most startling finding is that the overlap is very small: less than 1.4% of the total coverage, or about 2.2 million pages were indexed by all four engines.
doi_str_mv 10.1016/S0169-7552(98)00127-5
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_57459824</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0169755298001275</els_id><sourcerecordid>29436750</sourcerecordid><originalsourceid>FETCH-LOGICAL-c396t-98b3f8623c412536b3d733df1a154dee0c421e00e5749fa97abbd2b66a8ea9a43</originalsourceid><addsrcrecordid>eNqFkE1LxDAQhoMouH78BCGIiB6q-WjS5iSL-AWCBxWPYZpO3Ug3XZN2QX-9XVc8ePEyc3ned4aHkAPOzjjj-vxxHCYrlBInpjxljIsiUxtkwstCZAXTZpNMfpFtspPSGxspXpgJeZrSHt0s-PcBadNFOkdIQ_ThlfYzpBFb6P0SafKfSCHUtFtibGFBu4Yuhqr1jr5gRRNCdDOK4dUHTHtkq4E24f7P3iXP11dPl7fZ_cPN3eX0PnPS6D4zZSWbUgvpci6U1JWsCynrhgNXeY3IXC44MoaqyE0DpoCqqkWlNZQIBnK5S47XvYvYjf-n3s59cti2ELAbkh1zypRiBR7-Ad-6IYbxN8uN0bmSSo-QWkMudilFbOwi-jnED8uZXYm236LtyqI1pf0WbdWYO_oph-SgbSIE59NvWEimZbmqv1hjOBpZeow2OY_BYe0jut7Wnf_n0BdlUpFW</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>199645356</pqid></control><display><type>article</type><title>A technique for measuring the relative size and overlap of public Web search engines</title><source>Alma/SFX Local Collection</source><creator>Bharat, Krishna ; Broder, Andrei</creator><creatorcontrib>Bharat, Krishna ; Broder, Andrei</creatorcontrib><description>Search engines are among the most useful and popular services on the Web. Users are eager to know how they compare. Which one has the largest coverage? Have they indexed the same portion of the Web? How many pages are out there? Although these questions have been debated in the popular and technical press, no objective evaluation methodology has been proposed and few clear answers have emerged. In this paper we describe a standardized, statistical way of measuring search engine coverage and overlap through random queries. Our technique does not require privileged access to any database. It can be implemented by third-party evaluators using only public query interfaces. We present results from our experiments showing size and overlap estimates for HotBot, AltaVista, Excite, and Infoseek as percentages of their total joint coverage in mid 1997 and in November 1997. Our method does not provide absolute values. However using data from other sources we estimate that as of November 1997 the number of pages indexed by HotBot, AltaVista, Excite, and Infoseek were respectively roughly 77M, 100M, 32M, and 17M and the joint total coverage was 160 million pages. We further conjecture that the size of the static, public Web as of November was over 200 million pages. The most startling finding is that the overlap is very small: less than 1.4% of the total coverage, or about 2.2 million pages were indexed by all four engines.</description><identifier>ISSN: 0169-7552</identifier><identifier>ISSN: 1389-1286</identifier><identifier>EISSN: 1872-7069</identifier><identifier>DOI: 10.1016/S0169-7552(98)00127-5</identifier><identifier>CODEN: CNISE9</identifier><language>eng</language><publisher>Amsterdam: Elsevier B.V</publisher><subject>Applied sciences ; Computer science; control theory; systems ; Coverage ; Exact sciences and technology ; Information systems. Data bases ; Memory organisation. Data processing ; Overlap ; Search engines ; Size ; Software ; Studies ; Web page sampling ; Websites ; World Wide Web</subject><ispartof>Computer networks (Amsterdam, Netherlands : 1999), 1998-04, Vol.30 (1), p.379-388</ispartof><rights>1998</rights><rights>1998 INIST-CNRS</rights><rights>Copyright Elsevier Sequoia S.A. Apr 1998</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c396t-98b3f8623c412536b3d733df1a154dee0c421e00e5749fa97abbd2b66a8ea9a43</citedby><cites>FETCH-LOGICAL-c396t-98b3f8623c412536b3d733df1a154dee0c421e00e5749fa97abbd2b66a8ea9a43</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>309,310,314,780,784,789,790,23930,23931,25140,27924,27925</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&amp;idt=2306386$$DView record in Pascal Francis$$Hfree_for_read</backlink></links><search><creatorcontrib>Bharat, Krishna</creatorcontrib><creatorcontrib>Broder, Andrei</creatorcontrib><title>A technique for measuring the relative size and overlap of public Web search engines</title><title>Computer networks (Amsterdam, Netherlands : 1999)</title><description>Search engines are among the most useful and popular services on the Web. Users are eager to know how they compare. Which one has the largest coverage? Have they indexed the same portion of the Web? How many pages are out there? Although these questions have been debated in the popular and technical press, no objective evaluation methodology has been proposed and few clear answers have emerged. In this paper we describe a standardized, statistical way of measuring search engine coverage and overlap through random queries. Our technique does not require privileged access to any database. It can be implemented by third-party evaluators using only public query interfaces. We present results from our experiments showing size and overlap estimates for HotBot, AltaVista, Excite, and Infoseek as percentages of their total joint coverage in mid 1997 and in November 1997. Our method does not provide absolute values. However using data from other sources we estimate that as of November 1997 the number of pages indexed by HotBot, AltaVista, Excite, and Infoseek were respectively roughly 77M, 100M, 32M, and 17M and the joint total coverage was 160 million pages. We further conjecture that the size of the static, public Web as of November was over 200 million pages. The most startling finding is that the overlap is very small: less than 1.4% of the total coverage, or about 2.2 million pages were indexed by all four engines.</description><subject>Applied sciences</subject><subject>Computer science; control theory; systems</subject><subject>Coverage</subject><subject>Exact sciences and technology</subject><subject>Information systems. Data bases</subject><subject>Memory organisation. Data processing</subject><subject>Overlap</subject><subject>Search engines</subject><subject>Size</subject><subject>Software</subject><subject>Studies</subject><subject>Web page sampling</subject><subject>Websites</subject><subject>World Wide Web</subject><issn>0169-7552</issn><issn>1389-1286</issn><issn>1872-7069</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>1998</creationdate><recordtype>article</recordtype><recordid>eNqFkE1LxDAQhoMouH78BCGIiB6q-WjS5iSL-AWCBxWPYZpO3Ug3XZN2QX-9XVc8ePEyc3ned4aHkAPOzjjj-vxxHCYrlBInpjxljIsiUxtkwstCZAXTZpNMfpFtspPSGxspXpgJeZrSHt0s-PcBadNFOkdIQ_ThlfYzpBFb6P0SafKfSCHUtFtibGFBu4Yuhqr1jr5gRRNCdDOK4dUHTHtkq4E24f7P3iXP11dPl7fZ_cPN3eX0PnPS6D4zZSWbUgvpci6U1JWsCynrhgNXeY3IXC44MoaqyE0DpoCqqkWlNZQIBnK5S47XvYvYjf-n3s59cti2ELAbkh1zypRiBR7-Ad-6IYbxN8uN0bmSSo-QWkMudilFbOwi-jnED8uZXYm236LtyqI1pf0WbdWYO_oph-SgbSIE59NvWEimZbmqv1hjOBpZeow2OY_BYe0jut7Wnf_n0BdlUpFW</recordid><startdate>19980401</startdate><enddate>19980401</enddate><creator>Bharat, Krishna</creator><creator>Broder, Andrei</creator><general>Elsevier B.V</general><general>Elsevier Science</general><general>Elsevier Sequoia S.A</general><scope>IQODW</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>E3H</scope><scope>F2A</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>19980401</creationdate><title>A technique for measuring the relative size and overlap of public Web search engines</title><author>Bharat, Krishna ; Broder, Andrei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c396t-98b3f8623c412536b3d733df1a154dee0c421e00e5749fa97abbd2b66a8ea9a43</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>1998</creationdate><topic>Applied sciences</topic><topic>Computer science; control theory; systems</topic><topic>Coverage</topic><topic>Exact sciences and technology</topic><topic>Information systems. Data bases</topic><topic>Memory organisation. Data processing</topic><topic>Overlap</topic><topic>Search engines</topic><topic>Size</topic><topic>Software</topic><topic>Studies</topic><topic>Web page sampling</topic><topic>Websites</topic><topic>World Wide Web</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Bharat, Krishna</creatorcontrib><creatorcontrib>Broder, Andrei</creatorcontrib><collection>Pascal-Francis</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>Library &amp; Information Sciences Abstracts (LISA)</collection><collection>Library &amp; Information Science Abstracts (LISA)</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Computer networks (Amsterdam, Netherlands : 1999)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Bharat, Krishna</au><au>Broder, Andrei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A technique for measuring the relative size and overlap of public Web search engines</atitle><jtitle>Computer networks (Amsterdam, Netherlands : 1999)</jtitle><date>1998-04-01</date><risdate>1998</risdate><volume>30</volume><issue>1</issue><spage>379</spage><epage>388</epage><pages>379-388</pages><issn>0169-7552</issn><issn>1389-1286</issn><eissn>1872-7069</eissn><coden>CNISE9</coden><abstract>Search engines are among the most useful and popular services on the Web. Users are eager to know how they compare. Which one has the largest coverage? Have they indexed the same portion of the Web? How many pages are out there? Although these questions have been debated in the popular and technical press, no objective evaluation methodology has been proposed and few clear answers have emerged. In this paper we describe a standardized, statistical way of measuring search engine coverage and overlap through random queries. Our technique does not require privileged access to any database. It can be implemented by third-party evaluators using only public query interfaces. We present results from our experiments showing size and overlap estimates for HotBot, AltaVista, Excite, and Infoseek as percentages of their total joint coverage in mid 1997 and in November 1997. Our method does not provide absolute values. However using data from other sources we estimate that as of November 1997 the number of pages indexed by HotBot, AltaVista, Excite, and Infoseek were respectively roughly 77M, 100M, 32M, and 17M and the joint total coverage was 160 million pages. We further conjecture that the size of the static, public Web as of November was over 200 million pages. The most startling finding is that the overlap is very small: less than 1.4% of the total coverage, or about 2.2 million pages were indexed by all four engines.</abstract><cop>Amsterdam</cop><pub>Elsevier B.V</pub><doi>10.1016/S0169-7552(98)00127-5</doi><tpages>10</tpages></addata></record>
fulltext fulltext
identifier ISSN: 0169-7552
ispartof Computer networks (Amsterdam, Netherlands : 1999), 1998-04, Vol.30 (1), p.379-388
issn 0169-7552
1389-1286
1872-7069
language eng
recordid cdi_proquest_miscellaneous_57459824
source Alma/SFX Local Collection
subjects Applied sciences
Computer science
control theory
systems
Coverage
Exact sciences and technology
Information systems. Data bases
Memory organisation. Data processing
Overlap
Search engines
Size
Software
Studies
Web page sampling
Websites
World Wide Web
title A technique for measuring the relative size and overlap of public Web search engines
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-23T12%3A16%3A51IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20technique%20for%20measuring%20the%20relative%20size%20and%20overlap%20of%20public%20Web%20search%20engines&rft.jtitle=Computer%20networks%20(Amsterdam,%20Netherlands%20:%201999)&rft.au=Bharat,%20Krishna&rft.date=1998-04-01&rft.volume=30&rft.issue=1&rft.spage=379&rft.epage=388&rft.pages=379-388&rft.issn=0169-7552&rft.eissn=1872-7069&rft.coden=CNISE9&rft_id=info:doi/10.1016/S0169-7552(98)00127-5&rft_dat=%3Cproquest_cross%3E29436750%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=199645356&rft_id=info:pmid/&rft_els_id=S0169755298001275&rfr_iscdi=true