Effective and efficient classification on a search-engine model

Traditional document classification frameworks, which apply the learned classifier to each document in a corpus one by one, are infeasible for extremely large document corpora, like the Web or large corporate intranets. We consider the classification problem on a corpus that has been processed prima...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Knowledge and information systems 2008-08, Vol.16 (2), p.129-154
Hauptverfasser:	Anagnostopoulos, Aris, Broder, Andrei, Punera, Kunal
Format:	Artikel
Sprache:	eng
Schlagworte:	Applied sciences Artificial intelligence Computer Science Computer science control theory systems Computer systems and distributed systems. User interface Data Mining and Knowledge Discovery Database Management Exact sciences and technology Information Storage and Retrieval Information Systems and Communication Service Information Systems Applications (incl.Internet) Information systems. Data bases IT in Business Memory organisation. Data processing Regular Paper Software Speech and sound recognition and synthesis. Linguistics
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	154
container_issue	2
container_start_page	129
container_title	Knowledge and information systems
container_volume	16
creator	Anagnostopoulos, Aris Broder, Andrei Punera, Kunal
description	Traditional document classification frameworks, which apply the learned classifier to each document in a corpus one by one, are infeasible for extremely large document corpora, like the Web or large corporate intranets. We consider the classification problem on a corpus that has been processed primarily for the purpose of searching, and thus our access to documents is solely through the inverted index of a large scale search engine. Our main goal is to build the “best” short query that characterizes a document class using operators normally available within search engines. We show that surprisingly good classification accuracy can be achieved on average over multiple classes by queries with as few as 10 terms. As part of our study, we enhance some of the feature-selection techniques that are found in the literature by forcing the inclusion of terms that are negatively correlated with the target class and by making use of term correlations; we show that both of those techniques can offer significant advantages. Moreover, we show that optimizing the efficiency of query execution by careful selection of terms can further reduce the query costs. More precisely, we show that on our set-up the best 10-term query can achieve 93% of the accuracy of the best SVM classifier (14,000 terms), and if we are willing to tolerate a reduction to 89% of the best SVM, we can build a 10-term query that can be executed more than twice as fast as the best 10-term query.
doi_str_mv	10.1007/s10115-007-0102-6
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_34867687</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>34867687</sourcerecordid><originalsourceid>FETCH-LOGICAL-c380t-8dd854f5420307622b68e73e5f18f20f4bf4e759ddafdf19a9c6bd80f238df603</originalsourceid><addsrcrecordid>eNp9kE9LAzEQxRdRsFY_gLe96G11JtlNsieRUv9AwYueQ5pMaso2W5Ot4Ld3S4tHYWDeML95MK8orhHuEEDeZwTEphplBQisEifFBBi2FUcUp0eNXMrz4iLnNQBKgTgpHubekx3CN5UmupK8DzZQHErbmZzDOJkh9LEcy5SZTLKfFcVViFRuekfdZXHmTZfp6tinxcfT_H32Ui3enl9nj4vKcgVDpZxTTe2bmgEHKRhbCkWSU-NReQa-XvqaZNM6Z7zz2JrWiqVT4BlXzgvg0-L24LtN_deO8qA3IVvqOhOp32XNayWkUHIE8QDa1OecyOttChuTfjSC3kelD1HpvdxHpcV4c3M0N9mazicTbch_hwyahtW8HTl24PK4iitKet3vUhz__sf8F4PceC4</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>34867687</pqid></control><display><type>article</type><title>Effective and efficient classification on a search-engine model</title><source>Springer Nature - Complete Springer Journals</source><creator>Anagnostopoulos, Aris ; Broder, Andrei ; Punera, Kunal</creator><creatorcontrib>Anagnostopoulos, Aris ; Broder, Andrei ; Punera, Kunal</creatorcontrib><description>Traditional document classification frameworks, which apply the learned classifier to each document in a corpus one by one, are infeasible for extremely large document corpora, like the Web or large corporate intranets. We consider the classification problem on a corpus that has been processed primarily for the purpose of searching, and thus our access to documents is solely through the inverted index of a large scale search engine. Our main goal is to build the “best” short query that characterizes a document class using operators normally available within search engines. We show that surprisingly good classification accuracy can be achieved on average over multiple classes by queries with as few as 10 terms. As part of our study, we enhance some of the feature-selection techniques that are found in the literature by forcing the inclusion of terms that are negatively correlated with the target class and by making use of term correlations; we show that both of those techniques can offer significant advantages. Moreover, we show that optimizing the efficiency of query execution by careful selection of terms can further reduce the query costs. More precisely, we show that on our set-up the best 10-term query can achieve 93% of the accuracy of the best SVM classifier (14,000 terms), and if we are willing to tolerate a reduction to 89% of the best SVM, we can build a 10-term query that can be executed more than twice as fast as the best 10-term query.</description><identifier>ISSN: 0219-1377</identifier><identifier>EISSN: 0219-3116</identifier><identifier>DOI: 10.1007/s10115-007-0102-6</identifier><language>eng</language><publisher>London: Springer-Verlag</publisher><subject>Applied sciences ; Artificial intelligence ; Computer Science ; Computer science; control theory; systems ; Computer systems and distributed systems. User interface ; Data Mining and Knowledge Discovery ; Database Management ; Exact sciences and technology ; Information Storage and Retrieval ; Information Systems and Communication Service ; Information Systems Applications (incl.Internet) ; Information systems. Data bases ; IT in Business ; Memory organisation. Data processing ; Regular Paper ; Software ; Speech and sound recognition and synthesis. Linguistics</subject><ispartof>Knowledge and information systems, 2008-08, Vol.16 (2), p.129-154</ispartof><rights>Springer-Verlag London Limited 2007</rights><rights>2009 INIST-CNRS</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c380t-8dd854f5420307622b68e73e5f18f20f4bf4e759ddafdf19a9c6bd80f238df603</citedby><cites>FETCH-LOGICAL-c380t-8dd854f5420307622b68e73e5f18f20f4bf4e759ddafdf19a9c6bd80f238df603</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s10115-007-0102-6$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s10115-007-0102-6$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,777,781,27905,27906,41469,42538,51300</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=20552439$$DView record in Pascal Francis$$Hfree_for_read</backlink></links><search><creatorcontrib>Anagnostopoulos, Aris</creatorcontrib><creatorcontrib>Broder, Andrei</creatorcontrib><creatorcontrib>Punera, Kunal</creatorcontrib><title>Effective and efficient classification on a search-engine model</title><title>Knowledge and information systems</title><addtitle>Knowl Inf Syst</addtitle><description>Traditional document classification frameworks, which apply the learned classifier to each document in a corpus one by one, are infeasible for extremely large document corpora, like the Web or large corporate intranets. We consider the classification problem on a corpus that has been processed primarily for the purpose of searching, and thus our access to documents is solely through the inverted index of a large scale search engine. Our main goal is to build the “best” short query that characterizes a document class using operators normally available within search engines. We show that surprisingly good classification accuracy can be achieved on average over multiple classes by queries with as few as 10 terms. As part of our study, we enhance some of the feature-selection techniques that are found in the literature by forcing the inclusion of terms that are negatively correlated with the target class and by making use of term correlations; we show that both of those techniques can offer significant advantages. Moreover, we show that optimizing the efficiency of query execution by careful selection of terms can further reduce the query costs. More precisely, we show that on our set-up the best 10-term query can achieve 93% of the accuracy of the best SVM classifier (14,000 terms), and if we are willing to tolerate a reduction to 89% of the best SVM, we can build a 10-term query that can be executed more than twice as fast as the best 10-term query.</description><subject>Applied sciences</subject><subject>Artificial intelligence</subject><subject>Computer Science</subject><subject>Computer science; control theory; systems</subject><subject>Computer systems and distributed systems. User interface</subject><subject>Data Mining and Knowledge Discovery</subject><subject>Database Management</subject><subject>Exact sciences and technology</subject><subject>Information Storage and Retrieval</subject><subject>Information Systems and Communication Service</subject><subject>Information Systems Applications (incl.Internet)</subject><subject>Information systems. Data bases</subject><subject>IT in Business</subject><subject>Memory organisation. Data processing</subject><subject>Regular Paper</subject><subject>Software</subject><subject>Speech and sound recognition and synthesis. Linguistics</subject><issn>0219-1377</issn><issn>0219-3116</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2008</creationdate><recordtype>article</recordtype><recordid>eNp9kE9LAzEQxRdRsFY_gLe96G11JtlNsieRUv9AwYueQ5pMaso2W5Ot4Ld3S4tHYWDeML95MK8orhHuEEDeZwTEphplBQisEifFBBi2FUcUp0eNXMrz4iLnNQBKgTgpHubekx3CN5UmupK8DzZQHErbmZzDOJkh9LEcy5SZTLKfFcVViFRuekfdZXHmTZfp6tinxcfT_H32Ui3enl9nj4vKcgVDpZxTTe2bmgEHKRhbCkWSU-NReQa-XvqaZNM6Z7zz2JrWiqVT4BlXzgvg0-L24LtN_deO8qA3IVvqOhOp32XNayWkUHIE8QDa1OecyOttChuTfjSC3kelD1HpvdxHpcV4c3M0N9mazicTbch_hwyahtW8HTl24PK4iitKet3vUhz__sf8F4PceC4</recordid><startdate>20080801</startdate><enddate>20080801</enddate><creator>Anagnostopoulos, Aris</creator><creator>Broder, Andrei</creator><creator>Punera, Kunal</creator><general>Springer-Verlag</general><general>Springer</general><scope>IQODW</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20080801</creationdate><title>Effective and efficient classification on a search-engine model</title><author>Anagnostopoulos, Aris ; Broder, Andrei ; Punera, Kunal</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c380t-8dd854f5420307622b68e73e5f18f20f4bf4e759ddafdf19a9c6bd80f238df603</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2008</creationdate><topic>Applied sciences</topic><topic>Artificial intelligence</topic><topic>Computer Science</topic><topic>Computer science; control theory; systems</topic><topic>Computer systems and distributed systems. User interface</topic><topic>Data Mining and Knowledge Discovery</topic><topic>Database Management</topic><topic>Exact sciences and technology</topic><topic>Information Storage and Retrieval</topic><topic>Information Systems and Communication Service</topic><topic>Information Systems Applications (incl.Internet)</topic><topic>Information systems. Data bases</topic><topic>IT in Business</topic><topic>Memory organisation. Data processing</topic><topic>Regular Paper</topic><topic>Software</topic><topic>Speech and sound recognition and synthesis. Linguistics</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Anagnostopoulos, Aris</creatorcontrib><creatorcontrib>Broder, Andrei</creatorcontrib><creatorcontrib>Punera, Kunal</creatorcontrib><collection>Pascal-Francis</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Knowledge and information systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Anagnostopoulos, Aris</au><au>Broder, Andrei</au><au>Punera, Kunal</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Effective and efficient classification on a search-engine model</atitle><jtitle>Knowledge and information systems</jtitle><stitle>Knowl Inf Syst</stitle><date>2008-08-01</date><risdate>2008</risdate><volume>16</volume><issue>2</issue><spage>129</spage><epage>154</epage><pages>129-154</pages><issn>0219-1377</issn><eissn>0219-3116</eissn><abstract>Traditional document classification frameworks, which apply the learned classifier to each document in a corpus one by one, are infeasible for extremely large document corpora, like the Web or large corporate intranets. We consider the classification problem on a corpus that has been processed primarily for the purpose of searching, and thus our access to documents is solely through the inverted index of a large scale search engine. Our main goal is to build the “best” short query that characterizes a document class using operators normally available within search engines. We show that surprisingly good classification accuracy can be achieved on average over multiple classes by queries with as few as 10 terms. As part of our study, we enhance some of the feature-selection techniques that are found in the literature by forcing the inclusion of terms that are negatively correlated with the target class and by making use of term correlations; we show that both of those techniques can offer significant advantages. Moreover, we show that optimizing the efficiency of query execution by careful selection of terms can further reduce the query costs. More precisely, we show that on our set-up the best 10-term query can achieve 93% of the accuracy of the best SVM classifier (14,000 terms), and if we are willing to tolerate a reduction to 89% of the best SVM, we can build a 10-term query that can be executed more than twice as fast as the best 10-term query.</abstract><cop>London</cop><pub>Springer-Verlag</pub><doi>10.1007/s10115-007-0102-6</doi><tpages>26</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 0219-1377
ispartof	Knowledge and information systems, 2008-08, Vol.16 (2), p.129-154
issn	0219-1377 0219-3116
language	eng
recordid	cdi_proquest_miscellaneous_34867687
source	Springer Nature - Complete Springer Journals
subjects	Applied sciences Artificial intelligence Computer Science Computer science control theory systems Computer systems and distributed systems. User interface Data Mining and Knowledge Discovery Database Management Exact sciences and technology Information Storage and Retrieval Information Systems and Communication Service Information Systems Applications (incl.Internet) Information systems. Data bases IT in Business Memory organisation. Data processing Regular Paper Software Speech and sound recognition and synthesis. Linguistics
title	Effective and efficient classification on a search-engine model
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-20T12%3A23%3A33IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Effective%20and%20efficient%20classification%20on%20a%20search-engine%20model&rft.jtitle=Knowledge%20and%20information%20systems&rft.au=Anagnostopoulos,%20Aris&rft.date=2008-08-01&rft.volume=16&rft.issue=2&rft.spage=129&rft.epage=154&rft.pages=129-154&rft.issn=0219-1377&rft.eissn=0219-3116&rft_id=info:doi/10.1007/s10115-007-0102-6&rft_dat=%3Cproquest_cross%3E34867687%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=34867687&rft_id=info:pmid/&rfr_iscdi=true