Live and learn from mistakes: A lightweight system for document classification

► Text processing. ► Clusterheads leashed to class centroids. ► Online learning with negative feedback. ► A lightweight well-performing system for online document classification is proposed. We present a Life-Long Learning from Mistakes (3LM) algorithm for document classification, which could be use...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Information processing & management 2013-01, Vol.49 (1), p.83-98
Hauptverfasser:	Borodin, Yevgen, Polishchuk, Valentin, Mahmud, Jalal, Ramakrishnan, I.V., Stent, Amanda
Format:	Artikel
Sprache:	eng
Schlagworte:	3LM Accuracy Algorithms Artificial intelligence Automatic classification Centroid Classification Classifier Clusterhead Clusters Distance learning Document management Documents Errors Exact sciences and technology Heuristic Information and communication sciences Information processing Information processing and retrieval Information retrieval systems. Information and document management system Information retrieval. Man machine relationship Information science. Documentation Learning Lifelong Lifelong learning On-line systems Online Research process. Evaluation Sciences and techniques of general use Studies Text processing
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	98
container_issue	1
container_start_page	83
container_title	Information processing & management
container_volume	49
creator	Borodin, Yevgen Polishchuk, Valentin Mahmud, Jalal Ramakrishnan, I.V. Stent, Amanda
description	► Text processing. ► Clusterheads leashed to class centroids. ► Online learning with negative feedback. ► A lightweight well-performing system for online document classification is proposed. We present a Life-Long Learning from Mistakes (3LM) algorithm for document classification, which could be used in various scenarios such as spam filtering, blog classification, and web resource categorization. We extend the ideas of online clustering and batch-mode centroid-based classification to online learning with negative feedback. The 3LM is a competitive learning algorithm, which avoids over-smoothing, characteristic of the centroid-based classifiers, by using a different class representative, which we call clusterhead. The clusterheads competing for vector-space dominance are drawn toward misclassified documents, eventually bringing the model to a “balanced state” for a fixed distribution of documents. Subsequently, the clusterheads oscillate between the misclassified documents, heuristically minimizing the rate of misclassifications, an NP-complete problem. Further, the 3LM algorithm prevents over-fitting by “leashing” the clusterheads to their respective centroids. A clusterhead provably converges if its class can be separated by a hyper-plane from all other classes. Lifelong learning with fixed learning rate allows 3LM to adapt to possibly changing distribution of the data and continually learn and unlearn document classes. We report on our experiments, which demonstrate high accuracy of document classification on Reuters21578, OHSUMED, and TREC07p-spam datasets. The 3LM algorithm did not show over-fitting, while consistently outperforming centroid-based, Naïve Bayes, C4.5, AdaBoost, kNN, and SVM whose accuracy had been reported on the same three corpora.
doi_str_mv	10.1016/j.ipm.2012.02.001
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_1323210039</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0306457312000179</els_id><sourcerecordid>1323210039</sourcerecordid><originalsourceid>FETCH-LOGICAL-c421t-bb8a7de9ceb0692ecc9d3e252ff743a614cb50c47a1d586964c7a285d81ea0d33</originalsourceid><addsrcrecordid>eNqFkU1vFDEMhqMKpC6FH8AtEkLiMovzMZkMnKoKaKUVXNpzlHU8kGVmsiSzrfrvyWorDhxAsuzLY_u1X8ZeC1gLEOb9bh3301qCkGuoAeKMrYTtVNOqTjxjK1BgGt126py9KGUHALoVcsW-buI9cT8HPpLPMx9ymvgUy-J_UvnAL_kYv_9YHuiYeXksC018SJmHhIeJ5oXj6EuJQ0S_xDS_ZM8HPxZ69VQv2N3nT7dX183m25ebq8tNg1qKpdlure8C9UhbML0kxD4okq0chk4rb4TGbQuoOy9Ca01vNHZe2jZYQR6CUhfs3WnuPqdfByqLq5qRxtHPlA7FCSWVFACq_z8qrTJGgzUVffMXukuHPNdDKgW2axVYWSlxojCnUjINbp_j5POjE-COZridq2a4oxkOaoCoPW-fJvuCfhyynzGWP43SGFu16sp9PHFUn3cfKbuCkWakEDPh4kKK_9jyG7GkndE</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1208753082</pqid></control><display><type>article</type><title>Live and learn from mistakes: A lightweight system for document classification</title><source>Elsevier ScienceDirect Journals Complete - AutoHoldings</source><creator>Borodin, Yevgen ; Polishchuk, Valentin ; Mahmud, Jalal ; Ramakrishnan, I.V. ; Stent, Amanda</creator><creatorcontrib>Borodin, Yevgen ; Polishchuk, Valentin ; Mahmud, Jalal ; Ramakrishnan, I.V. ; Stent, Amanda</creatorcontrib><description>► Text processing. ► Clusterheads leashed to class centroids. ► Online learning with negative feedback. ► A lightweight well-performing system for online document classification is proposed. We present a Life-Long Learning from Mistakes (3LM) algorithm for document classification, which could be used in various scenarios such as spam filtering, blog classification, and web resource categorization. We extend the ideas of online clustering and batch-mode centroid-based classification to online learning with negative feedback. The 3LM is a competitive learning algorithm, which avoids over-smoothing, characteristic of the centroid-based classifiers, by using a different class representative, which we call clusterhead. The clusterheads competing for vector-space dominance are drawn toward misclassified documents, eventually bringing the model to a “balanced state” for a fixed distribution of documents. Subsequently, the clusterheads oscillate between the misclassified documents, heuristically minimizing the rate of misclassifications, an NP-complete problem. Further, the 3LM algorithm prevents over-fitting by “leashing” the clusterheads to their respective centroids. A clusterhead provably converges if its class can be separated by a hyper-plane from all other classes. Lifelong learning with fixed learning rate allows 3LM to adapt to possibly changing distribution of the data and continually learn and unlearn document classes. We report on our experiments, which demonstrate high accuracy of document classification on Reuters21578, OHSUMED, and TREC07p-spam datasets. The 3LM algorithm did not show over-fitting, while consistently outperforming centroid-based, Naïve Bayes, C4.5, AdaBoost, kNN, and SVM whose accuracy had been reported on the same three corpora.</description><identifier>ISSN: 0306-4573</identifier><identifier>EISSN: 1873-5371</identifier><identifier>DOI: 10.1016/j.ipm.2012.02.001</identifier><identifier>CODEN: IPMADK</identifier><language>eng</language><publisher>Kidlington: Elsevier Ltd</publisher><subject>3LM ; Accuracy ; Algorithms ; Artificial intelligence ; Automatic classification ; Centroid ; Classification ; Classifier ; Clusterhead ; Clusters ; Distance learning ; Document management ; Documents ; Errors ; Exact sciences and technology ; Heuristic ; Information and communication sciences ; Information processing ; Information processing and retrieval ; Information retrieval systems. Information and document management system ; Information retrieval. Man machine relationship ; Information science. Documentation ; Learning ; Lifelong ; Lifelong learning ; On-line systems ; Online ; Research process. Evaluation ; Sciences and techniques of general use ; Studies ; Text processing</subject><ispartof>Information processing & management, 2013-01, Vol.49 (1), p.83-98</ispartof><rights>2012 Elsevier Ltd</rights><rights>2015 INIST-CNRS</rights><rights>Copyright Pergamon Press Inc. Jan 2013</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c421t-bb8a7de9ceb0692ecc9d3e252ff743a614cb50c47a1d586964c7a285d81ea0d33</citedby><cites>FETCH-LOGICAL-c421t-bb8a7de9ceb0692ecc9d3e252ff743a614cb50c47a1d586964c7a285d81ea0d33</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://dx.doi.org/10.1016/j.ipm.2012.02.001$$EHTML$$P50$$Gelsevier$$H</linktohtml><link.rule.ids>314,778,782,3539,4012,27906,27907,27908,45978</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=26680394$$DView record in Pascal Francis$$Hfree_for_read</backlink></links><search><creatorcontrib>Borodin, Yevgen</creatorcontrib><creatorcontrib>Polishchuk, Valentin</creatorcontrib><creatorcontrib>Mahmud, Jalal</creatorcontrib><creatorcontrib>Ramakrishnan, I.V.</creatorcontrib><creatorcontrib>Stent, Amanda</creatorcontrib><title>Live and learn from mistakes: A lightweight system for document classification</title><title>Information processing & management</title><description>► Text processing. ► Clusterheads leashed to class centroids. ► Online learning with negative feedback. ► A lightweight well-performing system for online document classification is proposed. We present a Life-Long Learning from Mistakes (3LM) algorithm for document classification, which could be used in various scenarios such as spam filtering, blog classification, and web resource categorization. We extend the ideas of online clustering and batch-mode centroid-based classification to online learning with negative feedback. The 3LM is a competitive learning algorithm, which avoids over-smoothing, characteristic of the centroid-based classifiers, by using a different class representative, which we call clusterhead. The clusterheads competing for vector-space dominance are drawn toward misclassified documents, eventually bringing the model to a “balanced state” for a fixed distribution of documents. Subsequently, the clusterheads oscillate between the misclassified documents, heuristically minimizing the rate of misclassifications, an NP-complete problem. Further, the 3LM algorithm prevents over-fitting by “leashing” the clusterheads to their respective centroids. A clusterhead provably converges if its class can be separated by a hyper-plane from all other classes. Lifelong learning with fixed learning rate allows 3LM to adapt to possibly changing distribution of the data and continually learn and unlearn document classes. We report on our experiments, which demonstrate high accuracy of document classification on Reuters21578, OHSUMED, and TREC07p-spam datasets. The 3LM algorithm did not show over-fitting, while consistently outperforming centroid-based, Naïve Bayes, C4.5, AdaBoost, kNN, and SVM whose accuracy had been reported on the same three corpora.</description><subject>3LM</subject><subject>Accuracy</subject><subject>Algorithms</subject><subject>Artificial intelligence</subject><subject>Automatic classification</subject><subject>Centroid</subject><subject>Classification</subject><subject>Classifier</subject><subject>Clusterhead</subject><subject>Clusters</subject><subject>Distance learning</subject><subject>Document management</subject><subject>Documents</subject><subject>Errors</subject><subject>Exact sciences and technology</subject><subject>Heuristic</subject><subject>Information and communication sciences</subject><subject>Information processing</subject><subject>Information processing and retrieval</subject><subject>Information retrieval systems. Information and document management system</subject><subject>Information retrieval. Man machine relationship</subject><subject>Information science. Documentation</subject><subject>Learning</subject><subject>Lifelong</subject><subject>Lifelong learning</subject><subject>On-line systems</subject><subject>Online</subject><subject>Research process. Evaluation</subject><subject>Sciences and techniques of general use</subject><subject>Studies</subject><subject>Text processing</subject><issn>0306-4573</issn><issn>1873-5371</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2013</creationdate><recordtype>article</recordtype><recordid>eNqFkU1vFDEMhqMKpC6FH8AtEkLiMovzMZkMnKoKaKUVXNpzlHU8kGVmsiSzrfrvyWorDhxAsuzLY_u1X8ZeC1gLEOb9bh3301qCkGuoAeKMrYTtVNOqTjxjK1BgGt126py9KGUHALoVcsW-buI9cT8HPpLPMx9ymvgUy-J_UvnAL_kYv_9YHuiYeXksC018SJmHhIeJ5oXj6EuJQ0S_xDS_ZM8HPxZ69VQv2N3nT7dX183m25ebq8tNg1qKpdlure8C9UhbML0kxD4okq0chk4rb4TGbQuoOy9Ca01vNHZe2jZYQR6CUhfs3WnuPqdfByqLq5qRxtHPlA7FCSWVFACq_z8qrTJGgzUVffMXukuHPNdDKgW2axVYWSlxojCnUjINbp_j5POjE-COZridq2a4oxkOaoCoPW-fJvuCfhyynzGWP43SGFu16sp9PHFUn3cfKbuCkWakEDPh4kKK_9jyG7GkndE</recordid><startdate>201301</startdate><enddate>201301</enddate><creator>Borodin, Yevgen</creator><creator>Polishchuk, Valentin</creator><creator>Mahmud, Jalal</creator><creator>Ramakrishnan, I.V.</creator><creator>Stent, Amanda</creator><general>Elsevier Ltd</general><general>Elsevier</general><general>Elsevier Science Ltd</general><scope>IQODW</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>E3H</scope><scope>F2A</scope><scope>7SC</scope><scope>7TA</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>201301</creationdate><title>Live and learn from mistakes: A lightweight system for document classification</title><author>Borodin, Yevgen ; Polishchuk, Valentin ; Mahmud, Jalal ; Ramakrishnan, I.V. ; Stent, Amanda</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c421t-bb8a7de9ceb0692ecc9d3e252ff743a614cb50c47a1d586964c7a285d81ea0d33</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2013</creationdate><topic>3LM</topic><topic>Accuracy</topic><topic>Algorithms</topic><topic>Artificial intelligence</topic><topic>Automatic classification</topic><topic>Centroid</topic><topic>Classification</topic><topic>Classifier</topic><topic>Clusterhead</topic><topic>Clusters</topic><topic>Distance learning</topic><topic>Document management</topic><topic>Documents</topic><topic>Errors</topic><topic>Exact sciences and technology</topic><topic>Heuristic</topic><topic>Information and communication sciences</topic><topic>Information processing</topic><topic>Information processing and retrieval</topic><topic>Information retrieval systems. Information and document management system</topic><topic>Information retrieval. Man machine relationship</topic><topic>Information science. Documentation</topic><topic>Learning</topic><topic>Lifelong</topic><topic>Lifelong learning</topic><topic>On-line systems</topic><topic>Online</topic><topic>Research process. Evaluation</topic><topic>Sciences and techniques of general use</topic><topic>Studies</topic><topic>Text processing</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Borodin, Yevgen</creatorcontrib><creatorcontrib>Polishchuk, Valentin</creatorcontrib><creatorcontrib>Mahmud, Jalal</creatorcontrib><creatorcontrib>Ramakrishnan, I.V.</creatorcontrib><creatorcontrib>Stent, Amanda</creatorcontrib><collection>Pascal-Francis</collection><collection>CrossRef</collection><collection>Library & Information Sciences Abstracts (LISA)</collection><collection>Library & Information Science Abstracts (LISA)</collection><collection>Computer and Information Systems Abstracts</collection><collection>Materials Business File</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Information processing & management</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Borodin, Yevgen</au><au>Polishchuk, Valentin</au><au>Mahmud, Jalal</au><au>Ramakrishnan, I.V.</au><au>Stent, Amanda</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Live and learn from mistakes: A lightweight system for document classification</atitle><jtitle>Information processing & management</jtitle><date>2013-01</date><risdate>2013</risdate><volume>49</volume><issue>1</issue><spage>83</spage><epage>98</epage><pages>83-98</pages><issn>0306-4573</issn><eissn>1873-5371</eissn><coden>IPMADK</coden><abstract>► Text processing. ► Clusterheads leashed to class centroids. ► Online learning with negative feedback. ► A lightweight well-performing system for online document classification is proposed. We present a Life-Long Learning from Mistakes (3LM) algorithm for document classification, which could be used in various scenarios such as spam filtering, blog classification, and web resource categorization. We extend the ideas of online clustering and batch-mode centroid-based classification to online learning with negative feedback. The 3LM is a competitive learning algorithm, which avoids over-smoothing, characteristic of the centroid-based classifiers, by using a different class representative, which we call clusterhead. The clusterheads competing for vector-space dominance are drawn toward misclassified documents, eventually bringing the model to a “balanced state” for a fixed distribution of documents. Subsequently, the clusterheads oscillate between the misclassified documents, heuristically minimizing the rate of misclassifications, an NP-complete problem. Further, the 3LM algorithm prevents over-fitting by “leashing” the clusterheads to their respective centroids. A clusterhead provably converges if its class can be separated by a hyper-plane from all other classes. Lifelong learning with fixed learning rate allows 3LM to adapt to possibly changing distribution of the data and continually learn and unlearn document classes. We report on our experiments, which demonstrate high accuracy of document classification on Reuters21578, OHSUMED, and TREC07p-spam datasets. The 3LM algorithm did not show over-fitting, while consistently outperforming centroid-based, Naïve Bayes, C4.5, AdaBoost, kNN, and SVM whose accuracy had been reported on the same three corpora.</abstract><cop>Kidlington</cop><pub>Elsevier Ltd</pub><doi>10.1016/j.ipm.2012.02.001</doi><tpages>16</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 0306-4573
ispartof	Information processing & management, 2013-01, Vol.49 (1), p.83-98
issn	0306-4573 1873-5371
language	eng
recordid	cdi_proquest_miscellaneous_1323210039
source	Elsevier ScienceDirect Journals Complete - AutoHoldings
subjects	3LM Accuracy Algorithms Artificial intelligence Automatic classification Centroid Classification Classifier Clusterhead Clusters Distance learning Document management Documents Errors Exact sciences and technology Heuristic Information and communication sciences Information processing Information processing and retrieval Information retrieval systems. Information and document management system Information retrieval. Man machine relationship Information science. Documentation Learning Lifelong Lifelong learning On-line systems Online Research process. Evaluation Sciences and techniques of general use Studies Text processing
title	Live and learn from mistakes: A lightweight system for document classification
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-16T10%3A17%3A48IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Live%20and%20learn%20from%20mistakes:%20A%20lightweight%20system%20for%20document%20classification&rft.jtitle=Information%20processing%20&%20management&rft.au=Borodin,%20Yevgen&rft.date=2013-01&rft.volume=49&rft.issue=1&rft.spage=83&rft.epage=98&rft.pages=83-98&rft.issn=0306-4573&rft.eissn=1873-5371&rft.coden=IPMADK&rft_id=info:doi/10.1016/j.ipm.2012.02.001&rft_dat=%3Cproquest_cross%3E1323210039%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1208753082&rft_id=info:pmid/&rft_els_id=S0306457312000179&rfr_iscdi=true