Comparative Analysis of the Impact of Discretization on the Classification with Naïve Bayes and Semi-Naïve Bayes Classifiers

While data could be discrete and continuous (defined as ordinal numerical features), some classifiers, like Naive Bayes (NB), work only with or may perform better with the discrete data. We focus on NB due to its popularity and linear training time. We investigate the impact of eight discretization...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Mizianty, M., Kurgan, L., Ogiela, M.
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	accuracy aode Application software CACC CADD CAIM classification Classification tree analysis Computer science continuous features Decision trees Discretization Entropy Equal Frequency Equal Width Frequency IEM lbr Machine learning Maximum Entropy MODL naive bayes Niobium Performance analysis Physics runtime supervised discretization unsupervised discretization
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	828
container_issue
container_start_page	823
container_title
container_volume
creator	Mizianty, M. Kurgan, L. Ogiela, M.
description	While data could be discrete and continuous (defined as ordinal numerical features), some classifiers, like Naive Bayes (NB), work only with or may perform better with the discrete data. We focus on NB due to its popularity and linear training time. We investigate the impact of eight discretization algorithms (Equal Width, Equal Frequency, Maximum Entropy, IEM, CADD, CAIM, MODL, and CACC) on the classification with NB and two modern semi-NB classifiers, LBR and AODE.Our comprehensive empirical study indicates that unsupervised discretization algorithms are the fastest while among the supervised algorithms the fastest is maximum entropy, followed by CAIM and IEM. The CAIM and MODL discretizers generate the lowest and the highest number of discrete values, respectively.We compare the time to build the classification model and classification accuracy when using raw and discretized data. We show that discretization helps to improve the classification with the NB when compared with flexible NB which models continuous features using Gaussian kernels. The AODE classifier obtains on average the best accuracy, while the best performing setup includes discretization with IEM and classification with AODE. The runner-up setups include CAIM and CACC coupled with AODE and CAIM and IEM coupled with LBR. IEM and CAIM are shown to provide statistically significant improvements across all considered datasets for LBR and AODE classifiers when compared with using NB on the continuous data. We also show that the improved accuracy comes at the trade-off of substantially increased runtime.
doi_str_mv	10.1109/ICMLA.2008.29
format	Conference Proceeding
fullrecord	<record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_4725074</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>4725074</ieee_id><sourcerecordid>4725074</sourcerecordid><originalsourceid>FETCH-LOGICAL-c219t-47733071463f47cdddfbfb406ea91911344d3d728bfd733d7c84472c63b1981a3</originalsourceid><addsrcrecordid>eNpVjMlOwzAQhi2hStDSIycufoEET-zE8TGELVKBA3CuHC-qUZYqjkDhwCvxELwYjgoHpBmN5vsXhM6AxABEXFTl_aaIE0LyOBFHaEl4JlLKwi7QcsaC5CnQY7T2_pUQAiLjkOYn6LPs270c5OjeDC462UzeedxbPO4MroKkxvm7cl4NZnQfwdh3OMysl4303lmnDvTdjTv8IL-_QtWlnIzHstP4ybQu-kf_Ymbwp2hhZePN-veu0MvN9XN5F20eb6uy2EQqATFGjHNKCQeWUcu40lrb2taMZEYKEACUMU01T_La6uDUXOWM8URltAaRg6QrdH7odcaY7X5wrRymbbCkhDP6A20KYDc</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Comparative Analysis of the Impact of Discretization on the Classification with Naïve Bayes and Semi-Naïve Bayes Classifiers</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Mizianty, M. ; Kurgan, L. ; Ogiela, M.</creator><creatorcontrib>Mizianty, M. ; Kurgan, L. ; Ogiela, M.</creatorcontrib><description>While data could be discrete and continuous (defined as ordinal numerical features), some classifiers, like Naive Bayes (NB), work only with or may perform better with the discrete data. We focus on NB due to its popularity and linear training time. We investigate the impact of eight discretization algorithms (Equal Width, Equal Frequency, Maximum Entropy, IEM, CADD, CAIM, MODL, and CACC) on the classification with NB and two modern semi-NB classifiers, LBR and AODE.Our comprehensive empirical study indicates that unsupervised discretization algorithms are the fastest while among the supervised algorithms the fastest is maximum entropy, followed by CAIM and IEM. The CAIM and MODL discretizers generate the lowest and the highest number of discrete values, respectively.We compare the time to build the classification model and classification accuracy when using raw and discretized data. We show that discretization helps to improve the classification with the NB when compared with flexible NB which models continuous features using Gaussian kernels. The AODE classifier obtains on average the best accuracy, while the best performing setup includes discretization with IEM and classification with AODE. The runner-up setups include CAIM and CACC coupled with AODE and CAIM and IEM coupled with LBR. IEM and CAIM are shown to provide statistically significant improvements across all considered datasets for LBR and AODE classifiers when compared with using NB on the continuous data. We also show that the improved accuracy comes at the trade-off of substantially increased runtime.</description><identifier>ISBN: 0769534953</identifier><identifier>ISBN: 9780769534954</identifier><identifier>DOI: 10.1109/ICMLA.2008.29</identifier><identifier>LCCN: 2008908513</identifier><language>eng</language><publisher>IEEE</publisher><subject>accuracy ; aode ; Application software ; CACC ; CADD ; CAIM ; classification ; Classification tree analysis ; Computer science ; continuous features ; Decision trees ; Discretization ; Entropy ; Equal Frequency ; Equal Width ; Frequency ; IEM ; lbr ; Machine learning ; Maximum Entropy ; MODL ; naive bayes ; Niobium ; Performance analysis ; Physics ; runtime ; supervised discretization ; unsupervised discretization</subject><ispartof>2008 Seventh International Conference on Machine Learning and Applications, 2008, p.823-828</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c219t-47733071463f47cdddfbfb406ea91911344d3d728bfd733d7c84472c63b1981a3</citedby></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/4725074$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,2058,27925,54920</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/4725074$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Mizianty, M.</creatorcontrib><creatorcontrib>Kurgan, L.</creatorcontrib><creatorcontrib>Ogiela, M.</creatorcontrib><title>Comparative Analysis of the Impact of Discretization on the Classification with Naïve Bayes and Semi-Naïve Bayes Classifiers</title><title>2008 Seventh International Conference on Machine Learning and Applications</title><addtitle>ICMLA</addtitle><description>While data could be discrete and continuous (defined as ordinal numerical features), some classifiers, like Naive Bayes (NB), work only with or may perform better with the discrete data. We focus on NB due to its popularity and linear training time. We investigate the impact of eight discretization algorithms (Equal Width, Equal Frequency, Maximum Entropy, IEM, CADD, CAIM, MODL, and CACC) on the classification with NB and two modern semi-NB classifiers, LBR and AODE.Our comprehensive empirical study indicates that unsupervised discretization algorithms are the fastest while among the supervised algorithms the fastest is maximum entropy, followed by CAIM and IEM. The CAIM and MODL discretizers generate the lowest and the highest number of discrete values, respectively.We compare the time to build the classification model and classification accuracy when using raw and discretized data. We show that discretization helps to improve the classification with the NB when compared with flexible NB which models continuous features using Gaussian kernels. The AODE classifier obtains on average the best accuracy, while the best performing setup includes discretization with IEM and classification with AODE. The runner-up setups include CAIM and CACC coupled with AODE and CAIM and IEM coupled with LBR. IEM and CAIM are shown to provide statistically significant improvements across all considered datasets for LBR and AODE classifiers when compared with using NB on the continuous data. We also show that the improved accuracy comes at the trade-off of substantially increased runtime.</description><subject>accuracy</subject><subject>aode</subject><subject>Application software</subject><subject>CACC</subject><subject>CADD</subject><subject>CAIM</subject><subject>classification</subject><subject>Classification tree analysis</subject><subject>Computer science</subject><subject>continuous features</subject><subject>Decision trees</subject><subject>Discretization</subject><subject>Entropy</subject><subject>Equal Frequency</subject><subject>Equal Width</subject><subject>Frequency</subject><subject>IEM</subject><subject>lbr</subject><subject>Machine learning</subject><subject>Maximum Entropy</subject><subject>MODL</subject><subject>naive bayes</subject><subject>Niobium</subject><subject>Performance analysis</subject><subject>Physics</subject><subject>runtime</subject><subject>supervised discretization</subject><subject>unsupervised discretization</subject><isbn>0769534953</isbn><isbn>9780769534954</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2008</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNpVjMlOwzAQhi2hStDSIycufoEET-zE8TGELVKBA3CuHC-qUZYqjkDhwCvxELwYjgoHpBmN5vsXhM6AxABEXFTl_aaIE0LyOBFHaEl4JlLKwi7QcsaC5CnQY7T2_pUQAiLjkOYn6LPs270c5OjeDC462UzeedxbPO4MroKkxvm7cl4NZnQfwdh3OMysl4303lmnDvTdjTv8IL-_QtWlnIzHstP4ybQu-kf_Ymbwp2hhZePN-veu0MvN9XN5F20eb6uy2EQqATFGjHNKCQeWUcu40lrb2taMZEYKEACUMU01T_La6uDUXOWM8URltAaRg6QrdH7odcaY7X5wrRymbbCkhDP6A20KYDc</recordid><startdate>200812</startdate><enddate>200812</enddate><creator>Mizianty, M.</creator><creator>Kurgan, L.</creator><creator>Ogiela, M.</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>200812</creationdate><title>Comparative Analysis of the Impact of Discretization on the Classification with Naïve Bayes and Semi-Naïve Bayes Classifiers</title><author>Mizianty, M. ; Kurgan, L. ; Ogiela, M.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c219t-47733071463f47cdddfbfb406ea91911344d3d728bfd733d7c84472c63b1981a3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2008</creationdate><topic>accuracy</topic><topic>aode</topic><topic>Application software</topic><topic>CACC</topic><topic>CADD</topic><topic>CAIM</topic><topic>classification</topic><topic>Classification tree analysis</topic><topic>Computer science</topic><topic>continuous features</topic><topic>Decision trees</topic><topic>Discretization</topic><topic>Entropy</topic><topic>Equal Frequency</topic><topic>Equal Width</topic><topic>Frequency</topic><topic>IEM</topic><topic>lbr</topic><topic>Machine learning</topic><topic>Maximum Entropy</topic><topic>MODL</topic><topic>naive bayes</topic><topic>Niobium</topic><topic>Performance analysis</topic><topic>Physics</topic><topic>runtime</topic><topic>supervised discretization</topic><topic>unsupervised discretization</topic><toplevel>online_resources</toplevel><creatorcontrib>Mizianty, M.</creatorcontrib><creatorcontrib>Kurgan, L.</creatorcontrib><creatorcontrib>Ogiela, M.</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Mizianty, M.</au><au>Kurgan, L.</au><au>Ogiela, M.</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Comparative Analysis of the Impact of Discretization on the Classification with Naïve Bayes and Semi-Naïve Bayes Classifiers</atitle><btitle>2008 Seventh International Conference on Machine Learning and Applications</btitle><stitle>ICMLA</stitle><date>2008-12</date><risdate>2008</risdate><spage>823</spage><epage>828</epage><pages>823-828</pages><isbn>0769534953</isbn><isbn>9780769534954</isbn><abstract>While data could be discrete and continuous (defined as ordinal numerical features), some classifiers, like Naive Bayes (NB), work only with or may perform better with the discrete data. We focus on NB due to its popularity and linear training time. We investigate the impact of eight discretization algorithms (Equal Width, Equal Frequency, Maximum Entropy, IEM, CADD, CAIM, MODL, and CACC) on the classification with NB and two modern semi-NB classifiers, LBR and AODE.Our comprehensive empirical study indicates that unsupervised discretization algorithms are the fastest while among the supervised algorithms the fastest is maximum entropy, followed by CAIM and IEM. The CAIM and MODL discretizers generate the lowest and the highest number of discrete values, respectively.We compare the time to build the classification model and classification accuracy when using raw and discretized data. We show that discretization helps to improve the classification with the NB when compared with flexible NB which models continuous features using Gaussian kernels. The AODE classifier obtains on average the best accuracy, while the best performing setup includes discretization with IEM and classification with AODE. The runner-up setups include CAIM and CACC coupled with AODE and CAIM and IEM coupled with LBR. IEM and CAIM are shown to provide statistically significant improvements across all considered datasets for LBR and AODE classifiers when compared with using NB on the continuous data. We also show that the improved accuracy comes at the trade-off of substantially increased runtime.</abstract><pub>IEEE</pub><doi>10.1109/ICMLA.2008.29</doi><tpages>6</tpages></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISBN: 0769534953
ispartof	2008 Seventh International Conference on Machine Learning and Applications, 2008, p.823-828
issn
language	eng
recordid	cdi_ieee_primary_4725074
source	IEEE Electronic Library (IEL) Conference Proceedings
subjects	accuracy aode Application software CACC CADD CAIM classification Classification tree analysis Computer science continuous features Decision trees Discretization Entropy Equal Frequency Equal Width Frequency IEM lbr Machine learning Maximum Entropy MODL naive bayes Niobium Performance analysis Physics runtime supervised discretization unsupervised discretization
title	Comparative Analysis of the Impact of Discretization on the Classification with Naïve Bayes and Semi-Naïve Bayes Classifiers
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T03%3A21%3A11IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Comparative%20Analysis%20of%20the%20Impact%20of%20Discretization%20on%20the%20Classification%20with%20Na%C3%AFve%20Bayes%20and%20Semi-Na%C3%AFve%20Bayes%20Classifiers&rft.btitle=2008%20Seventh%20International%20Conference%20on%20Machine%20Learning%20and%20Applications&rft.au=Mizianty,%20M.&rft.date=2008-12&rft.spage=823&rft.epage=828&rft.pages=823-828&rft.isbn=0769534953&rft.isbn_list=9780769534954&rft_id=info:doi/10.1109/ICMLA.2008.29&rft_dat=%3Cieee_6IE%3E4725074%3C/ieee_6IE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=4725074&rfr_iscdi=true