Controlling Overfitting in Classification-Tree Models of Software Quality

Predictingwhich modules are likely to have faults during operations isimportant to software developers, so that software enhancementefforts can be focused on those modules that need improvementthe most. Modeling software quality with classification treesis attractive because they readily model nonmo...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Empirical software engineering : an international journal 2001-03, Vol.6 (1), p.59-79
Hauptverfasser:	Khoshgoftaar, Taghi M, Allen, Edward B
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Classification Computer programs Construction Mathematical models Modules Software Software quality Studies Training
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	79
container_issue	1
container_start_page	59
container_title	Empirical software engineering : an international journal
container_volume	6
creator	Khoshgoftaar, Taghi M Allen, Edward B
description	Predictingwhich modules are likely to have faults during operations isimportant to software developers, so that software enhancementefforts can be focused on those modules that need improvementthe most. Modeling software quality with classification treesis attractive because they readily model nonmonotonic relationships.In this paper, we apply the TREEDISCalgorithm which is a refinement of the CHAID algorithmto build classification-tree models. Chaid-based algorithmsdiffer from other classification-tree algorithms in their relianceon chi-squared tests when building the tree. Classification-treemodels are vulnerable to overfitting, where the model reflectsthe structure of the training data set too closely. Even thougha model appears to be accurate on training data, if overfitted,it may be much less accurate when applied to a current data set.To account for the severe consequences of misclassifying fault-pronemodules, our measure of overfitting is based on expected costsof misclassification, rather than the total number of misclassifications.We conducted a case study of a very large telecommunicationssystem. A two-way analysis of variance with repetitions foundthat TREEDISC's significance level was highly relatedto overfitting, and can be used to control it. Moreover, theminimum number of modules in a leaf also influenced the degreeof overfitting.[PUBLICATION ABSTRACT]
doi_str_mv	10.1023/A:1009803004576
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_miscellaneous_26657328</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>1671234131</sourcerecordid><originalsourceid>FETCH-LOGICAL-c205t-dcedc66b9aecfc4eac2f61b239ceed2cb6005f862f0fe472626eb07d95e804f73</originalsourceid><addsrcrecordid>eNp9z0tLxDAUBeAgCo6ja7fFhbip5nnTupPiY2BkEMf1kKY3kiE22qSK_96Krly4OmfxceAQcszoOaNcXFxdMkrrigpKpdKwQ2ZMaVFqYLA7dVHxUnAF--QgpS2dqJZqRhZN7PMQQ_D9c7F6x8H5nL-774smmJS889ZkH_tyPSAW97HDkIroisfo8ocZsHgYTfD585DsORMSHv3mnDzdXK-bu3K5ul00V8vScqpy2VnsLEBbG7TOSjSWO2AtF7VF7LhtgVLlKuCOOpSaAwdsqe5qhRWVTos5Of3ZfR3i24gpb158shiC6TGOacMBpt-8muDZv5CBZlxIJthET_7QbRyHfrqxqXQNquISxBcWd2rA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>879658246</pqid></control><display><type>article</type><title>Controlling Overfitting in Classification-Tree Models of Software Quality</title><source>SpringerLink Journals - AutoHoldings</source><creator>Khoshgoftaar, Taghi M ; Allen, Edward B</creator><creatorcontrib>Khoshgoftaar, Taghi M ; Allen, Edward B</creatorcontrib><description>Predictingwhich modules are likely to have faults during operations isimportant to software developers, so that software enhancementefforts can be focused on those modules that need improvementthe most. Modeling software quality with classification treesis attractive because they readily model nonmonotonic relationships.In this paper, we apply the TREEDISCalgorithm which is a refinement of the CHAID algorithmto build classification-tree models. Chaid-based algorithmsdiffer from other classification-tree algorithms in their relianceon chi-squared tests when building the tree. Classification-treemodels are vulnerable to overfitting, where the model reflectsthe structure of the training data set too closely. Even thougha model appears to be accurate on training data, if overfitted,it may be much less accurate when applied to a current data set.To account for the severe consequences of misclassifying fault-pronemodules, our measure of overfitting is based on expected costsof misclassification, rather than the total number of misclassifications.We conducted a case study of a very large telecommunicationssystem. A two-way analysis of variance with repetitions foundthat TREEDISC's significance level was highly relatedto overfitting, and can be used to control it. Moreover, theminimum number of modules in a leaf also influenced the degreeof overfitting.[PUBLICATION ABSTRACT]</description><identifier>ISSN: 1382-3256</identifier><identifier>EISSN: 1573-7616</identifier><identifier>DOI: 10.1023/A:1009803004576</identifier><language>eng</language><publisher>Dordrecht: Springer Nature B.V</publisher><subject>Algorithms ; Classification ; Computer programs ; Construction ; Mathematical models ; Modules ; Software ; Software quality ; Studies ; Training</subject><ispartof>Empirical software engineering : an international journal, 2001-03, Vol.6 (1), p.59-79</ispartof><rights>Kluwer Academic Publishers 2001</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c205t-dcedc66b9aecfc4eac2f61b239ceed2cb6005f862f0fe472626eb07d95e804f73</citedby></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Khoshgoftaar, Taghi M</creatorcontrib><creatorcontrib>Allen, Edward B</creatorcontrib><title>Controlling Overfitting in Classification-Tree Models of Software Quality</title><title>Empirical software engineering : an international journal</title><description>Predictingwhich modules are likely to have faults during operations isimportant to software developers, so that software enhancementefforts can be focused on those modules that need improvementthe most. Modeling software quality with classification treesis attractive because they readily model nonmonotonic relationships.In this paper, we apply the TREEDISCalgorithm which is a refinement of the CHAID algorithmto build classification-tree models. Chaid-based algorithmsdiffer from other classification-tree algorithms in their relianceon chi-squared tests when building the tree. Classification-treemodels are vulnerable to overfitting, where the model reflectsthe structure of the training data set too closely. Even thougha model appears to be accurate on training data, if overfitted,it may be much less accurate when applied to a current data set.To account for the severe consequences of misclassifying fault-pronemodules, our measure of overfitting is based on expected costsof misclassification, rather than the total number of misclassifications.We conducted a case study of a very large telecommunicationssystem. A two-way analysis of variance with repetitions foundthat TREEDISC's significance level was highly relatedto overfitting, and can be used to control it. Moreover, theminimum number of modules in a leaf also influenced the degreeof overfitting.[PUBLICATION ABSTRACT]</description><subject>Algorithms</subject><subject>Classification</subject><subject>Computer programs</subject><subject>Construction</subject><subject>Mathematical models</subject><subject>Modules</subject><subject>Software</subject><subject>Software quality</subject><subject>Studies</subject><subject>Training</subject><issn>1382-3256</issn><issn>1573-7616</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2001</creationdate><recordtype>article</recordtype><sourceid>AFKRA</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNp9z0tLxDAUBeAgCo6ja7fFhbip5nnTupPiY2BkEMf1kKY3kiE22qSK_96Krly4OmfxceAQcszoOaNcXFxdMkrrigpKpdKwQ2ZMaVFqYLA7dVHxUnAF--QgpS2dqJZqRhZN7PMQQ_D9c7F6x8H5nL-774smmJS889ZkH_tyPSAW97HDkIroisfo8ocZsHgYTfD585DsORMSHv3mnDzdXK-bu3K5ul00V8vScqpy2VnsLEBbG7TOSjSWO2AtF7VF7LhtgVLlKuCOOpSaAwdsqe5qhRWVTos5Of3ZfR3i24gpb158shiC6TGOacMBpt-8muDZv5CBZlxIJthET_7QbRyHfrqxqXQNquISxBcWd2rA</recordid><startdate>20010301</startdate><enddate>20010301</enddate><creator>Khoshgoftaar, Taghi M</creator><creator>Allen, Edward B</creator><general>Springer Nature B.V</general><scope>7SC</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>L6V</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>M7S</scope><scope>P5Z</scope><scope>P62</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>S0W</scope></search><sort><creationdate>20010301</creationdate><title>Controlling Overfitting in Classification-Tree Models of Software Quality</title><author>Khoshgoftaar, Taghi M ; Allen, Edward B</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c205t-dcedc66b9aecfc4eac2f61b239ceed2cb6005f862f0fe472626eb07d95e804f73</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2001</creationdate><topic>Algorithms</topic><topic>Classification</topic><topic>Computer programs</topic><topic>Construction</topic><topic>Mathematical models</topic><topic>Modules</topic><topic>Software</topic><topic>Software quality</topic><topic>Studies</topic><topic>Training</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Khoshgoftaar, Taghi M</creatorcontrib><creatorcontrib>Allen, Edward B</creatorcontrib><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Engineering Database</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>DELNET Engineering & Technology Collection</collection><jtitle>Empirical software engineering : an international journal</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Khoshgoftaar, Taghi M</au><au>Allen, Edward B</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Controlling Overfitting in Classification-Tree Models of Software Quality</atitle><jtitle>Empirical software engineering : an international journal</jtitle><date>2001-03-01</date><risdate>2001</risdate><volume>6</volume><issue>1</issue><spage>59</spage><epage>79</epage><pages>59-79</pages><issn>1382-3256</issn><eissn>1573-7616</eissn><abstract>Predictingwhich modules are likely to have faults during operations isimportant to software developers, so that software enhancementefforts can be focused on those modules that need improvementthe most. Modeling software quality with classification treesis attractive because they readily model nonmonotonic relationships.In this paper, we apply the TREEDISCalgorithm which is a refinement of the CHAID algorithmto build classification-tree models. Chaid-based algorithmsdiffer from other classification-tree algorithms in their relianceon chi-squared tests when building the tree. Classification-treemodels are vulnerable to overfitting, where the model reflectsthe structure of the training data set too closely. Even thougha model appears to be accurate on training data, if overfitted,it may be much less accurate when applied to a current data set.To account for the severe consequences of misclassifying fault-pronemodules, our measure of overfitting is based on expected costsof misclassification, rather than the total number of misclassifications.We conducted a case study of a very large telecommunicationssystem. A two-way analysis of variance with repetitions foundthat TREEDISC's significance level was highly relatedto overfitting, and can be used to control it. Moreover, theminimum number of modules in a leaf also influenced the degreeof overfitting.[PUBLICATION ABSTRACT]</abstract><cop>Dordrecht</cop><pub>Springer Nature B.V</pub><doi>10.1023/A:1009803004576</doi><tpages>21</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 1382-3256
ispartof	Empirical software engineering : an international journal, 2001-03, Vol.6 (1), p.59-79
issn	1382-3256 1573-7616
language	eng
recordid	cdi_proquest_miscellaneous_26657328
source	SpringerLink Journals - AutoHoldings
subjects	Algorithms Classification Computer programs Construction Mathematical models Modules Software Software quality Studies Training
title	Controlling Overfitting in Classification-Tree Models of Software Quality
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T19%3A17%3A35IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Controlling%20Overfitting%20in%20Classification-Tree%20Models%20of%20Software%20Quality&rft.jtitle=Empirical%20software%20engineering%20:%20an%20international%20journal&rft.au=Khoshgoftaar,%20Taghi%20M&rft.date=2001-03-01&rft.volume=6&rft.issue=1&rft.spage=59&rft.epage=79&rft.pages=59-79&rft.issn=1382-3256&rft.eissn=1573-7616&rft_id=info:doi/10.1023/A:1009803004576&rft_dat=%3Cproquest%3E1671234131%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=879658246&rft_id=info:pmid/&rfr_iscdi=true