Controlling Overfitting in Classification-Tree Models of Software Quality

Predictingwhich modules are likely to have faults during operations isimportant to software developers, so that software enhancementefforts can be focused on those modules that need improvementthe most. Modeling software quality with classification treesis attractive because they readily model nonmo...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Empirical software engineering : an international journal 2001-03, Vol.6 (1), p.59-79
Hauptverfasser: Khoshgoftaar, Taghi M, Allen, Edward B
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 79
container_issue 1
container_start_page 59
container_title Empirical software engineering : an international journal
container_volume 6
creator Khoshgoftaar, Taghi M
Allen, Edward B
description Predictingwhich modules are likely to have faults during operations isimportant to software developers, so that software enhancementefforts can be focused on those modules that need improvementthe most. Modeling software quality with classification treesis attractive because they readily model nonmonotonic relationships.In this paper, we apply the TREEDISCalgorithm which is a refinement of the CHAID algorithmto build classification-tree models. Chaid-based algorithmsdiffer from other classification-tree algorithms in their relianceon chi-squared tests when building the tree. Classification-treemodels are vulnerable to overfitting, where the model reflectsthe structure of the training data set too closely. Even thougha model appears to be accurate on training data, if overfitted,it may be much less accurate when applied to a current data set.To account for the severe consequences of misclassifying fault-pronemodules, our measure of overfitting is based on expected costsof misclassification, rather than the total number of misclassifications.We conducted a case study of a very large telecommunicationssystem. A two-way analysis of variance with repetitions foundthat TREEDISC's significance level was highly relatedto overfitting, and can be used to control it. Moreover, theminimum number of modules in a leaf also influenced the degreeof overfitting.[PUBLICATION ABSTRACT]
doi_str_mv 10.1023/A:1009803004576
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_miscellaneous_26657328</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>1671234131</sourcerecordid><originalsourceid>FETCH-LOGICAL-c205t-dcedc66b9aecfc4eac2f61b239ceed2cb6005f862f0fe472626eb07d95e804f73</originalsourceid><addsrcrecordid>eNp9z0tLxDAUBeAgCo6ja7fFhbip5nnTupPiY2BkEMf1kKY3kiE22qSK_96Krly4OmfxceAQcszoOaNcXFxdMkrrigpKpdKwQ2ZMaVFqYLA7dVHxUnAF--QgpS2dqJZqRhZN7PMQQ_D9c7F6x8H5nL-774smmJS889ZkH_tyPSAW97HDkIroisfo8ocZsHgYTfD585DsORMSHv3mnDzdXK-bu3K5ul00V8vScqpy2VnsLEBbG7TOSjSWO2AtF7VF7LhtgVLlKuCOOpSaAwdsqe5qhRWVTos5Of3ZfR3i24gpb158shiC6TGOacMBpt-8muDZv5CBZlxIJthET_7QbRyHfrqxqXQNquISxBcWd2rA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>879658246</pqid></control><display><type>article</type><title>Controlling Overfitting in Classification-Tree Models of Software Quality</title><source>SpringerLink Journals - AutoHoldings</source><creator>Khoshgoftaar, Taghi M ; Allen, Edward B</creator><creatorcontrib>Khoshgoftaar, Taghi M ; Allen, Edward B</creatorcontrib><description>Predictingwhich modules are likely to have faults during operations isimportant to software developers, so that software enhancementefforts can be focused on those modules that need improvementthe most. Modeling software quality with classification treesis attractive because they readily model nonmonotonic relationships.In this paper, we apply the TREEDISCalgorithm which is a refinement of the CHAID algorithmto build classification-tree models. Chaid-based algorithmsdiffer from other classification-tree algorithms in their relianceon chi-squared tests when building the tree. Classification-treemodels are vulnerable to overfitting, where the model reflectsthe structure of the training data set too closely. Even thougha model appears to be accurate on training data, if overfitted,it may be much less accurate when applied to a current data set.To account for the severe consequences of misclassifying fault-pronemodules, our measure of overfitting is based on expected costsof misclassification, rather than the total number of misclassifications.We conducted a case study of a very large telecommunicationssystem. A two-way analysis of variance with repetitions foundthat TREEDISC's significance level was highly relatedto overfitting, and can be used to control it. Moreover, theminimum number of modules in a leaf also influenced the degreeof overfitting.[PUBLICATION ABSTRACT]</description><identifier>ISSN: 1382-3256</identifier><identifier>EISSN: 1573-7616</identifier><identifier>DOI: 10.1023/A:1009803004576</identifier><language>eng</language><publisher>Dordrecht: Springer Nature B.V</publisher><subject>Algorithms ; Classification ; Computer programs ; Construction ; Mathematical models ; Modules ; Software ; Software quality ; Studies ; Training</subject><ispartof>Empirical software engineering : an international journal, 2001-03, Vol.6 (1), p.59-79</ispartof><rights>Kluwer Academic Publishers 2001</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c205t-dcedc66b9aecfc4eac2f61b239ceed2cb6005f862f0fe472626eb07d95e804f73</citedby></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Khoshgoftaar, Taghi M</creatorcontrib><creatorcontrib>Allen, Edward B</creatorcontrib><title>Controlling Overfitting in Classification-Tree Models of Software Quality</title><title>Empirical software engineering : an international journal</title><description>Predictingwhich modules are likely to have faults during operations isimportant to software developers, so that software enhancementefforts can be focused on those modules that need improvementthe most. Modeling software quality with classification treesis attractive because they readily model nonmonotonic relationships.In this paper, we apply the TREEDISCalgorithm which is a refinement of the CHAID algorithmto build classification-tree models. Chaid-based algorithmsdiffer from other classification-tree algorithms in their relianceon chi-squared tests when building the tree. Classification-treemodels are vulnerable to overfitting, where the model reflectsthe structure of the training data set too closely. Even thougha model appears to be accurate on training data, if overfitted,it may be much less accurate when applied to a current data set.To account for the severe consequences of misclassifying fault-pronemodules, our measure of overfitting is based on expected costsof misclassification, rather than the total number of misclassifications.We conducted a case study of a very large telecommunicationssystem. A two-way analysis of variance with repetitions foundthat TREEDISC's significance level was highly relatedto overfitting, and can be used to control it. Moreover, theminimum number of modules in a leaf also influenced the degreeof overfitting.[PUBLICATION ABSTRACT]</description><subject>Algorithms</subject><subject>Classification</subject><subject>Computer programs</subject><subject>Construction</subject><subject>Mathematical models</subject><subject>Modules</subject><subject>Software</subject><subject>Software quality</subject><subject>Studies</subject><subject>Training</subject><issn>1382-3256</issn><issn>1573-7616</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2001</creationdate><recordtype>article</recordtype><sourceid>AFKRA</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNp9z0tLxDAUBeAgCo6ja7fFhbip5nnTupPiY2BkEMf1kKY3kiE22qSK_96Krly4OmfxceAQcszoOaNcXFxdMkrrigpKpdKwQ2ZMaVFqYLA7dVHxUnAF--QgpS2dqJZqRhZN7PMQQ_D9c7F6x8H5nL-774smmJS889ZkH_tyPSAW97HDkIroisfo8ocZsHgYTfD585DsORMSHv3mnDzdXK-bu3K5ul00V8vScqpy2VnsLEBbG7TOSjSWO2AtF7VF7LhtgVLlKuCOOpSaAwdsqe5qhRWVTos5Of3ZfR3i24gpb158shiC6TGOacMBpt-8muDZv5CBZlxIJthET_7QbRyHfrqxqXQNquISxBcWd2rA</recordid><startdate>20010301</startdate><enddate>20010301</enddate><creator>Khoshgoftaar, Taghi M</creator><creator>Allen, Edward B</creator><general>Springer Nature B.V</general><scope>7SC</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>L6V</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>M7S</scope><scope>P5Z</scope><scope>P62</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>S0W</scope></search><sort><creationdate>20010301</creationdate><title>Controlling Overfitting in Classification-Tree Models of Software Quality</title><author>Khoshgoftaar, Taghi M ; Allen, Edward B</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c205t-dcedc66b9aecfc4eac2f61b239ceed2cb6005f862f0fe472626eb07d95e804f73</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2001</creationdate><topic>Algorithms</topic><topic>Classification</topic><topic>Computer programs</topic><topic>Construction</topic><topic>Mathematical models</topic><topic>Modules</topic><topic>Software</topic><topic>Software quality</topic><topic>Studies</topic><topic>Training</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Khoshgoftaar, Taghi M</creatorcontrib><creatorcontrib>Allen, Edward B</creatorcontrib><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Engineering Database</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>DELNET Engineering &amp; Technology Collection</collection><jtitle>Empirical software engineering : an international journal</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Khoshgoftaar, Taghi M</au><au>Allen, Edward B</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Controlling Overfitting in Classification-Tree Models of Software Quality</atitle><jtitle>Empirical software engineering : an international journal</jtitle><date>2001-03-01</date><risdate>2001</risdate><volume>6</volume><issue>1</issue><spage>59</spage><epage>79</epage><pages>59-79</pages><issn>1382-3256</issn><eissn>1573-7616</eissn><abstract>Predictingwhich modules are likely to have faults during operations isimportant to software developers, so that software enhancementefforts can be focused on those modules that need improvementthe most. Modeling software quality with classification treesis attractive because they readily model nonmonotonic relationships.In this paper, we apply the TREEDISCalgorithm which is a refinement of the CHAID algorithmto build classification-tree models. Chaid-based algorithmsdiffer from other classification-tree algorithms in their relianceon chi-squared tests when building the tree. Classification-treemodels are vulnerable to overfitting, where the model reflectsthe structure of the training data set too closely. Even thougha model appears to be accurate on training data, if overfitted,it may be much less accurate when applied to a current data set.To account for the severe consequences of misclassifying fault-pronemodules, our measure of overfitting is based on expected costsof misclassification, rather than the total number of misclassifications.We conducted a case study of a very large telecommunicationssystem. A two-way analysis of variance with repetitions foundthat TREEDISC's significance level was highly relatedto overfitting, and can be used to control it. Moreover, theminimum number of modules in a leaf also influenced the degreeof overfitting.[PUBLICATION ABSTRACT]</abstract><cop>Dordrecht</cop><pub>Springer Nature B.V</pub><doi>10.1023/A:1009803004576</doi><tpages>21</tpages></addata></record>
fulltext fulltext
identifier ISSN: 1382-3256
ispartof Empirical software engineering : an international journal, 2001-03, Vol.6 (1), p.59-79
issn 1382-3256
1573-7616
language eng
recordid cdi_proquest_miscellaneous_26657328
source SpringerLink Journals - AutoHoldings
subjects Algorithms
Classification
Computer programs
Construction
Mathematical models
Modules
Software
Software quality
Studies
Training
title Controlling Overfitting in Classification-Tree Models of Software Quality
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T19%3A17%3A35IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Controlling%20Overfitting%20in%20Classification-Tree%20Models%20of%20Software%20Quality&rft.jtitle=Empirical%20software%20engineering%20:%20an%20international%20journal&rft.au=Khoshgoftaar,%20Taghi%20M&rft.date=2001-03-01&rft.volume=6&rft.issue=1&rft.spage=59&rft.epage=79&rft.pages=59-79&rft.issn=1382-3256&rft.eissn=1573-7616&rft_id=info:doi/10.1023/A:1009803004576&rft_dat=%3Cproquest%3E1671234131%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=879658246&rft_id=info:pmid/&rfr_iscdi=true