Submodel Selection and Evaluation in Regression. The X-Random Case

Often, in a regression situation with many variables, a sequence of submodels is generated containing fewer variables by using such methods as stepwise addition or deletion of variables, or 'best subsets'. The question is which of this sequence of submodels is 'best', and how can...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	International statistical review 1992-12, Vol.60 (3), p.291-319
Hauptverfasser:	Breiman, Leo, Spector, Philip
Format:	Artikel
Sprache:	eng
Schlagworte:	Cost estimates Cost estimation models Dimensionality Error rates Estimate reliability Estimation bias Estimation methods Estimators Exact sciences and technology Induced substructures Linear inference, regression Mathematics Probability and statistics Sample size Sciences and techniques of general use Statistics
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	319
container_issue	3
container_start_page	291
container_title	International statistical review
container_volume	60
creator	Breiman, Leo Spector, Philip
description	Often, in a regression situation with many variables, a sequence of submodels is generated containing fewer variables by using such methods as stepwise addition or deletion of variables, or 'best subsets'. The question is which of this sequence of submodels is 'best', and how can submodel performance be evaluated. This was explored in Breiman (1988) for a fixed X-design. This is a sequel exploring the case of random X-designs. Analytical results are difficult, if not impossible. This study involved an extensive simulation. The basis of the study is the theoretical definition of prediction error (PE) as the expected squared error produced by applying a prediction equation to the distributional universe of (y, x) values. This definition is used throughout to compare various submodels. There can be startling differences between the x-fixed and x-random situations and different PE estimates are appropriate. Non-resampling estimates such as CP, adjusted R2, etc. turn out to be highly biased methods for submodel selection. The two best methods are cross-validation and bootstrap. One surprise is that 5 fold cross-validation (leave out 20% of the data) is better at submodel selection and evaluation than leave-one-out cross-validation. There are a number of other surprises. /// Dans l'analyse de problèmes de régression à plusieurs variables (indépendantes), on produit souvent une série de sous-modèles constitués d'un sous-ensemble des variables par des méthodes telles que l'addition par étape, le retrait par étape et la méthode du meilleur sous-ensemble. Le problème est de déterminer lequel de ces sous-modèles est le meilleur et d'évaluer sa performance. Ce problème fut exploré dans Breiman (1988) pour le cas d'une matrice X fixe. Dans ce qui suit on considère le cas où la matrice X est aléatoire. La détermination de résultats analytiques est difficile, sinon impossible. Notre étude a utilisé des simulations de grande envergure. Elle se base sur la définition théorique de l'erreur de prédiction (EP) comme étant l'espérance du carré de l'erreur produite en applicant une équation de prédiction à l'univers distributional des valeurs (y, x). La définition est utilisée dans toute l'étude à fin de comparer divers sous-modèles. Il y a une différence étonnante entre le cas où la matrice X est fixée et celui où elle est aléatoire. Différents estimateurs de la EP sont à propos. Les estimateurs n'utilisant pas de ré-échantillonage, tels que le Cpet le R2ajusté, produisent des
doi_str_mv	10.2307/1403680
format	Article
fullrecord	<record><control><sourceid>jstor_proqu</sourceid><recordid>TN_cdi_proquest_journals_1311323296</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><jstor_id>1403680</jstor_id><sourcerecordid>1403680</sourcerecordid><originalsourceid>FETCH-LOGICAL-c241t-ff02fdfa824da04c4cc23ceb13d058003276b815a575806f3a691d26360c4d7f3</originalsourceid><addsrcrecordid>eNp1kEtLAzEUhYMoWKv4FwIKrkZvcjNJutRSH1AQ2gruhjQPnTKdqcmM4L93aouuXF0OfHzncgg5Z3DNEdQNE4BSwwEZMJWzLNccD8kAEGSmFIpjcpLSCgCQazEgd_NuuW6cr-jcV962ZVNTUzs6-TRVZ35iWdOZf4s-pT5d08W7p6_ZrIeaNR2b5E_JUTBV8mf7OyQv95PF-DGbPj88jW-nmeWCtVkIwIMLRnPhDAgrrOVo_ZKhg1xv_1FyqVluctVHGdDIEXNcogQrnAo4JBc77yY2H51PbbFqulj3lQVDxpAjH8meutpRNjYpRR-KTSzXJn4VDIrtQMV-oJ683PtMsqYK0dS2TL-40DrXjP9hq9Q28V_bN6xnbNw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1311323296</pqid></control><display><type>article</type><title>Submodel Selection and Evaluation in Regression. The X-Random Case</title><source>Periodicals Index Online</source><source>JSTOR Mathematics & Statistics</source><source>JSTOR Archive Collection A-Z Listing</source><creator>Breiman, Leo ; Spector, Philip</creator><creatorcontrib>Breiman, Leo ; Spector, Philip</creatorcontrib><description>Often, in a regression situation with many variables, a sequence of submodels is generated containing fewer variables by using such methods as stepwise addition or deletion of variables, or 'best subsets'. The question is which of this sequence of submodels is 'best', and how can submodel performance be evaluated. This was explored in Breiman (1988) for a fixed X-design. This is a sequel exploring the case of random X-designs. Analytical results are difficult, if not impossible. This study involved an extensive simulation. The basis of the study is the theoretical definition of prediction error (PE) as the expected squared error produced by applying a prediction equation to the distributional universe of (y, x) values. This definition is used throughout to compare various submodels. There can be startling differences between the x-fixed and x-random situations and different PE estimates are appropriate. Non-resampling estimates such as CP, adjusted R2, etc. turn out to be highly biased methods for submodel selection. The two best methods are cross-validation and bootstrap. One surprise is that 5 fold cross-validation (leave out 20% of the data) is better at submodel selection and evaluation than leave-one-out cross-validation. There are a number of other surprises. /// Dans l'analyse de problèmes de régression à plusieurs variables (indépendantes), on produit souvent une série de sous-modèles constitués d'un sous-ensemble des variables par des méthodes telles que l'addition par étape, le retrait par étape et la méthode du meilleur sous-ensemble. Le problème est de déterminer lequel de ces sous-modèles est le meilleur et d'évaluer sa performance. Ce problème fut exploré dans Breiman (1988) pour le cas d'une matrice X fixe. Dans ce qui suit on considère le cas où la matrice X est aléatoire. La détermination de résultats analytiques est difficile, sinon impossible. Notre étude a utilisé des simulations de grande envergure. Elle se base sur la définition théorique de l'erreur de prédiction (EP) comme étant l'espérance du carré de l'erreur produite en applicant une équation de prédiction à l'univers distributional des valeurs (y, x). La définition est utilisée dans toute l'étude à fin de comparer divers sous-modèles. Il y a une différence étonnante entre le cas où la matrice X est fixée et celui où elle est aléatoire. Différents estimateurs de la EP sont à propos. Les estimateurs n'utilisant pas de ré-échantillonage, tels que le Cpet le R2ajusté, produisent des méthodes de sélection ayant grand biais. Les deux meilleures méthodes sont la validation croisée et l'autoamorçage. Une surprise est que la validation croisée quintuple est meilleure que la validation croisée tous sauf un. Il y a plusieurs autres résultats surprenants.</description><identifier>ISSN: 0306-7734</identifier><identifier>EISSN: 1751-5823</identifier><identifier>DOI: 10.2307/1403680</identifier><identifier>CODEN: ISTRDP</identifier><language>eng</language><publisher>Malden, MA: International Statistical Institute</publisher><subject>Cost estimates ; Cost estimation models ; Dimensionality ; Error rates ; Estimate reliability ; Estimation bias ; Estimation methods ; Estimators ; Exact sciences and technology ; Induced substructures ; Linear inference, regression ; Mathematics ; Probability and statistics ; Sample size ; Sciences and techniques of general use ; Statistics</subject><ispartof>International statistical review, 1992-12, Vol.60 (3), p.291-319</ispartof><rights>Copyright 1992 International Statistical Institute</rights><rights>1993 INIST-CNRS</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c241t-ff02fdfa824da04c4cc23ceb13d058003276b815a575806f3a691d26360c4d7f3</citedby></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.jstor.org/stable/pdf/1403680$$EPDF$$P50$$Gjstor$$H</linktopdf><linktohtml>$$Uhttps://www.jstor.org/stable/1403680$$EHTML$$P50$$Gjstor$$H</linktohtml><link.rule.ids>314,780,784,803,832,27869,27924,27925,58017,58021,58250,58254</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=4885812$$DView record in Pascal Francis$$Hfree_for_read</backlink></links><search><creatorcontrib>Breiman, Leo</creatorcontrib><creatorcontrib>Spector, Philip</creatorcontrib><title>Submodel Selection and Evaluation in Regression. The X-Random Case</title><title>International statistical review</title><description>Often, in a regression situation with many variables, a sequence of submodels is generated containing fewer variables by using such methods as stepwise addition or deletion of variables, or 'best subsets'. The question is which of this sequence of submodels is 'best', and how can submodel performance be evaluated. This was explored in Breiman (1988) for a fixed X-design. This is a sequel exploring the case of random X-designs. Analytical results are difficult, if not impossible. This study involved an extensive simulation. The basis of the study is the theoretical definition of prediction error (PE) as the expected squared error produced by applying a prediction equation to the distributional universe of (y, x) values. This definition is used throughout to compare various submodels. There can be startling differences between the x-fixed and x-random situations and different PE estimates are appropriate. Non-resampling estimates such as CP, adjusted R2, etc. turn out to be highly biased methods for submodel selection. The two best methods are cross-validation and bootstrap. One surprise is that 5 fold cross-validation (leave out 20% of the data) is better at submodel selection and evaluation than leave-one-out cross-validation. There are a number of other surprises. /// Dans l'analyse de problèmes de régression à plusieurs variables (indépendantes), on produit souvent une série de sous-modèles constitués d'un sous-ensemble des variables par des méthodes telles que l'addition par étape, le retrait par étape et la méthode du meilleur sous-ensemble. Le problème est de déterminer lequel de ces sous-modèles est le meilleur et d'évaluer sa performance. Ce problème fut exploré dans Breiman (1988) pour le cas d'une matrice X fixe. Dans ce qui suit on considère le cas où la matrice X est aléatoire. La détermination de résultats analytiques est difficile, sinon impossible. Notre étude a utilisé des simulations de grande envergure. Elle se base sur la définition théorique de l'erreur de prédiction (EP) comme étant l'espérance du carré de l'erreur produite en applicant une équation de prédiction à l'univers distributional des valeurs (y, x). La définition est utilisée dans toute l'étude à fin de comparer divers sous-modèles. Il y a une différence étonnante entre le cas où la matrice X est fixée et celui où elle est aléatoire. Différents estimateurs de la EP sont à propos. Les estimateurs n'utilisant pas de ré-échantillonage, tels que le Cpet le R2ajusté, produisent des méthodes de sélection ayant grand biais. Les deux meilleures méthodes sont la validation croisée et l'autoamorçage. Une surprise est que la validation croisée quintuple est meilleure que la validation croisée tous sauf un. Il y a plusieurs autres résultats surprenants.</description><subject>Cost estimates</subject><subject>Cost estimation models</subject><subject>Dimensionality</subject><subject>Error rates</subject><subject>Estimate reliability</subject><subject>Estimation bias</subject><subject>Estimation methods</subject><subject>Estimators</subject><subject>Exact sciences and technology</subject><subject>Induced substructures</subject><subject>Linear inference, regression</subject><subject>Mathematics</subject><subject>Probability and statistics</subject><subject>Sample size</subject><subject>Sciences and techniques of general use</subject><subject>Statistics</subject><issn>0306-7734</issn><issn>1751-5823</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>1992</creationdate><recordtype>article</recordtype><sourceid>K30</sourceid><recordid>eNp1kEtLAzEUhYMoWKv4FwIKrkZvcjNJutRSH1AQ2gruhjQPnTKdqcmM4L93aouuXF0OfHzncgg5Z3DNEdQNE4BSwwEZMJWzLNccD8kAEGSmFIpjcpLSCgCQazEgd_NuuW6cr-jcV962ZVNTUzs6-TRVZ35iWdOZf4s-pT5d08W7p6_ZrIeaNR2b5E_JUTBV8mf7OyQv95PF-DGbPj88jW-nmeWCtVkIwIMLRnPhDAgrrOVo_ZKhg1xv_1FyqVluctVHGdDIEXNcogQrnAo4JBc77yY2H51PbbFqulj3lQVDxpAjH8meutpRNjYpRR-KTSzXJn4VDIrtQMV-oJ683PtMsqYK0dS2TL-40DrXjP9hq9Q28V_bN6xnbNw</recordid><startdate>19921201</startdate><enddate>19921201</enddate><creator>Breiman, Leo</creator><creator>Spector, Philip</creator><general>International Statistical Institute</general><general>Blackwell</general><general>Longman</general><scope>IQODW</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>JSICY</scope><scope>K30</scope><scope>PAAUG</scope><scope>PAWHS</scope><scope>PAWZZ</scope><scope>PAXOH</scope><scope>PBHAV</scope><scope>PBQSW</scope><scope>PBYQZ</scope><scope>PCIWU</scope><scope>PCMID</scope><scope>PCZJX</scope><scope>PDGRG</scope><scope>PDWWI</scope><scope>PETMR</scope><scope>PFVGT</scope><scope>PGXDX</scope><scope>PIHIL</scope><scope>PISVA</scope><scope>PJCTQ</scope><scope>PJTMS</scope><scope>PLCHJ</scope><scope>PMHAD</scope><scope>PNQDJ</scope><scope>POUND</scope><scope>PPLAD</scope><scope>PQAPC</scope><scope>PQCAN</scope><scope>PQCMW</scope><scope>PQEME</scope><scope>PQHKH</scope><scope>PQMID</scope><scope>PQNCT</scope><scope>PQNET</scope><scope>PQSCT</scope><scope>PQSET</scope><scope>PSVJG</scope><scope>PVMQY</scope><scope>PZGFC</scope></search><sort><creationdate>19921201</creationdate><title>Submodel Selection and Evaluation in Regression. The X-Random Case</title><author>Breiman, Leo ; Spector, Philip</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c241t-ff02fdfa824da04c4cc23ceb13d058003276b815a575806f3a691d26360c4d7f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>1992</creationdate><topic>Cost estimates</topic><topic>Cost estimation models</topic><topic>Dimensionality</topic><topic>Error rates</topic><topic>Estimate reliability</topic><topic>Estimation bias</topic><topic>Estimation methods</topic><topic>Estimators</topic><topic>Exact sciences and technology</topic><topic>Induced substructures</topic><topic>Linear inference, regression</topic><topic>Mathematics</topic><topic>Probability and statistics</topic><topic>Sample size</topic><topic>Sciences and techniques of general use</topic><topic>Statistics</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Breiman, Leo</creatorcontrib><creatorcontrib>Spector, Philip</creatorcontrib><collection>Pascal-Francis</collection><collection>CrossRef</collection><collection>Periodicals Index Online Segment 36</collection><collection>Periodicals Index Online</collection><collection>Primary Sources Access—Foundation Edition (Plan E) - West</collection><collection>Primary Sources Access (Plan D) - International</collection><collection>Primary Sources Access & Build (Plan A) - MEA</collection><collection>Primary Sources Access—Foundation Edition (Plan E) - Midwest</collection><collection>Primary Sources Access—Foundation Edition (Plan E) - Northeast</collection><collection>Primary Sources Access (Plan D) - Southeast</collection><collection>Primary Sources Access (Plan D) - North Central</collection><collection>Primary Sources Access—Foundation Edition (Plan E) - Southeast</collection><collection>Primary Sources Access (Plan D) - South Central</collection><collection>Primary Sources Access & Build (Plan A) - UK / I</collection><collection>Primary Sources Access (Plan D) - Canada</collection><collection>Primary Sources Access (Plan D) - EMEALA</collection><collection>Primary Sources Access—Foundation Edition (Plan E) - North Central</collection><collection>Primary Sources Access—Foundation Edition (Plan E) - South Central</collection><collection>Primary Sources Access & Build (Plan A) - International</collection><collection>Primary Sources Access—Foundation Edition (Plan E) - International</collection><collection>Primary Sources Access (Plan D) - West</collection><collection>Periodicals Index Online Segments 1-50</collection><collection>Primary Sources Access (Plan D) - APAC</collection><collection>Primary Sources Access (Plan D) - Midwest</collection><collection>Primary Sources Access (Plan D) - MEA</collection><collection>Primary Sources Access—Foundation Edition (Plan E) - Canada</collection><collection>Primary Sources Access—Foundation Edition (Plan E) - UK / I</collection><collection>Primary Sources Access—Foundation Edition (Plan E) - EMEALA</collection><collection>Primary Sources Access & Build (Plan A) - APAC</collection><collection>Primary Sources Access & Build (Plan A) - Canada</collection><collection>Primary Sources Access & Build (Plan A) - West</collection><collection>Primary Sources Access & Build (Plan A) - EMEALA</collection><collection>Primary Sources Access (Plan D) - Northeast</collection><collection>Primary Sources Access & Build (Plan A) - Midwest</collection><collection>Primary Sources Access & Build (Plan A) - North Central</collection><collection>Primary Sources Access & Build (Plan A) - Northeast</collection><collection>Primary Sources Access & Build (Plan A) - South Central</collection><collection>Primary Sources Access & Build (Plan A) - Southeast</collection><collection>Primary Sources Access (Plan D) - UK / I</collection><collection>Primary Sources Access—Foundation Edition (Plan E) - APAC</collection><collection>Primary Sources Access—Foundation Edition (Plan E) - MEA</collection><jtitle>International statistical review</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Breiman, Leo</au><au>Spector, Philip</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Submodel Selection and Evaluation in Regression. The X-Random Case</atitle><jtitle>International statistical review</jtitle><date>1992-12-01</date><risdate>1992</risdate><volume>60</volume><issue>3</issue><spage>291</spage><epage>319</epage><pages>291-319</pages><issn>0306-7734</issn><eissn>1751-5823</eissn><coden>ISTRDP</coden><abstract>Often, in a regression situation with many variables, a sequence of submodels is generated containing fewer variables by using such methods as stepwise addition or deletion of variables, or 'best subsets'. The question is which of this sequence of submodels is 'best', and how can submodel performance be evaluated. This was explored in Breiman (1988) for a fixed X-design. This is a sequel exploring the case of random X-designs. Analytical results are difficult, if not impossible. This study involved an extensive simulation. The basis of the study is the theoretical definition of prediction error (PE) as the expected squared error produced by applying a prediction equation to the distributional universe of (y, x) values. This definition is used throughout to compare various submodels. There can be startling differences between the x-fixed and x-random situations and different PE estimates are appropriate. Non-resampling estimates such as CP, adjusted R2, etc. turn out to be highly biased methods for submodel selection. The two best methods are cross-validation and bootstrap. One surprise is that 5 fold cross-validation (leave out 20% of the data) is better at submodel selection and evaluation than leave-one-out cross-validation. There are a number of other surprises. /// Dans l'analyse de problèmes de régression à plusieurs variables (indépendantes), on produit souvent une série de sous-modèles constitués d'un sous-ensemble des variables par des méthodes telles que l'addition par étape, le retrait par étape et la méthode du meilleur sous-ensemble. Le problème est de déterminer lequel de ces sous-modèles est le meilleur et d'évaluer sa performance. Ce problème fut exploré dans Breiman (1988) pour le cas d'une matrice X fixe. Dans ce qui suit on considère le cas où la matrice X est aléatoire. La détermination de résultats analytiques est difficile, sinon impossible. Notre étude a utilisé des simulations de grande envergure. Elle se base sur la définition théorique de l'erreur de prédiction (EP) comme étant l'espérance du carré de l'erreur produite en applicant une équation de prédiction à l'univers distributional des valeurs (y, x). La définition est utilisée dans toute l'étude à fin de comparer divers sous-modèles. Il y a une différence étonnante entre le cas où la matrice X est fixée et celui où elle est aléatoire. Différents estimateurs de la EP sont à propos. Les estimateurs n'utilisant pas de ré-échantillonage, tels que le Cpet le R2ajusté, produisent des méthodes de sélection ayant grand biais. Les deux meilleures méthodes sont la validation croisée et l'autoamorçage. Une surprise est que la validation croisée quintuple est meilleure que la validation croisée tous sauf un. Il y a plusieurs autres résultats surprenants.</abstract><cop>Malden, MA</cop><pub>International Statistical Institute</pub><doi>10.2307/1403680</doi><tpages>29</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 0306-7734
ispartof	International statistical review, 1992-12, Vol.60 (3), p.291-319
issn	0306-7734 1751-5823
language	eng
recordid	cdi_proquest_journals_1311323296
source	Periodicals Index Online; JSTOR Mathematics & Statistics; JSTOR Archive Collection A-Z Listing
subjects	Cost estimates Cost estimation models Dimensionality Error rates Estimate reliability Estimation bias Estimation methods Estimators Exact sciences and technology Induced substructures Linear inference, regression Mathematics Probability and statistics Sample size Sciences and techniques of general use Statistics
title	Submodel Selection and Evaluation in Regression. The X-Random Case
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T22%3A18%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-jstor_proqu&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Submodel%20Selection%20and%20Evaluation%20in%20Regression.%20The%20X-Random%20Case&rft.jtitle=International%20statistical%20review&rft.au=Breiman,%20Leo&rft.date=1992-12-01&rft.volume=60&rft.issue=3&rft.spage=291&rft.epage=319&rft.pages=291-319&rft.issn=0306-7734&rft.eissn=1751-5823&rft.coden=ISTRDP&rft_id=info:doi/10.2307/1403680&rft_dat=%3Cjstor_proqu%3E1403680%3C/jstor_proqu%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1311323296&rft_id=info:pmid/&rft_jstor_id=1403680&rfr_iscdi=true