A sequential split-and-conquer approach for the analysis of big dependent data in computer experiments

Massive correlated data with many inputs are often generated from computer experiments to study complex systems. The Gaussian process (GP) model is a widely used tool for the analysis of computer experiments. Although GPs provide a simple and effective approximation to computer experiments, two crit...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Canadian journal of statistics 2020-12, Vol.48 (4), p.712-730
Hauptverfasser:	LI, Chengrui, HUNG, Ying, XIE, Minge
Format:	Artikel
Sprache:	eng
Schlagworte:	Approximation Asymptotic methods Complex systems Computer experiment Confidence confidence distribution Correlation analysis divide‐conquer‐combine method Experiments Family physicians Gaussian process Inference Matrices Measurement Predictions predictive distribution Simulation Substitutes Uncertainty
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	730
container_issue	4
container_start_page	712
container_title	Canadian journal of statistics
container_volume	48
creator	LI, Chengrui HUNG, Ying XIE, Minge
description	Massive correlated data with many inputs are often generated from computer experiments to study complex systems. The Gaussian process (GP) model is a widely used tool for the analysis of computer experiments. Although GPs provide a simple and effective approximation to computer experiments, two critical issues remain unresolved. One is the computational issue in GP estimation and prediction where intensive manipulations of a large correlation matrix are required. For a large sample size and with a large number of variables, this task is often unstable or infeasible. The other issue is how to improve the naive plug-in predictive distribution which is known to underestimate the uncertainty. In this article, we introduce a unified framework that can tackle both issues simultaneously. It consists of a sequential split-and-conquer procedure, an information combining technique using confidence distributions (CD), and a frequentist predictive distribution based on the combined CD. It is shown that the proposed method maintains the same asymptotic efficiency as the conventional likelihood inference under mild conditions, but dramatically reduces the computation in both estimation and prediction. The predictive distribution contains comprehensive information for inference and provides a better quantification of predictive uncertainty as compared with the plug-in approach. Simulations are conducted to compare the estimation and prediction accuracy with some existing methods, and the computational advantage of the proposed method is also illustrated. The proposed method is demonstrated by a real data example based on tens of thousands of computer experiments generated from a computational fluid dynamic simulator. Les expériences informatiques génèrent souvent des données corrélées massives avec de nombreuses entrées pour étudier des systèmes complexes. Les processus gaussiens (PG) sont largement utilisés comme outil pour leur analyse. Même si les PG offrent une approximation simple et efficace aux expériences informatiques, ils présentent deux problèmes critiques non résolus. Le premier se trouve au niveau computationnel dans l’estimation et les prévisions des PG qui nécessitent d’intenses manipulations de grandes matrices de corrélation. Pour une taille d’échantillon élevée et un grand nombre de variables, cette tâ che devient souvent instable, voire infaisable. L’autre problème réside dans l’amélioration de l’approche naïve de substitution de la distribution prédic
doi_str_mv	10.1002/cjs.11559
format	Article
fullrecord	<record><control><sourceid>jstor_proqu</sourceid><recordid>TN_cdi_proquest_journals_2468381303</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><jstor_id>48744741</jstor_id><sourcerecordid>48744741</sourcerecordid><originalsourceid>FETCH-LOGICAL-c3199-8aa861bd2bbfd4f21cd8302fd481ca9c9f1269a4d7c071d4ac2df30b6604e4b3</originalsourceid><addsrcrecordid>eNp1kD1PwzAURS0EEqUw8AOQLDExhPrZTuKMVcWnKjHQgc1y_EETpXGwU0H_PYYCG5Of_M59OroInQO5BkLoTLfxGiDPqwM0gZKIrOL5yyGaEAZVlpeUH6OTGFtCWA5AJ8jNcbRvW9uPjepwHLpmzFRvMu379BuwGobglV5j5wMe1xarXnW72ETsHa6bV2zsYHuT8tioUeGmx9pvhu2YsvZjsKHZpF08RUdOddGe_bxTtLq9WS3us-XT3cNivsx08qsyoZQooDa0rp3hjoI2ghGaZgFaVbpyQItKcVNqUoLhSlPjGKmLgnDLazZFl_uzSTrpx1G2fhuScZSUF4IJYIQl6mpP6eBjDNbJIWmqsJNA5FeLMrUov1tM7GzPvjed3f0PysXj82_iYp9o4-jDX4KLkvOSA_sEg3R--A</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2468381303</pqid></control><display><type>article</type><title>A sequential split-and-conquer approach for the analysis of big dependent data in computer experiments</title><source>Wiley Journals</source><source>JSTOR Mathematics & Statistics</source><source>JSTOR Archive Collection A-Z Listing</source><creator>LI, Chengrui ; HUNG, Ying ; XIE, Minge</creator><creatorcontrib>LI, Chengrui ; HUNG, Ying ; XIE, Minge</creatorcontrib><description>Massive correlated data with many inputs are often generated from computer experiments to study complex systems. The Gaussian process (GP) model is a widely used tool for the analysis of computer experiments. Although GPs provide a simple and effective approximation to computer experiments, two critical issues remain unresolved. One is the computational issue in GP estimation and prediction where intensive manipulations of a large correlation matrix are required. For a large sample size and with a large number of variables, this task is often unstable or infeasible. The other issue is how to improve the naive plug-in predictive distribution which is known to underestimate the uncertainty. In this article, we introduce a unified framework that can tackle both issues simultaneously. It consists of a sequential split-and-conquer procedure, an information combining technique using confidence distributions (CD), and a frequentist predictive distribution based on the combined CD. It is shown that the proposed method maintains the same asymptotic efficiency as the conventional likelihood inference under mild conditions, but dramatically reduces the computation in both estimation and prediction. The predictive distribution contains comprehensive information for inference and provides a better quantification of predictive uncertainty as compared with the plug-in approach. Simulations are conducted to compare the estimation and prediction accuracy with some existing methods, and the computational advantage of the proposed method is also illustrated. The proposed method is demonstrated by a real data example based on tens of thousands of computer experiments generated from a computational fluid dynamic simulator. Les expériences informatiques génèrent souvent des données corrélées massives avec de nombreuses entrées pour étudier des systèmes complexes. Les processus gaussiens (PG) sont largement utilisés comme outil pour leur analyse. Même si les PG offrent une approximation simple et efficace aux expériences informatiques, ils présentent deux problèmes critiques non résolus. Le premier se trouve au niveau computationnel dans l’estimation et les prévisions des PG qui nécessitent d’intenses manipulations de grandes matrices de corrélation. Pour une taille d’échantillon élevée et un grand nombre de variables, cette tâ che devient souvent instable, voire infaisable. L’autre problème réside dans l’amélioration de l’approche naïve de substitution de la distribution prédictive qui conduit à une sous-estimation de l’incertitude. Les auteurs introduisent un cadre unifié qui peut régler ces deux problèmes simultanément. Il repose sur une procédure séquentielle de type diviser pour régner, une technique de combinaison de l’information utilisant les distributions de confiance (DC) et une distribution prédictive fréquentiste basée sur des DC combinées. Les auteurs montrent que la méthode proposée conserve la même efficacité asymptotique que la vraisemblance conventionnelle sous des hypothèses raisonnables tout en réduisant substantiellement la quantité de calcul nécessaire pour l’estimation et la prévision. La distribution prédictive comporte une information complète pour l’inférence et une meilleure quantification de l’incertitude de prévision en comparaison de l’approche de substitution. Les auteurs présentent des simulations comparant la justesse de l’estimation et des prévisions par rapport aux méthodes existantes. Ils illustrent également l’avantage computationnel de leur approche. Ils démontrent finalement l’usage de leur méthode en analysant des données réelles de dizaines de milliers d’expériences informatiques provenant d’un simulateur numérique portant sur la dynamique des fluides.</description><identifier>ISSN: 0319-5724</identifier><identifier>EISSN: 1708-945X</identifier><identifier>DOI: 10.1002/cjs.11559</identifier><language>eng</language><publisher>Hoboken, USA: Wiley</publisher><subject>Approximation ; Asymptotic methods ; Complex systems ; Computer experiment ; Confidence ; confidence distribution ; Correlation analysis ; divide‐conquer‐combine method ; Experiments ; Family physicians ; Gaussian process ; Inference ; Matrices ; Measurement ; Predictions ; predictive distribution ; Simulation ; Substitutes ; Uncertainty</subject><ispartof>Canadian journal of statistics, 2020-12, Vol.48 (4), p.712-730</ispartof><rights>2020 Statistical Society of Canada / Société statistique du Canada</rights><rights>2020 Statistical Society of Canada / Soci\xE9t\xE9 statistique du Canada</rights><rights>2020 Statistical Society of Canada</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c3199-8aa861bd2bbfd4f21cd8302fd481ca9c9f1269a4d7c071d4ac2df30b6604e4b3</citedby><cites>FETCH-LOGICAL-c3199-8aa861bd2bbfd4f21cd8302fd481ca9c9f1269a4d7c071d4ac2df30b6604e4b3</cites><orcidid>0000-0001-7298-0966</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.jstor.org/stable/pdf/48744741$$EPDF$$P50$$Gjstor$$H</linktopdf><linktohtml>$$Uhttps://www.jstor.org/stable/48744741$$EHTML$$P50$$Gjstor$$H</linktohtml><link.rule.ids>314,780,784,803,832,1417,27924,27925,45574,45575,58017,58021,58250,58254</link.rule.ids></links><search><creatorcontrib>LI, Chengrui</creatorcontrib><creatorcontrib>HUNG, Ying</creatorcontrib><creatorcontrib>XIE, Minge</creatorcontrib><title>A sequential split-and-conquer approach for the analysis of big dependent data in computer experiments</title><title>Canadian journal of statistics</title><description>Massive correlated data with many inputs are often generated from computer experiments to study complex systems. The Gaussian process (GP) model is a widely used tool for the analysis of computer experiments. Although GPs provide a simple and effective approximation to computer experiments, two critical issues remain unresolved. One is the computational issue in GP estimation and prediction where intensive manipulations of a large correlation matrix are required. For a large sample size and with a large number of variables, this task is often unstable or infeasible. The other issue is how to improve the naive plug-in predictive distribution which is known to underestimate the uncertainty. In this article, we introduce a unified framework that can tackle both issues simultaneously. It consists of a sequential split-and-conquer procedure, an information combining technique using confidence distributions (CD), and a frequentist predictive distribution based on the combined CD. It is shown that the proposed method maintains the same asymptotic efficiency as the conventional likelihood inference under mild conditions, but dramatically reduces the computation in both estimation and prediction. The predictive distribution contains comprehensive information for inference and provides a better quantification of predictive uncertainty as compared with the plug-in approach. Simulations are conducted to compare the estimation and prediction accuracy with some existing methods, and the computational advantage of the proposed method is also illustrated. The proposed method is demonstrated by a real data example based on tens of thousands of computer experiments generated from a computational fluid dynamic simulator. Les expériences informatiques génèrent souvent des données corrélées massives avec de nombreuses entrées pour étudier des systèmes complexes. Les processus gaussiens (PG) sont largement utilisés comme outil pour leur analyse. Même si les PG offrent une approximation simple et efficace aux expériences informatiques, ils présentent deux problèmes critiques non résolus. Le premier se trouve au niveau computationnel dans l’estimation et les prévisions des PG qui nécessitent d’intenses manipulations de grandes matrices de corrélation. Pour une taille d’échantillon élevée et un grand nombre de variables, cette tâ che devient souvent instable, voire infaisable. L’autre problème réside dans l’amélioration de l’approche naïve de substitution de la distribution prédictive qui conduit à une sous-estimation de l’incertitude. Les auteurs introduisent un cadre unifié qui peut régler ces deux problèmes simultanément. Il repose sur une procédure séquentielle de type diviser pour régner, une technique de combinaison de l’information utilisant les distributions de confiance (DC) et une distribution prédictive fréquentiste basée sur des DC combinées. Les auteurs montrent que la méthode proposée conserve la même efficacité asymptotique que la vraisemblance conventionnelle sous des hypothèses raisonnables tout en réduisant substantiellement la quantité de calcul nécessaire pour l’estimation et la prévision. La distribution prédictive comporte une information complète pour l’inférence et une meilleure quantification de l’incertitude de prévision en comparaison de l’approche de substitution. Les auteurs présentent des simulations comparant la justesse de l’estimation et des prévisions par rapport aux méthodes existantes. Ils illustrent également l’avantage computationnel de leur approche. Ils démontrent finalement l’usage de leur méthode en analysant des données réelles de dizaines de milliers d’expériences informatiques provenant d’un simulateur numérique portant sur la dynamique des fluides.</description><subject>Approximation</subject><subject>Asymptotic methods</subject><subject>Complex systems</subject><subject>Computer experiment</subject><subject>Confidence</subject><subject>confidence distribution</subject><subject>Correlation analysis</subject><subject>divide‐conquer‐combine method</subject><subject>Experiments</subject><subject>Family physicians</subject><subject>Gaussian process</subject><subject>Inference</subject><subject>Matrices</subject><subject>Measurement</subject><subject>Predictions</subject><subject>predictive distribution</subject><subject>Simulation</subject><subject>Substitutes</subject><subject>Uncertainty</subject><issn>0319-5724</issn><issn>1708-945X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><recordid>eNp1kD1PwzAURS0EEqUw8AOQLDExhPrZTuKMVcWnKjHQgc1y_EETpXGwU0H_PYYCG5Of_M59OroInQO5BkLoTLfxGiDPqwM0gZKIrOL5yyGaEAZVlpeUH6OTGFtCWA5AJ8jNcbRvW9uPjepwHLpmzFRvMu379BuwGobglV5j5wMe1xarXnW72ETsHa6bV2zsYHuT8tioUeGmx9pvhu2YsvZjsKHZpF08RUdOddGe_bxTtLq9WS3us-XT3cNivsx08qsyoZQooDa0rp3hjoI2ghGaZgFaVbpyQItKcVNqUoLhSlPjGKmLgnDLazZFl_uzSTrpx1G2fhuScZSUF4IJYIQl6mpP6eBjDNbJIWmqsJNA5FeLMrUov1tM7GzPvjed3f0PysXj82_iYp9o4-jDX4KLkvOSA_sEg3R--A</recordid><startdate>20201201</startdate><enddate>20201201</enddate><creator>LI, Chengrui</creator><creator>HUNG, Ying</creator><creator>XIE, Minge</creator><general>Wiley</general><general>John Wiley & Sons, Inc</general><general>Wiley Subscription Services, Inc</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8BJ</scope><scope>8FD</scope><scope>FQK</scope><scope>H8D</scope><scope>JBE</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0001-7298-0966</orcidid></search><sort><creationdate>20201201</creationdate><title>A sequential split-and-conquer approach for the analysis of big dependent data in computer experiments</title><author>LI, Chengrui ; HUNG, Ying ; XIE, Minge</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c3199-8aa861bd2bbfd4f21cd8302fd481ca9c9f1269a4d7c071d4ac2df30b6604e4b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Approximation</topic><topic>Asymptotic methods</topic><topic>Complex systems</topic><topic>Computer experiment</topic><topic>Confidence</topic><topic>confidence distribution</topic><topic>Correlation analysis</topic><topic>divide‐conquer‐combine method</topic><topic>Experiments</topic><topic>Family physicians</topic><topic>Gaussian process</topic><topic>Inference</topic><topic>Matrices</topic><topic>Measurement</topic><topic>Predictions</topic><topic>predictive distribution</topic><topic>Simulation</topic><topic>Substitutes</topic><topic>Uncertainty</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>LI, Chengrui</creatorcontrib><creatorcontrib>HUNG, Ying</creatorcontrib><creatorcontrib>XIE, Minge</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>International Bibliography of the Social Sciences (IBSS)</collection><collection>Technology Research Database</collection><collection>International Bibliography of the Social Sciences</collection><collection>Aerospace Database</collection><collection>International Bibliography of the Social Sciences</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Canadian journal of statistics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>LI, Chengrui</au><au>HUNG, Ying</au><au>XIE, Minge</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A sequential split-and-conquer approach for the analysis of big dependent data in computer experiments</atitle><jtitle>Canadian journal of statistics</jtitle><date>2020-12-01</date><risdate>2020</risdate><volume>48</volume><issue>4</issue><spage>712</spage><epage>730</epage><pages>712-730</pages><issn>0319-5724</issn><eissn>1708-945X</eissn><abstract>Massive correlated data with many inputs are often generated from computer experiments to study complex systems. The Gaussian process (GP) model is a widely used tool for the analysis of computer experiments. Although GPs provide a simple and effective approximation to computer experiments, two critical issues remain unresolved. One is the computational issue in GP estimation and prediction where intensive manipulations of a large correlation matrix are required. For a large sample size and with a large number of variables, this task is often unstable or infeasible. The other issue is how to improve the naive plug-in predictive distribution which is known to underestimate the uncertainty. In this article, we introduce a unified framework that can tackle both issues simultaneously. It consists of a sequential split-and-conquer procedure, an information combining technique using confidence distributions (CD), and a frequentist predictive distribution based on the combined CD. It is shown that the proposed method maintains the same asymptotic efficiency as the conventional likelihood inference under mild conditions, but dramatically reduces the computation in both estimation and prediction. The predictive distribution contains comprehensive information for inference and provides a better quantification of predictive uncertainty as compared with the plug-in approach. Simulations are conducted to compare the estimation and prediction accuracy with some existing methods, and the computational advantage of the proposed method is also illustrated. The proposed method is demonstrated by a real data example based on tens of thousands of computer experiments generated from a computational fluid dynamic simulator. Les expériences informatiques génèrent souvent des données corrélées massives avec de nombreuses entrées pour étudier des systèmes complexes. Les processus gaussiens (PG) sont largement utilisés comme outil pour leur analyse. Même si les PG offrent une approximation simple et efficace aux expériences informatiques, ils présentent deux problèmes critiques non résolus. Le premier se trouve au niveau computationnel dans l’estimation et les prévisions des PG qui nécessitent d’intenses manipulations de grandes matrices de corrélation. Pour une taille d’échantillon élevée et un grand nombre de variables, cette tâ che devient souvent instable, voire infaisable. L’autre problème réside dans l’amélioration de l’approche naïve de substitution de la distribution prédictive qui conduit à une sous-estimation de l’incertitude. Les auteurs introduisent un cadre unifié qui peut régler ces deux problèmes simultanément. Il repose sur une procédure séquentielle de type diviser pour régner, une technique de combinaison de l’information utilisant les distributions de confiance (DC) et une distribution prédictive fréquentiste basée sur des DC combinées. Les auteurs montrent que la méthode proposée conserve la même efficacité asymptotique que la vraisemblance conventionnelle sous des hypothèses raisonnables tout en réduisant substantiellement la quantité de calcul nécessaire pour l’estimation et la prévision. La distribution prédictive comporte une information complète pour l’inférence et une meilleure quantification de l’incertitude de prévision en comparaison de l’approche de substitution. Les auteurs présentent des simulations comparant la justesse de l’estimation et des prévisions par rapport aux méthodes existantes. Ils illustrent également l’avantage computationnel de leur approche. Ils démontrent finalement l’usage de leur méthode en analysant des données réelles de dizaines de milliers d’expériences informatiques provenant d’un simulateur numérique portant sur la dynamique des fluides.</abstract><cop>Hoboken, USA</cop><pub>Wiley</pub><doi>10.1002/cjs.11559</doi><tpages>19</tpages><orcidid>https://orcid.org/0000-0001-7298-0966</orcidid></addata></record>
fulltext	fulltext
identifier	ISSN: 0319-5724
ispartof	Canadian journal of statistics, 2020-12, Vol.48 (4), p.712-730
issn	0319-5724 1708-945X
language	eng
recordid	cdi_proquest_journals_2468381303
source	Wiley Journals; JSTOR Mathematics & Statistics; JSTOR Archive Collection A-Z Listing
subjects	Approximation Asymptotic methods Complex systems Computer experiment Confidence confidence distribution Correlation analysis divide‐conquer‐combine method Experiments Family physicians Gaussian process Inference Matrices Measurement Predictions predictive distribution Simulation Substitutes Uncertainty
title	A sequential split-and-conquer approach for the analysis of big dependent data in computer experiments
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-27T11%3A16%3A30IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-jstor_proqu&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20sequential%20split-and-conquer%20approach%20for%20the%20analysis%20of%20big%20dependent%20data%20in%20computer%20experiments&rft.jtitle=Canadian%20journal%20of%20statistics&rft.au=LI,%20Chengrui&rft.date=2020-12-01&rft.volume=48&rft.issue=4&rft.spage=712&rft.epage=730&rft.pages=712-730&rft.issn=0319-5724&rft.eissn=1708-945X&rft_id=info:doi/10.1002/cjs.11559&rft_dat=%3Cjstor_proqu%3E48744741%3C/jstor_proqu%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2468381303&rft_id=info:pmid/&rft_jstor_id=48744741&rfr_iscdi=true