Small data: practical modeling issues in human-model -omic data

Human-model data are very valuable and important in biomedical research. Ethical and economical constraints limit the access to such data, and consequently these datasets rarely comprise more than a few hundred observations. As measurements are comparatively cheap, the tendency is to measure as many...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
1. Verfasser: Holsbø, Einar Jakobsen
Format: Dissertation
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Holsbø, Einar Jakobsen
description Human-model data are very valuable and important in biomedical research. Ethical and economical constraints limit the access to such data, and consequently these datasets rarely comprise more than a few hundred observations. As measurements are comparatively cheap, the tendency is to measure as many things as possible for the few, valuable participants in a study. With -omics technologies it is cheap and simple to make hundreds of thousands of measurements simultaneously. This few observations–many measurements setting is a high-dimensional problem in the technical language. Most gene expression experiments measure the expression levels of 10 000–15 000 genes for fewer than 100 subjects. I refer to this as the small data setting. This dissertation is an exercise in practical data analysis as it happens in a large epidemiological cohort study. It comprises three main projects: (i) predictive modeling of breast cancer metastasis from whole-blood transcriptomics measurements; (ii) standardizing a microarray data quality assessment in the Norwegian Women and Cancer (NOWAC) postgenome cohort; and (iii) shrinkage estimation of rates. These three are all small data analyses for various reasons. Predictive modeling in the small data setting is very challenging. There are several modern methods built to tackle high-dimensional data, but there is a need to evaluate these methods against one another when analyzing data in practice. Through the metastasis prediction work we learned first-hand that common practices in machine learning can be inefficient or harmful, especially for small data. I will outline some of the more important issues. In a large project such as NOWAC there is a need to centralize and disseminate knowledge and procedures. The standardization of NOWAC quality assessment was a project born of necessity. The standard operating procedure for outlier removal was developed so that preprocessing of the NOWAC microarray material should happen in the same way every time. We take this procedure from an archaic R-script that resided in peoples email inboxes to a well-documented, open-source R-package and present the NOWAC guidelines for microarray quality control. The procedure is built around the inherent high value of a singleobservation. Small data are plagued by high variance. Working with small data it is usually profitable to bias models by shrinkage or borrowing of information from elsewhere. We present a pseudo-Bayesian estimator of rates in an infor
format Dissertation
fullrecord <record><control><sourceid>cristin_3HK</sourceid><recordid>TN_cdi_cristin_nora_10037_14660</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>10037_14660</sourcerecordid><originalsourceid>FETCH-cristin_nora_10037_146603</originalsourceid><addsrcrecordid>eNpjZuCyMLIwMjYzNjaw4GSwD85NzMlRSEksSbRSKChKTC7JTE7MUcjNT0nNycxLV8gsLi5NLVbIzFPIKM1NzNMFSyjo5udmJoM18TCwpiXmFKfyQmluBnk31xBnD93kosziksy8-Lz8osR4QwMDY_N4QxMzMwNjwioAwTQxFA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>dissertation</recordtype></control><display><type>dissertation</type><title>Small data: practical modeling issues in human-model -omic data</title><source>NORA - Norwegian Open Research Archives</source><creator>Holsbø, Einar Jakobsen</creator><creatorcontrib>Holsbø, Einar Jakobsen ; Bongo, Lars Ailo</creatorcontrib><description>Human-model data are very valuable and important in biomedical research. Ethical and economical constraints limit the access to such data, and consequently these datasets rarely comprise more than a few hundred observations. As measurements are comparatively cheap, the tendency is to measure as many things as possible for the few, valuable participants in a study. With -omics technologies it is cheap and simple to make hundreds of thousands of measurements simultaneously. This few observations–many measurements setting is a high-dimensional problem in the technical language. Most gene expression experiments measure the expression levels of 10 000–15 000 genes for fewer than 100 subjects. I refer to this as the small data setting. This dissertation is an exercise in practical data analysis as it happens in a large epidemiological cohort study. It comprises three main projects: (i) predictive modeling of breast cancer metastasis from whole-blood transcriptomics measurements; (ii) standardizing a microarray data quality assessment in the Norwegian Women and Cancer (NOWAC) postgenome cohort; and (iii) shrinkage estimation of rates. These three are all small data analyses for various reasons. Predictive modeling in the small data setting is very challenging. There are several modern methods built to tackle high-dimensional data, but there is a need to evaluate these methods against one another when analyzing data in practice. Through the metastasis prediction work we learned first-hand that common practices in machine learning can be inefficient or harmful, especially for small data. I will outline some of the more important issues. In a large project such as NOWAC there is a need to centralize and disseminate knowledge and procedures. The standardization of NOWAC quality assessment was a project born of necessity. The standard operating procedure for outlier removal was developed so that preprocessing of the NOWAC microarray material should happen in the same way every time. We take this procedure from an archaic R-script that resided in peoples email inboxes to a well-documented, open-source R-package and present the NOWAC guidelines for microarray quality control. The procedure is built around the inherent high value of a singleobservation. Small data are plagued by high variance. Working with small data it is usually profitable to bias models by shrinkage or borrowing of information from elsewhere. We present a pseudo-Bayesian estimator of rates in an informal crime rate study. We exhibit the value of such procedures in a small data setting and demonstrate some novel considerations about the coverage properties of such a procedure. In short I gather some common practices in predictive modeling as applied to small data and assess their practical implications. I argue that with more focus on human-based datasets in biomedicine there is a need for particular consideration of these data in a small data paradigm to allow for reliable analysis. I will present what I believe to be sensible guidelines.</description><identifier>ISBN: 8282363308</identifier><identifier>ISBN: 9788282363303</identifier><language>eng</language><publisher>UiT Norges arktiske universitet</publisher><subject>Basale biofag: 470 ; Basic biosciences: 470 ; Bioinformatics: 475 ; Bioinformatikk: 475 ; Genetics and genomics: 474 ; Genetikk og genomikk: 474 ; Informasjons- og kommunikasjonsvitenskap: 420 ; Information and communication science: 420 ; Matematikk og Naturvitenskap: 400 ; Matematikk: 410 ; Mathematics and natural science: 400 ; Mathematics: 410 ; Statistics: 412 ; Statistikk: 412 ; VDP</subject><creationdate>2019</creationdate><rights>info:eu-repo/semantics/openAccess</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>230,311,777,882,4038,26548</link.rule.ids><linktorsrc>$$Uhttp://hdl.handle.net/10037/14660$$EView_record_in_NORA$$FView_record_in_$$GNORA$$Hfree_for_read</linktorsrc></links><search><creatorcontrib>Holsbø, Einar Jakobsen</creatorcontrib><title>Small data: practical modeling issues in human-model -omic data</title><description>Human-model data are very valuable and important in biomedical research. Ethical and economical constraints limit the access to such data, and consequently these datasets rarely comprise more than a few hundred observations. As measurements are comparatively cheap, the tendency is to measure as many things as possible for the few, valuable participants in a study. With -omics technologies it is cheap and simple to make hundreds of thousands of measurements simultaneously. This few observations–many measurements setting is a high-dimensional problem in the technical language. Most gene expression experiments measure the expression levels of 10 000–15 000 genes for fewer than 100 subjects. I refer to this as the small data setting. This dissertation is an exercise in practical data analysis as it happens in a large epidemiological cohort study. It comprises three main projects: (i) predictive modeling of breast cancer metastasis from whole-blood transcriptomics measurements; (ii) standardizing a microarray data quality assessment in the Norwegian Women and Cancer (NOWAC) postgenome cohort; and (iii) shrinkage estimation of rates. These three are all small data analyses for various reasons. Predictive modeling in the small data setting is very challenging. There are several modern methods built to tackle high-dimensional data, but there is a need to evaluate these methods against one another when analyzing data in practice. Through the metastasis prediction work we learned first-hand that common practices in machine learning can be inefficient or harmful, especially for small data. I will outline some of the more important issues. In a large project such as NOWAC there is a need to centralize and disseminate knowledge and procedures. The standardization of NOWAC quality assessment was a project born of necessity. The standard operating procedure for outlier removal was developed so that preprocessing of the NOWAC microarray material should happen in the same way every time. We take this procedure from an archaic R-script that resided in peoples email inboxes to a well-documented, open-source R-package and present the NOWAC guidelines for microarray quality control. The procedure is built around the inherent high value of a singleobservation. Small data are plagued by high variance. Working with small data it is usually profitable to bias models by shrinkage or borrowing of information from elsewhere. We present a pseudo-Bayesian estimator of rates in an informal crime rate study. We exhibit the value of such procedures in a small data setting and demonstrate some novel considerations about the coverage properties of such a procedure. In short I gather some common practices in predictive modeling as applied to small data and assess their practical implications. I argue that with more focus on human-based datasets in biomedicine there is a need for particular consideration of these data in a small data paradigm to allow for reliable analysis. I will present what I believe to be sensible guidelines.</description><subject>Basale biofag: 470</subject><subject>Basic biosciences: 470</subject><subject>Bioinformatics: 475</subject><subject>Bioinformatikk: 475</subject><subject>Genetics and genomics: 474</subject><subject>Genetikk og genomikk: 474</subject><subject>Informasjons- og kommunikasjonsvitenskap: 420</subject><subject>Information and communication science: 420</subject><subject>Matematikk og Naturvitenskap: 400</subject><subject>Matematikk: 410</subject><subject>Mathematics and natural science: 400</subject><subject>Mathematics: 410</subject><subject>Statistics: 412</subject><subject>Statistikk: 412</subject><subject>VDP</subject><isbn>8282363308</isbn><isbn>9788282363303</isbn><fulltext>true</fulltext><rsrctype>dissertation</rsrctype><creationdate>2019</creationdate><recordtype>dissertation</recordtype><sourceid>3HK</sourceid><recordid>eNpjZuCyMLIwMjYzNjaw4GSwD85NzMlRSEksSbRSKChKTC7JTE7MUcjNT0nNycxLV8gsLi5NLVbIzFPIKM1NzNMFSyjo5udmJoM18TCwpiXmFKfyQmluBnk31xBnD93kosziksy8-Lz8osR4QwMDY_N4QxMzMwNjwioAwTQxFA</recordid><startdate>2019</startdate><enddate>2019</enddate><creator>Holsbø, Einar Jakobsen</creator><general>UiT Norges arktiske universitet</general><scope>3HK</scope></search><sort><creationdate>2019</creationdate><title>Small data: practical modeling issues in human-model -omic data</title><author>Holsbø, Einar Jakobsen</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-cristin_nora_10037_146603</frbrgroupid><rsrctype>dissertations</rsrctype><prefilter>dissertations</prefilter><language>eng</language><creationdate>2019</creationdate><topic>Basale biofag: 470</topic><topic>Basic biosciences: 470</topic><topic>Bioinformatics: 475</topic><topic>Bioinformatikk: 475</topic><topic>Genetics and genomics: 474</topic><topic>Genetikk og genomikk: 474</topic><topic>Informasjons- og kommunikasjonsvitenskap: 420</topic><topic>Information and communication science: 420</topic><topic>Matematikk og Naturvitenskap: 400</topic><topic>Matematikk: 410</topic><topic>Mathematics and natural science: 400</topic><topic>Mathematics: 410</topic><topic>Statistics: 412</topic><topic>Statistikk: 412</topic><topic>VDP</topic><toplevel>online_resources</toplevel><creatorcontrib>Holsbø, Einar Jakobsen</creatorcontrib><collection>NORA - Norwegian Open Research Archives</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Holsbø, Einar Jakobsen</au><format>dissertation</format><genre>dissertation</genre><ristype>THES</ristype><Advisor>Bongo, Lars Ailo</Advisor><btitle>Small data: practical modeling issues in human-model -omic data</btitle><date>2019</date><risdate>2019</risdate><isbn>8282363308</isbn><isbn>9788282363303</isbn><abstract>Human-model data are very valuable and important in biomedical research. Ethical and economical constraints limit the access to such data, and consequently these datasets rarely comprise more than a few hundred observations. As measurements are comparatively cheap, the tendency is to measure as many things as possible for the few, valuable participants in a study. With -omics technologies it is cheap and simple to make hundreds of thousands of measurements simultaneously. This few observations–many measurements setting is a high-dimensional problem in the technical language. Most gene expression experiments measure the expression levels of 10 000–15 000 genes for fewer than 100 subjects. I refer to this as the small data setting. This dissertation is an exercise in practical data analysis as it happens in a large epidemiological cohort study. It comprises three main projects: (i) predictive modeling of breast cancer metastasis from whole-blood transcriptomics measurements; (ii) standardizing a microarray data quality assessment in the Norwegian Women and Cancer (NOWAC) postgenome cohort; and (iii) shrinkage estimation of rates. These three are all small data analyses for various reasons. Predictive modeling in the small data setting is very challenging. There are several modern methods built to tackle high-dimensional data, but there is a need to evaluate these methods against one another when analyzing data in practice. Through the metastasis prediction work we learned first-hand that common practices in machine learning can be inefficient or harmful, especially for small data. I will outline some of the more important issues. In a large project such as NOWAC there is a need to centralize and disseminate knowledge and procedures. The standardization of NOWAC quality assessment was a project born of necessity. The standard operating procedure for outlier removal was developed so that preprocessing of the NOWAC microarray material should happen in the same way every time. We take this procedure from an archaic R-script that resided in peoples email inboxes to a well-documented, open-source R-package and present the NOWAC guidelines for microarray quality control. The procedure is built around the inherent high value of a singleobservation. Small data are plagued by high variance. Working with small data it is usually profitable to bias models by shrinkage or borrowing of information from elsewhere. We present a pseudo-Bayesian estimator of rates in an informal crime rate study. We exhibit the value of such procedures in a small data setting and demonstrate some novel considerations about the coverage properties of such a procedure. In short I gather some common practices in predictive modeling as applied to small data and assess their practical implications. I argue that with more focus on human-based datasets in biomedicine there is a need for particular consideration of these data in a small data paradigm to allow for reliable analysis. I will present what I believe to be sensible guidelines.</abstract><pub>UiT Norges arktiske universitet</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier ISBN: 8282363308
ispartof
issn
language eng
recordid cdi_cristin_nora_10037_14660
source NORA - Norwegian Open Research Archives
subjects Basale biofag: 470
Basic biosciences: 470
Bioinformatics: 475
Bioinformatikk: 475
Genetics and genomics: 474
Genetikk og genomikk: 474
Informasjons- og kommunikasjonsvitenskap: 420
Information and communication science: 420
Matematikk og Naturvitenskap: 400
Matematikk: 410
Mathematics and natural science: 400
Mathematics: 410
Statistics: 412
Statistikk: 412
VDP
title Small data: practical modeling issues in human-model -omic data
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-17T22%3A08%3A38IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-cristin_3HK&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&rft.genre=dissertation&rft.btitle=Small%20data:%20practical%20modeling%20issues%20in%20human-model%20-omic%20data&rft.au=Holsb%C3%B8,%20Einar%20Jakobsen&rft.date=2019&rft.isbn=8282363308&rft.isbn_list=9788282363303&rft_id=info:doi/&rft_dat=%3Ccristin_3HK%3E10037_14660%3C/cristin_3HK%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true