The use of gene expression datasets in feature selection research: 20 years of inherent bias?

Feature selection algorithms are frequently employed in preprocessing machine learning pipelines applied to biological data to identify relevant features. The use of feature selection in gene expression studies began at the end of the 1990s with the analysis of human cancer microarray datasets. Sinc...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Wiley interdisciplinary reviews. Data mining and knowledge discovery 2024-03, Vol.14 (2)
Hauptverfasser:	Grisci, Bruno I., Feltes, Bruno César, de Faria Poloni, Joice, Narloch, Pedro H., Dorn, Márcio
Format:	Artikel
Sprache:	eng
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue	2
container_start_page
container_title	Wiley interdisciplinary reviews. Data mining and knowledge discovery
container_volume	14
creator	Grisci, Bruno I. Feltes, Bruno César de Faria Poloni, Joice Narloch, Pedro H. Dorn, Márcio
description	Feature selection algorithms are frequently employed in preprocessing machine learning pipelines applied to biological data to identify relevant features. The use of feature selection in gene expression studies began at the end of the 1990s with the analysis of human cancer microarray datasets. Since then, gene expression technology has been perfected, the Human Genome Project has been completed, new microarray platforms have been created and discontinued, and RNA‐seq has gradually replaced microarrays. However, most feature selection methods in the last two decades were designed, evaluated, and validated on the same datasets from the microarray technology's infancy. In this review of over 1200 publications regarding feature selection and gene expression, published between 2010 and 2020, we found that 57% of the publications used at least one outdated dataset, 23% used only outdated data, and 32% did not cite data sources. Other issues include referencing databases that are no longer available, the slow adoption of RNA‐seq datasets, and bias toward human cancer data, even for methods designed for a broader scope. In the most popular datasets, some being 23 years old, mislabeled samples, experimental biases, distribution shifts, and the absence of classification challenges are common. These problems are more predominant in publications with computer science backgrounds compared to publications from biology and can lead to inaccurate and misleading biological results. This article is categorized under: Algorithmic Development > Biological Data Mining Technologies > Machine Learning
doi_str_mv	10.1002/widm.1523
format	Article
fullrecord	<record><control><sourceid>crossref</sourceid><recordid>TN_cdi_crossref_primary_10_1002_widm_1523</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>10_1002_widm_1523</sourcerecordid><originalsourceid>FETCH-LOGICAL-c264t-b9e5908eba5f46e691b06e57682246d439db5f18620f6ef828a05f456a8e4b083</originalsourceid><addsrcrecordid>eNo9kDFPwzAUhC0EElXpwD_wypBiO7bjsCBUQUGqxFJmy06eiVGbVn6uoBsrf5NfQiIQb7l3utMNHyGXnM05Y-L6PbbbOVeiPCETXktRyKpWp_-_qc7JDPGNDVcKY4yYELvugB4Q6C7QV-iBwsc-AWLc9bR12SFkpLGnAVw-JKAIG2jymA4tcKnpbqhg359fx8HguBL7DhL0mfro8PaCnAW3QZj96ZS8PNyvF4_F6nn5tLhbFY3QMhe-BlUzA96pIDXomnumQVXaCCF1K8u69SpwowULGoIRxrGhqbQzID0z5ZRc_e42aYeYINh9iluXjpYzO8KxIxw7wil_AGY5WKw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>The use of gene expression datasets in feature selection research: 20 years of inherent bias?</title><source>Access via Wiley Online Library</source><creator>Grisci, Bruno I. ; Feltes, Bruno César ; de Faria Poloni, Joice ; Narloch, Pedro H. ; Dorn, Márcio</creator><creatorcontrib>Grisci, Bruno I. ; Feltes, Bruno César ; de Faria Poloni, Joice ; Narloch, Pedro H. ; Dorn, Márcio</creatorcontrib><description>Feature selection algorithms are frequently employed in preprocessing machine learning pipelines applied to biological data to identify relevant features. The use of feature selection in gene expression studies began at the end of the 1990s with the analysis of human cancer microarray datasets. Since then, gene expression technology has been perfected, the Human Genome Project has been completed, new microarray platforms have been created and discontinued, and RNA‐seq has gradually replaced microarrays. However, most feature selection methods in the last two decades were designed, evaluated, and validated on the same datasets from the microarray technology's infancy. In this review of over 1200 publications regarding feature selection and gene expression, published between 2010 and 2020, we found that 57% of the publications used at least one outdated dataset, 23% used only outdated data, and 32% did not cite data sources. Other issues include referencing databases that are no longer available, the slow adoption of RNA‐seq datasets, and bias toward human cancer data, even for methods designed for a broader scope. In the most popular datasets, some being 23 years old, mislabeled samples, experimental biases, distribution shifts, and the absence of classification challenges are common. These problems are more predominant in publications with computer science backgrounds compared to publications from biology and can lead to inaccurate and misleading biological results. This article is categorized under: Algorithmic Development > Biological Data Mining Technologies > Machine Learning</description><identifier>ISSN: 1942-4787</identifier><identifier>EISSN: 1942-4795</identifier><identifier>DOI: 10.1002/widm.1523</identifier><language>eng</language><ispartof>Wiley interdisciplinary reviews. Data mining and knowledge discovery, 2024-03, Vol.14 (2)</ispartof><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c264t-b9e5908eba5f46e691b06e57682246d439db5f18620f6ef828a05f456a8e4b083</citedby><cites>FETCH-LOGICAL-c264t-b9e5908eba5f46e691b06e57682246d439db5f18620f6ef828a05f456a8e4b083</cites><orcidid>0000-0001-8534-3480 ; 0000-0003-4083-5881</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Grisci, Bruno I.</creatorcontrib><creatorcontrib>Feltes, Bruno César</creatorcontrib><creatorcontrib>de Faria Poloni, Joice</creatorcontrib><creatorcontrib>Narloch, Pedro H.</creatorcontrib><creatorcontrib>Dorn, Márcio</creatorcontrib><title>The use of gene expression datasets in feature selection research: 20 years of inherent bias?</title><title>Wiley interdisciplinary reviews. Data mining and knowledge discovery</title><description>Feature selection algorithms are frequently employed in preprocessing machine learning pipelines applied to biological data to identify relevant features. The use of feature selection in gene expression studies began at the end of the 1990s with the analysis of human cancer microarray datasets. Since then, gene expression technology has been perfected, the Human Genome Project has been completed, new microarray platforms have been created and discontinued, and RNA‐seq has gradually replaced microarrays. However, most feature selection methods in the last two decades were designed, evaluated, and validated on the same datasets from the microarray technology's infancy. In this review of over 1200 publications regarding feature selection and gene expression, published between 2010 and 2020, we found that 57% of the publications used at least one outdated dataset, 23% used only outdated data, and 32% did not cite data sources. Other issues include referencing databases that are no longer available, the slow adoption of RNA‐seq datasets, and bias toward human cancer data, even for methods designed for a broader scope. In the most popular datasets, some being 23 years old, mislabeled samples, experimental biases, distribution shifts, and the absence of classification challenges are common. These problems are more predominant in publications with computer science backgrounds compared to publications from biology and can lead to inaccurate and misleading biological results. This article is categorized under: Algorithmic Development > Biological Data Mining Technologies > Machine Learning</description><issn>1942-4787</issn><issn>1942-4795</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNo9kDFPwzAUhC0EElXpwD_wypBiO7bjsCBUQUGqxFJmy06eiVGbVn6uoBsrf5NfQiIQb7l3utMNHyGXnM05Y-L6PbbbOVeiPCETXktRyKpWp_-_qc7JDPGNDVcKY4yYELvugB4Q6C7QV-iBwsc-AWLc9bR12SFkpLGnAVw-JKAIG2jymA4tcKnpbqhg359fx8HguBL7DhL0mfro8PaCnAW3QZj96ZS8PNyvF4_F6nn5tLhbFY3QMhe-BlUzA96pIDXomnumQVXaCCF1K8u69SpwowULGoIRxrGhqbQzID0z5ZRc_e42aYeYINh9iluXjpYzO8KxIxw7wil_AGY5WKw</recordid><startdate>202403</startdate><enddate>202403</enddate><creator>Grisci, Bruno I.</creator><creator>Feltes, Bruno César</creator><creator>de Faria Poloni, Joice</creator><creator>Narloch, Pedro H.</creator><creator>Dorn, Márcio</creator><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0001-8534-3480</orcidid><orcidid>https://orcid.org/0000-0003-4083-5881</orcidid></search><sort><creationdate>202403</creationdate><title>The use of gene expression datasets in feature selection research: 20 years of inherent bias?</title><author>Grisci, Bruno I. ; Feltes, Bruno César ; de Faria Poloni, Joice ; Narloch, Pedro H. ; Dorn, Márcio</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c264t-b9e5908eba5f46e691b06e57682246d439db5f18620f6ef828a05f456a8e4b083</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Grisci, Bruno I.</creatorcontrib><creatorcontrib>Feltes, Bruno César</creatorcontrib><creatorcontrib>de Faria Poloni, Joice</creatorcontrib><creatorcontrib>Narloch, Pedro H.</creatorcontrib><creatorcontrib>Dorn, Márcio</creatorcontrib><collection>CrossRef</collection><jtitle>Wiley interdisciplinary reviews. Data mining and knowledge discovery</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Grisci, Bruno I.</au><au>Feltes, Bruno César</au><au>de Faria Poloni, Joice</au><au>Narloch, Pedro H.</au><au>Dorn, Márcio</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>The use of gene expression datasets in feature selection research: 20 years of inherent bias?</atitle><jtitle>Wiley interdisciplinary reviews. Data mining and knowledge discovery</jtitle><date>2024-03</date><risdate>2024</risdate><volume>14</volume><issue>2</issue><issn>1942-4787</issn><eissn>1942-4795</eissn><abstract>Feature selection algorithms are frequently employed in preprocessing machine learning pipelines applied to biological data to identify relevant features. The use of feature selection in gene expression studies began at the end of the 1990s with the analysis of human cancer microarray datasets. Since then, gene expression technology has been perfected, the Human Genome Project has been completed, new microarray platforms have been created and discontinued, and RNA‐seq has gradually replaced microarrays. However, most feature selection methods in the last two decades were designed, evaluated, and validated on the same datasets from the microarray technology's infancy. In this review of over 1200 publications regarding feature selection and gene expression, published between 2010 and 2020, we found that 57% of the publications used at least one outdated dataset, 23% used only outdated data, and 32% did not cite data sources. Other issues include referencing databases that are no longer available, the slow adoption of RNA‐seq datasets, and bias toward human cancer data, even for methods designed for a broader scope. In the most popular datasets, some being 23 years old, mislabeled samples, experimental biases, distribution shifts, and the absence of classification challenges are common. These problems are more predominant in publications with computer science backgrounds compared to publications from biology and can lead to inaccurate and misleading biological results. This article is categorized under: Algorithmic Development > Biological Data Mining Technologies > Machine Learning</abstract><doi>10.1002/widm.1523</doi><orcidid>https://orcid.org/0000-0001-8534-3480</orcidid><orcidid>https://orcid.org/0000-0003-4083-5881</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 1942-4787
ispartof	Wiley interdisciplinary reviews. Data mining and knowledge discovery, 2024-03, Vol.14 (2)
issn	1942-4787 1942-4795
language	eng
recordid	cdi_crossref_primary_10_1002_widm_1523
source	Access via Wiley Online Library
title	The use of gene expression datasets in feature selection research: 20 years of inherent bias?
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-19T03%3A57%3A13IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=The%20use%20of%20gene%20expression%20datasets%20in%20feature%20selection%20research:%2020%E2%80%89years%20of%20inherent%20bias?&rft.jtitle=Wiley%20interdisciplinary%20reviews.%20Data%20mining%20and%20knowledge%20discovery&rft.au=Grisci,%20Bruno%20I.&rft.date=2024-03&rft.volume=14&rft.issue=2&rft.issn=1942-4787&rft.eissn=1942-4795&rft_id=info:doi/10.1002/widm.1523&rft_dat=%3Ccrossref%3E10_1002_widm_1523%3C/crossref%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true