The use of gene expression datasets in feature selection research: 20 years of inherent bias?

Feature selection algorithms are frequently employed in preprocessing machine learning pipelines applied to biological data to identify relevant features. The use of feature selection in gene expression studies began at the end of the 1990s with the analysis of human cancer microarray datasets. Sinc...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Wiley interdisciplinary reviews. Data mining and knowledge discovery 2024-03, Vol.14 (2)
Hauptverfasser: Grisci, Bruno I., Feltes, Bruno César, de Faria Poloni, Joice, Narloch, Pedro H., Dorn, Márcio
Format: Artikel
Sprache:eng
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue 2
container_start_page
container_title Wiley interdisciplinary reviews. Data mining and knowledge discovery
container_volume 14
creator Grisci, Bruno I.
Feltes, Bruno César
de Faria Poloni, Joice
Narloch, Pedro H.
Dorn, Márcio
description Feature selection algorithms are frequently employed in preprocessing machine learning pipelines applied to biological data to identify relevant features. The use of feature selection in gene expression studies began at the end of the 1990s with the analysis of human cancer microarray datasets. Since then, gene expression technology has been perfected, the Human Genome Project has been completed, new microarray platforms have been created and discontinued, and RNA‐seq has gradually replaced microarrays. However, most feature selection methods in the last two decades were designed, evaluated, and validated on the same datasets from the microarray technology's infancy. In this review of over 1200 publications regarding feature selection and gene expression, published between 2010 and 2020, we found that 57% of the publications used at least one outdated dataset, 23% used only outdated data, and 32% did not cite data sources. Other issues include referencing databases that are no longer available, the slow adoption of RNA‐seq datasets, and bias toward human cancer data, even for methods designed for a broader scope. In the most popular datasets, some being 23 years old, mislabeled samples, experimental biases, distribution shifts, and the absence of classification challenges are common. These problems are more predominant in publications with computer science backgrounds compared to publications from biology and can lead to inaccurate and misleading biological results. This article is categorized under: Algorithmic Development > Biological Data Mining Technologies > Machine Learning
doi_str_mv 10.1002/widm.1523
format Article
fullrecord <record><control><sourceid>crossref</sourceid><recordid>TN_cdi_crossref_primary_10_1002_widm_1523</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>10_1002_widm_1523</sourcerecordid><originalsourceid>FETCH-LOGICAL-c264t-b9e5908eba5f46e691b06e57682246d439db5f18620f6ef828a05f456a8e4b083</originalsourceid><addsrcrecordid>eNo9kDFPwzAUhC0EElXpwD_wypBiO7bjsCBUQUGqxFJmy06eiVGbVn6uoBsrf5NfQiIQb7l3utMNHyGXnM05Y-L6PbbbOVeiPCETXktRyKpWp_-_qc7JDPGNDVcKY4yYELvugB4Q6C7QV-iBwsc-AWLc9bR12SFkpLGnAVw-JKAIG2jymA4tcKnpbqhg359fx8HguBL7DhL0mfro8PaCnAW3QZj96ZS8PNyvF4_F6nn5tLhbFY3QMhe-BlUzA96pIDXomnumQVXaCCF1K8u69SpwowULGoIRxrGhqbQzID0z5ZRc_e42aYeYINh9iluXjpYzO8KxIxw7wil_AGY5WKw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>The use of gene expression datasets in feature selection research: 20 years of inherent bias?</title><source>Access via Wiley Online Library</source><creator>Grisci, Bruno I. ; Feltes, Bruno César ; de Faria Poloni, Joice ; Narloch, Pedro H. ; Dorn, Márcio</creator><creatorcontrib>Grisci, Bruno I. ; Feltes, Bruno César ; de Faria Poloni, Joice ; Narloch, Pedro H. ; Dorn, Márcio</creatorcontrib><description>Feature selection algorithms are frequently employed in preprocessing machine learning pipelines applied to biological data to identify relevant features. The use of feature selection in gene expression studies began at the end of the 1990s with the analysis of human cancer microarray datasets. Since then, gene expression technology has been perfected, the Human Genome Project has been completed, new microarray platforms have been created and discontinued, and RNA‐seq has gradually replaced microarrays. However, most feature selection methods in the last two decades were designed, evaluated, and validated on the same datasets from the microarray technology's infancy. In this review of over 1200 publications regarding feature selection and gene expression, published between 2010 and 2020, we found that 57% of the publications used at least one outdated dataset, 23% used only outdated data, and 32% did not cite data sources. Other issues include referencing databases that are no longer available, the slow adoption of RNA‐seq datasets, and bias toward human cancer data, even for methods designed for a broader scope. In the most popular datasets, some being 23 years old, mislabeled samples, experimental biases, distribution shifts, and the absence of classification challenges are common. These problems are more predominant in publications with computer science backgrounds compared to publications from biology and can lead to inaccurate and misleading biological results. This article is categorized under: Algorithmic Development &gt; Biological Data Mining Technologies &gt; Machine Learning</description><identifier>ISSN: 1942-4787</identifier><identifier>EISSN: 1942-4795</identifier><identifier>DOI: 10.1002/widm.1523</identifier><language>eng</language><ispartof>Wiley interdisciplinary reviews. Data mining and knowledge discovery, 2024-03, Vol.14 (2)</ispartof><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c264t-b9e5908eba5f46e691b06e57682246d439db5f18620f6ef828a05f456a8e4b083</citedby><cites>FETCH-LOGICAL-c264t-b9e5908eba5f46e691b06e57682246d439db5f18620f6ef828a05f456a8e4b083</cites><orcidid>0000-0001-8534-3480 ; 0000-0003-4083-5881</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Grisci, Bruno I.</creatorcontrib><creatorcontrib>Feltes, Bruno César</creatorcontrib><creatorcontrib>de Faria Poloni, Joice</creatorcontrib><creatorcontrib>Narloch, Pedro H.</creatorcontrib><creatorcontrib>Dorn, Márcio</creatorcontrib><title>The use of gene expression datasets in feature selection research: 20 years of inherent bias?</title><title>Wiley interdisciplinary reviews. Data mining and knowledge discovery</title><description>Feature selection algorithms are frequently employed in preprocessing machine learning pipelines applied to biological data to identify relevant features. The use of feature selection in gene expression studies began at the end of the 1990s with the analysis of human cancer microarray datasets. Since then, gene expression technology has been perfected, the Human Genome Project has been completed, new microarray platforms have been created and discontinued, and RNA‐seq has gradually replaced microarrays. However, most feature selection methods in the last two decades were designed, evaluated, and validated on the same datasets from the microarray technology's infancy. In this review of over 1200 publications regarding feature selection and gene expression, published between 2010 and 2020, we found that 57% of the publications used at least one outdated dataset, 23% used only outdated data, and 32% did not cite data sources. Other issues include referencing databases that are no longer available, the slow adoption of RNA‐seq datasets, and bias toward human cancer data, even for methods designed for a broader scope. In the most popular datasets, some being 23 years old, mislabeled samples, experimental biases, distribution shifts, and the absence of classification challenges are common. These problems are more predominant in publications with computer science backgrounds compared to publications from biology and can lead to inaccurate and misleading biological results. This article is categorized under: Algorithmic Development &gt; Biological Data Mining Technologies &gt; Machine Learning</description><issn>1942-4787</issn><issn>1942-4795</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNo9kDFPwzAUhC0EElXpwD_wypBiO7bjsCBUQUGqxFJmy06eiVGbVn6uoBsrf5NfQiIQb7l3utMNHyGXnM05Y-L6PbbbOVeiPCETXktRyKpWp_-_qc7JDPGNDVcKY4yYELvugB4Q6C7QV-iBwsc-AWLc9bR12SFkpLGnAVw-JKAIG2jymA4tcKnpbqhg359fx8HguBL7DhL0mfro8PaCnAW3QZj96ZS8PNyvF4_F6nn5tLhbFY3QMhe-BlUzA96pIDXomnumQVXaCCF1K8u69SpwowULGoIRxrGhqbQzID0z5ZRc_e42aYeYINh9iluXjpYzO8KxIxw7wil_AGY5WKw</recordid><startdate>202403</startdate><enddate>202403</enddate><creator>Grisci, Bruno I.</creator><creator>Feltes, Bruno César</creator><creator>de Faria Poloni, Joice</creator><creator>Narloch, Pedro H.</creator><creator>Dorn, Márcio</creator><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0001-8534-3480</orcidid><orcidid>https://orcid.org/0000-0003-4083-5881</orcidid></search><sort><creationdate>202403</creationdate><title>The use of gene expression datasets in feature selection research: 20 years of inherent bias?</title><author>Grisci, Bruno I. ; Feltes, Bruno César ; de Faria Poloni, Joice ; Narloch, Pedro H. ; Dorn, Márcio</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c264t-b9e5908eba5f46e691b06e57682246d439db5f18620f6ef828a05f456a8e4b083</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Grisci, Bruno I.</creatorcontrib><creatorcontrib>Feltes, Bruno César</creatorcontrib><creatorcontrib>de Faria Poloni, Joice</creatorcontrib><creatorcontrib>Narloch, Pedro H.</creatorcontrib><creatorcontrib>Dorn, Márcio</creatorcontrib><collection>CrossRef</collection><jtitle>Wiley interdisciplinary reviews. Data mining and knowledge discovery</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Grisci, Bruno I.</au><au>Feltes, Bruno César</au><au>de Faria Poloni, Joice</au><au>Narloch, Pedro H.</au><au>Dorn, Márcio</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>The use of gene expression datasets in feature selection research: 20 years of inherent bias?</atitle><jtitle>Wiley interdisciplinary reviews. Data mining and knowledge discovery</jtitle><date>2024-03</date><risdate>2024</risdate><volume>14</volume><issue>2</issue><issn>1942-4787</issn><eissn>1942-4795</eissn><abstract>Feature selection algorithms are frequently employed in preprocessing machine learning pipelines applied to biological data to identify relevant features. The use of feature selection in gene expression studies began at the end of the 1990s with the analysis of human cancer microarray datasets. Since then, gene expression technology has been perfected, the Human Genome Project has been completed, new microarray platforms have been created and discontinued, and RNA‐seq has gradually replaced microarrays. However, most feature selection methods in the last two decades were designed, evaluated, and validated on the same datasets from the microarray technology's infancy. In this review of over 1200 publications regarding feature selection and gene expression, published between 2010 and 2020, we found that 57% of the publications used at least one outdated dataset, 23% used only outdated data, and 32% did not cite data sources. Other issues include referencing databases that are no longer available, the slow adoption of RNA‐seq datasets, and bias toward human cancer data, even for methods designed for a broader scope. In the most popular datasets, some being 23 years old, mislabeled samples, experimental biases, distribution shifts, and the absence of classification challenges are common. These problems are more predominant in publications with computer science backgrounds compared to publications from biology and can lead to inaccurate and misleading biological results. This article is categorized under: Algorithmic Development &gt; Biological Data Mining Technologies &gt; Machine Learning</abstract><doi>10.1002/widm.1523</doi><orcidid>https://orcid.org/0000-0001-8534-3480</orcidid><orcidid>https://orcid.org/0000-0003-4083-5881</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1942-4787
ispartof Wiley interdisciplinary reviews. Data mining and knowledge discovery, 2024-03, Vol.14 (2)
issn 1942-4787
1942-4795
language eng
recordid cdi_crossref_primary_10_1002_widm_1523
source Access via Wiley Online Library
title The use of gene expression datasets in feature selection research: 20 years of inherent bias?
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-19T03%3A57%3A13IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=The%20use%20of%20gene%20expression%20datasets%20in%20feature%20selection%20research:%2020%E2%80%89years%20of%20inherent%20bias?&rft.jtitle=Wiley%20interdisciplinary%20reviews.%20Data%20mining%20and%20knowledge%20discovery&rft.au=Grisci,%20Bruno%20I.&rft.date=2024-03&rft.volume=14&rft.issue=2&rft.issn=1942-4787&rft.eissn=1942-4795&rft_id=info:doi/10.1002/widm.1523&rft_dat=%3Ccrossref%3E10_1002_widm_1523%3C/crossref%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true