The use of gene expression datasets in feature selection research: 20 years of inherent bias?
Feature selection algorithms are frequently employed in preprocessing machine learning pipelines applied to biological data to identify relevant features. The use of feature selection in gene expression studies began at the end of the 1990s with the analysis of human cancer microarray datasets. Sinc...
Gespeichert in:
Veröffentlicht in: | Wiley interdisciplinary reviews. Data mining and knowledge discovery 2024-03, Vol.14 (2) |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | 2 |
container_start_page | |
container_title | Wiley interdisciplinary reviews. Data mining and knowledge discovery |
container_volume | 14 |
creator | Grisci, Bruno I. Feltes, Bruno César de Faria Poloni, Joice Narloch, Pedro H. Dorn, Márcio |
description | Feature selection algorithms are frequently employed in preprocessing machine learning pipelines applied to biological data to identify relevant features. The use of feature selection in gene expression studies began at the end of the 1990s with the analysis of human cancer microarray datasets. Since then, gene expression technology has been perfected, the Human Genome Project has been completed, new microarray platforms have been created and discontinued, and RNA‐seq has gradually replaced microarrays. However, most feature selection methods in the last two decades were designed, evaluated, and validated on the same datasets from the microarray technology's infancy. In this review of over 1200 publications regarding feature selection and gene expression, published between 2010 and 2020, we found that 57% of the publications used at least one outdated dataset, 23% used only outdated data, and 32% did not cite data sources. Other issues include referencing databases that are no longer available, the slow adoption of RNA‐seq datasets, and bias toward human cancer data, even for methods designed for a broader scope. In the most popular datasets, some being 23 years old, mislabeled samples, experimental biases, distribution shifts, and the absence of classification challenges are common. These problems are more predominant in publications with computer science backgrounds compared to publications from biology and can lead to inaccurate and misleading biological results.
This article is categorized under:
Algorithmic Development > Biological Data Mining
Technologies > Machine Learning |
doi_str_mv | 10.1002/widm.1523 |
format | Article |
fullrecord | <record><control><sourceid>crossref</sourceid><recordid>TN_cdi_crossref_primary_10_1002_widm_1523</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>10_1002_widm_1523</sourcerecordid><originalsourceid>FETCH-LOGICAL-c264t-b9e5908eba5f46e691b06e57682246d439db5f18620f6ef828a05f456a8e4b083</originalsourceid><addsrcrecordid>eNo9kDFPwzAUhC0EElXpwD_wypBiO7bjsCBUQUGqxFJmy06eiVGbVn6uoBsrf5NfQiIQb7l3utMNHyGXnM05Y-L6PbbbOVeiPCETXktRyKpWp_-_qc7JDPGNDVcKY4yYELvugB4Q6C7QV-iBwsc-AWLc9bR12SFkpLGnAVw-JKAIG2jymA4tcKnpbqhg359fx8HguBL7DhL0mfro8PaCnAW3QZj96ZS8PNyvF4_F6nn5tLhbFY3QMhe-BlUzA96pIDXomnumQVXaCCF1K8u69SpwowULGoIRxrGhqbQzID0z5ZRc_e42aYeYINh9iluXjpYzO8KxIxw7wil_AGY5WKw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>The use of gene expression datasets in feature selection research: 20 years of inherent bias?</title><source>Access via Wiley Online Library</source><creator>Grisci, Bruno I. ; Feltes, Bruno César ; de Faria Poloni, Joice ; Narloch, Pedro H. ; Dorn, Márcio</creator><creatorcontrib>Grisci, Bruno I. ; Feltes, Bruno César ; de Faria Poloni, Joice ; Narloch, Pedro H. ; Dorn, Márcio</creatorcontrib><description>Feature selection algorithms are frequently employed in preprocessing machine learning pipelines applied to biological data to identify relevant features. The use of feature selection in gene expression studies began at the end of the 1990s with the analysis of human cancer microarray datasets. Since then, gene expression technology has been perfected, the Human Genome Project has been completed, new microarray platforms have been created and discontinued, and RNA‐seq has gradually replaced microarrays. However, most feature selection methods in the last two decades were designed, evaluated, and validated on the same datasets from the microarray technology's infancy. In this review of over 1200 publications regarding feature selection and gene expression, published between 2010 and 2020, we found that 57% of the publications used at least one outdated dataset, 23% used only outdated data, and 32% did not cite data sources. Other issues include referencing databases that are no longer available, the slow adoption of RNA‐seq datasets, and bias toward human cancer data, even for methods designed for a broader scope. In the most popular datasets, some being 23 years old, mislabeled samples, experimental biases, distribution shifts, and the absence of classification challenges are common. These problems are more predominant in publications with computer science backgrounds compared to publications from biology and can lead to inaccurate and misleading biological results.
This article is categorized under:
Algorithmic Development > Biological Data Mining
Technologies > Machine Learning</description><identifier>ISSN: 1942-4787</identifier><identifier>EISSN: 1942-4795</identifier><identifier>DOI: 10.1002/widm.1523</identifier><language>eng</language><ispartof>Wiley interdisciplinary reviews. Data mining and knowledge discovery, 2024-03, Vol.14 (2)</ispartof><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c264t-b9e5908eba5f46e691b06e57682246d439db5f18620f6ef828a05f456a8e4b083</citedby><cites>FETCH-LOGICAL-c264t-b9e5908eba5f46e691b06e57682246d439db5f18620f6ef828a05f456a8e4b083</cites><orcidid>0000-0001-8534-3480 ; 0000-0003-4083-5881</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Grisci, Bruno I.</creatorcontrib><creatorcontrib>Feltes, Bruno César</creatorcontrib><creatorcontrib>de Faria Poloni, Joice</creatorcontrib><creatorcontrib>Narloch, Pedro H.</creatorcontrib><creatorcontrib>Dorn, Márcio</creatorcontrib><title>The use of gene expression datasets in feature selection research: 20 years of inherent bias?</title><title>Wiley interdisciplinary reviews. Data mining and knowledge discovery</title><description>Feature selection algorithms are frequently employed in preprocessing machine learning pipelines applied to biological data to identify relevant features. The use of feature selection in gene expression studies began at the end of the 1990s with the analysis of human cancer microarray datasets. Since then, gene expression technology has been perfected, the Human Genome Project has been completed, new microarray platforms have been created and discontinued, and RNA‐seq has gradually replaced microarrays. However, most feature selection methods in the last two decades were designed, evaluated, and validated on the same datasets from the microarray technology's infancy. In this review of over 1200 publications regarding feature selection and gene expression, published between 2010 and 2020, we found that 57% of the publications used at least one outdated dataset, 23% used only outdated data, and 32% did not cite data sources. Other issues include referencing databases that are no longer available, the slow adoption of RNA‐seq datasets, and bias toward human cancer data, even for methods designed for a broader scope. In the most popular datasets, some being 23 years old, mislabeled samples, experimental biases, distribution shifts, and the absence of classification challenges are common. These problems are more predominant in publications with computer science backgrounds compared to publications from biology and can lead to inaccurate and misleading biological results.
This article is categorized under:
Algorithmic Development > Biological Data Mining
Technologies > Machine Learning</description><issn>1942-4787</issn><issn>1942-4795</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNo9kDFPwzAUhC0EElXpwD_wypBiO7bjsCBUQUGqxFJmy06eiVGbVn6uoBsrf5NfQiIQb7l3utMNHyGXnM05Y-L6PbbbOVeiPCETXktRyKpWp_-_qc7JDPGNDVcKY4yYELvugB4Q6C7QV-iBwsc-AWLc9bR12SFkpLGnAVw-JKAIG2jymA4tcKnpbqhg359fx8HguBL7DhL0mfro8PaCnAW3QZj96ZS8PNyvF4_F6nn5tLhbFY3QMhe-BlUzA96pIDXomnumQVXaCCF1K8u69SpwowULGoIRxrGhqbQzID0z5ZRc_e42aYeYINh9iluXjpYzO8KxIxw7wil_AGY5WKw</recordid><startdate>202403</startdate><enddate>202403</enddate><creator>Grisci, Bruno I.</creator><creator>Feltes, Bruno César</creator><creator>de Faria Poloni, Joice</creator><creator>Narloch, Pedro H.</creator><creator>Dorn, Márcio</creator><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0001-8534-3480</orcidid><orcidid>https://orcid.org/0000-0003-4083-5881</orcidid></search><sort><creationdate>202403</creationdate><title>The use of gene expression datasets in feature selection research: 20 years of inherent bias?</title><author>Grisci, Bruno I. ; Feltes, Bruno César ; de Faria Poloni, Joice ; Narloch, Pedro H. ; Dorn, Márcio</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c264t-b9e5908eba5f46e691b06e57682246d439db5f18620f6ef828a05f456a8e4b083</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Grisci, Bruno I.</creatorcontrib><creatorcontrib>Feltes, Bruno César</creatorcontrib><creatorcontrib>de Faria Poloni, Joice</creatorcontrib><creatorcontrib>Narloch, Pedro H.</creatorcontrib><creatorcontrib>Dorn, Márcio</creatorcontrib><collection>CrossRef</collection><jtitle>Wiley interdisciplinary reviews. Data mining and knowledge discovery</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Grisci, Bruno I.</au><au>Feltes, Bruno César</au><au>de Faria Poloni, Joice</au><au>Narloch, Pedro H.</au><au>Dorn, Márcio</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>The use of gene expression datasets in feature selection research: 20 years of inherent bias?</atitle><jtitle>Wiley interdisciplinary reviews. Data mining and knowledge discovery</jtitle><date>2024-03</date><risdate>2024</risdate><volume>14</volume><issue>2</issue><issn>1942-4787</issn><eissn>1942-4795</eissn><abstract>Feature selection algorithms are frequently employed in preprocessing machine learning pipelines applied to biological data to identify relevant features. The use of feature selection in gene expression studies began at the end of the 1990s with the analysis of human cancer microarray datasets. Since then, gene expression technology has been perfected, the Human Genome Project has been completed, new microarray platforms have been created and discontinued, and RNA‐seq has gradually replaced microarrays. However, most feature selection methods in the last two decades were designed, evaluated, and validated on the same datasets from the microarray technology's infancy. In this review of over 1200 publications regarding feature selection and gene expression, published between 2010 and 2020, we found that 57% of the publications used at least one outdated dataset, 23% used only outdated data, and 32% did not cite data sources. Other issues include referencing databases that are no longer available, the slow adoption of RNA‐seq datasets, and bias toward human cancer data, even for methods designed for a broader scope. In the most popular datasets, some being 23 years old, mislabeled samples, experimental biases, distribution shifts, and the absence of classification challenges are common. These problems are more predominant in publications with computer science backgrounds compared to publications from biology and can lead to inaccurate and misleading biological results.
This article is categorized under:
Algorithmic Development > Biological Data Mining
Technologies > Machine Learning</abstract><doi>10.1002/widm.1523</doi><orcidid>https://orcid.org/0000-0001-8534-3480</orcidid><orcidid>https://orcid.org/0000-0003-4083-5881</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1942-4787 |
ispartof | Wiley interdisciplinary reviews. Data mining and knowledge discovery, 2024-03, Vol.14 (2) |
issn | 1942-4787 1942-4795 |
language | eng |
recordid | cdi_crossref_primary_10_1002_widm_1523 |
source | Access via Wiley Online Library |
title | The use of gene expression datasets in feature selection research: 20 years of inherent bias? |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-19T03%3A57%3A13IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=The%20use%20of%20gene%20expression%20datasets%20in%20feature%20selection%20research:%2020%E2%80%89years%20of%20inherent%20bias?&rft.jtitle=Wiley%20interdisciplinary%20reviews.%20Data%20mining%20and%20knowledge%20discovery&rft.au=Grisci,%20Bruno%20I.&rft.date=2024-03&rft.volume=14&rft.issue=2&rft.issn=1942-4787&rft.eissn=1942-4795&rft_id=info:doi/10.1002/widm.1523&rft_dat=%3Ccrossref%3E10_1002_widm_1523%3C/crossref%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |