Domain expertise–agnostic feature selection for the analysis of breast cancer data

•We propose a three-step wrapper method for the discovery of connected protein networks underlying particular molecular and cellular processes which characterize distinct behaviors in tumors in a manner complementary to the current PAM50-based breast cancer classification.•Our method does not depend...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Artificial intelligence in medicine 2020-08, Vol.108, p.101928-101928, Article 101928
Hauptverfasser: Pozzoli, Susanna, Soliman, Amira, Bahri, Leila, Branca, Rui Mamede, Girdzijauskas, Sarunas, Brambilla, Marco
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•We propose a three-step wrapper method for the discovery of connected protein networks underlying particular molecular and cellular processes which characterize distinct behaviors in tumors in a manner complementary to the current PAM50-based breast cancer classification.•Our method does not depend on a body of specialist knowledge and is complementary to current research on cancer biology.•The protein clusters that showed top scoring modularities recapitulated many of the cellular phenotypes characteristic of cancer cells. Progress in proteomics has enabled biologists to accurately measure the amount of protein in a tumor. This work is based on a breast cancer data set, result of the proteomics analysis of a cohort of tumors carried out at Karolinska Institutet. While evidence suggests that an anomaly in the protein content is related to the cancerous nature of tumors, the proteins that could be markers of cancer types and subtypes and the underlying interactions are not completely known. This work sheds light on the potential of the application of unsupervised learning in the analysis of the aforementioned data sets, namely in the detection of distinctive proteins for the identification of the cancer subtypes, in the absence of domain expertise. In the analyzed data set, the number of samples, or tumors, is significantly lower than the number of features, or proteins; consequently, the input data can be thought of as high-dimensional data. The use of high-dimensional data has already become widespread, and a great deal of effort has been put into high-dimensional data analysis by means of feature selection, but it is still largely based on prior specialist knowledge, which in this case is not complete. There is a growing need for unsupervised feature selection, which raises the issue of how to generate promising subsets of features among all the possible combinations, as well as how to evaluate the quality of these subsets in the absence of specialist knowledge. We hereby propose a new wrapper method for the generation and evaluation of subsets of features via spectral clustering and modularity, respectively. We conduct experiments to test the effectiveness of the new method in the analysis of the breast cancer data, in a domain expertise–agnostic context. Furthermore, we show that we can successfully augment our method by incorporating an external source of data on known protein complexes. Our approach reveals a large number of subsets of features that
ISSN:0933-3657
1873-2860
1873-2860
DOI:10.1016/j.artmed.2020.101928