Utilizing Semantic Textual Similarity for Clinical Survey Data Feature Selection
Survey data can contain a high number of features while having a comparatively low quantity of examples. Machine learning models that attempt to predict outcomes from survey data under these conditions can overfit and result in poor generalizability. One remedy to this issue is feature selection, wh...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Survey data can contain a high number of features while having a
comparatively low quantity of examples. Machine learning models that attempt to
predict outcomes from survey data under these conditions can overfit and result
in poor generalizability. One remedy to this issue is feature selection, which
attempts to select an optimal subset of features to learn upon. A relatively
unexplored source of information in the feature selection process is the usage
of textual names of features, which may be semantically indicative of which
features are relevant to a target outcome. The relationships between feature
names and target names can be evaluated using language models (LMs) to produce
semantic textual similarity (STS) scores, which can then be used to select
features. We examine the performance using STS to select features directly and
in the minimal-redundancy-maximal-relevance (mRMR) algorithm. The performance
of STS as a feature selection metric is evaluated against preliminary survey
data collected as a part of a clinical study on persistent post-surgical pain
(PPSP). The results suggest that features selected with STS can result in
higher performance models compared to traditional feature selection algorithms. |
---|---|
DOI: | 10.48550/arxiv.2308.09892 |