Inferring the patient’s age from implicit age clues in health forum posts

[Display omitted] •Estimating the age of the patient from implicit age clues in health forum posts is important to many emerging health studies.•The proposed classifier is able to label each post in the r/Cancer health forum according to the age group of the patient with an accuracy comparable to th...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of biomedical informatics 2022-01, Vol.125, p.103976-103976, Article 103976
Hauptverfasser: Black, Christopher M., Meng, Weilin, Yao, Lixia, Ben Miled, Zina
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:[Display omitted] •Estimating the age of the patient from implicit age clues in health forum posts is important to many emerging health studies.•The proposed classifier is able to label each post in the r/Cancer health forum according to the age group of the patient with an accuracy comparable to that of human annotators.•The methodology can apply to other health forums and potentially reduce the need for manual annotation. Broader patient-reported experiences in oncology are largely unknown due to the lack of available information from traditional data sources. Online health community data provide an exploratory way to uncover these experiences at a large scale. Analyzing these data can guide further studies towards understanding patients’ needs and experiences. However, analysis of online health data is inherently difficult due to the unstructured nature of these data and the variety of ways information can be expressed over text. Specifically, subscribers may not disclose critical information such as the age of the patient in their posts. In fact, the number of health forum posts that explicitly mention the age of the patient is significantly lower than the number of posts that do not include this information in the Reddit r/Cancer health forum under consideration in the present paper. Health-focused studies often need to consider or control for age as a confounder, hence the importance of having sufficient age data. This paper presents a methodology that can help classify health forum posts according to four age groups (0–17, 18–39, 40–64 and 65 + years) even when the posts do not contain explicit mention of the age of the patient. First, the subset of the posts that include explicit mention of the age of the patient is identified. Second, the explicit age clues are removed from these posts and used to train the proposed age classifier. The resulting classifier is able to infer the age of the patient using only implicit age clues with an average true positive rate (TPR) of 71%. This TPR is comparable to the average TPR of 69% obtained from human annotations for the same set of posts.
ISSN:1532-0464
1532-0480
DOI:10.1016/j.jbi.2021.103976