Social media mining for birth defects research: A rule-based, bootstrapping approach to collecting data for rare health-related events on Twitter

[Display omitted] •Rare health-related events—in this case, birth defects—are reported on Twitter.•An NLP-based approach was deployed to collect sparse tweets for manual annotation.•Pregnancies with birth defect outcomes can be observed on Twitter.•Congenital heart defects are the most common birth...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of biomedical informatics 2018-11, Vol.87, p.68-78
Hauptverfasser: Klein, Ari Z., Sarker, Abeed, Cai, Haitao, Weissenbacher, Davy, Gonzalez-Hernandez, Graciela
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:[Display omitted] •Rare health-related events—in this case, birth defects—are reported on Twitter.•An NLP-based approach was deployed to collect sparse tweets for manual annotation.•Pregnancies with birth defect outcomes can be observed on Twitter.•Congenital heart defects are the most common birth defect reported on Twitter.•Social media mining can provide unique opportunities for epidemiological insights. Although birth defects are the leading cause of infant mortality in the United States, methods for observing human pregnancies with birth defect outcomes are limited. The primary objectives of this study were (i) to assess whether rare health-related events—in this case, birth defects—are reported on social media, (ii) to design and deploy a natural language processing (NLP) approach for collecting such sparse data from social media, and (iii) to utilize the collected data to discover a cohort of women whose pregnancies with birth defect outcomes could be observed on social media for epidemiological analysis. To assess whether birth defects are mentioned on social media, we mined 432 million tweets posted by 112,647 users who were automatically identified via their public announcements of pregnancy on Twitter. To retrieve tweets that mention birth defects, we developed a rule-based, bootstrapping approach, which relies on a lexicon, lexical variants generated from the lexicon entries, regular expressions, post-processing, and manual analysis guided by distributional properties. To identify users whose pregnancies with birth defect outcomes could be observed for epidemiological analysis, inclusion criteria were (i) tweets indicating that the user’s child has a birth defect, and (ii) accessibility to the user’s tweets during pregnancy. We conducted a semi-automatic evaluation to estimate the recall of the tweet-collection approach, and performed a preliminary assessment of the prevalence of selected birth defects among the pregnancy cohort derived from Twitter. We manually annotated 16,822 retrieved tweets, distinguishing tweets indicating that the user’s child has a birth defect (true positives) from tweets that merely mention birth defects (false positives). Inter-annotator agreement was substantial: κ = 0.79 (Cohen’s kappa). Analyzing the timelines of the 646 users whose tweets were true positives resulted in the discovery of 195 users that met the inclusion criteria. Congenital heart defects are the most common type of birth defect reported on Twitter
ISSN:1532-0464
1532-0480
DOI:10.1016/j.jbi.2018.10.001