Finding Malicious Cyber Discussions in Social Media

AbstractSecurity analysts gather essential information oncyber attacks, exploits, vulnerabilities, and victimsby manually searching social media sites. This effortcan be dramatically reduced using natural languagemachine learning techniques. Using a newEnglish text corpus containing more than 250k d...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Lippmann,Richard P, Campbell,Joseph P, Weller-Fahy,David J, Mensch,Alyssa C, Campbell,William M
Format: Report
Sprache:eng
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:AbstractSecurity analysts gather essential information oncyber attacks, exploits, vulnerabilities, and victimsby manually searching social media sites. This effortcan be dramatically reduced using natural languagemachine learning techniques. Using a newEnglish text corpus containing more than 250k discussionsfrom Stack Exchange, Reddit, and Twitteron cyber and non-cyber topics, we demonstrate theability to detect more than 90% of the cyber discussionswith fewer than 1% false alarms. If an originalsearched document corpus includes only 5%cyber documents, then our processing provides anenriched corpus for analysts where 83% to 95% ofthe documents are on cyber topics. Good performancewas obtained using TF-IDF features and logisticregression. A classifier trained using priorhistorical data accurately detected 86% of emergentHeartbleed discussions and retrospective experimentsdemonstrate that classifier performanceremains stable up to a year without retraining.