An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages
Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, N.J. Belkin, P. Ingwersen and M.-K. Leong (Eds.), Athens, Greece, July 24-28, 2000, pages 160-167 The growing problem of unsolicited bulk e-mail, also known as "spam", ha...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Proceedings of the 23rd Annual International ACM SIGIR Conference
on Research and Development in Information Retrieval, N.J. Belkin, P.
Ingwersen and M.-K. Leong (Eds.), Athens, Greece, July 24-28, 2000, pages
160-167 The growing problem of unsolicited bulk e-mail, also known as "spam", has
generated a need for reliable anti-spam e-mail filters. Filters of this type
have so far been based mostly on manually constructed keyword patterns. An
alternative approach has recently been proposed, whereby a Naive Bayesian
classifier is trained automatically to detect spam messages. We test this
approach on a large collection of personal e-mail messages, which we make
publicly available in "encrypted" form contributing towards standard
benchmarks. We introduce appropriate cost-sensitive measures, investigating at
the same time the effect of attribute-set size, training-corpus size,
lemmatization, and stop lists, issues that have not been explored in previous
experiments. Finally, the Naive Bayesian filter is compared, in terms of
performance, to a filter that uses keyword patterns, and which is part of a
widely used e-mail reader. |
---|---|
DOI: | 10.48550/arxiv.cs/0008019 |