Pictures of relevance: A geometric analysis of similarity measures

We want computer systems that can help us assess the similarity or relevance of existing objects (e.g., documents, functions, commands, etc.) to a statement of our current needs (e.g., the query). Towards this end, a variety of similarity measures have been proposed. However, the relationship betwee...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of the American Society for Information Science 1987-11, Vol.38 (6), p.420-442
Hauptverfasser: Jones, William P., Furnas, George W.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:We want computer systems that can help us assess the similarity or relevance of existing objects (e.g., documents, functions, commands, etc.) to a statement of our current needs (e.g., the query). Towards this end, a variety of similarity measures have been proposed. However, the relationship between a measure's formula and its performance is not always obvious. A geometric analysis is advanced and its utility demonstrated through its application to six conventional information retrieval similarity measures and a seventh spreading activation measure. All seven similarity measures work with a representational scheme wherein a query and the database objects are represented as vectors of term weights. A geometric analysis characterizes each similarity measure by the nature of its iso‐similarity contours in an n‐space containing query and object vectors. This analysis reveals important differences among the similarity measures and suggests conditions in which these differences will affect retrieval performance. The cosine coefficient, for example, is shown to be insensitive to between‐document differences in the magnitude of term weights while the inner product measure is sometimes overly affected by such differences. The context‐sensitive spreading activation measure may overcome both of these limitations and deserves further study. The geometric analysis is intended to complement, and perhaps to guide, the empirical analysis of similarity measures. © 1987 John Wiley & Sons, Inc.
ISSN:0002-8231
1097-4571
DOI:10.1002/(SICI)1097-4571(198711)38:6<420::AID-ASI3>3.0.CO;2-S