A comparative analysis of similarity measures akin to the Jaccard index in collaborative recommendations: empirical and theoretical perspective

Jaccard index, originally proposed by Jaccard (Bull Soc Vaudoise Sci Nat 37:241–272, 1901), is a measure for examining the similarity (or dissimilarity) between two sample data objects. It is defined as the proportion of the intersection size to the union size of the two data samples. It provides a...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Social network analysis and mining 2020-12, Vol.10 (1), p.43, Article 43
Hauptverfasser: Verma, Vijay, Aggarwal, Rajesh Kumar
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Jaccard index, originally proposed by Jaccard (Bull Soc Vaudoise Sci Nat 37:241–272, 1901), is a measure for examining the similarity (or dissimilarity) between two sample data objects. It is defined as the proportion of the intersection size to the union size of the two data samples. It provides a very simple and intuitive measure of similarity between data samples. This research examines the measures that are akin to the Jaccard index and may be used for modelling affinity between users (or items) in collaborative recommendations. Particularly, the measures such as simple matching coefficient (SMC), Sorensen–Dice coefficient (SDC), Salton’s cosine index (SCI), and overlap coefficient (OLC) are compared and analysed in both theoretical and empirical perspectives with respect to the Jaccard index. Since these measures apprehend only the structural similarity information (overlapping information) between the data samples, these are very useful in situations where only the associations between users and items are available such as browsing or buying behaviours of the users on an e-commerce portal (i.e. unary rating data, a special case of ratings). Furthermore, a theoretical relation among these measures has been established. We have also derived an equivalent expression for each of these measures so that it can be directly applied for binary data samples in data mining/machine learning jargon. In order to compare and validate the effectiveness of these structural similarity measures, several experiments have been conducted using standardized benchmark datasets (MovieLens, FilmTrust, Epinions, Yahoo! Movies, and Yahoo! Music). Empirically obtained results demonstrate that the Salton’s cosine index (SCI) provides better accuracy (in terms of MAE, RMSE, and precision) for large datasets, whereas the overlap coefficient (OLC) results in more accurate recommendations for small datasets.
ISSN:1869-5450
1869-5469
DOI:10.1007/s13278-020-00660-9