A comparative analysis of similarity measures akin to the Jaccard index in collaborative recommendations: empirical and theoretical perspective

Jaccard index, originally proposed by Jaccard (Bull Soc Vaudoise Sci Nat 37:241–272, 1901), is a measure for examining the similarity (or dissimilarity) between two sample data objects. It is defined as the proportion of the intersection size to the union size of the two data samples. It provides a...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Social network analysis and mining 2020-12, Vol.10 (1), p.43, Article 43
Hauptverfasser: Verma, Vijay, Aggarwal, Rajesh Kumar
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue 1
container_start_page 43
container_title Social network analysis and mining
container_volume 10
creator Verma, Vijay
Aggarwal, Rajesh Kumar
description Jaccard index, originally proposed by Jaccard (Bull Soc Vaudoise Sci Nat 37:241–272, 1901), is a measure for examining the similarity (or dissimilarity) between two sample data objects. It is defined as the proportion of the intersection size to the union size of the two data samples. It provides a very simple and intuitive measure of similarity between data samples. This research examines the measures that are akin to the Jaccard index and may be used for modelling affinity between users (or items) in collaborative recommendations. Particularly, the measures such as simple matching coefficient (SMC), Sorensen–Dice coefficient (SDC), Salton’s cosine index (SCI), and overlap coefficient (OLC) are compared and analysed in both theoretical and empirical perspectives with respect to the Jaccard index. Since these measures apprehend only the structural similarity information (overlapping information) between the data samples, these are very useful in situations where only the associations between users and items are available such as browsing or buying behaviours of the users on an e-commerce portal (i.e. unary rating data, a special case of ratings). Furthermore, a theoretical relation among these measures has been established. We have also derived an equivalent expression for each of these measures so that it can be directly applied for binary data samples in data mining/machine learning jargon. In order to compare and validate the effectiveness of these structural similarity measures, several experiments have been conducted using standardized benchmark datasets (MovieLens, FilmTrust, Epinions, Yahoo! Movies, and Yahoo! Music). Empirically obtained results demonstrate that the Salton’s cosine index (SCI) provides better accuracy (in terms of MAE, RMSE, and precision) for large datasets, whereas the overlap coefficient (OLC) results in more accurate recommendations for small datasets.
doi_str_mv 10.1007/s13278-020-00660-9
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2920667773</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2920667773</sourcerecordid><originalsourceid>FETCH-LOGICAL-c443t-21716d624901da34e035c791651ae968dfd49029605049dd083c0907561aa91a3</originalsourceid><addsrcrecordid>eNp9kMtOwzAQRSMEEhX0B1hZYh0Yx4lTs6sqnqrEBtbWYE_BJYmDnSL6Ffwy7kOwY2N75HuP5ZNlZxwuOEB9Gbko6kkOBeQAUkKuDrIRn0iVV6VUh7_nCo6zcYxLAOAghAI5yr6nzPi2x4CD-ySGHTbr6CLzCxZd6xoMblizljCuAkWG765jg2fDG7EHNAaDZa6z9JXWBGoafPF7VKAEbqmzafRdvGLU9i44g016xW4IPtCwnXsKsSezqZ1mRwtsIo33-0n2fHP9NLvL54-397PpPDdlKYa84DWXVhalAm5RlASiMrXisuJISk7swqarQkmooFTWwkQYUFBXkiMqjuIkO99x--A_VhQHvfSrkH4fdaGKZLGua5FSxS5lgo8x0EL3wbUY1pqD3rjXO_c6uddb91qlktiVYgp3rxT-0P-0fgCXMoil</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2920667773</pqid></control><display><type>article</type><title>A comparative analysis of similarity measures akin to the Jaccard index in collaborative recommendations: empirical and theoretical perspective</title><source>Springer Nature - Complete Springer Journals</source><source>ProQuest Central</source><creator>Verma, Vijay ; Aggarwal, Rajesh Kumar</creator><creatorcontrib>Verma, Vijay ; Aggarwal, Rajesh Kumar</creatorcontrib><description>Jaccard index, originally proposed by Jaccard (Bull Soc Vaudoise Sci Nat 37:241–272, 1901), is a measure for examining the similarity (or dissimilarity) between two sample data objects. It is defined as the proportion of the intersection size to the union size of the two data samples. It provides a very simple and intuitive measure of similarity between data samples. This research examines the measures that are akin to the Jaccard index and may be used for modelling affinity between users (or items) in collaborative recommendations. Particularly, the measures such as simple matching coefficient (SMC), Sorensen–Dice coefficient (SDC), Salton’s cosine index (SCI), and overlap coefficient (OLC) are compared and analysed in both theoretical and empirical perspectives with respect to the Jaccard index. Since these measures apprehend only the structural similarity information (overlapping information) between the data samples, these are very useful in situations where only the associations between users and items are available such as browsing or buying behaviours of the users on an e-commerce portal (i.e. unary rating data, a special case of ratings). Furthermore, a theoretical relation among these measures has been established. We have also derived an equivalent expression for each of these measures so that it can be directly applied for binary data samples in data mining/machine learning jargon. In order to compare and validate the effectiveness of these structural similarity measures, several experiments have been conducted using standardized benchmark datasets (MovieLens, FilmTrust, Epinions, Yahoo! Movies, and Yahoo! Music). Empirically obtained results demonstrate that the Salton’s cosine index (SCI) provides better accuracy (in terms of MAE, RMSE, and precision) for large datasets, whereas the overlap coefficient (OLC) results in more accurate recommendations for small datasets.</description><identifier>ISSN: 1869-5450</identifier><identifier>EISSN: 1869-5469</identifier><identifier>DOI: 10.1007/s13278-020-00660-9</identifier><language>eng</language><publisher>Vienna: Springer Vienna</publisher><subject>Affinity ; Algorithms ; Applications of Graph Theory and Complex Networks ; Binary data ; Coefficients ; Collaboration ; Comparative analysis ; Computer Science ; Data mining ; Data Mining and Knowledge Discovery ; Datasets ; Economics ; Empirical analysis ; Game Theory ; Humanities ; Information overload ; Law ; Linear algebra ; Machine learning ; Methodology of the Social Sciences ; Music ; Neighborhoods ; Original Article ; Ratings &amp; rankings ; Recommender systems ; Similarity ; Similarity measures ; Social and Behav. Sciences ; Statistics for Social Sciences</subject><ispartof>Social network analysis and mining, 2020-12, Vol.10 (1), p.43, Article 43</ispartof><rights>Springer-Verlag GmbH Austria, part of Springer Nature 2020</rights><rights>Springer-Verlag GmbH Austria, part of Springer Nature 2020.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c443t-21716d624901da34e035c791651ae968dfd49029605049dd083c0907561aa91a3</citedby><cites>FETCH-LOGICAL-c443t-21716d624901da34e035c791651ae968dfd49029605049dd083c0907561aa91a3</cites><orcidid>0000-0002-1186-3974</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s13278-020-00660-9$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/2920667773?pq-origsite=primo$$EHTML$$P50$$Gproquest$$H</linktohtml><link.rule.ids>314,776,780,21367,27901,27902,33721,41464,42533,43781,51294</link.rule.ids></links><search><creatorcontrib>Verma, Vijay</creatorcontrib><creatorcontrib>Aggarwal, Rajesh Kumar</creatorcontrib><title>A comparative analysis of similarity measures akin to the Jaccard index in collaborative recommendations: empirical and theoretical perspective</title><title>Social network analysis and mining</title><addtitle>Soc. Netw. Anal. Min</addtitle><description>Jaccard index, originally proposed by Jaccard (Bull Soc Vaudoise Sci Nat 37:241–272, 1901), is a measure for examining the similarity (or dissimilarity) between two sample data objects. It is defined as the proportion of the intersection size to the union size of the two data samples. It provides a very simple and intuitive measure of similarity between data samples. This research examines the measures that are akin to the Jaccard index and may be used for modelling affinity between users (or items) in collaborative recommendations. Particularly, the measures such as simple matching coefficient (SMC), Sorensen–Dice coefficient (SDC), Salton’s cosine index (SCI), and overlap coefficient (OLC) are compared and analysed in both theoretical and empirical perspectives with respect to the Jaccard index. Since these measures apprehend only the structural similarity information (overlapping information) between the data samples, these are very useful in situations where only the associations between users and items are available such as browsing or buying behaviours of the users on an e-commerce portal (i.e. unary rating data, a special case of ratings). Furthermore, a theoretical relation among these measures has been established. We have also derived an equivalent expression for each of these measures so that it can be directly applied for binary data samples in data mining/machine learning jargon. In order to compare and validate the effectiveness of these structural similarity measures, several experiments have been conducted using standardized benchmark datasets (MovieLens, FilmTrust, Epinions, Yahoo! Movies, and Yahoo! Music). Empirically obtained results demonstrate that the Salton’s cosine index (SCI) provides better accuracy (in terms of MAE, RMSE, and precision) for large datasets, whereas the overlap coefficient (OLC) results in more accurate recommendations for small datasets.</description><subject>Affinity</subject><subject>Algorithms</subject><subject>Applications of Graph Theory and Complex Networks</subject><subject>Binary data</subject><subject>Coefficients</subject><subject>Collaboration</subject><subject>Comparative analysis</subject><subject>Computer Science</subject><subject>Data mining</subject><subject>Data Mining and Knowledge Discovery</subject><subject>Datasets</subject><subject>Economics</subject><subject>Empirical analysis</subject><subject>Game Theory</subject><subject>Humanities</subject><subject>Information overload</subject><subject>Law</subject><subject>Linear algebra</subject><subject>Machine learning</subject><subject>Methodology of the Social Sciences</subject><subject>Music</subject><subject>Neighborhoods</subject><subject>Original Article</subject><subject>Ratings &amp; rankings</subject><subject>Recommender systems</subject><subject>Similarity</subject><subject>Similarity measures</subject><subject>Social and Behav. Sciences</subject><subject>Statistics for Social Sciences</subject><issn>1869-5450</issn><issn>1869-5469</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNp9kMtOwzAQRSMEEhX0B1hZYh0Yx4lTs6sqnqrEBtbWYE_BJYmDnSL6Ffwy7kOwY2N75HuP5ZNlZxwuOEB9Gbko6kkOBeQAUkKuDrIRn0iVV6VUh7_nCo6zcYxLAOAghAI5yr6nzPi2x4CD-ySGHTbr6CLzCxZd6xoMblizljCuAkWG765jg2fDG7EHNAaDZa6z9JXWBGoafPF7VKAEbqmzafRdvGLU9i44g016xW4IPtCwnXsKsSezqZ1mRwtsIo33-0n2fHP9NLvL54-397PpPDdlKYa84DWXVhalAm5RlASiMrXisuJISk7swqarQkmooFTWwkQYUFBXkiMqjuIkO99x--A_VhQHvfSrkH4fdaGKZLGua5FSxS5lgo8x0EL3wbUY1pqD3rjXO_c6uddb91qlktiVYgp3rxT-0P-0fgCXMoil</recordid><startdate>20201201</startdate><enddate>20201201</enddate><creator>Verma, Vijay</creator><creator>Aggarwal, Rajesh Kumar</creator><general>Springer Vienna</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>0-V</scope><scope>3V.</scope><scope>7XB</scope><scope>88J</scope><scope>8BJ</scope><scope>8FE</scope><scope>8FG</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ALSLI</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FQK</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JBE</scope><scope>JQ2</scope><scope>K7-</scope><scope>M2R</scope><scope>P5Z</scope><scope>P62</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>Q9U</scope><orcidid>https://orcid.org/0000-0002-1186-3974</orcidid></search><sort><creationdate>20201201</creationdate><title>A comparative analysis of similarity measures akin to the Jaccard index in collaborative recommendations: empirical and theoretical perspective</title><author>Verma, Vijay ; Aggarwal, Rajesh Kumar</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c443t-21716d624901da34e035c791651ae968dfd49029605049dd083c0907561aa91a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Affinity</topic><topic>Algorithms</topic><topic>Applications of Graph Theory and Complex Networks</topic><topic>Binary data</topic><topic>Coefficients</topic><topic>Collaboration</topic><topic>Comparative analysis</topic><topic>Computer Science</topic><topic>Data mining</topic><topic>Data Mining and Knowledge Discovery</topic><topic>Datasets</topic><topic>Economics</topic><topic>Empirical analysis</topic><topic>Game Theory</topic><topic>Humanities</topic><topic>Information overload</topic><topic>Law</topic><topic>Linear algebra</topic><topic>Machine learning</topic><topic>Methodology of the Social Sciences</topic><topic>Music</topic><topic>Neighborhoods</topic><topic>Original Article</topic><topic>Ratings &amp; rankings</topic><topic>Recommender systems</topic><topic>Similarity</topic><topic>Similarity measures</topic><topic>Social and Behav. Sciences</topic><topic>Statistics for Social Sciences</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Verma, Vijay</creatorcontrib><creatorcontrib>Aggarwal, Rajesh Kumar</creatorcontrib><collection>CrossRef</collection><collection>ProQuest Social Sciences Premium Collection</collection><collection>ProQuest Central (Corporate)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Social Science Database (Alumni Edition)</collection><collection>International Bibliography of the Social Sciences (IBSS)</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Social Science Premium Collection</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>International Bibliography of the Social Sciences</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>International Bibliography of the Social Sciences</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>Social Science Database</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central Basic</collection><jtitle>Social network analysis and mining</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Verma, Vijay</au><au>Aggarwal, Rajesh Kumar</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A comparative analysis of similarity measures akin to the Jaccard index in collaborative recommendations: empirical and theoretical perspective</atitle><jtitle>Social network analysis and mining</jtitle><stitle>Soc. Netw. Anal. Min</stitle><date>2020-12-01</date><risdate>2020</risdate><volume>10</volume><issue>1</issue><spage>43</spage><pages>43-</pages><artnum>43</artnum><issn>1869-5450</issn><eissn>1869-5469</eissn><abstract>Jaccard index, originally proposed by Jaccard (Bull Soc Vaudoise Sci Nat 37:241–272, 1901), is a measure for examining the similarity (or dissimilarity) between two sample data objects. It is defined as the proportion of the intersection size to the union size of the two data samples. It provides a very simple and intuitive measure of similarity between data samples. This research examines the measures that are akin to the Jaccard index and may be used for modelling affinity between users (or items) in collaborative recommendations. Particularly, the measures such as simple matching coefficient (SMC), Sorensen–Dice coefficient (SDC), Salton’s cosine index (SCI), and overlap coefficient (OLC) are compared and analysed in both theoretical and empirical perspectives with respect to the Jaccard index. Since these measures apprehend only the structural similarity information (overlapping information) between the data samples, these are very useful in situations where only the associations between users and items are available such as browsing or buying behaviours of the users on an e-commerce portal (i.e. unary rating data, a special case of ratings). Furthermore, a theoretical relation among these measures has been established. We have also derived an equivalent expression for each of these measures so that it can be directly applied for binary data samples in data mining/machine learning jargon. In order to compare and validate the effectiveness of these structural similarity measures, several experiments have been conducted using standardized benchmark datasets (MovieLens, FilmTrust, Epinions, Yahoo! Movies, and Yahoo! Music). Empirically obtained results demonstrate that the Salton’s cosine index (SCI) provides better accuracy (in terms of MAE, RMSE, and precision) for large datasets, whereas the overlap coefficient (OLC) results in more accurate recommendations for small datasets.</abstract><cop>Vienna</cop><pub>Springer Vienna</pub><doi>10.1007/s13278-020-00660-9</doi><orcidid>https://orcid.org/0000-0002-1186-3974</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 1869-5450
ispartof Social network analysis and mining, 2020-12, Vol.10 (1), p.43, Article 43
issn 1869-5450
1869-5469
language eng
recordid cdi_proquest_journals_2920667773
source Springer Nature - Complete Springer Journals; ProQuest Central
subjects Affinity
Algorithms
Applications of Graph Theory and Complex Networks
Binary data
Coefficients
Collaboration
Comparative analysis
Computer Science
Data mining
Data Mining and Knowledge Discovery
Datasets
Economics
Empirical analysis
Game Theory
Humanities
Information overload
Law
Linear algebra
Machine learning
Methodology of the Social Sciences
Music
Neighborhoods
Original Article
Ratings & rankings
Recommender systems
Similarity
Similarity measures
Social and Behav. Sciences
Statistics for Social Sciences
title A comparative analysis of similarity measures akin to the Jaccard index in collaborative recommendations: empirical and theoretical perspective
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-02T11%3A50%3A25IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20comparative%20analysis%20of%20similarity%20measures%20akin%20to%20the%20Jaccard%20index%20in%20collaborative%20recommendations:%20empirical%20and%20theoretical%20perspective&rft.jtitle=Social%20network%20analysis%20and%20mining&rft.au=Verma,%20Vijay&rft.date=2020-12-01&rft.volume=10&rft.issue=1&rft.spage=43&rft.pages=43-&rft.artnum=43&rft.issn=1869-5450&rft.eissn=1869-5469&rft_id=info:doi/10.1007/s13278-020-00660-9&rft_dat=%3Cproquest_cross%3E2920667773%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2920667773&rft_id=info:pmid/&rfr_iscdi=true