Who's who in Gnome: Using LSA to merge software repository identities

Understanding an individual's contribution to an ecosystem often necessitates integrating information from multiple repositories corresponding to different projects within the ecosystem or different kinds of repositories (e.g., mail archives and version control systems). However, recognising th...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Kouters, E., Vasilescu, B., Serebrenik, A., van den Brand, M. G. J.
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 595
container_issue
container_start_page 592
container_title
container_volume
creator Kouters, E.
Vasilescu, B.
Serebrenik, A.
van den Brand, M. G. J.
description Understanding an individual's contribution to an ecosystem often necessitates integrating information from multiple repositories corresponding to different projects within the ecosystem or different kinds of repositories (e.g., mail archives and version control systems). However, recognising that different contributions belong to the same contributor is challenging, since developers may use different aliases. It is known that existing identity merging algorithms are sensitive to large discrepancies between the aliases used by the same individual: the noisier the data, the worse their performance. To assess the scale of the problem for a large software ecosystem, we study all Gnome Git repositories, classify the differences in aliases, and discuss robustness of existing algorithms with respect to these types of differences. We then propose a new identity merging algorithm based on Latent Semantic Analysis (LSA), designed to be robust against more types of differences in aliases, and evaluate it empirically by means of cross-validation on Gnome Git authors. Our results show a clear improvement over existing algorithms in terms of precision and recall on worst-case input data.
doi_str_mv 10.1109/ICSM.2012.6405329
format Conference Proceeding
fullrecord <record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_6405329</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>6405329</ieee_id><sourcerecordid>6405329</sourcerecordid><originalsourceid>FETCH-LOGICAL-i175t-f9d16de9f453c4ed7694432994a7e784d1357e1550d8f7badb08307c25c8b4093</originalsourceid><addsrcrecordid>eNo1kM1qAjEURtM_qFofoHSTXVczTXKT3KQ7EWsFSxdWupTR3NGUOpHJgPj2LdSuvsWBA-dj7F6KUkrhn2bjxVuphFSl1cKA8hesL7VFUCCVu2Q9ZdAWILW7YkOP7p-BuGY9KSwUFhFuWT_nLyGMRtA9NvncpcfMj7vEY8OnTdrTM1_m2Gz5fDHiXeJ7arfEc6q7Y9USb-mQcuxSe-IxUNPFLlK-Yzd19Z1peN4BW75MPsavxfx9OhuP5kWUaLqi9kHaQL7WBjaaAlqv9W-I1xUSOh0kGCRpjAiuxnUV1sKBwI0yG7fWwsOAPfx5IxGtDm3cV-1pdb4DfgBiiE4g</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Who's who in Gnome: Using LSA to merge software repository identities</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Kouters, E. ; Vasilescu, B. ; Serebrenik, A. ; van den Brand, M. G. J.</creator><creatorcontrib>Kouters, E. ; Vasilescu, B. ; Serebrenik, A. ; van den Brand, M. G. J.</creatorcontrib><description>Understanding an individual's contribution to an ecosystem often necessitates integrating information from multiple repositories corresponding to different projects within the ecosystem or different kinds of repositories (e.g., mail archives and version control systems). However, recognising that different contributions belong to the same contributor is challenging, since developers may use different aliases. It is known that existing identity merging algorithms are sensitive to large discrepancies between the aliases used by the same individual: the noisier the data, the worse their performance. To assess the scale of the problem for a large software ecosystem, we study all Gnome Git repositories, classify the differences in aliases, and discuss robustness of existing algorithms with respect to these types of differences. We then propose a new identity merging algorithm based on Latent Semantic Analysis (LSA), designed to be robust against more types of differences in aliases, and evaluate it empirically by means of cross-validation on Gnome Git authors. Our results show a clear improvement over existing algorithms in terms of precision and recall on worst-case input data.</description><identifier>ISSN: 1063-6773</identifier><identifier>ISBN: 9781467323130</identifier><identifier>ISBN: 1467323136</identifier><identifier>EISSN: 2576-3148</identifier><identifier>EISBN: 1467323128</identifier><identifier>EISBN: 9781467323123</identifier><identifier>DOI: 10.1109/ICSM.2012.6405329</identifier><language>eng</language><publisher>IEEE</publisher><subject>Algorithm design and analysis ; Birds ; Classification algorithms ; Electronic mail ; Gnome ; identity merging ; latent semantic analysis ; Merging ; Robustness ; Software</subject><ispartof>2012 28th IEEE International Conference on Software Maintenance (ICSM), 2012, p.592-595</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/6405329$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,2052,27902,54895</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/6405329$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Kouters, E.</creatorcontrib><creatorcontrib>Vasilescu, B.</creatorcontrib><creatorcontrib>Serebrenik, A.</creatorcontrib><creatorcontrib>van den Brand, M. G. J.</creatorcontrib><title>Who's who in Gnome: Using LSA to merge software repository identities</title><title>2012 28th IEEE International Conference on Software Maintenance (ICSM)</title><addtitle>ICSM</addtitle><description>Understanding an individual's contribution to an ecosystem often necessitates integrating information from multiple repositories corresponding to different projects within the ecosystem or different kinds of repositories (e.g., mail archives and version control systems). However, recognising that different contributions belong to the same contributor is challenging, since developers may use different aliases. It is known that existing identity merging algorithms are sensitive to large discrepancies between the aliases used by the same individual: the noisier the data, the worse their performance. To assess the scale of the problem for a large software ecosystem, we study all Gnome Git repositories, classify the differences in aliases, and discuss robustness of existing algorithms with respect to these types of differences. We then propose a new identity merging algorithm based on Latent Semantic Analysis (LSA), designed to be robust against more types of differences in aliases, and evaluate it empirically by means of cross-validation on Gnome Git authors. Our results show a clear improvement over existing algorithms in terms of precision and recall on worst-case input data.</description><subject>Algorithm design and analysis</subject><subject>Birds</subject><subject>Classification algorithms</subject><subject>Electronic mail</subject><subject>Gnome</subject><subject>identity merging</subject><subject>latent semantic analysis</subject><subject>Merging</subject><subject>Robustness</subject><subject>Software</subject><issn>1063-6773</issn><issn>2576-3148</issn><isbn>9781467323130</isbn><isbn>1467323136</isbn><isbn>1467323128</isbn><isbn>9781467323123</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2012</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNo1kM1qAjEURtM_qFofoHSTXVczTXKT3KQ7EWsFSxdWupTR3NGUOpHJgPj2LdSuvsWBA-dj7F6KUkrhn2bjxVuphFSl1cKA8hesL7VFUCCVu2Q9ZdAWILW7YkOP7p-BuGY9KSwUFhFuWT_nLyGMRtA9NvncpcfMj7vEY8OnTdrTM1_m2Gz5fDHiXeJ7arfEc6q7Y9USb-mQcuxSe-IxUNPFLlK-Yzd19Z1peN4BW75MPsavxfx9OhuP5kWUaLqi9kHaQL7WBjaaAlqv9W-I1xUSOh0kGCRpjAiuxnUV1sKBwI0yG7fWwsOAPfx5IxGtDm3cV-1pdb4DfgBiiE4g</recordid><startdate>201209</startdate><enddate>201209</enddate><creator>Kouters, E.</creator><creator>Vasilescu, B.</creator><creator>Serebrenik, A.</creator><creator>van den Brand, M. G. J.</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>201209</creationdate><title>Who's who in Gnome: Using LSA to merge software repository identities</title><author>Kouters, E. ; Vasilescu, B. ; Serebrenik, A. ; van den Brand, M. G. J.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i175t-f9d16de9f453c4ed7694432994a7e784d1357e1550d8f7badb08307c25c8b4093</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2012</creationdate><topic>Algorithm design and analysis</topic><topic>Birds</topic><topic>Classification algorithms</topic><topic>Electronic mail</topic><topic>Gnome</topic><topic>identity merging</topic><topic>latent semantic analysis</topic><topic>Merging</topic><topic>Robustness</topic><topic>Software</topic><toplevel>online_resources</toplevel><creatorcontrib>Kouters, E.</creatorcontrib><creatorcontrib>Vasilescu, B.</creatorcontrib><creatorcontrib>Serebrenik, A.</creatorcontrib><creatorcontrib>van den Brand, M. G. J.</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Kouters, E.</au><au>Vasilescu, B.</au><au>Serebrenik, A.</au><au>van den Brand, M. G. J.</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Who's who in Gnome: Using LSA to merge software repository identities</atitle><btitle>2012 28th IEEE International Conference on Software Maintenance (ICSM)</btitle><stitle>ICSM</stitle><date>2012-09</date><risdate>2012</risdate><spage>592</spage><epage>595</epage><pages>592-595</pages><issn>1063-6773</issn><eissn>2576-3148</eissn><isbn>9781467323130</isbn><isbn>1467323136</isbn><eisbn>1467323128</eisbn><eisbn>9781467323123</eisbn><abstract>Understanding an individual's contribution to an ecosystem often necessitates integrating information from multiple repositories corresponding to different projects within the ecosystem or different kinds of repositories (e.g., mail archives and version control systems). However, recognising that different contributions belong to the same contributor is challenging, since developers may use different aliases. It is known that existing identity merging algorithms are sensitive to large discrepancies between the aliases used by the same individual: the noisier the data, the worse their performance. To assess the scale of the problem for a large software ecosystem, we study all Gnome Git repositories, classify the differences in aliases, and discuss robustness of existing algorithms with respect to these types of differences. We then propose a new identity merging algorithm based on Latent Semantic Analysis (LSA), designed to be robust against more types of differences in aliases, and evaluate it empirically by means of cross-validation on Gnome Git authors. Our results show a clear improvement over existing algorithms in terms of precision and recall on worst-case input data.</abstract><pub>IEEE</pub><doi>10.1109/ICSM.2012.6405329</doi><tpages>4</tpages></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1063-6773
ispartof 2012 28th IEEE International Conference on Software Maintenance (ICSM), 2012, p.592-595
issn 1063-6773
2576-3148
language eng
recordid cdi_ieee_primary_6405329
source IEEE Electronic Library (IEL) Conference Proceedings
subjects Algorithm design and analysis
Birds
Classification algorithms
Electronic mail
Gnome
identity merging
latent semantic analysis
Merging
Robustness
Software
title Who's who in Gnome: Using LSA to merge software repository identities
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-29T19%3A18%3A27IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Who's%20who%20in%20Gnome:%20Using%20LSA%20to%20merge%20software%20repository%20identities&rft.btitle=2012%2028th%20IEEE%20International%20Conference%20on%20Software%20Maintenance%20(ICSM)&rft.au=Kouters,%20E.&rft.date=2012-09&rft.spage=592&rft.epage=595&rft.pages=592-595&rft.issn=1063-6773&rft.eissn=2576-3148&rft.isbn=9781467323130&rft.isbn_list=1467323136&rft_id=info:doi/10.1109/ICSM.2012.6405329&rft_dat=%3Cieee_6IE%3E6405329%3C/ieee_6IE%3E%3Curl%3E%3C/url%3E&rft.eisbn=1467323128&rft.eisbn_list=9781467323123&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=6405329&rfr_iscdi=true