The Debsources Dataset: two decades of free and open source software

We present the Debsources Dataset: source code and related metadata spanning two decades of Free and Open Source Software (FOSS) history, seen through the lens of the Debian distribution. The dataset spans more than 3 billion lines of source code as well as metadata about them such as: size metrics...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Empirical software engineering : an international journal 2017-06, Vol.22 (3), p.1405-1437
Hauptverfasser: Caneill, Matthieu, Germán, Daniel M., Zacchiroli, Stefano
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1437
container_issue 3
container_start_page 1405
container_title Empirical software engineering : an international journal
container_volume 22
creator Caneill, Matthieu
Germán, Daniel M.
Zacchiroli, Stefano
description We present the Debsources Dataset: source code and related metadata spanning two decades of Free and Open Source Software (FOSS) history, seen through the lens of the Debian distribution. The dataset spans more than 3 billion lines of source code as well as metadata about them such as: size metrics (lines of code, disk usage), developer-defined symbols (ctags), file-level checksums (SHA1, SHA256, TLSH), file media types (MIME), release information (which version of which package containing which source code files has been released when), and license information (GPL, BSD, etc). The Debsources Dataset comes as a set of tarballs containing deduplicated unique source code files organized by their SHA1 checksums (the source code), plus a portable PostgreSQL database dump (the metadata). A case study is run to show how the Debsources Dataset can be used to easily and efficiently instrument very long-term analyses of the evolution of Debian from various angles (size, granularity, licensing, etc.), getting a grasp of major FOSS trends of the past two decades. The Debsources Dataset is Open Data, released under the terms of the CC BY-SA 4.0 license, and available for download from Zenodo with DOI reference 10.5281/zenodo.61089.
doi_str_mv 10.1007/s10664-016-9461-5
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2063633448</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2063633448</sourcerecordid><originalsourceid>FETCH-LOGICAL-c316t-e94a642d769c96acf97b0ca3bf86400c0aa48010166c827cf0fc6a21899e105a3</originalsourceid><addsrcrecordid>eNp1UE1LAzEQDaJgrf4AbwHP0cnHzm68SWtVKHip55BmJ2rR3ZpsKf57U1bw5OkNw_uYeYxdSriWAPVNloBoBEgU1qAU1RGbyKrWokaJx2XWjRJaVXjKznLeAICtTTVh89Ub8Tmtc79LgTKf-8FnGm75sO95S8G3ZdlHHhMR913L-y11fGQXiMPeJzpnJ9F_ZLr4xSl7WdyvZo9i-fzwNLtbiqAlDoKs8WhUW6MNFn2Itl5D8HodGzQAAbw3DcjyA4ZG1SFCDOiVbKwlCZXXU3Y1-m5T_7WjPLhNOaQrkU4BatTamKaw5MgKqc85UXTb9P7p07eT4A5lubEsV4LcoSxXFY0aNblwu1dKf87_i34A5j9rEg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2063633448</pqid></control><display><type>article</type><title>The Debsources Dataset: two decades of free and open source software</title><source>Springer Nature - Complete Springer Journals</source><creator>Caneill, Matthieu ; Germán, Daniel M. ; Zacchiroli, Stefano</creator><creatorcontrib>Caneill, Matthieu ; Germán, Daniel M. ; Zacchiroli, Stefano</creatorcontrib><description>We present the Debsources Dataset: source code and related metadata spanning two decades of Free and Open Source Software (FOSS) history, seen through the lens of the Debian distribution. The dataset spans more than 3 billion lines of source code as well as metadata about them such as: size metrics (lines of code, disk usage), developer-defined symbols (ctags), file-level checksums (SHA1, SHA256, TLSH), file media types (MIME), release information (which version of which package containing which source code files has been released when), and license information (GPL, BSD, etc). The Debsources Dataset comes as a set of tarballs containing deduplicated unique source code files organized by their SHA1 checksums (the source code), plus a portable PostgreSQL database dump (the metadata). A case study is run to show how the Debsources Dataset can be used to easily and efficiently instrument very long-term analyses of the evolution of Debian from various angles (size, granularity, licensing, etc.), getting a grasp of major FOSS trends of the past two decades. The Debsources Dataset is Open Data, released under the terms of the CC BY-SA 4.0 license, and available for download from Zenodo with DOI reference 10.5281/zenodo.61089.</description><identifier>ISSN: 1382-3256</identifier><identifier>EISSN: 1573-7616</identifier><identifier>DOI: 10.1007/s10664-016-9461-5</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Compilers ; Computer Science ; Datasets ; Downloading ; Freeware ; Interpreters ; Metadata ; Open data ; Open source software ; Programming Languages ; Software Engineering/Programming and Operating Systems ; Source code</subject><ispartof>Empirical software engineering : an international journal, 2017-06, Vol.22 (3), p.1405-1437</ispartof><rights>Springer Science+Business Media New York 2016</rights><rights>Copyright Springer Science &amp; Business Media 2017</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c316t-e94a642d769c96acf97b0ca3bf86400c0aa48010166c827cf0fc6a21899e105a3</citedby><cites>FETCH-LOGICAL-c316t-e94a642d769c96acf97b0ca3bf86400c0aa48010166c827cf0fc6a21899e105a3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s10664-016-9461-5$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s10664-016-9461-5$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,776,780,27903,27904,41467,42536,51297</link.rule.ids></links><search><creatorcontrib>Caneill, Matthieu</creatorcontrib><creatorcontrib>Germán, Daniel M.</creatorcontrib><creatorcontrib>Zacchiroli, Stefano</creatorcontrib><title>The Debsources Dataset: two decades of free and open source software</title><title>Empirical software engineering : an international journal</title><addtitle>Empir Software Eng</addtitle><description>We present the Debsources Dataset: source code and related metadata spanning two decades of Free and Open Source Software (FOSS) history, seen through the lens of the Debian distribution. The dataset spans more than 3 billion lines of source code as well as metadata about them such as: size metrics (lines of code, disk usage), developer-defined symbols (ctags), file-level checksums (SHA1, SHA256, TLSH), file media types (MIME), release information (which version of which package containing which source code files has been released when), and license information (GPL, BSD, etc). The Debsources Dataset comes as a set of tarballs containing deduplicated unique source code files organized by their SHA1 checksums (the source code), plus a portable PostgreSQL database dump (the metadata). A case study is run to show how the Debsources Dataset can be used to easily and efficiently instrument very long-term analyses of the evolution of Debian from various angles (size, granularity, licensing, etc.), getting a grasp of major FOSS trends of the past two decades. The Debsources Dataset is Open Data, released under the terms of the CC BY-SA 4.0 license, and available for download from Zenodo with DOI reference 10.5281/zenodo.61089.</description><subject>Compilers</subject><subject>Computer Science</subject><subject>Datasets</subject><subject>Downloading</subject><subject>Freeware</subject><subject>Interpreters</subject><subject>Metadata</subject><subject>Open data</subject><subject>Open source software</subject><subject>Programming Languages</subject><subject>Software Engineering/Programming and Operating Systems</subject><subject>Source code</subject><issn>1382-3256</issn><issn>1573-7616</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2017</creationdate><recordtype>article</recordtype><recordid>eNp1UE1LAzEQDaJgrf4AbwHP0cnHzm68SWtVKHip55BmJ2rR3ZpsKf57U1bw5OkNw_uYeYxdSriWAPVNloBoBEgU1qAU1RGbyKrWokaJx2XWjRJaVXjKznLeAICtTTVh89Ub8Tmtc79LgTKf-8FnGm75sO95S8G3ZdlHHhMR913L-y11fGQXiMPeJzpnJ9F_ZLr4xSl7WdyvZo9i-fzwNLtbiqAlDoKs8WhUW6MNFn2Itl5D8HodGzQAAbw3DcjyA4ZG1SFCDOiVbKwlCZXXU3Y1-m5T_7WjPLhNOaQrkU4BatTamKaw5MgKqc85UXTb9P7p07eT4A5lubEsV4LcoSxXFY0aNblwu1dKf87_i34A5j9rEg</recordid><startdate>20170601</startdate><enddate>20170601</enddate><creator>Caneill, Matthieu</creator><creator>Germán, Daniel M.</creator><creator>Zacchiroli, Stefano</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20170601</creationdate><title>The Debsources Dataset: two decades of free and open source software</title><author>Caneill, Matthieu ; Germán, Daniel M. ; Zacchiroli, Stefano</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c316t-e94a642d769c96acf97b0ca3bf86400c0aa48010166c827cf0fc6a21899e105a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2017</creationdate><topic>Compilers</topic><topic>Computer Science</topic><topic>Datasets</topic><topic>Downloading</topic><topic>Freeware</topic><topic>Interpreters</topic><topic>Metadata</topic><topic>Open data</topic><topic>Open source software</topic><topic>Programming Languages</topic><topic>Software Engineering/Programming and Operating Systems</topic><topic>Source code</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Caneill, Matthieu</creatorcontrib><creatorcontrib>Germán, Daniel M.</creatorcontrib><creatorcontrib>Zacchiroli, Stefano</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Empirical software engineering : an international journal</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Caneill, Matthieu</au><au>Germán, Daniel M.</au><au>Zacchiroli, Stefano</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>The Debsources Dataset: two decades of free and open source software</atitle><jtitle>Empirical software engineering : an international journal</jtitle><stitle>Empir Software Eng</stitle><date>2017-06-01</date><risdate>2017</risdate><volume>22</volume><issue>3</issue><spage>1405</spage><epage>1437</epage><pages>1405-1437</pages><issn>1382-3256</issn><eissn>1573-7616</eissn><abstract>We present the Debsources Dataset: source code and related metadata spanning two decades of Free and Open Source Software (FOSS) history, seen through the lens of the Debian distribution. The dataset spans more than 3 billion lines of source code as well as metadata about them such as: size metrics (lines of code, disk usage), developer-defined symbols (ctags), file-level checksums (SHA1, SHA256, TLSH), file media types (MIME), release information (which version of which package containing which source code files has been released when), and license information (GPL, BSD, etc). The Debsources Dataset comes as a set of tarballs containing deduplicated unique source code files organized by their SHA1 checksums (the source code), plus a portable PostgreSQL database dump (the metadata). A case study is run to show how the Debsources Dataset can be used to easily and efficiently instrument very long-term analyses of the evolution of Debian from various angles (size, granularity, licensing, etc.), getting a grasp of major FOSS trends of the past two decades. The Debsources Dataset is Open Data, released under the terms of the CC BY-SA 4.0 license, and available for download from Zenodo with DOI reference 10.5281/zenodo.61089.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s10664-016-9461-5</doi><tpages>33</tpages></addata></record>
fulltext fulltext
identifier ISSN: 1382-3256
ispartof Empirical software engineering : an international journal, 2017-06, Vol.22 (3), p.1405-1437
issn 1382-3256
1573-7616
language eng
recordid cdi_proquest_journals_2063633448
source Springer Nature - Complete Springer Journals
subjects Compilers
Computer Science
Datasets
Downloading
Freeware
Interpreters
Metadata
Open data
Open source software
Programming Languages
Software Engineering/Programming and Operating Systems
Source code
title The Debsources Dataset: two decades of free and open source software
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-26T21%3A56%3A22IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=The%20Debsources%20Dataset:%20two%20decades%20of%20free%20and%20open%20source%20software&rft.jtitle=Empirical%20software%20engineering%20:%20an%20international%20journal&rft.au=Caneill,%20Matthieu&rft.date=2017-06-01&rft.volume=22&rft.issue=3&rft.spage=1405&rft.epage=1437&rft.pages=1405-1437&rft.issn=1382-3256&rft.eissn=1573-7616&rft_id=info:doi/10.1007/s10664-016-9461-5&rft_dat=%3Cproquest_cross%3E2063633448%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2063633448&rft_id=info:pmid/&rfr_iscdi=true