The Debsources Dataset: two decades of free and open source software
We present the Debsources Dataset: source code and related metadata spanning two decades of Free and Open Source Software (FOSS) history, seen through the lens of the Debian distribution. The dataset spans more than 3 billion lines of source code as well as metadata about them such as: size metrics...
Gespeichert in:
Veröffentlicht in: | Empirical software engineering : an international journal 2017-06, Vol.22 (3), p.1405-1437 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 1437 |
---|---|
container_issue | 3 |
container_start_page | 1405 |
container_title | Empirical software engineering : an international journal |
container_volume | 22 |
creator | Caneill, Matthieu Germán, Daniel M. Zacchiroli, Stefano |
description | We present the Debsources Dataset: source code and related metadata spanning two decades of Free and Open Source Software (FOSS) history, seen through the lens of the Debian distribution. The dataset spans more than 3 billion lines of source code as well as metadata about them such as: size metrics (lines of code, disk usage), developer-defined symbols (ctags), file-level checksums (SHA1, SHA256, TLSH), file media types (MIME), release information (which version of which package containing which source code files has been released when), and license information (GPL, BSD, etc). The Debsources Dataset comes as a set of tarballs containing deduplicated unique source code files organized by their SHA1 checksums (the source code), plus a portable PostgreSQL database dump (the metadata). A case study is run to show how the Debsources Dataset can be used to easily and efficiently instrument very long-term analyses of the evolution of Debian from various angles (size, granularity, licensing, etc.), getting a grasp of major FOSS trends of the past two decades. The Debsources Dataset is Open Data, released under the terms of the CC BY-SA 4.0 license, and available for download from Zenodo with DOI reference 10.5281/zenodo.61089. |
doi_str_mv | 10.1007/s10664-016-9461-5 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2063633448</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2063633448</sourcerecordid><originalsourceid>FETCH-LOGICAL-c316t-e94a642d769c96acf97b0ca3bf86400c0aa48010166c827cf0fc6a21899e105a3</originalsourceid><addsrcrecordid>eNp1UE1LAzEQDaJgrf4AbwHP0cnHzm68SWtVKHip55BmJ2rR3ZpsKf57U1bw5OkNw_uYeYxdSriWAPVNloBoBEgU1qAU1RGbyKrWokaJx2XWjRJaVXjKznLeAICtTTVh89Ub8Tmtc79LgTKf-8FnGm75sO95S8G3ZdlHHhMR913L-y11fGQXiMPeJzpnJ9F_ZLr4xSl7WdyvZo9i-fzwNLtbiqAlDoKs8WhUW6MNFn2Itl5D8HodGzQAAbw3DcjyA4ZG1SFCDOiVbKwlCZXXU3Y1-m5T_7WjPLhNOaQrkU4BatTamKaw5MgKqc85UXTb9P7p07eT4A5lubEsV4LcoSxXFY0aNblwu1dKf87_i34A5j9rEg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2063633448</pqid></control><display><type>article</type><title>The Debsources Dataset: two decades of free and open source software</title><source>Springer Nature - Complete Springer Journals</source><creator>Caneill, Matthieu ; Germán, Daniel M. ; Zacchiroli, Stefano</creator><creatorcontrib>Caneill, Matthieu ; Germán, Daniel M. ; Zacchiroli, Stefano</creatorcontrib><description>We present the Debsources Dataset: source code and related metadata spanning two decades of Free and Open Source Software (FOSS) history, seen through the lens of the Debian distribution. The dataset spans more than 3 billion lines of source code as well as metadata about them such as: size metrics (lines of code, disk usage), developer-defined symbols (ctags), file-level checksums (SHA1, SHA256, TLSH), file media types (MIME), release information (which version of which package containing which source code files has been released when), and license information (GPL, BSD, etc). The Debsources Dataset comes as a set of tarballs containing deduplicated unique source code files organized by their SHA1 checksums (the source code), plus a portable PostgreSQL database dump (the metadata). A case study is run to show how the Debsources Dataset can be used to easily and efficiently instrument very long-term analyses of the evolution of Debian from various angles (size, granularity, licensing, etc.), getting a grasp of major FOSS trends of the past two decades. The Debsources Dataset is Open Data, released under the terms of the CC BY-SA 4.0 license, and available for download from Zenodo with DOI reference 10.5281/zenodo.61089.</description><identifier>ISSN: 1382-3256</identifier><identifier>EISSN: 1573-7616</identifier><identifier>DOI: 10.1007/s10664-016-9461-5</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Compilers ; Computer Science ; Datasets ; Downloading ; Freeware ; Interpreters ; Metadata ; Open data ; Open source software ; Programming Languages ; Software Engineering/Programming and Operating Systems ; Source code</subject><ispartof>Empirical software engineering : an international journal, 2017-06, Vol.22 (3), p.1405-1437</ispartof><rights>Springer Science+Business Media New York 2016</rights><rights>Copyright Springer Science & Business Media 2017</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c316t-e94a642d769c96acf97b0ca3bf86400c0aa48010166c827cf0fc6a21899e105a3</citedby><cites>FETCH-LOGICAL-c316t-e94a642d769c96acf97b0ca3bf86400c0aa48010166c827cf0fc6a21899e105a3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s10664-016-9461-5$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s10664-016-9461-5$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,776,780,27903,27904,41467,42536,51297</link.rule.ids></links><search><creatorcontrib>Caneill, Matthieu</creatorcontrib><creatorcontrib>Germán, Daniel M.</creatorcontrib><creatorcontrib>Zacchiroli, Stefano</creatorcontrib><title>The Debsources Dataset: two decades of free and open source software</title><title>Empirical software engineering : an international journal</title><addtitle>Empir Software Eng</addtitle><description>We present the Debsources Dataset: source code and related metadata spanning two decades of Free and Open Source Software (FOSS) history, seen through the lens of the Debian distribution. The dataset spans more than 3 billion lines of source code as well as metadata about them such as: size metrics (lines of code, disk usage), developer-defined symbols (ctags), file-level checksums (SHA1, SHA256, TLSH), file media types (MIME), release information (which version of which package containing which source code files has been released when), and license information (GPL, BSD, etc). The Debsources Dataset comes as a set of tarballs containing deduplicated unique source code files organized by their SHA1 checksums (the source code), plus a portable PostgreSQL database dump (the metadata). A case study is run to show how the Debsources Dataset can be used to easily and efficiently instrument very long-term analyses of the evolution of Debian from various angles (size, granularity, licensing, etc.), getting a grasp of major FOSS trends of the past two decades. The Debsources Dataset is Open Data, released under the terms of the CC BY-SA 4.0 license, and available for download from Zenodo with DOI reference 10.5281/zenodo.61089.</description><subject>Compilers</subject><subject>Computer Science</subject><subject>Datasets</subject><subject>Downloading</subject><subject>Freeware</subject><subject>Interpreters</subject><subject>Metadata</subject><subject>Open data</subject><subject>Open source software</subject><subject>Programming Languages</subject><subject>Software Engineering/Programming and Operating Systems</subject><subject>Source code</subject><issn>1382-3256</issn><issn>1573-7616</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2017</creationdate><recordtype>article</recordtype><recordid>eNp1UE1LAzEQDaJgrf4AbwHP0cnHzm68SWtVKHip55BmJ2rR3ZpsKf57U1bw5OkNw_uYeYxdSriWAPVNloBoBEgU1qAU1RGbyKrWokaJx2XWjRJaVXjKznLeAICtTTVh89Ub8Tmtc79LgTKf-8FnGm75sO95S8G3ZdlHHhMR913L-y11fGQXiMPeJzpnJ9F_ZLr4xSl7WdyvZo9i-fzwNLtbiqAlDoKs8WhUW6MNFn2Itl5D8HodGzQAAbw3DcjyA4ZG1SFCDOiVbKwlCZXXU3Y1-m5T_7WjPLhNOaQrkU4BatTamKaw5MgKqc85UXTb9P7p07eT4A5lubEsV4LcoSxXFY0aNblwu1dKf87_i34A5j9rEg</recordid><startdate>20170601</startdate><enddate>20170601</enddate><creator>Caneill, Matthieu</creator><creator>Germán, Daniel M.</creator><creator>Zacchiroli, Stefano</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20170601</creationdate><title>The Debsources Dataset: two decades of free and open source software</title><author>Caneill, Matthieu ; Germán, Daniel M. ; Zacchiroli, Stefano</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c316t-e94a642d769c96acf97b0ca3bf86400c0aa48010166c827cf0fc6a21899e105a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2017</creationdate><topic>Compilers</topic><topic>Computer Science</topic><topic>Datasets</topic><topic>Downloading</topic><topic>Freeware</topic><topic>Interpreters</topic><topic>Metadata</topic><topic>Open data</topic><topic>Open source software</topic><topic>Programming Languages</topic><topic>Software Engineering/Programming and Operating Systems</topic><topic>Source code</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Caneill, Matthieu</creatorcontrib><creatorcontrib>Germán, Daniel M.</creatorcontrib><creatorcontrib>Zacchiroli, Stefano</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Empirical software engineering : an international journal</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Caneill, Matthieu</au><au>Germán, Daniel M.</au><au>Zacchiroli, Stefano</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>The Debsources Dataset: two decades of free and open source software</atitle><jtitle>Empirical software engineering : an international journal</jtitle><stitle>Empir Software Eng</stitle><date>2017-06-01</date><risdate>2017</risdate><volume>22</volume><issue>3</issue><spage>1405</spage><epage>1437</epage><pages>1405-1437</pages><issn>1382-3256</issn><eissn>1573-7616</eissn><abstract>We present the Debsources Dataset: source code and related metadata spanning two decades of Free and Open Source Software (FOSS) history, seen through the lens of the Debian distribution. The dataset spans more than 3 billion lines of source code as well as metadata about them such as: size metrics (lines of code, disk usage), developer-defined symbols (ctags), file-level checksums (SHA1, SHA256, TLSH), file media types (MIME), release information (which version of which package containing which source code files has been released when), and license information (GPL, BSD, etc). The Debsources Dataset comes as a set of tarballs containing deduplicated unique source code files organized by their SHA1 checksums (the source code), plus a portable PostgreSQL database dump (the metadata). A case study is run to show how the Debsources Dataset can be used to easily and efficiently instrument very long-term analyses of the evolution of Debian from various angles (size, granularity, licensing, etc.), getting a grasp of major FOSS trends of the past two decades. The Debsources Dataset is Open Data, released under the terms of the CC BY-SA 4.0 license, and available for download from Zenodo with DOI reference 10.5281/zenodo.61089.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s10664-016-9461-5</doi><tpages>33</tpages></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1382-3256 |
ispartof | Empirical software engineering : an international journal, 2017-06, Vol.22 (3), p.1405-1437 |
issn | 1382-3256 1573-7616 |
language | eng |
recordid | cdi_proquest_journals_2063633448 |
source | Springer Nature - Complete Springer Journals |
subjects | Compilers Computer Science Datasets Downloading Freeware Interpreters Metadata Open data Open source software Programming Languages Software Engineering/Programming and Operating Systems Source code |
title | The Debsources Dataset: two decades of free and open source software |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-26T21%3A56%3A22IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=The%20Debsources%20Dataset:%20two%20decades%20of%20free%20and%20open%20source%20software&rft.jtitle=Empirical%20software%20engineering%20:%20an%20international%20journal&rft.au=Caneill,%20Matthieu&rft.date=2017-06-01&rft.volume=22&rft.issue=3&rft.spage=1405&rft.epage=1437&rft.pages=1405-1437&rft.issn=1382-3256&rft.eissn=1573-7616&rft_id=info:doi/10.1007/s10664-016-9461-5&rft_dat=%3Cproquest_cross%3E2063633448%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2063633448&rft_id=info:pmid/&rfr_iscdi=true |