Categorizing the Content of GitHub README Files

README files play an essential role in shaping a developer’s first impression of a software repository and in documenting the software project that the repository hosts. Yet, we lack a systematic understanding of the content of a typical README file as well as tools that can process these files auto...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Empirical software engineering : an international journal 2019-06, Vol.24 (3), p.1296-1327
Hauptverfasser: Prana, Gede Artha Azriadi, Treude, Christoph, Thung, Ferdian, Atapattu, Thushari, Lo, David
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1327
container_issue 3
container_start_page 1296
container_title Empirical software engineering : an international journal
container_volume 24
creator Prana, Gede Artha Azriadi
Treude, Christoph
Thung, Ferdian
Atapattu, Thushari
Lo, David
description README files play an essential role in shaping a developer’s first impression of a software repository and in documenting the software project that the repository hosts. Yet, we lack a systematic understanding of the content of a typical README file as well as tools that can process these files automatically. To close this gap, we conduct a qualitative study involving the manual annotation of 4,226 README file sections from 393 randomly sampled GitHub repositories and we design and evaluate a classifier and a set of features that can categorize these sections automatically. We find that information discussing the ‘What’ and ‘How’ of a repository is very common, while many README files lack information regarding the purpose and status of a repository. Our multi-label classifier which can predict eight different categories achieves an F1 score of 0.746. To evaluate the usefulness of the classification, we used the automatically determined classes to label sections in GitHub README files using badges and showed files with and without these badges to twenty software professionals. The majority of participants perceived the automated labeling of sections based on our classifier to ease information discovery. This work enables the owners of software repositories to improve the quality of their documentation and it has the potential to make it easier for the software development community to discover relevant information in GitHub README files.
doi_str_mv 10.1007/s10664-018-9660-3
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2242862334</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2242862334</sourcerecordid><originalsourceid>FETCH-LOGICAL-c430t-924e6eccaee83dc6268f631fbe71b8260347c6125f4564b1afef3740c25fddcf3</originalsourceid><addsrcrecordid>eNp1kE1Lw0AQhhdRsFZ_gLeA57X7ldnkWGI_hIogel6S7WxMqUnd3R7017slgidPMwzP-w48hNxyds8Z07PAGYCijBe0BGBUnpEJz7WkGjicp10WgkqRwyW5CmHHGCu1yidkVtUR28F3313fZvEds2roI_YxG1y26uL62GQvi_nD0yJbdnsM1-TC1fuAN79zSt6Wi9dqTTfPq8dqvqFWSRZpKRQCWlsjFnJrQUDhQHLXoOZNIYBJpS1wkTuVg2p47dBJrZhNl-3WOjkld2PvwQ-fRwzR7Iaj79NLI4QSBQgpVaL4SFk_hODRmYPvPmr_ZTgzJy9m9GKSF3PyYmTKiDETEtu36P-a_w_9AHt3Y1g</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2242862334</pqid></control><display><type>article</type><title>Categorizing the Content of GitHub README Files</title><source>SpringerLink Journals</source><creator>Prana, Gede Artha Azriadi ; Treude, Christoph ; Thung, Ferdian ; Atapattu, Thushari ; Lo, David</creator><creatorcontrib>Prana, Gede Artha Azriadi ; Treude, Christoph ; Thung, Ferdian ; Atapattu, Thushari ; Lo, David</creatorcontrib><description>README files play an essential role in shaping a developer’s first impression of a software repository and in documenting the software project that the repository hosts. Yet, we lack a systematic understanding of the content of a typical README file as well as tools that can process these files automatically. To close this gap, we conduct a qualitative study involving the manual annotation of 4,226 README file sections from 393 randomly sampled GitHub repositories and we design and evaluate a classifier and a set of features that can categorize these sections automatically. We find that information discussing the ‘What’ and ‘How’ of a repository is very common, while many README files lack information regarding the purpose and status of a repository. Our multi-label classifier which can predict eight different categories achieves an F1 score of 0.746. To evaluate the usefulness of the classification, we used the automatically determined classes to label sections in GitHub README files using badges and showed files with and without these badges to twenty software professionals. The majority of participants perceived the automated labeling of sections based on our classifier to ease information discovery. This work enables the owners of software repositories to improve the quality of their documentation and it has the potential to make it easier for the software development community to discover relevant information in GitHub README files.</description><identifier>ISSN: 1382-3256</identifier><identifier>EISSN: 1573-7616</identifier><identifier>DOI: 10.1007/s10664-018-9660-3</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Annotations ; Classifiers ; Compilers ; Computer Science ; Interpreters ; Programming Languages ; Repositories ; Software development ; Software Engineering/Programming and Operating Systems</subject><ispartof>Empirical software engineering : an international journal, 2019-06, Vol.24 (3), p.1296-1327</ispartof><rights>Springer Science+Business Media, LLC, part of Springer Nature 2018</rights><rights>Copyright Springer Nature B.V. 2019</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c430t-924e6eccaee83dc6268f631fbe71b8260347c6125f4564b1afef3740c25fddcf3</citedby><cites>FETCH-LOGICAL-c430t-924e6eccaee83dc6268f631fbe71b8260347c6125f4564b1afef3740c25fddcf3</cites><orcidid>0000-0003-3759-5661</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s10664-018-9660-3$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s10664-018-9660-3$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,776,780,27903,27904,41467,42536,51297</link.rule.ids></links><search><creatorcontrib>Prana, Gede Artha Azriadi</creatorcontrib><creatorcontrib>Treude, Christoph</creatorcontrib><creatorcontrib>Thung, Ferdian</creatorcontrib><creatorcontrib>Atapattu, Thushari</creatorcontrib><creatorcontrib>Lo, David</creatorcontrib><title>Categorizing the Content of GitHub README Files</title><title>Empirical software engineering : an international journal</title><addtitle>Empir Software Eng</addtitle><description>README files play an essential role in shaping a developer’s first impression of a software repository and in documenting the software project that the repository hosts. Yet, we lack a systematic understanding of the content of a typical README file as well as tools that can process these files automatically. To close this gap, we conduct a qualitative study involving the manual annotation of 4,226 README file sections from 393 randomly sampled GitHub repositories and we design and evaluate a classifier and a set of features that can categorize these sections automatically. We find that information discussing the ‘What’ and ‘How’ of a repository is very common, while many README files lack information regarding the purpose and status of a repository. Our multi-label classifier which can predict eight different categories achieves an F1 score of 0.746. To evaluate the usefulness of the classification, we used the automatically determined classes to label sections in GitHub README files using badges and showed files with and without these badges to twenty software professionals. The majority of participants perceived the automated labeling of sections based on our classifier to ease information discovery. This work enables the owners of software repositories to improve the quality of their documentation and it has the potential to make it easier for the software development community to discover relevant information in GitHub README files.</description><subject>Annotations</subject><subject>Classifiers</subject><subject>Compilers</subject><subject>Computer Science</subject><subject>Interpreters</subject><subject>Programming Languages</subject><subject>Repositories</subject><subject>Software development</subject><subject>Software Engineering/Programming and Operating Systems</subject><issn>1382-3256</issn><issn>1573-7616</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2019</creationdate><recordtype>article</recordtype><recordid>eNp1kE1Lw0AQhhdRsFZ_gLeA57X7ldnkWGI_hIogel6S7WxMqUnd3R7017slgidPMwzP-w48hNxyds8Z07PAGYCijBe0BGBUnpEJz7WkGjicp10WgkqRwyW5CmHHGCu1yidkVtUR28F3313fZvEds2roI_YxG1y26uL62GQvi_nD0yJbdnsM1-TC1fuAN79zSt6Wi9dqTTfPq8dqvqFWSRZpKRQCWlsjFnJrQUDhQHLXoOZNIYBJpS1wkTuVg2p47dBJrZhNl-3WOjkld2PvwQ-fRwzR7Iaj79NLI4QSBQgpVaL4SFk_hODRmYPvPmr_ZTgzJy9m9GKSF3PyYmTKiDETEtu36P-a_w_9AHt3Y1g</recordid><startdate>20190601</startdate><enddate>20190601</enddate><creator>Prana, Gede Artha Azriadi</creator><creator>Treude, Christoph</creator><creator>Thung, Ferdian</creator><creator>Atapattu, Thushari</creator><creator>Lo, David</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0003-3759-5661</orcidid></search><sort><creationdate>20190601</creationdate><title>Categorizing the Content of GitHub README Files</title><author>Prana, Gede Artha Azriadi ; Treude, Christoph ; Thung, Ferdian ; Atapattu, Thushari ; Lo, David</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c430t-924e6eccaee83dc6268f631fbe71b8260347c6125f4564b1afef3740c25fddcf3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2019</creationdate><topic>Annotations</topic><topic>Classifiers</topic><topic>Compilers</topic><topic>Computer Science</topic><topic>Interpreters</topic><topic>Programming Languages</topic><topic>Repositories</topic><topic>Software development</topic><topic>Software Engineering/Programming and Operating Systems</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Prana, Gede Artha Azriadi</creatorcontrib><creatorcontrib>Treude, Christoph</creatorcontrib><creatorcontrib>Thung, Ferdian</creatorcontrib><creatorcontrib>Atapattu, Thushari</creatorcontrib><creatorcontrib>Lo, David</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Empirical software engineering : an international journal</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Prana, Gede Artha Azriadi</au><au>Treude, Christoph</au><au>Thung, Ferdian</au><au>Atapattu, Thushari</au><au>Lo, David</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Categorizing the Content of GitHub README Files</atitle><jtitle>Empirical software engineering : an international journal</jtitle><stitle>Empir Software Eng</stitle><date>2019-06-01</date><risdate>2019</risdate><volume>24</volume><issue>3</issue><spage>1296</spage><epage>1327</epage><pages>1296-1327</pages><issn>1382-3256</issn><eissn>1573-7616</eissn><abstract>README files play an essential role in shaping a developer’s first impression of a software repository and in documenting the software project that the repository hosts. Yet, we lack a systematic understanding of the content of a typical README file as well as tools that can process these files automatically. To close this gap, we conduct a qualitative study involving the manual annotation of 4,226 README file sections from 393 randomly sampled GitHub repositories and we design and evaluate a classifier and a set of features that can categorize these sections automatically. We find that information discussing the ‘What’ and ‘How’ of a repository is very common, while many README files lack information regarding the purpose and status of a repository. Our multi-label classifier which can predict eight different categories achieves an F1 score of 0.746. To evaluate the usefulness of the classification, we used the automatically determined classes to label sections in GitHub README files using badges and showed files with and without these badges to twenty software professionals. The majority of participants perceived the automated labeling of sections based on our classifier to ease information discovery. This work enables the owners of software repositories to improve the quality of their documentation and it has the potential to make it easier for the software development community to discover relevant information in GitHub README files.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s10664-018-9660-3</doi><tpages>32</tpages><orcidid>https://orcid.org/0000-0003-3759-5661</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1382-3256
ispartof Empirical software engineering : an international journal, 2019-06, Vol.24 (3), p.1296-1327
issn 1382-3256
1573-7616
language eng
recordid cdi_proquest_journals_2242862334
source SpringerLink Journals
subjects Annotations
Classifiers
Compilers
Computer Science
Interpreters
Programming Languages
Repositories
Software development
Software Engineering/Programming and Operating Systems
title Categorizing the Content of GitHub README Files
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-27T03%3A36%3A07IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Categorizing%20the%20Content%20of%20GitHub%20README%20Files&rft.jtitle=Empirical%20software%20engineering%20:%20an%20international%20journal&rft.au=Prana,%20Gede%20Artha%20Azriadi&rft.date=2019-06-01&rft.volume=24&rft.issue=3&rft.spage=1296&rft.epage=1327&rft.pages=1296-1327&rft.issn=1382-3256&rft.eissn=1573-7616&rft_id=info:doi/10.1007/s10664-018-9660-3&rft_dat=%3Cproquest_cross%3E2242862334%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2242862334&rft_id=info:pmid/&rfr_iscdi=true