Studying the Impact of Noises in Build Breakage Data

Much research has investigated the common reasons for build breakages. However, prior research has paid little attention to builds that may break due to reasons that are unlikely to be related to development activities. For example, Continuous Integration (CI) builds may break due to timeout or conn...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on software engineering 2021-09, Vol.47 (9), p.1998-2011
Hauptverfasser: Ghaleb, Taher Ahmed, da Costa, Daniel Alencar, Zou, Ying, Hassan, Ahmed E.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 2011
container_issue 9
container_start_page 1998
container_title IEEE transactions on software engineering
container_volume 47
creator Ghaleb, Taher Ahmed
da Costa, Daniel Alencar
Zou, Ying
Hassan, Ahmed E.
description Much research has investigated the common reasons for build breakages. However, prior research has paid little attention to builds that may break due to reasons that are unlikely to be related to development activities. For example, Continuous Integration (CI) builds may break due to timeout or connection errors while generating the build. Such kinds of build breakages potentially introduce noises to build breakage data. Not considering such noises may lead to misleading results when studying CI builds. In this paper, we propose three criteria to identify build breakages that can potentially introduce noises to build breakage data. We apply these criteria to a dataset of 350,246 builds from 153 GitHub projects that are linked with Travis CI . Our results reveal that 33 percent of the build breakages are due to environmental factors (e.g., errors in CI servers), 29 percent are due to (unfixed) errors in previous builds, and 9 percent are due to build jobs that were later deemed by developers as noisy (there is an overlap of 17 percent between these three types of breakages). We measure the impact of noises in build breakage data on modeling build breakages. We observe that models that use uncleaned build breakage data can lead to misleading associations between build breakages and development activities (e.g., the role of developer). However, such associations could not be observed after eliminating noisy build breakages. Moreover, we replicate a prior study that investigates the association between build breakages and development activities using data from 14 GitHub projects. We observe that some observations reported by the prior study (e.g., pull requests cause more breakages) do not hold after eliminating the noises from build breakage data.
doi_str_mv 10.1109/TSE.2019.2941880
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_crossref_primary_10_1109_TSE_2019_2941880</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>8839858</ieee_id><sourcerecordid>2573570401</sourcerecordid><originalsourceid>FETCH-LOGICAL-c291t-d783d723abf88ca260056de078fdd7546e834cc2457fe467413c1376743da4f53</originalsourceid><addsrcrecordid>eNo9kEtPAjEUhRujiYjuTdw0cT3j7WvaLgVRSYguwHVT-8BBYLCdWfDvGQJxde_iO-ckH0L3BEpCQD8t5pOSAtEl1ZwoBRdoQDTTBRMULtEAQKtCCKWv0U3OKwAQUooB4vO28_t6u8TtT8DTzc66FjcRfzR1DhnXWzzq6rXHoxTsr10G_GJbe4uuol3ncHe-Q_T1OlmM34vZ59t0_DwrHNWkLbxUzEvK7HdUylla9aOVDyBV9F4KXgXFuHOUCxkDryQnzBEm-4d5y6NgQ_R46t2l5q8LuTWrpkvbftJQIZmQwIH0FJwol5qcU4hml-qNTXtDwBzdmN6NOboxZzd95OEUqUMI_7hSTCuh2AEkwl0f</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2573570401</pqid></control><display><type>article</type><title>Studying the Impact of Noises in Build Breakage Data</title><source>IEEE Electronic Library (IEL)</source><creator>Ghaleb, Taher Ahmed ; da Costa, Daniel Alencar ; Zou, Ying ; Hassan, Ahmed E.</creator><creatorcontrib>Ghaleb, Taher Ahmed ; da Costa, Daniel Alencar ; Zou, Ying ; Hassan, Ahmed E.</creatorcontrib><description>Much research has investigated the common reasons for build breakages. However, prior research has paid little attention to builds that may break due to reasons that are unlikely to be related to development activities. For example, Continuous Integration (CI) builds may break due to timeout or connection errors while generating the build. Such kinds of build breakages potentially introduce noises to build breakage data. Not considering such noises may lead to misleading results when studying CI builds. In this paper, we propose three criteria to identify build breakages that can potentially introduce noises to build breakage data. We apply these criteria to a dataset of 350,246 builds from 153 GitHub projects that are linked with Travis CI . Our results reveal that 33 percent of the build breakages are due to environmental factors (e.g., errors in CI servers), 29 percent are due to (unfixed) errors in previous builds, and 9 percent are due to build jobs that were later deemed by developers as noisy (there is an overlap of 17 percent between these three types of breakages). We measure the impact of noises in build breakage data on modeling build breakages. We observe that models that use uncleaned build breakage data can lead to misleading associations between build breakages and development activities (e.g., the role of developer). However, such associations could not be observed after eliminating noisy build breakages. Moreover, we replicate a prior study that investigates the association between build breakages and development activities using data from 14 GitHub projects. We observe that some observations reported by the prior study (e.g., pull requests cause more breakages) do not hold after eliminating the noises from build breakage data.</description><identifier>ISSN: 0098-5589</identifier><identifier>EISSN: 1939-3520</identifier><identifier>DOI: 10.1109/TSE.2019.2941880</identifier><identifier>CODEN: IESEDJ</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>CI build breakages ; Continuous integration ; Criteria ; Data models ; empirical software engineering ; Environmental factors ; Indexes ; mining software repositories ; Noise measurement ; noisy data ; Servers ; Software</subject><ispartof>IEEE transactions on software engineering, 2021-09, Vol.47 (9), p.1998-2011</ispartof><rights>Copyright IEEE Computer Society 2021</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c291t-d783d723abf88ca260056de078fdd7546e834cc2457fe467413c1376743da4f53</citedby><cites>FETCH-LOGICAL-c291t-d783d723abf88ca260056de078fdd7546e834cc2457fe467413c1376743da4f53</cites><orcidid>0000-0001-9336-7298 ; 0000-0003-4525-3266</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/8839858$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,777,781,793,27905,27906,54739</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/8839858$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Ghaleb, Taher Ahmed</creatorcontrib><creatorcontrib>da Costa, Daniel Alencar</creatorcontrib><creatorcontrib>Zou, Ying</creatorcontrib><creatorcontrib>Hassan, Ahmed E.</creatorcontrib><title>Studying the Impact of Noises in Build Breakage Data</title><title>IEEE transactions on software engineering</title><addtitle>TSE</addtitle><description>Much research has investigated the common reasons for build breakages. However, prior research has paid little attention to builds that may break due to reasons that are unlikely to be related to development activities. For example, Continuous Integration (CI) builds may break due to timeout or connection errors while generating the build. Such kinds of build breakages potentially introduce noises to build breakage data. Not considering such noises may lead to misleading results when studying CI builds. In this paper, we propose three criteria to identify build breakages that can potentially introduce noises to build breakage data. We apply these criteria to a dataset of 350,246 builds from 153 GitHub projects that are linked with Travis CI . Our results reveal that 33 percent of the build breakages are due to environmental factors (e.g., errors in CI servers), 29 percent are due to (unfixed) errors in previous builds, and 9 percent are due to build jobs that were later deemed by developers as noisy (there is an overlap of 17 percent between these three types of breakages). We measure the impact of noises in build breakage data on modeling build breakages. We observe that models that use uncleaned build breakage data can lead to misleading associations between build breakages and development activities (e.g., the role of developer). However, such associations could not be observed after eliminating noisy build breakages. Moreover, we replicate a prior study that investigates the association between build breakages and development activities using data from 14 GitHub projects. We observe that some observations reported by the prior study (e.g., pull requests cause more breakages) do not hold after eliminating the noises from build breakage data.</description><subject>CI build breakages</subject><subject>Continuous integration</subject><subject>Criteria</subject><subject>Data models</subject><subject>empirical software engineering</subject><subject>Environmental factors</subject><subject>Indexes</subject><subject>mining software repositories</subject><subject>Noise measurement</subject><subject>noisy data</subject><subject>Servers</subject><subject>Software</subject><issn>0098-5589</issn><issn>1939-3520</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kEtPAjEUhRujiYjuTdw0cT3j7WvaLgVRSYguwHVT-8BBYLCdWfDvGQJxde_iO-ckH0L3BEpCQD8t5pOSAtEl1ZwoBRdoQDTTBRMULtEAQKtCCKWv0U3OKwAQUooB4vO28_t6u8TtT8DTzc66FjcRfzR1DhnXWzzq6rXHoxTsr10G_GJbe4uuol3ncHe-Q_T1OlmM34vZ59t0_DwrHNWkLbxUzEvK7HdUylla9aOVDyBV9F4KXgXFuHOUCxkDryQnzBEm-4d5y6NgQ_R46t2l5q8LuTWrpkvbftJQIZmQwIH0FJwol5qcU4hml-qNTXtDwBzdmN6NOboxZzd95OEUqUMI_7hSTCuh2AEkwl0f</recordid><startdate>20210901</startdate><enddate>20210901</enddate><creator>Ghaleb, Taher Ahmed</creator><creator>da Costa, Daniel Alencar</creator><creator>Zou, Ying</creator><creator>Hassan, Ahmed E.</creator><general>IEEE</general><general>IEEE Computer Society</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>JQ2</scope><scope>K9.</scope><orcidid>https://orcid.org/0000-0001-9336-7298</orcidid><orcidid>https://orcid.org/0000-0003-4525-3266</orcidid></search><sort><creationdate>20210901</creationdate><title>Studying the Impact of Noises in Build Breakage Data</title><author>Ghaleb, Taher Ahmed ; da Costa, Daniel Alencar ; Zou, Ying ; Hassan, Ahmed E.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c291t-d783d723abf88ca260056de078fdd7546e834cc2457fe467413c1376743da4f53</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>CI build breakages</topic><topic>Continuous integration</topic><topic>Criteria</topic><topic>Data models</topic><topic>empirical software engineering</topic><topic>Environmental factors</topic><topic>Indexes</topic><topic>mining software repositories</topic><topic>Noise measurement</topic><topic>noisy data</topic><topic>Servers</topic><topic>Software</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ghaleb, Taher Ahmed</creatorcontrib><creatorcontrib>da Costa, Daniel Alencar</creatorcontrib><creatorcontrib>Zou, Ying</creatorcontrib><creatorcontrib>Hassan, Ahmed E.</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Health &amp; Medical Complete (Alumni)</collection><jtitle>IEEE transactions on software engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Ghaleb, Taher Ahmed</au><au>da Costa, Daniel Alencar</au><au>Zou, Ying</au><au>Hassan, Ahmed E.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Studying the Impact of Noises in Build Breakage Data</atitle><jtitle>IEEE transactions on software engineering</jtitle><stitle>TSE</stitle><date>2021-09-01</date><risdate>2021</risdate><volume>47</volume><issue>9</issue><spage>1998</spage><epage>2011</epage><pages>1998-2011</pages><issn>0098-5589</issn><eissn>1939-3520</eissn><coden>IESEDJ</coden><abstract>Much research has investigated the common reasons for build breakages. However, prior research has paid little attention to builds that may break due to reasons that are unlikely to be related to development activities. For example, Continuous Integration (CI) builds may break due to timeout or connection errors while generating the build. Such kinds of build breakages potentially introduce noises to build breakage data. Not considering such noises may lead to misleading results when studying CI builds. In this paper, we propose three criteria to identify build breakages that can potentially introduce noises to build breakage data. We apply these criteria to a dataset of 350,246 builds from 153 GitHub projects that are linked with Travis CI . Our results reveal that 33 percent of the build breakages are due to environmental factors (e.g., errors in CI servers), 29 percent are due to (unfixed) errors in previous builds, and 9 percent are due to build jobs that were later deemed by developers as noisy (there is an overlap of 17 percent between these three types of breakages). We measure the impact of noises in build breakage data on modeling build breakages. We observe that models that use uncleaned build breakage data can lead to misleading associations between build breakages and development activities (e.g., the role of developer). However, such associations could not be observed after eliminating noisy build breakages. Moreover, we replicate a prior study that investigates the association between build breakages and development activities using data from 14 GitHub projects. We observe that some observations reported by the prior study (e.g., pull requests cause more breakages) do not hold after eliminating the noises from build breakage data.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TSE.2019.2941880</doi><tpages>14</tpages><orcidid>https://orcid.org/0000-0001-9336-7298</orcidid><orcidid>https://orcid.org/0000-0003-4525-3266</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 0098-5589
ispartof IEEE transactions on software engineering, 2021-09, Vol.47 (9), p.1998-2011
issn 0098-5589
1939-3520
language eng
recordid cdi_crossref_primary_10_1109_TSE_2019_2941880
source IEEE Electronic Library (IEL)
subjects CI build breakages
Continuous integration
Criteria
Data models
empirical software engineering
Environmental factors
Indexes
mining software repositories
Noise measurement
noisy data
Servers
Software
title Studying the Impact of Noises in Build Breakage Data
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-20T14%3A12%3A35IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Studying%20the%20Impact%20of%20Noises%20in%20Build%20Breakage%20Data&rft.jtitle=IEEE%20transactions%20on%20software%20engineering&rft.au=Ghaleb,%20Taher%20Ahmed&rft.date=2021-09-01&rft.volume=47&rft.issue=9&rft.spage=1998&rft.epage=2011&rft.pages=1998-2011&rft.issn=0098-5589&rft.eissn=1939-3520&rft.coden=IESEDJ&rft_id=info:doi/10.1109/TSE.2019.2941880&rft_dat=%3Cproquest_RIE%3E2573570401%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2573570401&rft_id=info:pmid/&rft_ieee_id=8839858&rfr_iscdi=true