Studying the Impact of Noises in Build Breakage Data
Much research has investigated the common reasons for build breakages. However, prior research has paid little attention to builds that may break due to reasons that are unlikely to be related to development activities. For example, Continuous Integration (CI) builds may break due to timeout or conn...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on software engineering 2021-09, Vol.47 (9), p.1998-2011 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 2011 |
---|---|
container_issue | 9 |
container_start_page | 1998 |
container_title | IEEE transactions on software engineering |
container_volume | 47 |
creator | Ghaleb, Taher Ahmed da Costa, Daniel Alencar Zou, Ying Hassan, Ahmed E. |
description | Much research has investigated the common reasons for build breakages. However, prior research has paid little attention to builds that may break due to reasons that are unlikely to be related to development activities. For example, Continuous Integration (CI) builds may break due to timeout or connection errors while generating the build. Such kinds of build breakages potentially introduce noises to build breakage data. Not considering such noises may lead to misleading results when studying CI builds. In this paper, we propose three criteria to identify build breakages that can potentially introduce noises to build breakage data. We apply these criteria to a dataset of 350,246 builds from 153 GitHub projects that are linked with Travis CI . Our results reveal that 33 percent of the build breakages are due to environmental factors (e.g., errors in CI servers), 29 percent are due to (unfixed) errors in previous builds, and 9 percent are due to build jobs that were later deemed by developers as noisy (there is an overlap of 17 percent between these three types of breakages). We measure the impact of noises in build breakage data on modeling build breakages. We observe that models that use uncleaned build breakage data can lead to misleading associations between build breakages and development activities (e.g., the role of developer). However, such associations could not be observed after eliminating noisy build breakages. Moreover, we replicate a prior study that investigates the association between build breakages and development activities using data from 14 GitHub projects. We observe that some observations reported by the prior study (e.g., pull requests cause more breakages) do not hold after eliminating the noises from build breakage data. |
doi_str_mv | 10.1109/TSE.2019.2941880 |
format | Article |
fullrecord | <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_crossref_primary_10_1109_TSE_2019_2941880</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>8839858</ieee_id><sourcerecordid>2573570401</sourcerecordid><originalsourceid>FETCH-LOGICAL-c291t-d783d723abf88ca260056de078fdd7546e834cc2457fe467413c1376743da4f53</originalsourceid><addsrcrecordid>eNo9kEtPAjEUhRujiYjuTdw0cT3j7WvaLgVRSYguwHVT-8BBYLCdWfDvGQJxde_iO-ckH0L3BEpCQD8t5pOSAtEl1ZwoBRdoQDTTBRMULtEAQKtCCKWv0U3OKwAQUooB4vO28_t6u8TtT8DTzc66FjcRfzR1DhnXWzzq6rXHoxTsr10G_GJbe4uuol3ncHe-Q_T1OlmM34vZ59t0_DwrHNWkLbxUzEvK7HdUylla9aOVDyBV9F4KXgXFuHOUCxkDryQnzBEm-4d5y6NgQ_R46t2l5q8LuTWrpkvbftJQIZmQwIH0FJwol5qcU4hml-qNTXtDwBzdmN6NOboxZzd95OEUqUMI_7hSTCuh2AEkwl0f</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2573570401</pqid></control><display><type>article</type><title>Studying the Impact of Noises in Build Breakage Data</title><source>IEEE Electronic Library (IEL)</source><creator>Ghaleb, Taher Ahmed ; da Costa, Daniel Alencar ; Zou, Ying ; Hassan, Ahmed E.</creator><creatorcontrib>Ghaleb, Taher Ahmed ; da Costa, Daniel Alencar ; Zou, Ying ; Hassan, Ahmed E.</creatorcontrib><description>Much research has investigated the common reasons for build breakages. However, prior research has paid little attention to builds that may break due to reasons that are unlikely to be related to development activities. For example, Continuous Integration (CI) builds may break due to timeout or connection errors while generating the build. Such kinds of build breakages potentially introduce noises to build breakage data. Not considering such noises may lead to misleading results when studying CI builds. In this paper, we propose three criteria to identify build breakages that can potentially introduce noises to build breakage data. We apply these criteria to a dataset of 350,246 builds from 153 GitHub projects that are linked with Travis CI . Our results reveal that 33 percent of the build breakages are due to environmental factors (e.g., errors in CI servers), 29 percent are due to (unfixed) errors in previous builds, and 9 percent are due to build jobs that were later deemed by developers as noisy (there is an overlap of 17 percent between these three types of breakages). We measure the impact of noises in build breakage data on modeling build breakages. We observe that models that use uncleaned build breakage data can lead to misleading associations between build breakages and development activities (e.g., the role of developer). However, such associations could not be observed after eliminating noisy build breakages. Moreover, we replicate a prior study that investigates the association between build breakages and development activities using data from 14 GitHub projects. We observe that some observations reported by the prior study (e.g., pull requests cause more breakages) do not hold after eliminating the noises from build breakage data.</description><identifier>ISSN: 0098-5589</identifier><identifier>EISSN: 1939-3520</identifier><identifier>DOI: 10.1109/TSE.2019.2941880</identifier><identifier>CODEN: IESEDJ</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>CI build breakages ; Continuous integration ; Criteria ; Data models ; empirical software engineering ; Environmental factors ; Indexes ; mining software repositories ; Noise measurement ; noisy data ; Servers ; Software</subject><ispartof>IEEE transactions on software engineering, 2021-09, Vol.47 (9), p.1998-2011</ispartof><rights>Copyright IEEE Computer Society 2021</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c291t-d783d723abf88ca260056de078fdd7546e834cc2457fe467413c1376743da4f53</citedby><cites>FETCH-LOGICAL-c291t-d783d723abf88ca260056de078fdd7546e834cc2457fe467413c1376743da4f53</cites><orcidid>0000-0001-9336-7298 ; 0000-0003-4525-3266</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/8839858$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,777,781,793,27905,27906,54739</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/8839858$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Ghaleb, Taher Ahmed</creatorcontrib><creatorcontrib>da Costa, Daniel Alencar</creatorcontrib><creatorcontrib>Zou, Ying</creatorcontrib><creatorcontrib>Hassan, Ahmed E.</creatorcontrib><title>Studying the Impact of Noises in Build Breakage Data</title><title>IEEE transactions on software engineering</title><addtitle>TSE</addtitle><description>Much research has investigated the common reasons for build breakages. However, prior research has paid little attention to builds that may break due to reasons that are unlikely to be related to development activities. For example, Continuous Integration (CI) builds may break due to timeout or connection errors while generating the build. Such kinds of build breakages potentially introduce noises to build breakage data. Not considering such noises may lead to misleading results when studying CI builds. In this paper, we propose three criteria to identify build breakages that can potentially introduce noises to build breakage data. We apply these criteria to a dataset of 350,246 builds from 153 GitHub projects that are linked with Travis CI . Our results reveal that 33 percent of the build breakages are due to environmental factors (e.g., errors in CI servers), 29 percent are due to (unfixed) errors in previous builds, and 9 percent are due to build jobs that were later deemed by developers as noisy (there is an overlap of 17 percent between these three types of breakages). We measure the impact of noises in build breakage data on modeling build breakages. We observe that models that use uncleaned build breakage data can lead to misleading associations between build breakages and development activities (e.g., the role of developer). However, such associations could not be observed after eliminating noisy build breakages. Moreover, we replicate a prior study that investigates the association between build breakages and development activities using data from 14 GitHub projects. We observe that some observations reported by the prior study (e.g., pull requests cause more breakages) do not hold after eliminating the noises from build breakage data.</description><subject>CI build breakages</subject><subject>Continuous integration</subject><subject>Criteria</subject><subject>Data models</subject><subject>empirical software engineering</subject><subject>Environmental factors</subject><subject>Indexes</subject><subject>mining software repositories</subject><subject>Noise measurement</subject><subject>noisy data</subject><subject>Servers</subject><subject>Software</subject><issn>0098-5589</issn><issn>1939-3520</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kEtPAjEUhRujiYjuTdw0cT3j7WvaLgVRSYguwHVT-8BBYLCdWfDvGQJxde_iO-ckH0L3BEpCQD8t5pOSAtEl1ZwoBRdoQDTTBRMULtEAQKtCCKWv0U3OKwAQUooB4vO28_t6u8TtT8DTzc66FjcRfzR1DhnXWzzq6rXHoxTsr10G_GJbe4uuol3ncHe-Q_T1OlmM34vZ59t0_DwrHNWkLbxUzEvK7HdUylla9aOVDyBV9F4KXgXFuHOUCxkDryQnzBEm-4d5y6NgQ_R46t2l5q8LuTWrpkvbftJQIZmQwIH0FJwol5qcU4hml-qNTXtDwBzdmN6NOboxZzd95OEUqUMI_7hSTCuh2AEkwl0f</recordid><startdate>20210901</startdate><enddate>20210901</enddate><creator>Ghaleb, Taher Ahmed</creator><creator>da Costa, Daniel Alencar</creator><creator>Zou, Ying</creator><creator>Hassan, Ahmed E.</creator><general>IEEE</general><general>IEEE Computer Society</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>JQ2</scope><scope>K9.</scope><orcidid>https://orcid.org/0000-0001-9336-7298</orcidid><orcidid>https://orcid.org/0000-0003-4525-3266</orcidid></search><sort><creationdate>20210901</creationdate><title>Studying the Impact of Noises in Build Breakage Data</title><author>Ghaleb, Taher Ahmed ; da Costa, Daniel Alencar ; Zou, Ying ; Hassan, Ahmed E.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c291t-d783d723abf88ca260056de078fdd7546e834cc2457fe467413c1376743da4f53</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>CI build breakages</topic><topic>Continuous integration</topic><topic>Criteria</topic><topic>Data models</topic><topic>empirical software engineering</topic><topic>Environmental factors</topic><topic>Indexes</topic><topic>mining software repositories</topic><topic>Noise measurement</topic><topic>noisy data</topic><topic>Servers</topic><topic>Software</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ghaleb, Taher Ahmed</creatorcontrib><creatorcontrib>da Costa, Daniel Alencar</creatorcontrib><creatorcontrib>Zou, Ying</creatorcontrib><creatorcontrib>Hassan, Ahmed E.</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Health & Medical Complete (Alumni)</collection><jtitle>IEEE transactions on software engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Ghaleb, Taher Ahmed</au><au>da Costa, Daniel Alencar</au><au>Zou, Ying</au><au>Hassan, Ahmed E.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Studying the Impact of Noises in Build Breakage Data</atitle><jtitle>IEEE transactions on software engineering</jtitle><stitle>TSE</stitle><date>2021-09-01</date><risdate>2021</risdate><volume>47</volume><issue>9</issue><spage>1998</spage><epage>2011</epage><pages>1998-2011</pages><issn>0098-5589</issn><eissn>1939-3520</eissn><coden>IESEDJ</coden><abstract>Much research has investigated the common reasons for build breakages. However, prior research has paid little attention to builds that may break due to reasons that are unlikely to be related to development activities. For example, Continuous Integration (CI) builds may break due to timeout or connection errors while generating the build. Such kinds of build breakages potentially introduce noises to build breakage data. Not considering such noises may lead to misleading results when studying CI builds. In this paper, we propose three criteria to identify build breakages that can potentially introduce noises to build breakage data. We apply these criteria to a dataset of 350,246 builds from 153 GitHub projects that are linked with Travis CI . Our results reveal that 33 percent of the build breakages are due to environmental factors (e.g., errors in CI servers), 29 percent are due to (unfixed) errors in previous builds, and 9 percent are due to build jobs that were later deemed by developers as noisy (there is an overlap of 17 percent between these three types of breakages). We measure the impact of noises in build breakage data on modeling build breakages. We observe that models that use uncleaned build breakage data can lead to misleading associations between build breakages and development activities (e.g., the role of developer). However, such associations could not be observed after eliminating noisy build breakages. Moreover, we replicate a prior study that investigates the association between build breakages and development activities using data from 14 GitHub projects. We observe that some observations reported by the prior study (e.g., pull requests cause more breakages) do not hold after eliminating the noises from build breakage data.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TSE.2019.2941880</doi><tpages>14</tpages><orcidid>https://orcid.org/0000-0001-9336-7298</orcidid><orcidid>https://orcid.org/0000-0003-4525-3266</orcidid></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 0098-5589 |
ispartof | IEEE transactions on software engineering, 2021-09, Vol.47 (9), p.1998-2011 |
issn | 0098-5589 1939-3520 |
language | eng |
recordid | cdi_crossref_primary_10_1109_TSE_2019_2941880 |
source | IEEE Electronic Library (IEL) |
subjects | CI build breakages Continuous integration Criteria Data models empirical software engineering Environmental factors Indexes mining software repositories Noise measurement noisy data Servers Software |
title | Studying the Impact of Noises in Build Breakage Data |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-20T14%3A12%3A35IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Studying%20the%20Impact%20of%20Noises%20in%20Build%20Breakage%20Data&rft.jtitle=IEEE%20transactions%20on%20software%20engineering&rft.au=Ghaleb,%20Taher%20Ahmed&rft.date=2021-09-01&rft.volume=47&rft.issue=9&rft.spage=1998&rft.epage=2011&rft.pages=1998-2011&rft.issn=0098-5589&rft.eissn=1939-3520&rft.coden=IESEDJ&rft_id=info:doi/10.1109/TSE.2019.2941880&rft_dat=%3Cproquest_RIE%3E2573570401%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2573570401&rft_id=info:pmid/&rft_ieee_id=8839858&rfr_iscdi=true |