Context-aware Retrieval-based Deep Commit Message Generation

Commit messages recorded in version control systems contain valuable information for software development, maintenance, and comprehension. Unfortunately, developers often commit code with empty or poor quality commit messages. To address this issue, several studies have proposed approaches to genera...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	ACM transactions on software engineering and methodology 2021-07, Vol.30 (4), p.1-30
Hauptverfasser:	Wang, Haoye, Xia, Xin, Lo, David, He, Qiang, Wang, Xinyu, Grundy, John
Format:	Artikel
Sprache:	eng
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	30
container_issue	4
container_start_page	1
container_title	ACM transactions on software engineering and methodology
container_volume	30
creator	Wang, Haoye Xia, Xin Lo, David He, Qiang Wang, Xinyu Grundy, John
description	Commit messages recorded in version control systems contain valuable information for software development, maintenance, and comprehension. Unfortunately, developers often commit code with empty or poor quality commit messages. To address this issue, several studies have proposed approaches to generate commit messages from commit diffs . Recent studies make use of neural machine translation algorithms to try and translate git diffs into commit messages and have achieved some promising results. However, these learning-based methods tend to generate high-frequency words but ignore low-frequency ones. In addition, they suffer from exposure bias issues, which leads to a gap between training phase and testing phase. In this article, we propose CoRec to address the above two limitations. Specifically, we first train a context-aware encoder-decoder model that randomly selects the previous output of the decoder or the embedding vector of a ground truth word as context to make the model gradually aware of previous alignment choices. Given a diff for testing, the trained model is reused to retrieve the most similar diff from the training set. Finally, we use the retrieval diff to guide the probability distribution for the final generated vocabulary. Our method combines the advantages of both information retrieval and neural machine translation. We evaluate CoRec on a dataset from Liu et al. and a large-scale dataset crawled from 10K popular Java repositories in Github. Our experimental results show that CoRec significantly outperforms the state-of-the-art method NNGen by 19% on average in terms of BLEU.
doi_str_mv	10.1145/3464689
format	Article
fullrecord	<record><control><sourceid>crossref</sourceid><recordid>TN_cdi_crossref_primary_10_1145_3464689</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>10_1145_3464689</sourcerecordid><originalsourceid>FETCH-LOGICAL-c258t-40351faf5d6911fd43c0d459e8e26e6b8dcf1b901f3e0abb5e5b22d1292f2add3</originalsourceid><addsrcrecordid>eNotj8tKAzEUQIMoWKv4C7NzFc3NqxNwI6NWoaUgCu6Gm8mNjHRmShJ8_L2KXZ2zOnAYOwdxCaDNldJW29odsBkYs-AL5eThrwvtuFLwesxOcn4XApSQesaum2ks9FU4fmKi6olK6ukDt9xjplDdEu2qZhqGvlRryhnfqFrSSAlLP42n7CjiNtPZnnP2cn_33Dzw1Wb52NyseCdNXbgWykDEaIJ1ADFo1YmgjaOapCXr69BF8E5AVCTQe0PGSxlAOhklhqDm7OK_26Up50Sx3aV-wPTdgmj_ptv9tPoBzLFJIw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Context-aware Retrieval-based Deep Commit Message Generation</title><source>ACM Digital Library Complete</source><creator>Wang, Haoye ; Xia, Xin ; Lo, David ; He, Qiang ; Wang, Xinyu ; Grundy, John</creator><creatorcontrib>Wang, Haoye ; Xia, Xin ; Lo, David ; He, Qiang ; Wang, Xinyu ; Grundy, John</creatorcontrib><description>Commit messages recorded in version control systems contain valuable information for software development, maintenance, and comprehension. Unfortunately, developers often commit code with empty or poor quality commit messages. To address this issue, several studies have proposed approaches to generate commit messages from commit diffs . Recent studies make use of neural machine translation algorithms to try and translate git diffs into commit messages and have achieved some promising results. However, these learning-based methods tend to generate high-frequency words but ignore low-frequency ones. In addition, they suffer from exposure bias issues, which leads to a gap between training phase and testing phase. In this article, we propose CoRec to address the above two limitations. Specifically, we first train a context-aware encoder-decoder model that randomly selects the previous output of the decoder or the embedding vector of a ground truth word as context to make the model gradually aware of previous alignment choices. Given a diff for testing, the trained model is reused to retrieve the most similar diff from the training set. Finally, we use the retrieval diff to guide the probability distribution for the final generated vocabulary. Our method combines the advantages of both information retrieval and neural machine translation. We evaluate CoRec on a dataset from Liu et al. and a large-scale dataset crawled from 10K popular Java repositories in Github. Our experimental results show that CoRec significantly outperforms the state-of-the-art method NNGen by 19% on average in terms of BLEU.</description><identifier>ISSN: 1049-331X</identifier><identifier>EISSN: 1557-7392</identifier><identifier>DOI: 10.1145/3464689</identifier><language>eng</language><ispartof>ACM transactions on software engineering and methodology, 2021-07, Vol.30 (4), p.1-30</ispartof><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c258t-40351faf5d6911fd43c0d459e8e26e6b8dcf1b901f3e0abb5e5b22d1292f2add3</citedby><cites>FETCH-LOGICAL-c258t-40351faf5d6911fd43c0d459e8e26e6b8dcf1b901f3e0abb5e5b22d1292f2add3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids></links><search><creatorcontrib>Wang, Haoye</creatorcontrib><creatorcontrib>Xia, Xin</creatorcontrib><creatorcontrib>Lo, David</creatorcontrib><creatorcontrib>He, Qiang</creatorcontrib><creatorcontrib>Wang, Xinyu</creatorcontrib><creatorcontrib>Grundy, John</creatorcontrib><title>Context-aware Retrieval-based Deep Commit Message Generation</title><title>ACM transactions on software engineering and methodology</title><description>Commit messages recorded in version control systems contain valuable information for software development, maintenance, and comprehension. Unfortunately, developers often commit code with empty or poor quality commit messages. To address this issue, several studies have proposed approaches to generate commit messages from commit diffs . Recent studies make use of neural machine translation algorithms to try and translate git diffs into commit messages and have achieved some promising results. However, these learning-based methods tend to generate high-frequency words but ignore low-frequency ones. In addition, they suffer from exposure bias issues, which leads to a gap between training phase and testing phase. In this article, we propose CoRec to address the above two limitations. Specifically, we first train a context-aware encoder-decoder model that randomly selects the previous output of the decoder or the embedding vector of a ground truth word as context to make the model gradually aware of previous alignment choices. Given a diff for testing, the trained model is reused to retrieve the most similar diff from the training set. Finally, we use the retrieval diff to guide the probability distribution for the final generated vocabulary. Our method combines the advantages of both information retrieval and neural machine translation. We evaluate CoRec on a dataset from Liu et al. and a large-scale dataset crawled from 10K popular Java repositories in Github. Our experimental results show that CoRec significantly outperforms the state-of-the-art method NNGen by 19% on average in terms of BLEU.</description><issn>1049-331X</issn><issn>1557-7392</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><recordid>eNotj8tKAzEUQIMoWKv4C7NzFc3NqxNwI6NWoaUgCu6Gm8mNjHRmShJ8_L2KXZ2zOnAYOwdxCaDNldJW29odsBkYs-AL5eThrwvtuFLwesxOcn4XApSQesaum2ks9FU4fmKi6olK6ukDt9xjplDdEu2qZhqGvlRryhnfqFrSSAlLP42n7CjiNtPZnnP2cn_33Dzw1Wb52NyseCdNXbgWykDEaIJ1ADFo1YmgjaOapCXr69BF8E5AVCTQe0PGSxlAOhklhqDm7OK_26Up50Sx3aV-wPTdgmj_ptv9tPoBzLFJIw</recordid><startdate>20210701</startdate><enddate>20210701</enddate><creator>Wang, Haoye</creator><creator>Xia, Xin</creator><creator>Lo, David</creator><creator>He, Qiang</creator><creator>Wang, Xinyu</creator><creator>Grundy, John</creator><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20210701</creationdate><title>Context-aware Retrieval-based Deep Commit Message Generation</title><author>Wang, Haoye ; Xia, Xin ; Lo, David ; He, Qiang ; Wang, Xinyu ; Grundy, John</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c258t-40351faf5d6911fd43c0d459e8e26e6b8dcf1b901f3e0abb5e5b22d1292f2add3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wang, Haoye</creatorcontrib><creatorcontrib>Xia, Xin</creatorcontrib><creatorcontrib>Lo, David</creatorcontrib><creatorcontrib>He, Qiang</creatorcontrib><creatorcontrib>Wang, Xinyu</creatorcontrib><creatorcontrib>Grundy, John</creatorcontrib><collection>CrossRef</collection><jtitle>ACM transactions on software engineering and methodology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Wang, Haoye</au><au>Xia, Xin</au><au>Lo, David</au><au>He, Qiang</au><au>Wang, Xinyu</au><au>Grundy, John</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Context-aware Retrieval-based Deep Commit Message Generation</atitle><jtitle>ACM transactions on software engineering and methodology</jtitle><date>2021-07-01</date><risdate>2021</risdate><volume>30</volume><issue>4</issue><spage>1</spage><epage>30</epage><pages>1-30</pages><issn>1049-331X</issn><eissn>1557-7392</eissn><abstract>Commit messages recorded in version control systems contain valuable information for software development, maintenance, and comprehension. Unfortunately, developers often commit code with empty or poor quality commit messages. To address this issue, several studies have proposed approaches to generate commit messages from commit diffs . Recent studies make use of neural machine translation algorithms to try and translate git diffs into commit messages and have achieved some promising results. However, these learning-based methods tend to generate high-frequency words but ignore low-frequency ones. In addition, they suffer from exposure bias issues, which leads to a gap between training phase and testing phase. In this article, we propose CoRec to address the above two limitations. Specifically, we first train a context-aware encoder-decoder model that randomly selects the previous output of the decoder or the embedding vector of a ground truth word as context to make the model gradually aware of previous alignment choices. Given a diff for testing, the trained model is reused to retrieve the most similar diff from the training set. Finally, we use the retrieval diff to guide the probability distribution for the final generated vocabulary. Our method combines the advantages of both information retrieval and neural machine translation. We evaluate CoRec on a dataset from Liu et al. and a large-scale dataset crawled from 10K popular Java repositories in Github. Our experimental results show that CoRec significantly outperforms the state-of-the-art method NNGen by 19% on average in terms of BLEU.</abstract><doi>10.1145/3464689</doi><tpages>30</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 1049-331X
ispartof	ACM transactions on software engineering and methodology, 2021-07, Vol.30 (4), p.1-30
issn	1049-331X 1557-7392
language	eng
recordid	cdi_crossref_primary_10_1145_3464689
source	ACM Digital Library Complete
title	Context-aware Retrieval-based Deep Commit Message Generation
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-29T04%3A08%3A38IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Context-aware%20Retrieval-based%20Deep%20Commit%20Message%20Generation&rft.jtitle=ACM%20transactions%20on%20software%20engineering%20and%20methodology&rft.au=Wang,%20Haoye&rft.date=2021-07-01&rft.volume=30&rft.issue=4&rft.spage=1&rft.epage=30&rft.pages=1-30&rft.issn=1049-331X&rft.eissn=1557-7392&rft_id=info:doi/10.1145/3464689&rft_dat=%3Ccrossref%3E10_1145_3464689%3C/crossref%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true