Context-aware Retrieval-based Deep Commit Message Generation
Commit messages recorded in version control systems contain valuable information for software development, maintenance, and comprehension. Unfortunately, developers often commit code with empty or poor quality commit messages. To address this issue, several studies have proposed approaches to genera...
Gespeichert in:
Veröffentlicht in: | ACM transactions on software engineering and methodology 2021-07, Vol.30 (4), p.1-30 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Commit messages recorded in version control systems contain valuable information for software development, maintenance, and comprehension. Unfortunately, developers often commit code with empty or poor quality commit messages. To address this issue, several studies have proposed approaches to generate commit messages from commit
diffs
. Recent studies make use of neural machine translation algorithms to try and translate git
diffs
into commit messages and have achieved some promising results. However, these learning-based methods tend to generate high-frequency words but ignore low-frequency ones. In addition, they suffer from exposure bias issues, which leads to a gap between training phase and testing phase.
In this article, we propose
CoRec
to address the above two limitations. Specifically, we first train a context-aware encoder-decoder model that randomly selects the previous output of the decoder or the embedding vector of a ground truth word as context to make the model gradually aware of previous alignment choices. Given a
diff
for testing, the trained model is reused to retrieve the most similar
diff
from the training set. Finally, we use the retrieval
diff
to guide the probability distribution for the final generated vocabulary. Our method combines the advantages of both information retrieval and neural machine translation. We evaluate
CoRec
on a dataset from Liu et al. and a large-scale dataset crawled from 10K popular Java repositories in Github. Our experimental results show that
CoRec
significantly outperforms the state-of-the-art method NNGen by 19% on average in terms of BLEU. |
---|---|
ISSN: | 1049-331X 1557-7392 |
DOI: | 10.1145/3464689 |