Predicting Duplicate in Bug Report Using Topic-Based Duplicate Learning With Fine Tuning-Based BERT Algorithm

As the usage and coverage of software increase, various functional improvements and bugs are occurring. The Eclipse, Mozilla open-source projects receive more than about 300 bug reports per day. Usually, when a user finds a bug, they write a bug report. The developer assigned to the bug reads the co...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE access 2022, Vol.10, p.129666-129675
Hauptverfasser: Kim, Taemin, Yang, Geunseok
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:As the usage and coverage of software increase, various functional improvements and bugs are occurring. The Eclipse, Mozilla open-source projects receive more than about 300 bug reports per day. Usually, when a user finds a bug, they write a bug report. The developer assigned to the bug reads the content of the bug, and if it has already been fixed, the developer marks it as a duplicate bug report. However, if duplicate bug reports are submitted, the developer must manually identify the same bug, and this process requires a lot of effort by the developer. If redundancies in bug reports can be identified automatically, unnecessary effort on the part of the developer can be reduced. To resolve this problem, this paper predicts redundancy using the BERT (Bidirectional Encoder Representations from the Transformer) algorithm and topic-based duplicate/non-duplicate feature extraction. First, a bug report by bug status is extracted from the bug repository, and topic models are constructed by status by applying topic modeling to each status. In each topic, feature selection is performed using the non-duplicate status and the duplicate status. It learns the extracted features as inputs to the BERT algorithm and predicts duplicate bug reports. In this paper, Precision, Recall, F-measure, and Accuracy were used to evaluate the proposed model, and Eclipse, Mozilla, Apache, and KDE open sources were used. The proposed model shows about 87.67%, 89.85%, 87.03%, and 88.95% performance in Eclipse, Mozilla, Apache, and KDE, respectively. In addition, performance comparison with baselines (Naïve Bayes, Randomforest, Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), Convolutional Neural Networks-Long Short-Term Memory Networks (CNN-LSTM)) in Eclipse, Mozilla, Apache, and KDE about 36.33%, 44.46%, 47.77%, and 45.17%, improvement, respectively, showed that the proposed model is better at detecting duplicates than the baselines.
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2022.3226238