SiameseQAT: A Semantic Context-Based Duplicate Bug Report Detection Using Replicated Cluster Information

In large-scale software development environments, defect reports are maintained through bug tracking systems (BTS) and analyzed by domain experts. Different users may create bug reports in a non-standard manner and may report a particular problem using a particular set of words due to stylistic choi...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE access 2021, Vol.9, p.44610-44630
Hauptverfasser:	Rocha, Thiago Marques, Carvalho, Andre Luiz Da Costa
Format:	Artikel
Sprache:	eng
Schlagworte:	attention mechanism BERT Centroids Classification Clusters Computer bugs Context Context modeling Data mining Debugging deep learning deep neural networks Descriptions Duplicate bug report Feature extraction Integrated works software LDA loss function MLP Open source software quintet Reproduction (copying) Retrieval semantic context-based Semantics Siamese network Software development Software development tools Source code Task analysis topic modeling Tracking systems triplet Unstructured data
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In large-scale software development environments, defect reports are maintained through bug tracking systems (BTS) and analyzed by domain experts. Different users may create bug reports in a non-standard manner and may report a particular problem using a particular set of words due to stylistic choices and writing patterns. Therefore, the same defect can be reported with very different descriptions, generating non-trivial duplicates. To avoid redundant work for the development team, an expert needs to look at all new reports while trying to label possible duplicates. However, this approach is neither trivial nor scalable and directly impacts bug fix correction time. Recent efforts to find duplicate bug reports tend to focus on deep neural approaches that consider hybrid representations of bug reports, using both structured and unstructured information. Unfortunately, these approaches ignore that a single bug can have multiple previously identified duplicates and, therefore, multiple textual descriptions, titles, and categorical information. In this work, we propose SiameseQAT, a duplicate bug report detection method that considers information on individual bugs as well as information extracted from bug clusters. The SiameseQAT combines context and semantic learning on structured and unstructured features and corpus topic extraction-based features, with a novel loss function called Quintet Loss , which considers the centroid of duplicate clusters and their contextual information. We validated our approach on the well-known open-source software repositories Eclipse, NetBeans, and Open Office, comprised of more than 500 thousand bug reports. We evaluated both the retrieval and classification of duplicates, reporting a Recall@25 mean of 85% for retrieval and 84% AUROC for classification tasks, results that were significantly superior to previous works.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2021.3066283