Improve cross-project just-in-time defect prediction with dynamic transfer learning

•Introduction of the kernel variance matching method to address variations in marginal probability distributions between the source and target projects•Utilization of the CatBoost algorithm for just-in-time software defect prediction model construction•Introduction of the improved CORAL method for f...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:The Journal of systems and software 2025-01, Vol.219, p.112214, Article 112214
Hauptverfasser: Dai, Hongming, Xi, Jianqing, Dai, Hong-Liang
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•Introduction of the kernel variance matching method to address variations in marginal probability distributions between the source and target projects•Utilization of the CatBoost algorithm for just-in-time software defect prediction model construction•Introduction of the improved CORAL method for formulating the model's loss function•Introduction of the KCC method to handle various marginal and conditional probability distributions of features for cross-project just-in-time defect prediction Cross-project just-in-time software defect prediction (CP-JIT-SDP) is a prominent research topic in the field of software engineering. This approach is characterized by its immediacy, accuracy, real-time feedback, and traceability, enabling it to effectively address the challenges of defect prediction in new projects or projects with limited training data. However, CP-JIT-SDP faces significant challenges due to the differences in the feature distribution between the source and target projects. To address this issue, researchers have proposed methods for adjusting marginal or conditional probability distributions. This study introduces a transfer-learning approach that integrates dynamic distribution adaptation. The kernel variance matching (KVM) method is proposed to adjust the disparity in the marginal probability distribution by recalculating the variance of the source and target projects within the reproducing kernel Hilbert space (RKHS) to minimize the variance disparity. The categorical boosting (CatBoost) algorithm is used to construct models, while the improved CORrelation ALignment (CORAL) method is applied to develop the loss function to address the difference in the conditional probability distribution. This method is abbreviated as KCC, where the symbol K represents KVM, the symbol C represents CatBoost, and the next symbol C represents improved CORAL. The KCC method aims to optimize the joint probability distribution of the source project so that it closely agrees with that of the target project through iterative and dynamic integration. Six well-known open-source projects were used to evaluate the effectiveness of the proposed method. The empirical findings indicate that the KCC method exhibited significant improvements over the baseline methods. In particular, the KCC method demonstrated an average increase of 18% in the geometric mean (G-mean), 105.4% in the Matthews correlation coefficient (MCC), 25.6% in the F1-score, and 16.9% in the area under the r
ISSN:0164-1212
DOI:10.1016/j.jss.2024.112214