Self-Admitted Technical Debt in Commit Messages: Comparing Java, Python, and R

The folder organization and datasets within each are as follows: Collection Folder: the original dataset that we scraped is placed. We have removed the user names and email addresses to keep the users’ privacy. RQ1 Folder has three subfolders: ❖     Manual Training: The initial manually labeled data...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Fard, Fatemeh, Codabux, Zadia
Format: Dataset
Sprache:eng
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The folder organization and datasets within each are as follows: Collection Folder: the original dataset that we scraped is placed. We have removed the user names and email addresses to keep the users’ privacy. RQ1 Folder has three subfolders: ❖     Manual Training: The initial manually labeled data we used to initially train the classifiers is included. Note that columns A-O in this dataset are all extracted from GitHub’s API. Column O (heading “message”) is the commit message itself. The following columns P and Q (heading “author_a” and “author_b”) are the final classification (upon which the Cohen Kappa was calculated). Column R (heading “notes”) contains some commentaries on specific cases that may be meaningful. ❖     Predicted: The results of the automatic classifiers (both 1st and 2nd round) are included. The additional columns are generated by the classifiers. ❖     Verifications contain the manually labeled data that we used as 1st and 2nd verification rounds. This is a simplified dataset with the commit’s sha and the parsed message. The authors classified columns E and F independently and individually. The labels stated here are those that the authors agreed to (without having access to column D). Note that column D was added afterward by sha-matching by another author to calculate the Cohen Kappa. The yellow rows are those with disagreements.  RQ2_RQ3 Folder contains the manually labeled dataset for RQ2 and RQ3 (SATD Types and Activities).  NOTE: Kindly note that many messages or classifications are multiline. This means that the cells have to be expanded to be capable of reading all text included in a cell.
DOI:10.5281/zenodo.12761214