Self-Admitted Technical Debt in Commit Messages: Comparing Java, Python, and R
The folder organization and datasets within each are as follows: Collection Folder: the original dataset that we scraped is placed. We have removed the user names and email addresses to keep the users’ privacy. RQ1 Folder has three subfolders: ❖ Manual Training: The initial manually labeled data...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Dataset |
Sprache: | eng |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The folder organization and datasets within each are as follows:
Collection Folder: the original dataset that we scraped is placed. We have removed the user names and email addresses to keep the users’ privacy. RQ1 Folder has three subfolders:
❖ Manual Training: The initial manually labeled data we used to initially train the classifiers is included. Note that columns A-O in this dataset are all extracted from GitHub’s API. Column O (heading “message”) is the commit message itself. The following columns P and Q (heading “author_a” and “author_b”) are the final classification (upon which the Cohen Kappa was calculated). Column R (heading “notes”) contains some commentaries on specific cases that may be meaningful.
❖ Predicted: The results of the automatic classifiers (both 1st and 2nd round) are included. The additional columns are generated by the classifiers.
❖ Verifications contain the manually labeled data that we used as 1st and 2nd verification rounds. This is a simplified dataset with the commit’s sha and the parsed message. The authors classified columns E and F independently and individually. The labels stated here are those that the authors agreed to (without having access to column D). Note that column D was added afterward by sha-matching by another author to calculate the Cohen Kappa. The yellow rows are those with disagreements.
RQ2_RQ3 Folder contains the manually labeled dataset for RQ2 and RQ3 (SATD Types and Activities).
NOTE: Kindly note that many messages or classifications are multiline. This means that the cells have to be expanded to be capable of reading all text included in a cell. |
---|---|
DOI: | 10.5281/zenodo.12761214 |