PyBugHive: A Comprehensive Database of Manually Validated, Reproducible Python Bugs

Python is currently the number one language in the TIOBE index and has been the second most popular language on GitHub for years. But so far, there are only a few bug databases that contain bugs for Python projects and even fewer in which bugs can be reproduced. In this paper, we present a manually...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE access 2024, Vol.12, p.123739-123756
Hauptverfasser:	Antal, Gabor, Vandor, Norbert, Kollath, Istvan, Mosolygo, Balazs, Hegedus, Peter, Ferenc, Rudolf
Format:	Artikel
Sprache:	eng
Schlagworte:	benchmark Benchmark testing Bug database bug dataset Codes Computer bugs Datasets Java Large language models Line interfaces manually curated bugs Python real bugs reproducibility Reproducibility of results Software development management Source coding
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Python is currently the number one language in the TIOBE index and has been the second most popular language on GitHub for years. But so far, there are only a few bug databases that contain bugs for Python projects and even fewer in which bugs can be reproduced. In this paper, we present a manually curated database of reproducible Python bugs called PyBugHive. The initial version of PyBugHive is a benchmark of 149 real, manually validated bugs from 11 Python projects. Each entry in our database contains the summary of the bug report, the corresponding patch, and the test cases that expose the given bug. PyBugHive features a rich command line interface for accessing both the buggy and fixed versions of the programs and provides the abstraction for executing the corresponding test cases. The interface facilitates highly reproducible empirical research and tool comparisons in fields such as testing, automated program repair, or bug prediction. The usage of our database is demonstrated through a use case involving a large language model, GPT-3.5. First, we evaluated the bug detection capabilities of the model with the help of the bug repository. Using multiple prompts, we found out that GPT-3.5 was able to detect 67 out of 149 bugs (45%). Furthermore, we leveraged the constructed bug dataset in assessing the automatic program repair capabilities of GPT-3.5 by comparing the generated fixes with the real patches contained in the dataset. However, its performance was far worse in this task compared to bug detection, as it was able to fix only one of the detected issues.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2024.3449106