Can Large Language Models Write Good Property-Based Tests?
Property-based testing (PBT), while an established technique in the software testing research community, is still relatively underused in real-world software. Pain points in writing property-based tests include implementing diverse random input generators and thinking of meaningful properties to tes...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Property-based testing (PBT), while an established technique in the software
testing research community, is still relatively underused in real-world
software. Pain points in writing property-based tests include implementing
diverse random input generators and thinking of meaningful properties to test.
Developers, however, are more amenable to writing documentation; plenty of
library API documentation is available and can be used as natural language
specifications for PBTs. As large language models (LLMs) have recently shown
promise in a variety of coding tasks, we investigate using modern LLMs to
automatically synthesize PBTs using two prompting techniques. A key challenge
is to rigorously evaluate the LLM-synthesized PBTs. We propose a methodology to
do so considering several properties of the generated tests: (1) validity, (2)
soundness, and (3) property coverage, a novel metric that measures the ability
of the PBT to detect property violations through generation of property
mutants. In our evaluation on 40 Python library API methods across three models
(GPT-4, Gemini-1.5-Pro, Claude-3-Opus), we find that with the best model and
prompting approach, a valid and sound PBT can be synthesized in 2.4 samples on
average. We additionally find that our metric for determining soundness of a
PBT is aligned with human judgment of property assertions, achieving a
precision of 100% and recall of 97%. Finally, we evaluate the property coverage
of LLMs across all API methods and find that the best model (GPT-4) is able to
automatically synthesize correct PBTs for 21% of properties extractable from
API documentation. |
---|---|
DOI: | 10.48550/arxiv.2307.04346 |