Can Large Language Models Write Good Property-Based Tests?

Property-based testing (PBT), while an established technique in the software testing research community, is still relatively underused in real-world software. Pain points in writing property-based tests include implementing diverse random input generators and thinking of meaningful properties to tes...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Vikram, Vasudev, Lemieux, Caroline, Sunshine, Joshua, Padhye, Rohan
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Software Engineering
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Vikram, Vasudev Lemieux, Caroline Sunshine, Joshua Padhye, Rohan
description	Property-based testing (PBT), while an established technique in the software testing research community, is still relatively underused in real-world software. Pain points in writing property-based tests include implementing diverse random input generators and thinking of meaningful properties to test. Developers, however, are more amenable to writing documentation; plenty of library API documentation is available and can be used as natural language specifications for PBTs. As large language models (LLMs) have recently shown promise in a variety of coding tasks, we investigate using modern LLMs to automatically synthesize PBTs using two prompting techniques. A key challenge is to rigorously evaluate the LLM-synthesized PBTs. We propose a methodology to do so considering several properties of the generated tests: (1) validity, (2) soundness, and (3) property coverage, a novel metric that measures the ability of the PBT to detect property violations through generation of property mutants. In our evaluation on 40 Python library API methods across three models (GPT-4, Gemini-1.5-Pro, Claude-3-Opus), we find that with the best model and prompting approach, a valid and sound PBT can be synthesized in 2.4 samples on average. We additionally find that our metric for determining soundness of a PBT is aligned with human judgment of property assertions, achieving a precision of 100% and recall of 97%. Finally, we evaluate the property coverage of LLMs across all API methods and find that the best model (GPT-4) is able to automatically synthesize correct PBTs for 21% of properties extractable from API documentation.
doi_str_mv	10.48550/arxiv.2307.04346
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2307_04346</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2307_04346</sourcerecordid><originalsourceid>FETCH-LOGICAL-a676-d7f16928b7781904475376e6411e32179613c1bf1aaada69e3ac659c99b023823</originalsourceid><addsrcrecordid>eNotj8FOwzAQRH3hgAofwAn_QILXdtYxFwQRFKQgeojUY7SJN1Wk0lR2QPTvKaWXmTk9zRPiBlRuy6JQdxR_xu9cG-VyZY3FS3Ff0U7WFDd8zN3mi47jfQq8TXIdx5nlcpqCXMVpz3E-ZE-UOMiG05wersTFQNvE1-deiObluales_pj-VY91hmhwyy4AdDrsnOuBK-sdYVxyGgB2GhwHsH00A1ARIHQs6EeC9973yltSm0W4vYfezrf7uP4SfHQ_km0JwnzCyfqP2g</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Can Large Language Models Write Good Property-Based Tests?</title><source>arXiv.org</source><creator>Vikram, Vasudev ; Lemieux, Caroline ; Sunshine, Joshua ; Padhye, Rohan</creator><creatorcontrib>Vikram, Vasudev ; Lemieux, Caroline ; Sunshine, Joshua ; Padhye, Rohan</creatorcontrib><description>Property-based testing (PBT), while an established technique in the software testing research community, is still relatively underused in real-world software. Pain points in writing property-based tests include implementing diverse random input generators and thinking of meaningful properties to test. Developers, however, are more amenable to writing documentation; plenty of library API documentation is available and can be used as natural language specifications for PBTs. As large language models (LLMs) have recently shown promise in a variety of coding tasks, we investigate using modern LLMs to automatically synthesize PBTs using two prompting techniques. A key challenge is to rigorously evaluate the LLM-synthesized PBTs. We propose a methodology to do so considering several properties of the generated tests: (1) validity, (2) soundness, and (3) property coverage, a novel metric that measures the ability of the PBT to detect property violations through generation of property mutants. In our evaluation on 40 Python library API methods across three models (GPT-4, Gemini-1.5-Pro, Claude-3-Opus), we find that with the best model and prompting approach, a valid and sound PBT can be synthesized in 2.4 samples on average. We additionally find that our metric for determining soundness of a PBT is aligned with human judgment of property assertions, achieving a precision of 100% and recall of 97%. Finally, we evaluate the property coverage of LLMs across all API methods and find that the best model (GPT-4) is able to automatically synthesize correct PBTs for 21% of properties extractable from API documentation.</description><identifier>DOI: 10.48550/arxiv.2307.04346</identifier><language>eng</language><subject>Computer Science - Software Engineering</subject><creationdate>2023-07</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2307.04346$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2307.04346$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Vikram, Vasudev</creatorcontrib><creatorcontrib>Lemieux, Caroline</creatorcontrib><creatorcontrib>Sunshine, Joshua</creatorcontrib><creatorcontrib>Padhye, Rohan</creatorcontrib><title>Can Large Language Models Write Good Property-Based Tests?</title><description>Property-based testing (PBT), while an established technique in the software testing research community, is still relatively underused in real-world software. Pain points in writing property-based tests include implementing diverse random input generators and thinking of meaningful properties to test. Developers, however, are more amenable to writing documentation; plenty of library API documentation is available and can be used as natural language specifications for PBTs. As large language models (LLMs) have recently shown promise in a variety of coding tasks, we investigate using modern LLMs to automatically synthesize PBTs using two prompting techniques. A key challenge is to rigorously evaluate the LLM-synthesized PBTs. We propose a methodology to do so considering several properties of the generated tests: (1) validity, (2) soundness, and (3) property coverage, a novel metric that measures the ability of the PBT to detect property violations through generation of property mutants. In our evaluation on 40 Python library API methods across three models (GPT-4, Gemini-1.5-Pro, Claude-3-Opus), we find that with the best model and prompting approach, a valid and sound PBT can be synthesized in 2.4 samples on average. We additionally find that our metric for determining soundness of a PBT is aligned with human judgment of property assertions, achieving a precision of 100% and recall of 97%. Finally, we evaluate the property coverage of LLMs across all API methods and find that the best model (GPT-4) is able to automatically synthesize correct PBTs for 21% of properties extractable from API documentation.</description><subject>Computer Science - Software Engineering</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8FOwzAQRH3hgAofwAn_QILXdtYxFwQRFKQgeojUY7SJN1Wk0lR2QPTvKaWXmTk9zRPiBlRuy6JQdxR_xu9cG-VyZY3FS3Ff0U7WFDd8zN3mi47jfQq8TXIdx5nlcpqCXMVpz3E-ZE-UOMiG05wersTFQNvE1-deiObluales_pj-VY91hmhwyy4AdDrsnOuBK-sdYVxyGgB2GhwHsH00A1ARIHQs6EeC9973yltSm0W4vYfezrf7uP4SfHQ_km0JwnzCyfqP2g</recordid><startdate>20230710</startdate><enddate>20230710</enddate><creator>Vikram, Vasudev</creator><creator>Lemieux, Caroline</creator><creator>Sunshine, Joshua</creator><creator>Padhye, Rohan</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230710</creationdate><title>Can Large Language Models Write Good Property-Based Tests?</title><author>Vikram, Vasudev ; Lemieux, Caroline ; Sunshine, Joshua ; Padhye, Rohan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a676-d7f16928b7781904475376e6411e32179613c1bf1aaada69e3ac659c99b023823</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Software Engineering</topic><toplevel>online_resources</toplevel><creatorcontrib>Vikram, Vasudev</creatorcontrib><creatorcontrib>Lemieux, Caroline</creatorcontrib><creatorcontrib>Sunshine, Joshua</creatorcontrib><creatorcontrib>Padhye, Rohan</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Vikram, Vasudev</au><au>Lemieux, Caroline</au><au>Sunshine, Joshua</au><au>Padhye, Rohan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Can Large Language Models Write Good Property-Based Tests?</atitle><date>2023-07-10</date><risdate>2023</risdate><abstract>Property-based testing (PBT), while an established technique in the software testing research community, is still relatively underused in real-world software. Pain points in writing property-based tests include implementing diverse random input generators and thinking of meaningful properties to test. Developers, however, are more amenable to writing documentation; plenty of library API documentation is available and can be used as natural language specifications for PBTs. As large language models (LLMs) have recently shown promise in a variety of coding tasks, we investigate using modern LLMs to automatically synthesize PBTs using two prompting techniques. A key challenge is to rigorously evaluate the LLM-synthesized PBTs. We propose a methodology to do so considering several properties of the generated tests: (1) validity, (2) soundness, and (3) property coverage, a novel metric that measures the ability of the PBT to detect property violations through generation of property mutants. In our evaluation on 40 Python library API methods across three models (GPT-4, Gemini-1.5-Pro, Claude-3-Opus), we find that with the best model and prompting approach, a valid and sound PBT can be synthesized in 2.4 samples on average. We additionally find that our metric for determining soundness of a PBT is aligned with human judgment of property assertions, achieving a precision of 100% and recall of 97%. Finally, we evaluate the property coverage of LLMs across all API methods and find that the best model (GPT-4) is able to automatically synthesize correct PBTs for 21% of properties extractable from API documentation.</abstract><doi>10.48550/arxiv.2307.04346</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2307.04346
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2307_04346
source	arXiv.org
subjects	Computer Science - Software Engineering
title	Can Large Language Models Write Good Property-Based Tests?
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T06%3A16%3A35IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Can%20Large%20Language%20Models%20Write%20Good%20Property-Based%20Tests?&rft.au=Vikram,%20Vasudev&rft.date=2023-07-10&rft_id=info:doi/10.48550/arxiv.2307.04346&rft_dat=%3Carxiv_GOX%3E2307_04346%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true