Empirical evaluation of tools for hairy requirements engineering tasks

Context A hairy requirements engineering (RE) task involving natural language (NL) documents is (1) a non-algorithmic task to find all relevant answers in a set of documents, that is (2) not inherently difficult for NL-understanding humans on a small scale, but is (3) unmanageable in the large scale...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Empirical software engineering : an international journal 2021-11, Vol.26 (6), Article 111
1. Verfasser: Berry, Daniel M.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Context A hairy requirements engineering (RE) task involving natural language (NL) documents is (1) a non-algorithmic task to find all relevant answers in a set of documents, that is (2) not inherently difficult for NL-understanding humans on a small scale, but is (3) unmanageable in the large scale. In performing a hairy RE task, humans need more help finding all the relevant answers than they do in recognizing that an answer is irrelevant. Therefore, a hairy RE task demands the assistance of a tool that focuses more on achieving high recall, i.e., finding all relevant answers, than on achieving high precision, i.e., finding only relevant answers. As close to 100% recall as possible is needed, particularly when the task is applied to the development of a high-dependability system. In this case, a hairy-RE-task tool that falls short of close to 100% recall may even be useless, because to find the missing information, a human has to do the entire task manually anyway. On the other hand, too much imprecision, too many irrelevant answers in the tool’s output, means that manually vetting the tool’s output to eliminate the irrelevant answers may be too burdensome. The reality is that all that can be realistically expected and validated is that the recall of a hairy-RE-task tool is higher than the recall of a human doing the task manually. Objective Therefore, the evaluation of any hairy-RE-task tool must consider the context in which the tool is used, and it must compare the performance of a human applying the tool to do the task with the performance of a human doing the task entirely manually, in the same context. The context of a hairy-RE-task tool includes characteristics of the documents being subjected to the task and the purposes of subjecting the documents to the task. However, traditionally, many a hairy-RE-task tool has been evaluated by considering only (1) how high is its precision, or (2) how high is its F -measure, which weights recall and precision equally, ignoring the context, and possibly leading to incorrect — often underestimated — conclusions about how effective it is. Method To evaluate a hairy-RE-task tool, this article offers an empirical procedure that takes into account not only (1) the performance of the tool, but also (2) the context in which the task is being done, (3) the performance of humans doing the task manually, and (4) the performance of those vetting the tool’s output. The empirical procedure uses (I) on one hand, (1) the re
ISSN:1382-3256
1573-7616
DOI:10.1007/s10664-021-09986-0