Quantifying confidence shifts in a BERT-based question answering system evaluated on perturbed instances

Recent work on transformer-based neural networks has led to impressive advances on multiple-choice natural language processing (NLP) problems, such as Question Answering (QA) and abductive reasoning. Despite these advances, there is limited work still on systematically evaluating such models in ambi...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	PloS one 2023-12, Vol.18 (12), p.e0295925-e0295925
Hauptverfasser:	Shen, Ke, Kejriwal, Mayank
Format:	Artikel
Sprache:	eng
Schlagworte:	Analysis Benchmarks Biology and Life Sciences Computational linguistics Computer and Information Sciences Decision making Engineering and Technology Human impact Human influences Humans Information management Information processing Information Storage and Retrieval Language Language processing Linguistics Multiple choice Natural language Natural language interfaces Natural Language Processing Neural networks Neural Networks, Computer Physical Sciences Question-answering systems Questions Semantics Social Sciences Transformers
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Recent work on transformer-based neural networks has led to impressive advances on multiple-choice natural language processing (NLP) problems, such as Question Answering (QA) and abductive reasoning. Despite these advances, there is limited work still on systematically evaluating such models in ambiguous situations where (for example) no correct answer exists for a given prompt among the provided set of choices. Such ambiguous situations are not infrequent in real world applications. We design and conduct an experimental study of this phenomenon using three probes that aim to 'confuse' the model by perturbing QA instances in a consistent and well-defined manner. Using a detailed set of results based on an established transformer-based multiple-choice QA system on two established benchmark datasets, we show that the model's confidence in its results is very different from that of an expected model that is 'agnostic' to all choices that are incorrect. Our results suggest that high performance on idealized QA instances should not be used to infer or extrapolate similarly high performance on more ambiguous instances. Auxiliary results suggest that the model may not be able to distinguish between these two situations with sufficient certainty. Stronger testing protocols and benchmarking may hence be necessary before such models are deployed in front-facing systems or ambiguous decision making with significant human impact.
ISSN:	1932-6203 1932-6203
DOI:	10.1371/journal.pone.0295925