Detecting ChatGPT-generated essays in a large-scale writing assessment: Is there a bias against non-native English speakers?

With the prevalence of generative AI tools like ChatGPT, automated detectors of AI-generated texts have been increasingly used in education to detect the misuse of these tools (e.g., cheating in assessments). Recently, the responsible use of these detectors has attracted a lot of attention. Research...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Computers and education 2024-08, Vol.217, p.105070, Article 105070
Hauptverfasser: Jiang, Yang, Hao, Jiangang, Fauss, Michael, Li, Chen
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:With the prevalence of generative AI tools like ChatGPT, automated detectors of AI-generated texts have been increasingly used in education to detect the misuse of these tools (e.g., cheating in assessments). Recently, the responsible use of these detectors has attracted a lot of attention. Research has shown that publicly available detectors are more likely to misclassify essays written by non-native English speakers as AI-generated than those written by native English speakers. In this study, we address these concerns by leveraging carefully sampled large-scale data from the Graduate Record Examinations (GRE) writing assessment. We developed multiple detectors of ChatGPT-generated essays based on linguistic features from the ETS e-rater engine and text perplexity features, and investigated their performance and potential bias. Results showed that our carefully constructed detectors not only achieved near-perfect detection accuracy, but also showed no evidence of bias disadvantaging non-native English speakers. Findings of this study contribute to the ongoing debates surrounding the formulation of policies for utilizing AI-generated content detectors in education. •We study the potential bias in detecting ChatGPT-generated essays in a large-scale assessment.•Detectors based on linguistic features showed near-perfect detection performance.•Detectors built using well-sampled data from GRE do not show bias against non-native English speakers.•Findings shed light on the fairness in applying automated LLM detectors in education.
ISSN:0360-1315
1873-782X
DOI:10.1016/j.compedu.2024.105070