Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring
The manual assessment and grading of student writing is a time-consuming yet critical task for teachers. Recent developments in generative AI, such as large language models, offer potential solutions to facilitate essay-scoring tasks for teachers. In our study, we evaluate the performance and reliab...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2024-11 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The manual assessment and grading of student writing is a time-consuming yet critical task for teachers. Recent developments in generative AI, such as large language models, offer potential solutions to facilitate essay-scoring tasks for teachers. In our study, we evaluate the performance and reliability of both open-source and closed-source LLMs in assessing German student essays, comparing their evaluations to those of 37 teachers across 10 pre-defined criteria (i.e., plot logic, expression). A corpus of 20 real-world essays from Year 7 and 8 students was analyzed using five LLMs: GPT-3.5, GPT-4, o1, LLaMA 3-70B, and Mixtral 8x7B, aiming to provide in-depth insights into LLMs' scoring capabilities. Closed-source GPT models outperform open-source models in both internal consistency and alignment with human ratings, particularly excelling in language-related criteria. The novel o1 model outperforms all other LLMs, achieving Spearman's \(r = .74\) with human assessments in the overall score, and an internal consistency of \(ICC=.80\). These findings indicate that LLM-based assessment can be a useful tool to reduce teacher workload by supporting the evaluation of essays, especially with regard to language-related criteria. However, due to their tendency for higher scores, the models require further refinement to better capture aspects of content quality. |
---|---|
ISSN: | 2331-8422 |