Fine-tuning ChatGPT for Automatic Scoring of Written Scientific Explanations in Chinese
The development of explanations for scientific phenomena is essential in science assessment, but scoring student-written explanations remains challenging and resource-intensive. Large language models (LLMs) have shown promise in addressing this issue, particularly in alphabetic languages like Englis...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The development of explanations for scientific phenomena is essential in
science assessment, but scoring student-written explanations remains
challenging and resource-intensive. Large language models (LLMs) have shown
promise in addressing this issue, particularly in alphabetic languages like
English. However, their applicability to logographic languages is less
explored. This study investigates the potential of fine-tuning ChatGPT, a
leading LLM, to automatically score scientific explanations written in Chinese.
Student responses to seven scientific explanation tasks were collected and
automatically scored, with scoring accuracy examined in relation to reasoning
complexity using the Kendall correlation. A qualitative analysis explored how
linguistic features influenced scoring accuracy. The results show that
domain-specific adaptation enables ChatGPT to score Chinese scientific
explanations with accuracy. However, scoring accuracy correlates with reasoning
complexity: a negative correlation for lower-level responses and a positive one
for higher-level responses. The model overrates complex reasoning in low-level
responses with intricate sentence structures and underrates high-level
responses using concise causal reasoning. These correlations stem from
linguistic features--simplicity and clarity enhance accuracy for lower-level
responses, while comprehensiveness improves accuracy for higher-level ones.
Simpler, shorter responses tend to score more accurately at lower levels,
whereas longer, information-rich responses yield better accuracy at higher
levels. These findings demonstrate the effectiveness of LLMs in automatic
scoring within a Chinese context and emphasize the importance of linguistic
features and reasoning complexity in fine-tuning scoring models for educational
assessments. |
---|---|
DOI: | 10.48550/arxiv.2501.06704 |