Andrea Horbach, Daniel Mora Melanchthon, Nils-Jonathan Schaller, Stefan Keller, Jennifer Meyer, Thorben Jansen
Empirische Arbeit: One Model to Score Them All? On Suitability, Stability, and Synergy in Automated Essay Evaluation
Sofort lieferbar
0,00 €
inkl. MwSt.
Writing is a key educational competence whose development depends on feedback. Automated Essay Scoring (AES) is increasingly used to support feedback generation, yet most prior research has neglected score stability across repeated runs. However, inconsistent scoring can undermine trust in AES and limit its educational value. We evaluate feature-based logistic regression models, transformer-based neural models, and generative large language models on 4,593 EFL essays from the MEWS dataset. Beyond accuracy, we analyze prediction variability across multiple runs and examine agreement patterns within and across model families. Results show that feature-based models remain competitive, while LLMs achieve high accuracy. Despite strong average performance, GPT-5 exhibits substantial variability across runs. Across models, agreement patterns reveal that different families succeed and fail on different item subsets. Our findings underline stability as a crucial dimension for deploying AES models in educational contexts and highlight the need for careful model selection and potentially model combination.
| Bibliographie | Andrea Horbach / Daniel Mora Melanchthon / Nils-Jonathan Schaller / Stefan Keller / Jennifer Meyer / Thorben Jansen Empirische Arbeit: One Model to Score Them All? On Suitability, Stability, and Synergy in Automated Essay Evaluation 14 Seiten. () |
|---|---|
| Seiten | 14 |
| Artikelnummer | PEU20260305 |
| Autor:in | Andrea Horbach, Daniel Mora Melanchthon, Nils-Jonathan Schaller, Stefan Keller, Jennifer Meyer, Thorben Jansen |
| Erscheinungsdatum | 01.07.2026 |