Background

Background The paper discusses the significant challenge of providing timely and reliable feedback to Chinese high school students, especially for writing practices, given the high student-teacher ratio. Developing automated systems that can offer accurate feedback and assessment is of immense importance in this context.
Existing Work Automated Essay Scoring (AES) systems assist students by offering immediate and consistent feedback, simplifying the grading process for educators. However, effectively implementing AES systems in real-world educational settings is challenging, particularly due to the diversity of exercise contexts and the inherent ambiguity in scoring rubrics.

Core Contributions

Introduced a study using GPT-3.5 and GPT-4 Large Language Models (LLMs) as tools for Automated Essay Scoring (AES)
- Challenge 1: Improving AES systems to adapt to various scoring environments The authors introduced LLMs, specifically OpenAI's GPT-3.5 and GPT-4, and enhanced the models through prompt engineering and dataset fine-tuning to improve grading accuracy across diverse scoring settings. The fine-tuned GPT-3.5 performed better than traditional grading models in several datasets, offering higher accuracy, consistency, generality, and explainability.
- Challenge 2: Utilizing LLMs to assist human evaluators in increasing efficiency and consistency in grading Through human-computer collaboration experiments, it was found that feedback generated by LLMs helps novice graders achieve accuracy comparable to experts, and experts become more efficient and consistent in their assessments. This underscores the potential of LLMs as educational technology tools and opens up new avenues for effective human-AI collaboration.

Implementation and Deployment

A series of experiments conducted on both public and private datasets compared the performance of LLM-based AES systems with traditional grading models. The study used Cohen's Quadratic Weighted Kappa (QWK) as the main assessment metric, reflecting the level of agreement between the predictions and actual scores. Fine-tuned GPT-3.5 models outperformed the BERT baseline in six out of eight datasets, and their superior performance was further confirmed in private datasets. In addition to extensive model performance evaluations, the paper also includes successful examples of how model performance was enhanced through prompt engineering and joint-training of GPT-3.5.

Summary

This research showcases the potential of large language models in the field of education, especially within AES systems. LLMs not only have the ability to automate scoring processes but also enhance the performance of human graders through generated feedback. This advancement offers valuable insights for the future of AI-assisted education and efficient collaboration between AI and humans.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2401.06431.md

2401.06431.md

Background

Core Contributions

Implementation and Deployment

Summary

Files

2401.06431.md

Latest commit

History

2401.06431.md

File metadata and controls

Background

Core Contributions

Implementation and Deployment

Summary