Skip to content

Latest commit

 

History

History
20 lines (15 loc) · 2.98 KB

2404.04941.md

File metadata and controls

20 lines (15 loc) · 2.98 KB

Background

  • Background The paper discusses the potential of Automated Essay Scoring (AES) as an alternative to costly and labor-intensive human assessment. AES promises to address the issues of rater fatigue and inconsistency, allowing for efficient essay grading at scale. However, the existing AES systems usually tailor to specific writing prompts, relying on labeled essays, which demands significant cost and expertise for data acquisition. These models often struggle when encountering essays written for unseen writing prompts.

  • Existing Work Existing works have been focusing on matching human ratings by developing supervised models specific to writing prompts, facing difficulties with essays written for unseen prompts. In addition, there is a proliferation of studies leveraging Large Language Models (LLMs) as evaluation metrics for machine-generated text, but utilizing LLMs for zero-shot Automated Essay Scoring is yet to be fully explored.

Core Contributions

  • Introduced a zero-shot essay scoring framework (Multi Trait Specialization, MTS)
    • Challenge 1: How to utilize LLMs to score unseen writing prompts MTS addresses this by deconstructing writing proficiency into distinct traits and generating scoring criteria for each trait with the use of ChatGPT, scoring the traits over several rounds of conversation. This approach aims to transcend the single step holistic scoring method for better alignment with human raters.

    • Challenge 2: How to ensure the accuracy of evaluations by LLMs The conversational framework instructs the LLM to retrieve and refer back to text content (Quote Retrieval and Scoring mechanism), ensuring a faithful evaluation of the essay's content, followed by allocating scores based on the provided scoring criteria. The final overall score is then calculated by averaging trait scores and implementing min-max scaling combined with outlier clipping mechanism.

Implementation and Deployment

Experimental results on the ASAP and TOEFL11 benchmark datasets indicated that MTS consistently surpassed the direct prompting (Vanilla) in terms of average Quadratic Weighted Kappa (QWK) across all LLMs and datasets, with maximum gains of 0.437 on TOEFL11 and 0.355 on ASAP. Moreover, the small-sized Llama2-13b-chat model significantly outperformed ChatGPT with MTS assistance, enabling effective real-world deployment. MTS is only behind supervised SOTA by an average QWK of 0.171 on TOEFL11 and 0.257 on ASAP, offering a potent zero-shot alternative to supervised models.

Summary

The study presented a novel zero-shot LLM framework for essay scoring (MTS) which scores essays across different writing traits through multi-round conversations, using min-max scaling and outlier clipping mechanism for final score determination. MTS significantly improves accuracy over direct prompting methods and demonstrates superior small-scale deployment compared to ChatGPT, offering a zero-shot essay scoring alternative to supervised learning approaches.