Research Article

Optimizing GPT-4 for Automated Short Answer Grading in Educational Assessments

by  Augustine O. Ugbari, Clement Ndeekor, Echebiri Wobidi
journal cover
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Issue 21
Published: July 2025
Authors: Augustine O. Ugbari, Clement Ndeekor, Echebiri Wobidi
10.5120/ijca2025925255
PDF

Augustine O. Ugbari, Clement Ndeekor, Echebiri Wobidi . Optimizing GPT-4 for Automated Short Answer Grading in Educational Assessments. International Journal of Computer Applications. 187, 21 (July 2025), 32-36. DOI=10.5120/ijca2025925255

                        @article{ 10.5120/ijca2025925255,
                        author  = { Augustine O. Ugbari,Clement Ndeekor,Echebiri Wobidi },
                        title   = { Optimizing GPT-4 for Automated Short Answer Grading in Educational Assessments },
                        journal = { International Journal of Computer Applications },
                        year    = { 2025 },
                        volume  = { 187 },
                        number  = { 21 },
                        pages   = { 32-36 },
                        doi     = { 10.5120/ijca2025925255 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }
                        %0 Journal Article
                        %D 2025
                        %A Augustine O. Ugbari
                        %A Clement Ndeekor
                        %A Echebiri Wobidi
                        %T Optimizing GPT-4 for Automated Short Answer Grading in Educational Assessments%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 21
                        %P 32-36
                        %R 10.5120/ijca2025925255
                        %I Foundation of Computer Science (FCS), NY, USA
Abstract

Automated Short Answer Grading Systems (ASAGS) have witnessed significant advancement with the integration of large language models (LLMs), particularly GPT-4. This paper explores methodologies to optimize GPT-4 for the purpose of grading short answer questions in educational assessments. The focus is on aligning GPT-4’s natural language processing capabilities with human grading rubrics to enhance accuracy, consistency, and fairness. We examine techniques including prompt engineering, rubric-based scoring, and fine-tuning strategies. The research also assesses the model’s performance across various domains, evaluates inter-rater reliability with human graders, and addresses concerns related to bias, explainability, and scalability. This paper proposes a framework that leverages GPT-4 as a co-grader, ensuring human-in-the-loop moderation to improve educational outcomes.

References
  • Anderson, L. W., & Krathwohl, D. R. (2001). A taxonomy for learning, teaching, and assessing: A revision of Bloom’s taxonomy of educational objectives. Longman.
  • Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877–1901.
  • Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., ... & Zhang, Y. (2023). Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv preprint. https://doi.org/10.48550/arXiv.2303.12712
  • Burrows, S., Gurevych, I., & Stein, B. (2015). The efficacy of machine learning for automated essay grading. IEEE Transactions on Learning Technologies, 9(4), 532–544.
  • Clark, E., Tafjord, O., & Richardson, K. B. (2021). What can large language models do with syntax?. arXiv preprint arXiv:2103.08505.
  • Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Dzikovska, M., Heilman, M., Collins, A., & Core, M. (2013). BEA: A large corpus of learner essays. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 1–9).
  • Floridi, L., Cowls, J., Beltrametti, M., Chatila, R., Chazerand, P., Dignum, V., ... & Vayena, E. (2018). AI4People—An ethical framework for a good AI society: Opportunities, risks, principles, and recommendations. Minds and Machines, 28(4), 689–707. https://doi.org/10.1007/s11023-018-9482-5
  • Guidotti, R., Monreale, A., Rossi, F., Giannotti, F., & Pedreschi, D. (2018). A survey of methods for explaining black box models. ACM computing surveys (CSUR), 51(5), 1–42.
  • Kaggle. (2012). Automated student assessment prize (ASAP). https://www.kaggle.com/c/asap-aes
  • Kasneci, E., Sessler, K., Küchenhoff, L., Bannert, M., Dementieva, D., Fischer, F., ... & Kasneci, G. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103, 102274. https://doi.org/10.1016/j.lindif.2023.102274
  • Mohler, J., Bunescu, R., & Mihalcea, R. (2011). Lexical methods for measuring the semantic content similarity of text. In Proceedings of the conference on empirical methods in natural language processing (pp. 1416–1426).
  • OpenAI. (2023). GPT-4 Technical Report.
  • Ouyang, A., Wu, J., Jiang, X., Almeida, D., Wainwright, C. J., Sutskever, I., ... & Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730–27744.
  • Riordan, B., Xue, Z., Cruz, N., & Warschauer, M. (2017). Assessing automated scoring of student-written short answers using deep learning. Journal of Educational Data Mining, 9(1), 25–47.
  • Shermis, M. D., & Burstein, J. (Eds.). (2013). Handbook of automated essay evaluation: Current applications and new directions. Routledge.
  • Sukkarieh, J. Z., & Pulman, S. G. (2005). Issues in the automated evaluation of reading comprehension exercises. In Proceedings of the ACL student research workshop (pp. 9–16).
  • Wang, Y., Liang, N., She, D., Liu, K., Xiao, X., & Zhu, J. (2023). Large language models are few-shot graders for multi-aspect feedback. arXiv preprint arXiv:2305.10775.
  • Zhao, Y. E., Prasad, A., Eschweiler, K. M., & Chai, J. (2023). Chain-of-thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
  • Zupanc, B., & Bosnić, Z. (2015). Text similarity based on latent semantic analysis. Informatica, 39(3).
Index Terms
Computer Science
Information Sciences
No index terms available.
Keywords

Automated Short Answer Grading Systems (ASAGS) Large Language Models (LLMs) GPT-4 Short Answer Questions (SAQs) Prompt Engineering Rubric-Based Scoring Few-Shot Learning Fine-Tuning Inter-Rater Reliability Natural Language Processing (NLP) Chain-of-Thought Prompting Feedback Generation.

Powered by PhDFocusTM