Optimizing GPT-4 for Automated Short Answer Grading in Educational Assessments

Augustine O. Ugbari; Clement Ndeekor; Echebiri Wobidi

Research Article

Optimizing GPT-4 for Automated Short Answer Grading in Educational Assessments

by Augustine O. Ugbari, Clement Ndeekor, Echebiri Wobidi

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 187 - Issue 21

Published: July 2025

Authors: Augustine O. Ugbari, Clement Ndeekor, Echebiri Wobidi

10.5120/ijca2025925255

PDF

Augustine O. Ugbari, Clement Ndeekor, Echebiri Wobidi . Optimizing GPT-4 for Automated Short Answer Grading in Educational Assessments. International Journal of Computer Applications. 187, 21 (July 2025), 32-36. DOI=10.5120/ijca2025925255

                        @article{ 10.5120/ijca2025925255,
                        author  = { Augustine O. Ugbari,Clement Ndeekor,Echebiri Wobidi },
                        title   = { Optimizing GPT-4 for Automated Short Answer Grading in Educational Assessments },
                        journal = { International Journal of Computer Applications },
                        year    = { 2025 },
                        volume  = { 187 },
                        number  = { 21 },
                        pages   = { 32-36 },
                        doi     = { 10.5120/ijca2025925255 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2025
                        %A Augustine O. Ugbari
                        %A Clement Ndeekor
                        %A Echebiri Wobidi
                        %T Optimizing GPT-4 for Automated Short Answer Grading in Educational Assessments%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 21
                        %P 32-36
                        %R 10.5120/ijca2025925255
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

Automated Short Answer Grading Systems (ASAGS) have witnessed significant advancement with the integration of large language models (LLMs), particularly GPT-4. This paper explores methodologies to optimize GPT-4 for the purpose of grading short answer questions in educational assessments. The focus is on aligning GPT-4’s natural language processing capabilities with human grading rubrics to enhance accuracy, consistency, and fairness. We examine techniques including prompt engineering, rubric-based scoring, and fine-tuning strategies. The research also assesses the model’s performance across various domains, evaluates inter-rater reliability with human graders, and addresses concerns related to bias, explainability, and scalability. This paper proposes a framework that leverages GPT-4 as a co-grader, ensuring human-in-the-loop moderation to improve educational outcomes.

References

Anderson, L. W., & Krathwohl, D. R. (2001). A taxonomy for learning, teaching, and assessing: A revision of Bloom’s taxonomy of educational objectives. Longman.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877–1901.
Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., ... & Zhang, Y. (2023). Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv preprint. https://doi.org/10.48550/arXiv.2303.12712
Burrows, S., Gurevych, I., & Stein, B. (2015). The efficacy of machine learning for automated essay grading. IEEE Transactions on Learning Technologies, 9(4), 532–544.
Clark, E., Tafjord, O., & Richardson, K. B. (2021). What can large language models do with syntax?. arXiv preprint arXiv:2103.08505.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Dzikovska, M., Heilman, M., Collins, A., & Core, M. (2013). BEA: A large corpus of learner essays. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 1–9).
Floridi, L., Cowls, J., Beltrametti, M., Chatila, R., Chazerand, P., Dignum, V., ... & Vayena, E. (2018). AI4People—An ethical framework for a good AI society: Opportunities, risks, principles, and recommendations. Minds and Machines, 28(4), 689–707. https://doi.org/10.1007/s11023-018-9482-5
Guidotti, R., Monreale, A., Rossi, F., Giannotti, F., & Pedreschi, D. (2018). A survey of methods for explaining black box models. ACM computing surveys (CSUR), 51(5), 1–42.
Kaggle. (2012). Automated student assessment prize (ASAP). https://www.kaggle.com/c/asap-aes
Kasneci, E., Sessler, K., Küchenhoff, L., Bannert, M., Dementieva, D., Fischer, F., ... & Kasneci, G. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103, 102274. https://doi.org/10.1016/j.lindif.2023.102274
Mohler, J., Bunescu, R., & Mihalcea, R. (2011). Lexical methods for measuring the semantic content similarity of text. In Proceedings of the conference on empirical methods in natural language processing (pp. 1416–1426).
OpenAI. (2023). GPT-4 Technical Report.
Ouyang, A., Wu, J., Jiang, X., Almeida, D., Wainwright, C. J., Sutskever, I., ... & Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730–27744.
Riordan, B., Xue, Z., Cruz, N., & Warschauer, M. (2017). Assessing automated scoring of student-written short answers using deep learning. Journal of Educational Data Mining, 9(1), 25–47.
Shermis, M. D., & Burstein, J. (Eds.). (2013). Handbook of automated essay evaluation: Current applications and new directions. Routledge.
Sukkarieh, J. Z., & Pulman, S. G. (2005). Issues in the automated evaluation of reading comprehension exercises. In Proceedings of the ACL student research workshop (pp. 9–16).
Wang, Y., Liang, N., She, D., Liu, K., Xiao, X., & Zhu, J. (2023). Large language models are few-shot graders for multi-aspect feedback. arXiv preprint arXiv:2305.10775.
Zhao, Y. E., Prasad, A., Eschweiler, K. M., & Chai, J. (2023). Chain-of-thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
Zupanc, B., & Bosnić, Z. (2015). Text similarity based on latent semantic analysis. Informatica, 39(3).

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

Automated Short Answer Grading Systems (ASAGS) Large Language Models (LLMs) GPT-4 Short Answer Questions (SAQs) Prompt Engineering Rubric-Based Scoring Few-Shot Learning Fine-Tuning Inter-Rater Reliability Natural Language Processing (NLP) Chain-of-Thought Prompting Feedback Generation.