International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
|
Volume 187 - Issue 29 |
Published: August 2025 |
Authors: Abhishek Palavancha |
![]() |
Abhishek Palavancha . A Conversational Multi‑Agent Framework for Prompt Evaluation across Large Language Models. International Journal of Computer Applications. 187, 29 (August 2025), 57-60. DOI=10.5120/ijca2025925510
@article{ 10.5120/ijca2025925510, author = { Abhishek Palavancha }, title = { A Conversational Multi‑Agent Framework for Prompt Evaluation across Large Language Models }, journal = { International Journal of Computer Applications }, year = { 2025 }, volume = { 187 }, number = { 29 }, pages = { 57-60 }, doi = { 10.5120/ijca2025925510 }, publisher = { Foundation of Computer Science (FCS), NY, USA } }
%0 Journal Article %D 2025 %A Abhishek Palavancha %T A Conversational Multi‑Agent Framework for Prompt Evaluation across Large Language Models%T %J International Journal of Computer Applications %V 187 %N 29 %P 57-60 %R 10.5120/ijca2025925510 %I Foundation of Computer Science (FCS), NY, USA
Large Language Models (LLMs) are increasingly deployed in diverse applications, yet designing effective prompts that generalize across multiple LLMs remains challenging. This paper proposes a conversational multi-agent framework for testing and evaluating AI prompts using multiple LLMs (ChatGPT, Claude, Google Gemini) in a collaborative setup. The framework introduces a multi-agent architecture where AI agents powered by different LLMs interact under an orchestrator to process user prompts and evaluate responses collaboratively. A dynamic conversational interface enables prompt refinement and testing in real-time, providing immediate feedback on prompt efficacy. Key evaluation metrics include fluency, task success rate, response diversity, coherence, and groundedness to systematically assess prompt outcomes. Comprehensive experiments across 12 diverse datasets and 8 prompt categories demonstrate that multi-LLM collaboration surfaces strengths and weaknesses of prompts more effectively than single-model testing, with statistical significance (p<0.05). This work contributes a novel interactive approach to prompt engineering by leveraging multi-agent conversations to ensure prompts elicit high-quality, coherent, and factual responses across leading LLMs.