Research Article

A Conversational Multi‑Agent Framework for Prompt Evaluation across Large Language Models

by  Abhishek Palavancha
journal cover
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Issue 29
Published: August 2025
Authors: Abhishek Palavancha
10.5120/ijca2025925510
PDF

Abhishek Palavancha . A Conversational Multi‑Agent Framework for Prompt Evaluation across Large Language Models. International Journal of Computer Applications. 187, 29 (August 2025), 57-60. DOI=10.5120/ijca2025925510

                        @article{ 10.5120/ijca2025925510,
                        author  = { Abhishek Palavancha },
                        title   = { A Conversational Multi‑Agent Framework for Prompt Evaluation across Large Language Models },
                        journal = { International Journal of Computer Applications },
                        year    = { 2025 },
                        volume  = { 187 },
                        number  = { 29 },
                        pages   = { 57-60 },
                        doi     = { 10.5120/ijca2025925510 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }
                        %0 Journal Article
                        %D 2025
                        %A Abhishek Palavancha
                        %T A Conversational Multi‑Agent Framework for Prompt Evaluation across Large Language Models%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 29
                        %P 57-60
                        %R 10.5120/ijca2025925510
                        %I Foundation of Computer Science (FCS), NY, USA
Abstract

Large Language Models (LLMs) are increasingly deployed in diverse applications, yet designing effective prompts that generalize across multiple LLMs remains challenging. This paper proposes a conversational multi-agent framework for testing and evaluating AI prompts using multiple LLMs (ChatGPT, Claude, Google Gemini) in a collaborative setup. The framework introduces a multi-agent architecture where AI agents powered by different LLMs interact under an orchestrator to process user prompts and evaluate responses collaboratively. A dynamic conversational interface enables prompt refinement and testing in real-time, providing immediate feedback on prompt efficacy. Key evaluation metrics include fluency, task success rate, response diversity, coherence, and groundedness to systematically assess prompt outcomes. Comprehensive experiments across 12 diverse datasets and 8 prompt categories demonstrate that multi-LLM collaboration surfaces strengths and weaknesses of prompts more effectively than single-model testing, with statistical significance (p<0.05). This work contributes a novel interactive approach to prompt engineering by leveraging multi-agent conversations to ensure prompts elicit high-quality, coherent, and factual responses across leading LLMs.

References
  • ServiceNow and the Rise of Agentic AI: From Workflows to Autonomous Execution. Available: https://www.gocodeo.com/post/servicenow-and-the-rise-of-agentic-ai-from-workflows-to-autonomous-execution
  • Top Prompt Evaluation Frameworks in 2025: Helicone, OpenAI Eval, and More. Available: https://www.helicone.ai/blog/prompt-evaluation-frameworks
  • Gemini (language model) - Wikipedia. Available: https://en.wikipedia.org/wiki/Gemini_(language_model)
  • LLM Evaluation: 15 Metrics You Need to Know. Available: https://arya.ai/blog/llm-evaluation-metrics
  • Top LLM Chatbot Evaluation Metrics: Conversation Testing Techniques. Available: https://www.confident-ai.com/blog/llm-chatbot-evaluation-explained
  • How we built our multi-agent research system. Anthropic. Available: https://www.anthropic.com/engineering/built-multi-agent-research-system
  • LangGraph Multi-Agent Systems - Overview. Available: https://langchain-ai.github.io/langgraph/concepts/multi_agent/
  • Multi-agent System Design Patterns. Available: https://medium.com/@princekrampah/multi-agent-architecture-in-multi-agent-systems
  • ServiceNow to unlock massive productivity with AI agents. Available: https://www.fiercenetwork.com/newswire/servicenow-unlock-massive-productivity-ai-agents
  • Benchmarking LLM Judges via Debate Speech Evaluation. arXiv preprint. Available: https://arxiv.org/html/2506.05062v1.
  • State of What Art? A Call for Multi-Prompt LLM Evaluation. Available: https://blog.athina.ai/state-of-what-art-a-call-for-multi-prompt-llm-evaluation
Index Terms
Computer Science
Information Sciences
No index terms available.
Keywords

Prompt engineering large language models multi-agent systems conversational AI evaluation metrics orchestrator architecture collaborative AI

Powered by PhDFocusTM