A Conversational Multi‑Agent Framework for Prompt Evaluation across Large Language Models

Abhishek Palavancha

Research Article

A Conversational Multi‑Agent Framework for Prompt Evaluation across Large Language Models

by Abhishek Palavancha

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 187 - Issue 29

Published: August 2025

Authors: Abhishek Palavancha

10.5120/ijca2025925510

PDF

Abhishek Palavancha . A Conversational Multi‑Agent Framework for Prompt Evaluation across Large Language Models. International Journal of Computer Applications. 187, 29 (August 2025), 57-60. DOI=10.5120/ijca2025925510

                        @article{ 10.5120/ijca2025925510,
                        author  = { Abhishek Palavancha },
                        title   = { A Conversational Multi‑Agent Framework for Prompt Evaluation across Large Language Models },
                        journal = { International Journal of Computer Applications },
                        year    = { 2025 },
                        volume  = { 187 },
                        number  = { 29 },
                        pages   = { 57-60 },
                        doi     = { 10.5120/ijca2025925510 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2025
                        %A Abhishek Palavancha
                        %T A Conversational Multi‑Agent Framework for Prompt Evaluation across Large Language Models%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 29
                        %P 57-60
                        %R 10.5120/ijca2025925510
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

Large Language Models (LLMs) are increasingly deployed in diverse applications, yet designing effective prompts that generalize across multiple LLMs remains challenging. This paper proposes a conversational multi-agent framework for testing and evaluating AI prompts using multiple LLMs (ChatGPT, Claude, Google Gemini) in a collaborative setup. The framework introduces a multi-agent architecture where AI agents powered by different LLMs interact under an orchestrator to process user prompts and evaluate responses collaboratively. A dynamic conversational interface enables prompt refinement and testing in real-time, providing immediate feedback on prompt efficacy. Key evaluation metrics include fluency, task success rate, response diversity, coherence, and groundedness to systematically assess prompt outcomes. Comprehensive experiments across 12 diverse datasets and 8 prompt categories demonstrate that multi-LLM collaboration surfaces strengths and weaknesses of prompts more effectively than single-model testing, with statistical significance (p<0.05). This work contributes a novel interactive approach to prompt engineering by leveraging multi-agent conversations to ensure prompts elicit high-quality, coherent, and factual responses across leading LLMs.

References

ServiceNow and the Rise of Agentic AI: From Workflows to Autonomous Execution. Available: https://www.gocodeo.com/post/servicenow-and-the-rise-of-agentic-ai-from-workflows-to-autonomous-execution
Top Prompt Evaluation Frameworks in 2025: Helicone, OpenAI Eval, and More. Available: https://www.helicone.ai/blog/prompt-evaluation-frameworks
Gemini (language model) - Wikipedia. Available: https://en.wikipedia.org/wiki/Gemini_(language_model)
LLM Evaluation: 15 Metrics You Need to Know. Available: https://arya.ai/blog/llm-evaluation-metrics
Top LLM Chatbot Evaluation Metrics: Conversation Testing Techniques. Available: https://www.confident-ai.com/blog/llm-chatbot-evaluation-explained
How we built our multi-agent research system. Anthropic. Available: https://www.anthropic.com/engineering/built-multi-agent-research-system
LangGraph Multi-Agent Systems - Overview. Available: https://langchain-ai.github.io/langgraph/concepts/multi_agent/
Multi-agent System Design Patterns. Available: https://medium.com/@princekrampah/multi-agent-architecture-in-multi-agent-systems
ServiceNow to unlock massive productivity with AI agents. Available: https://www.fiercenetwork.com/newswire/servicenow-unlock-massive-productivity-ai-agents
Benchmarking LLM Judges via Debate Speech Evaluation. arXiv preprint. Available: https://arxiv.org/html/2506.05062v1.
State of What Art? A Call for Multi-Prompt LLM Evaluation. Available: https://blog.athina.ai/state-of-what-art-a-call-for-multi-prompt-llm-evaluation

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

Prompt engineering large language models multi-agent systems conversational AI evaluation metrics orchestrator architecture collaborative AI