Evals System Architecture
The Evals system in MCPJam Inspector is a comprehensive testing framework designed to evaluate MCP (Model Context Protocol) server implementations. This guide provides a deep dive into the architecture, data flows, and key components to help you contribute effectively.Overview
The Evals system allows developers to:- Run automated tests against MCP servers to validate tool implementations
 - Generate test cases using AI based on available server tools
 - Track results in real-time with detailed metrics and analytics
 - Compare expected vs actual behavior using agentic LLM loops
 
Key Features
- Multi-step wizard UI for test configuration
 - Support for multiple LLM providers (OpenAI, Anthropic, DeepSeek, Ollama)
 - Real-time result tracking via MCPJamBackend
 - AI-powered test case generation
 - Agentic execution with up to 20 conversation turns
 - Token usage and performance metrics
 
Architecture Overview
The Evals system is composed of three main layers:System Components
1. Client Layer (UI)
EvalRunner Component (client/src/components/evals/eval-runner.tsx)
The primary UI for configuring and launching evaluation runs.
Architecture: 4-Step Wizard
Step Details:
- 
Select Servers: Choose from connected MCP servers
- Filters: Only shows connected servers
 - Validation: At least one server required
 
 - 
Choose Model: Select LLM provider and model
- Providers: OpenAI, Anthropic, DeepSeek, Ollama, MCPJam
 - Credential check: Validates API keys via 
hasToken() 
 - 
Define Tests: Create or generate test cases
- Manual entry: Title, query, expected tool calls, number of runs
 - AI generation: Click “Generate Tests” to create 6 test cases (2 easy, 2 medium, 2 hard)
 
 - 
Review & Run: Confirm and execute
- Displays summary of configuration
 - POST to 
/api/mcp/evals/run 
 
Results Components (client/src/components/evals/*)
Real-time display of evaluation results.
Component Hierarchy:
2. Server Layer (API)
Evals Routes (server/routes/mcp/evals.ts)
HTTP API endpoints for eval execution and test generation.
Endpoint: POST /api/mcp/evals/run
Request Schema:
resolveServerIdsOrThrow(): Case-insensitive server ID matchingtransformServerConfigsToEnvironment(): Converts server manager format to CLI formattransformLLMConfigToLlmsConfig(): Routes API keys appropriately
Endpoint: POST /api/mcp/evals/generate-tests
Request Schema:
Test Generation Agent (server/services/eval-agent.ts)
Generates test cases using backend LLM.
Algorithm:
- Groups tools by server ID
 - Creates system prompt with MCP agent instructions
 - Creates user prompt with tool definitions and requirements
 - Calls backend LLM (meta-llama/llama-3.3-70b-instruct)
 - Parses JSON response
 - Returns 6 test cases (2 easy, 2 medium, 2 hard)
 
3. CLI Layer (Execution Engine)
Runner (evals-cli/src/evals/runner.ts)
The core orchestrator that executes evaluation tests.
Entry Points:
runEvalsWithApiKey(): CLI mode with API key authenticationrunEvalsWithAuth(): UI mode with Convex authentication
- Max 20 conversation turns to prevent infinite loops
 - Token usage tracking (prompt + completion)
 - Duration measurement
 - Tool call recording
 
Evaluator (evals-cli/src/evals/evaluator.ts)
Compares expected vs actual tool calls to determine pass/fail status.
Logic:
- ✅ All expected tools must be called
 - ⚠️ Additional unexpected tools are allowed (marked but don’t fail)
 
RunRecorder (evals-cli/src/db/tests.ts)
Database interface for persisting evaluation results.
Two Modes:
- API Key Mode (
createRunRecorder): Uses CLI-based database client - Auth Mode (
createRunRecorderWithAuth): Uses Convex HTTP client 
Data Models
Database Schema
TypeScript Interfaces
Integration Points
LLM Providers
The system supports multiple execution paths based on the selected model: Provider Configuration:MCP Server Integration
Connection Workflow: Transport Support:- 
STDIO: Command execution with stdin/stdout
 - 
HTTP/SSE: Server-Sent Events
 - 
Streamable HTTP: Custom streaming protocol
 
MCPJam Backend
Database Actions:Contributing Guide
Adding a New LLM Provider
- Update LLM config schema in 
evals-cli/src/utils/validators.ts: 
- Add provider case in 
evals-cli/src/evals/runner.ts: 
- Add to UI model list in 
client/src/hooks/use-chat.tsx: 
Adding a New MCP Transport
- Update server definition schema in 
evals-cli/src/utils/validators.ts: 
- Implement transport in Mastra MCP client (external library).
 - 
Update config transformer in 
server/utils/eval-transformer.ts: 
Debugging Evals
Enable verbose logging:Testing Changes
Run evals locally:- Start development server: 
npm run dev - Navigate to “Run evals” tab
 - Configure and execute test
 - Check browser console for errors
 - View results in “Eval results” tab
 
Common Issues
Issue: Test cases are not created- Check Convex auth token validity
 - Verify 
CONVEX_URLandCONVEX_HTTP_URLenvironment variables - Inspect browser network tab for failed requests
 
- Verify server connection status in ClientManager
 - Check tool definitions in 
listTools()response - Ensure tool names match exactly (case-sensitive)
 
- Confirm 
/streamingendpoint is accessible - Check Convex auth token in request headers
 - Verify model ID format (
@mcpjam/...) 
Performance Considerations
Optimization Strategies
- 
Parallel Execution: Run multiple test cases concurrently
 - 
Tool Batching: Execute independent tools in parallel
 - 
Database Batching: Batch iteration updates
 - 
Caching: Cache tool definitions between iterations
 
Metrics
Key performance indicators:- Average iteration duration: Time from start to finish
 - Token usage per iteration: Prompt + completion tokens
 - Tool execution time: Time spent in MCP calls
 - Database write time: Time to persist results
 - LLM response time: Time for each model call
 
helpers.ts aggregation functions.
Security Considerations
API Key Management
- Never commit API keys to version control
 - Store keys in localStorage (client) or environment variables (CLI)
 - Use Convex auth tokens for backend models (no API key exposure)
 
Input Validation
All inputs are validated with Zod schemas:Error Handling
- Never expose internal errors to the client
 - Sanitize error messages before logging
 - Catch all exceptions in async functions
 - Validate all external inputs (LLM responses, tool results)
 
Future Enhancements
Potential areas for contribution:- Parallel Test Execution: Run multiple test cases simultaneously
 - Custom Evaluators: Support for user-defined pass/fail criteria
 - Retry Logic: Automatic retry on transient failures
 - Result Comparison: Compare results across different models
 - Historical Analysis: Trend analysis of eval performance over time
 - Export Results: Download results as CSV/JSON
 - Shareable Suites: Share test configurations with team members
 - Scheduling: Run evals on a schedule (cron-like)
 
Glossary
| Term | Definition | 
|---|---|
| Eval Suite | A collection of test cases executed together | 
| Test Case | A single test with a query and expected tool calls | 
| Iteration | One execution of a test case (test cases can have multiple runs) | 
| Agentic Loop | Iterative LLM conversation with tool calling | 
| Tool Call | Invocation of an MCP server tool by the LLM | 
| Expected Tools | Tools that should be called for a test to pass | 
| Actual Tools | Tools that were actually called during execution | 
| Missing Tools | Expected tools that were not called (causes failure) | 
| Unexpected Tools | Tools called but not expected (logged, doesn’t fail) | 
| RunRecorder | Interface for persisting eval results to database | 
| MCPClient | Mastra client for communicating with MCP servers | 
Resources
- MCP Specification: https://spec.modelcontextprotocol.io
 - Mastra MCP Client: https://mastra.dev
 - Convex Database: https://convex.dev
 - Vercel AI SDK: https://sdk.vercel.ai
 
Questions?
If you have questions or need help contributing:- Check the GitHub Issues
 - Join our Discord community
 - Read the main Contributing Guide
 

