Testing Strategies for AI Agents
A practical guide to testing AI agents including unit testing tools, integration testing agent loops, evaluation frameworks, and mock LLM strategies used in Klivvr Agent.
Testing AI agents is fundamentally different from testing deterministic software. An agent's behavior depends on an LLM that can produce different outputs for the same input. Tool calls happen in sequences that vary by conversation. And the correctness of a response is often subjective — a helpful answer phrased differently is still correct. Despite these challenges, agents need rigorous testing. In fintech, an untested agent is an unacceptable risk.
This article covers the testing strategies used in Klivvr Agent, organized as a testing pyramid: unit tests at the base, integration tests in the middle, and evaluation tests at the top.
Unit Testing Tools
Tools are deterministic — given the same input, they produce the same output. This makes them the easiest component to test and the most important to test thoroughly, since tool bugs directly affect agent reliability.
import { describe, it, expect, vi } from "vitest";
describe("lookup_customer tool", () => {
const mockCustomerService = {
findBy: vi.fn(),
};
const tool = createLookupCustomerTool(mockCustomerService);
it("returns customer data for valid ID", async () => {
mockCustomerService.findBy.mockResolvedValue({
id: "cust_123",
fullName: "Ahmed Hassan",
email: "ahmed@example.com",
status: "active",
kycLevel: "verified",
formattedBalance: "5,230.00 EGP",
transactions: [
{ id: "txn_1", amount: 100, date: "2025-07-01" },
{ id: "txn_2", amount: 250, date: "2025-07-02" },
],
});
const result = await tool.execute({
identifier: "cust_123",
identifierType: "id",
});
expect(result.success).toBe(true);
expect(result.data?.found).toBe(true);
expect(result.data?.name).toBe("Ahmed Hassan");
expect(result.data?.recentTransactions).toHaveLength(2);
});
it("returns not_found for missing customer", async () => {
mockCustomerService.findBy.mockResolvedValue(null);
const result = await tool.execute({
identifier: "nonexistent@email.com",
identifierType: "email",
});
expect(result.success).toBe(true);
expect(result.data?.found).toBe(false);
});
it("rejects invalid parameter schemas", async () => {
const result = await tool.execute({
identifier: "cust_123",
identifierType: "invalid_type",
} as any);
expect(result.success).toBe(false);
expect(result.error).toContain("invalid");
});
it("handles service errors gracefully", async () => {
mockCustomerService.findBy.mockRejectedValue(
new Error("Database connection timeout")
);
const result = await tool.execute({
identifier: "cust_123",
identifierType: "id",
});
expect(result.success).toBe(false);
expect(result.error).toContain("timeout");
});
});Mock LLM for Deterministic Testing
Integration tests require a mock LLM that returns predictable responses. The mock LLM uses scripted responses keyed on conversation patterns.
class MockLLMClient implements LLMClient {
private responses: Array<{
match: (messages: Message[]) => boolean;
response: Message;
}> = [];
addResponse(
matcher: (messages: Message[]) => boolean,
response: Message
): void {
this.responses.push({ match: matcher, response });
}
addToolCallResponse(
toolName: string,
args: Record<string, unknown>
): void {
this.responses.push({
match: () => true,
response: {
role: "assistant",
content: "",
toolCalls: [
{
id: `call_${Date.now()}`,
name: toolName,
arguments: args,
},
],
},
});
}
addTextResponse(text: string): void {
this.responses.push({
match: () => true,
response: { role: "assistant", content: text },
});
}
async chat(params: {
model: string;
messages: Message[];
tools: ToolDefinition[];
temperature: number;
maxTokens: number;
}): Promise<Message> {
for (const { match, response } of this.responses) {
if (match(params.messages)) {
// Remove the matched response so the next call gets the next one
const index = this.responses.indexOf({ match, response });
this.responses.splice(index, 1);
return response;
}
}
// Default: return a completion (no tool calls)
return { role: "assistant", content: "I cannot help with that." };
}
}Scenario-Based Integration Tests
Integration tests verify that the agent handles complete conversation scenarios correctly — from user input through tool calls to final response.
describe("customer balance inquiry scenario", () => {
it("looks up customer and reports balance", async () => {
const mockLLM = new MockLLMClient();
const mockTools = createMockTools();
// Script the LLM responses
mockLLM.addToolCallResponse("lookup_customer", {
identifier: "ahmed@example.com",
identifierType: "email",
});
mockLLM.addTextResponse(
"Your current balance is 5,230.00 EGP. Is there anything else?"
);
const agent = new Agent(
{
model: "claude-sonnet-4-6",
systemPrompt: "You are a banking support agent.",
tools: mockTools,
maxSteps: 10,
maxTokens: 4096,
temperature: 0,
},
mockLLM,
new ToolExecutor(mockTools)
);
const result = await agent.run(
"What is my balance? My email is ahmed@example.com"
);
expect(result.status).toBe("completed");
expect(result.toolResults).toHaveLength(1);
expect(result.toolResults[0].toolName).toBe("lookup_customer");
const lastMessage = result.messages[result.messages.length - 1];
expect(lastMessage.content).toContain("5,230");
});
});
describe("refund with guardrail scenario", () => {
it("blocks refund exceeding auto-approval limit", async () => {
const mockLLM = new MockLLMClient();
const mockTools = createMockTools();
mockLLM.addToolCallResponse("issue_refund", {
transactionId: "txn_large",
reason: "customer_request",
amount: 1000, // Exceeds 500 EGP limit
});
mockLLM.addTextResponse(
"I apologize, but this refund requires manager approval. " +
"Let me escalate this for you."
);
const agent = createAgentWithGuardrails(mockLLM, mockTools);
const result = await agent.run("I need a refund of 1000 EGP for txn_large");
// The refund tool should NOT have been executed
const refundResults = result.toolResults.filter(
(r) => r.toolName === "issue_refund"
);
expect(refundResults).toHaveLength(0);
});
});Evaluation Framework
Beyond unit and integration tests, evaluation tests measure agent quality across a dataset of real-world scenarios. Each scenario has an expected behavior that is scored.
interface EvalCase {
id: string;
input: string;
expectedToolCalls?: string[];
expectedTopics?: string[];
mustNotContain?: string[];
maxSteps?: number;
}
interface EvalResult {
caseId: string;
passed: boolean;
scores: {
toolAccuracy: number; // Did it call the right tools?
topicCoverage: number; // Did it address the right topics?
safety: number; // Did it avoid forbidden content?
efficiency: number; // Did it complete in few steps?
};
details: string;
}
class AgentEvaluator {
async evaluate(
agent: Agent,
cases: EvalCase[]
): Promise<EvalResult[]> {
const results: EvalResult[] = [];
for (const evalCase of cases) {
const agentResult = await agent.run(evalCase.input);
const toolsUsed = agentResult.toolResults.map((r) => r.toolName);
const responseText =
agentResult.messages[agentResult.messages.length - 1]?.content ?? "";
// Score tool accuracy
const toolAccuracy = evalCase.expectedToolCalls
? this.calculateOverlap(toolsUsed, evalCase.expectedToolCalls)
: 1;
// Score topic coverage
const topicCoverage = evalCase.expectedTopics
? evalCase.expectedTopics.filter((t) =>
responseText.toLowerCase().includes(t.toLowerCase())
).length / evalCase.expectedTopics.length
: 1;
// Score safety
const safety = evalCase.mustNotContain
? evalCase.mustNotContain.every(
(term) => !responseText.toLowerCase().includes(term.toLowerCase())
)
? 1
: 0
: 1;
// Score efficiency
const efficiency = evalCase.maxSteps
? agentResult.stepCount <= evalCase.maxSteps
? 1
: 0.5
: 1;
results.push({
caseId: evalCase.id,
passed: toolAccuracy >= 0.8 && safety === 1,
scores: { toolAccuracy, topicCoverage, safety, efficiency },
details: `Tools: [${toolsUsed.join(", ")}], Steps: ${agentResult.stepCount}`,
});
}
return results;
}
private calculateOverlap(actual: string[], expected: string[]): number {
const matched = expected.filter((e) => actual.includes(e));
return matched.length / expected.length;
}
}Continuous Evaluation in CI
Evaluation tests run in CI on every pull request that modifies agent code, tools, or system prompts. Regressions are caught before they reach production.
// eval/run-eval.ts
const evaluator = new AgentEvaluator();
const cases = loadEvalCases("eval/cases/*.json");
const results = await evaluator.evaluate(agent, cases);
const passRate = results.filter((r) => r.passed).length / results.length;
const avgToolAccuracy =
results.reduce((sum, r) => sum + r.scores.toolAccuracy, 0) / results.length;
console.log(`Pass rate: ${(passRate * 100).toFixed(1)}%`);
console.log(`Avg tool accuracy: ${(avgToolAccuracy * 100).toFixed(1)}%`);
// Fail CI if quality drops below threshold
if (passRate < 0.9) {
console.error("Evaluation pass rate below 90% threshold");
process.exit(1);
}Conclusion
Testing AI agents requires a layered approach: unit tests for tools and guardrails, integration tests for conversation scenarios, and evaluation tests for overall quality. Mock LLMs make integration tests deterministic and fast. Evaluation frameworks measure quality across diverse scenarios. And CI integration ensures that quality does not regress. In Klivvr Agent, this testing pyramid gives us confidence that the agent will behave correctly, safely, and efficiently in production — even as tools, prompts, and models evolve.
Related Articles
AI Agents in Fintech Operations
How AI agents automate fintech operational workflows including compliance monitoring, fraud detection, dispute resolution, and regulatory reporting — with insights from Klivvr Agent deployments.
Human-in-the-Loop Patterns for AI Agents
How to design effective human-in-the-loop workflows for AI agents, covering escalation policies, approval workflows, the autonomy ladder, and trust-building strategies.
Multi-Agent Systems in TypeScript
Architecture patterns for multi-agent systems including supervisor topologies, agent-to-agent communication, task delegation, and shared state management in Klivvr Agent.