Testing Strategies for AI Agents

Testing AI agents is fundamentally different from testing deterministic software. An agent's behavior depends on an LLM that can produce different outputs for the same input. Tool calls happen in sequences that vary by conversation. And the correctness of a response is often subjective — a helpful answer phrased differently is still correct. Despite these challenges, agents need rigorous testing. In fintech, an untested agent is an unacceptable risk.

This article covers the testing strategies used in Klivvr Agent, organized as a testing pyramid: unit tests at the base, integration tests in the middle, and evaluation tests at the top.

Unit Testing Tools

Tools are deterministic — given the same input, they produce the same output. This makes them the easiest component to test and the most important to test thoroughly, since tool bugs directly affect agent reliability.

import { describe, it, expect, vi } from "vitest";
 
describe("lookup_customer tool", () => {
  const mockCustomerService = {
    findBy: vi.fn(),
  };
 
  const tool = createLookupCustomerTool(mockCustomerService);
 
  it("returns customer data for valid ID", async () => {
    mockCustomerService.findBy.mockResolvedValue({
      id: "cust_123",
      fullName: "Ahmed Hassan",
      email: "ahmed@example.com",
      status: "active",
      kycLevel: "verified",
      formattedBalance: "5,230.00 EGP",
      transactions: [
        { id: "txn_1", amount: 100, date: "2025-07-01" },
        { id: "txn_2", amount: 250, date: "2025-07-02" },
      ],
    });
 
    const result = await tool.execute({
      identifier: "cust_123",
      identifierType: "id",
    });
 
    expect(result.success).toBe(true);
    expect(result.data?.found).toBe(true);
    expect(result.data?.name).toBe("Ahmed Hassan");
    expect(result.data?.recentTransactions).toHaveLength(2);
  });
 
  it("returns not_found for missing customer", async () => {
    mockCustomerService.findBy.mockResolvedValue(null);
 
    const result = await tool.execute({
      identifier: "nonexistent@email.com",
      identifierType: "email",
    });
 
    expect(result.success).toBe(true);
    expect(result.data?.found).toBe(false);
  });
 
  it("rejects invalid parameter schemas", async () => {
    const result = await tool.execute({
      identifier: "cust_123",
      identifierType: "invalid_type",
    } as any);
 
    expect(result.success).toBe(false);
    expect(result.error).toContain("invalid");
  });
 
  it("handles service errors gracefully", async () => {
    mockCustomerService.findBy.mockRejectedValue(
      new Error("Database connection timeout")
    );
 
    const result = await tool.execute({
      identifier: "cust_123",
      identifierType: "id",
    });
 
    expect(result.success).toBe(false);
    expect(result.error).toContain("timeout");
  });
});

Mock LLM for Deterministic Testing

Integration tests require a mock LLM that returns predictable responses. The mock LLM uses scripted responses keyed on conversation patterns.

class MockLLMClient implements LLMClient {
  private responses: Array<{
    match: (messages: Message[]) => boolean;
    response: Message;
  }> = [];
 
  addResponse(
    matcher: (messages: Message[]) => boolean,
    response: Message
  ): void {
    this.responses.push({ match: matcher, response });
  }
 
  addToolCallResponse(
    toolName: string,
    args: Record<string, unknown>
  ): void {
    this.responses.push({
      match: () => true,
      response: {
        role: "assistant",
        content: "",
        toolCalls: [
          {
            id: `call_${Date.now()}`,
            name: toolName,
            arguments: args,
          },
        ],
      },
    });
  }
 
  addTextResponse(text: string): void {
    this.responses.push({
      match: () => true,
      response: { role: "assistant", content: text },
    });
  }
 
  async chat(params: {
    model: string;
    messages: Message[];
    tools: ToolDefinition[];
    temperature: number;
    maxTokens: number;
  }): Promise<Message> {
    for (const { match, response } of this.responses) {
      if (match(params.messages)) {
        // Remove the matched response so the next call gets the next one
        const index = this.responses.indexOf({ match, response });
        this.responses.splice(index, 1);
        return response;
      }
    }
 
    // Default: return a completion (no tool calls)
    return { role: "assistant", content: "I cannot help with that." };
  }
}

Scenario-Based Integration Tests

Integration tests verify that the agent handles complete conversation scenarios correctly — from user input through tool calls to final response.

describe("customer balance inquiry scenario", () => {
  it("looks up customer and reports balance", async () => {
    const mockLLM = new MockLLMClient();
    const mockTools = createMockTools();
 
    // Script the LLM responses
    mockLLM.addToolCallResponse("lookup_customer", {
      identifier: "ahmed@example.com",
      identifierType: "email",
    });
    mockLLM.addTextResponse(
      "Your current balance is 5,230.00 EGP. Is there anything else?"
    );
 
    const agent = new Agent(
      {
        model: "claude-sonnet-4-6",
        systemPrompt: "You are a banking support agent.",
        tools: mockTools,
        maxSteps: 10,
        maxTokens: 4096,
        temperature: 0,
      },
      mockLLM,
      new ToolExecutor(mockTools)
    );
 
    const result = await agent.run(
      "What is my balance? My email is ahmed@example.com"
    );
 
    expect(result.status).toBe("completed");
    expect(result.toolResults).toHaveLength(1);
    expect(result.toolResults[0].toolName).toBe("lookup_customer");
 
    const lastMessage = result.messages[result.messages.length - 1];
    expect(lastMessage.content).toContain("5,230");
  });
});
 
describe("refund with guardrail scenario", () => {
  it("blocks refund exceeding auto-approval limit", async () => {
    const mockLLM = new MockLLMClient();
    const mockTools = createMockTools();
 
    mockLLM.addToolCallResponse("issue_refund", {
      transactionId: "txn_large",
      reason: "customer_request",
      amount: 1000,  // Exceeds 500 EGP limit
    });
    mockLLM.addTextResponse(
      "I apologize, but this refund requires manager approval. " +
      "Let me escalate this for you."
    );
 
    const agent = createAgentWithGuardrails(mockLLM, mockTools);
    const result = await agent.run("I need a refund of 1000 EGP for txn_large");
 
    // The refund tool should NOT have been executed
    const refundResults = result.toolResults.filter(
      (r) => r.toolName === "issue_refund"
    );
    expect(refundResults).toHaveLength(0);
  });
});

Evaluation Framework

Beyond unit and integration tests, evaluation tests measure agent quality across a dataset of real-world scenarios. Each scenario has an expected behavior that is scored.

interface EvalCase {
  id: string;
  input: string;
  expectedToolCalls?: string[];
  expectedTopics?: string[];
  mustNotContain?: string[];
  maxSteps?: number;
}
 
interface EvalResult {
  caseId: string;
  passed: boolean;
  scores: {
    toolAccuracy: number;  // Did it call the right tools?
    topicCoverage: number; // Did it address the right topics?
    safety: number;        // Did it avoid forbidden content?
    efficiency: number;    // Did it complete in few steps?
  };
  details: string;
}
 
class AgentEvaluator {
  async evaluate(
    agent: Agent,
    cases: EvalCase[]
  ): Promise<EvalResult[]> {
    const results: EvalResult[] = [];
 
    for (const evalCase of cases) {
      const agentResult = await agent.run(evalCase.input);
 
      const toolsUsed = agentResult.toolResults.map((r) => r.toolName);
      const responseText =
        agentResult.messages[agentResult.messages.length - 1]?.content ?? "";
 
      // Score tool accuracy
      const toolAccuracy = evalCase.expectedToolCalls
        ? this.calculateOverlap(toolsUsed, evalCase.expectedToolCalls)
        : 1;
 
      // Score topic coverage
      const topicCoverage = evalCase.expectedTopics
        ? evalCase.expectedTopics.filter((t) =>
            responseText.toLowerCase().includes(t.toLowerCase())
          ).length / evalCase.expectedTopics.length
        : 1;
 
      // Score safety
      const safety = evalCase.mustNotContain
        ? evalCase.mustNotContain.every(
            (term) => !responseText.toLowerCase().includes(term.toLowerCase())
          )
          ? 1
          : 0
        : 1;
 
      // Score efficiency
      const efficiency = evalCase.maxSteps
        ? agentResult.stepCount <= evalCase.maxSteps
          ? 1
          : 0.5
        : 1;
 
      results.push({
        caseId: evalCase.id,
        passed: toolAccuracy >= 0.8 && safety === 1,
        scores: { toolAccuracy, topicCoverage, safety, efficiency },
        details: `Tools: [${toolsUsed.join(", ")}], Steps: ${agentResult.stepCount}`,
      });
    }
 
    return results;
  }
 
  private calculateOverlap(actual: string[], expected: string[]): number {
    const matched = expected.filter((e) => actual.includes(e));
    return matched.length / expected.length;
  }
}

Continuous Evaluation in CI

Evaluation tests run in CI on every pull request that modifies agent code, tools, or system prompts. Regressions are caught before they reach production.

// eval/run-eval.ts
const evaluator = new AgentEvaluator();
const cases = loadEvalCases("eval/cases/*.json");
 
const results = await evaluator.evaluate(agent, cases);
 
const passRate = results.filter((r) => r.passed).length / results.length;
const avgToolAccuracy =
  results.reduce((sum, r) => sum + r.scores.toolAccuracy, 0) / results.length;
 
console.log(`Pass rate: ${(passRate * 100).toFixed(1)}%`);
console.log(`Avg tool accuracy: ${(avgToolAccuracy * 100).toFixed(1)}%`);
 
// Fail CI if quality drops below threshold
if (passRate < 0.9) {
  console.error("Evaluation pass rate below 90% threshold");
  process.exit(1);
}

Conclusion

Testing AI agents requires a layered approach: unit tests for tools and guardrails, integration tests for conversation scenarios, and evaluation tests for overall quality. Mock LLMs make integration tests deterministic and fast. Evaluation frameworks measure quality across diverse scenarios. And CI integration ensures that quality does not regress. In Klivvr Agent, this testing pyramid gives us confidence that the agent will behave correctly, safely, and efficiently in production — even as tools, prompts, and models evolve.

Testing Strategies for AI Agents

Unit Testing Tools

Mock LLM for Deterministic Testing

Scenario-Based Integration Tests

Evaluation Framework

Continuous Evaluation in CI

Conclusion

Related Articles

AI Agents in Fintech Operations

Human-in-the-Loop Patterns for AI Agents

Multi-Agent Systems in TypeScript

Related Articles

AI Agents in Fintech Operations
October 20, 20257 min read

Human-in-the-Loop Patterns for AI Agents
September 5, 20257 min read

Multi-Agent Systems in TypeScript
August 14, 20256 min read