Evaluating AI Agent Performance: Metrics and Benchmarks

A framework for measuring AI agent quality across retrieval accuracy, answer correctness, latency, and cost, with Python implementations for automated evaluation pipelines.

technical11 min readBy Klivvr Engineering
Share:

You cannot improve what you cannot measure, and measuring AI agent quality is harder than measuring traditional software quality. A unit test asserts that a function returns an exact expected value. An AI agent returns natural language that can be correct in many different phrasings, partially correct, or subtly wrong in ways that are difficult to detect programmatically. Without a rigorous evaluation framework, you are flying blind. Every prompt change, model upgrade, or retrieval tweak might improve one dimension of quality while silently degrading another.

Data Whispal Agent serves analytics queries where accuracy is non-negotiable. A user asking "What was our churn rate last quarter?" needs a precise number from the data, not a plausible-sounding guess. We invested heavily in building an evaluation pipeline that measures every dimension of agent quality and runs automatically on every code change. This article describes the metrics we track, the evaluation methods we use, and the Python code that automates the entire process.

The Four Dimensions of Agent Quality

We evaluate Data Whispal Agent along four dimensions: retrieval quality, answer correctness, response quality, and operational performance. Each dimension has its own metrics, and they interact in important ways.

Retrieval quality measures whether the system fetches the right documents before the LLM ever sees them. If retrieval fails, no amount of prompt engineering can save the answer.

Answer correctness measures whether the final response contains accurate information. This is the dimension users care about most.

Response quality measures the structure, clarity, and completeness of the answer. A correct but incoherent response is still a failure.

Operational performance measures latency, cost, and reliability. A perfect answer that takes 30 seconds and costs a dollar per query is not viable.

from dataclasses import dataclass, field
from enum import Enum
 
class Dimension(str, Enum):
    RETRIEVAL = "retrieval"
    CORRECTNESS = "correctness"
    RESPONSE_QUALITY = "response_quality"
    OPERATIONAL = "operational"
 
@dataclass
class EvaluationResult:
    query_id: str
    query: str
    metrics: dict[str, float] = field(default_factory=dict)
    dimension_scores: dict[Dimension, float] = field(default_factory=dict)
    metadata: dict = field(default_factory=dict)
 
@dataclass
class EvaluationSuite:
    cases: list[dict]
    name: str
    version: str
 
    def summary(self, results: list[EvaluationResult]) -> dict:
        summary = {}
        all_metrics = set()
        for r in results:
            all_metrics.update(r.metrics.keys())
        for metric in all_metrics:
            values = [
                r.metrics[metric] for r in results if metric in r.metrics
            ]
            summary[metric] = {
                "mean": sum(values) / len(values) if values else 0,
                "min": min(values) if values else 0,
                "max": max(values) if values else 0,
            }
        return summary

Retrieval Evaluation

Retrieval evaluation compares the documents retrieved for each query against a ground-truth set of relevant documents. The core metrics are precision, recall, and Mean Reciprocal Rank (MRR).

Precision@k measures what fraction of the top-k retrieved documents are relevant. Recall@k measures what fraction of all relevant documents appear in the top-k results. MRR measures the rank position of the first relevant document, rewarding systems that place relevant results higher.

import numpy as np
 
def precision_at_k(retrieved_ids: list[str], relevant_ids: set[str], k: int) -> float:
    """Fraction of top-k results that are relevant."""
    top_k = retrieved_ids[:k]
    relevant_in_top_k = sum(1 for doc_id in top_k if doc_id in relevant_ids)
    return relevant_in_top_k / k
 
def recall_at_k(retrieved_ids: list[str], relevant_ids: set[str], k: int) -> float:
    """Fraction of all relevant docs found in top-k results."""
    if not relevant_ids:
        return 1.0
    top_k = set(retrieved_ids[:k])
    found = len(top_k & relevant_ids)
    return found / len(relevant_ids)
 
def mean_reciprocal_rank(retrieved_ids: list[str], relevant_ids: set[str]) -> float:
    """Reciprocal of the rank of the first relevant document."""
    for rank, doc_id in enumerate(retrieved_ids, 1):
        if doc_id in relevant_ids:
            return 1.0 / rank
    return 0.0
 
def ndcg_at_k(
    retrieved_ids: list[str],
    relevance_scores: dict[str, float],
    k: int,
) -> float:
    """Normalized Discounted Cumulative Gain at k."""
    dcg = 0.0
    for i, doc_id in enumerate(retrieved_ids[:k]):
        rel = relevance_scores.get(doc_id, 0.0)
        dcg += rel / np.log2(i + 2)  # +2 because rank starts at 1
 
    # Ideal DCG
    ideal_rels = sorted(relevance_scores.values(), reverse=True)[:k]
    idcg = sum(rel / np.log2(i + 2) for i, rel in enumerate(ideal_rels))
 
    return dcg / idcg if idcg > 0 else 0.0
 
def evaluate_retrieval(
    retriever,
    eval_cases: list[dict],
    k_values: list[int] = [1, 3, 5, 10],
) -> dict:
    """Run full retrieval evaluation across all cases."""
    results = {f"precision@{k}": [] for k in k_values}
    results.update({f"recall@{k}": [] for k in k_values})
    results["mrr"] = []
 
    for case in eval_cases:
        query = case["query"]
        relevant_ids = set(case["relevant_doc_ids"])
 
        retrieved_docs = retriever.invoke(query)
        retrieved_ids = [doc.metadata["id"] for doc in retrieved_docs]
 
        results["mrr"].append(mean_reciprocal_rank(retrieved_ids, relevant_ids))
        for k in k_values:
            results[f"precision@{k}"].append(
                precision_at_k(retrieved_ids, relevant_ids, k)
            )
            results[f"recall@{k}"].append(
                recall_at_k(retrieved_ids, relevant_ids, k)
            )
 
    return {
        metric: round(np.mean(values), 4)
        for metric, values in results.items()
    }

Building the ground-truth dataset is the hardest part of retrieval evaluation. We use a combination of manual labeling and LLM-assisted annotation. Domain experts label a seed set of 50 queries, and we use an LLM to extend this to 200+ queries with human review of borderline cases.

Answer Correctness Evaluation

Answer correctness is the most important and most difficult dimension to evaluate. We use three complementary approaches: exact fact matching, LLM-as-judge scoring, and human evaluation on a rotating sample.

Exact fact matching checks whether specific expected data points appear in the answer. This is the most reliable automated method but only works for factual queries with known answers.

import re
 
def extract_numbers(text: str) -> list[float]:
    """Extract all numerical values from text."""
    patterns = [
        r'\$[\d,]+\.?\d*[MBK]?',      # Currency: $1.2M, $500K
        r'[\d,]+\.?\d*%',               # Percentages: 15.3%
        r'[\d,]+\.?\d*',                # Plain numbers: 1,234.56
    ]
    numbers = []
    for pattern in patterns:
        matches = re.findall(pattern, text)
        for match in matches:
            cleaned = match.replace('$', '').replace('%', '').replace(',', '')
            multiplier = 1
            if cleaned.endswith('M'):
                multiplier = 1_000_000
                cleaned = cleaned[:-1]
            elif cleaned.endswith('B'):
                multiplier = 1_000_000_000
                cleaned = cleaned[:-1]
            elif cleaned.endswith('K'):
                multiplier = 1_000
                cleaned = cleaned[:-1]
            try:
                numbers.append(float(cleaned) * multiplier)
            except ValueError:
                continue
    return numbers
 
def fact_check_answer(
    answer: str,
    expected_facts: list[dict],
) -> dict:
    """Check if expected facts appear in the answer.
 
    Each expected fact has:
    - 'value': the expected value (string or number)
    - 'tolerance': acceptable deviation for numbers (default 0.01)
    - 'required': whether this fact must be present (default True)
    """
    results = {"total": len(expected_facts), "found": 0, "missing": []}
 
    for fact in expected_facts:
        value = fact["value"]
        found = False
 
        if isinstance(value, (int, float)):
            tolerance = fact.get("tolerance", 0.01)
            answer_numbers = extract_numbers(answer)
            for num in answer_numbers:
                if abs(num - value) / max(abs(value), 1e-10) <= tolerance:
                    found = True
                    break
        else:
            found = str(value).lower() in answer.lower()
 
        if found:
            results["found"] += 1
        else:
            results["missing"].append(fact)
 
    results["score"] = results["found"] / results["total"] if results["total"] > 0 else 1.0
    return results

LLM-as-judge uses a separate LLM call to evaluate whether the generated answer correctly addresses the question given the available context. This catches subtle errors that string matching misses: incorrect causal claims, over-extrapolation, and misattribution.

from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field
 
class JudgeScore(BaseModel):
    correctness: int = Field(
        description="1-5 score for factual accuracy", ge=1, le=5
    )
    completeness: int = Field(
        description="1-5 score for answer completeness", ge=1, le=5
    )
    faithfulness: int = Field(
        description="1-5 score: does the answer stay within the context?",
        ge=1, le=5,
    )
    reasoning: str = Field(description="Brief explanation of the scores")
 
JUDGE_PROMPT = """You are an expert evaluator for a data analytics AI agent.
Evaluate the following answer on three criteria:
 
1. CORRECTNESS (1-5): Are the facts, numbers, and conclusions accurate
   given the context?
2. COMPLETENESS (1-5): Does the answer fully address the question?
3. FAITHFULNESS (1-5): Does the answer only use information from the
   provided context, without hallucinating facts?
 
Context provided to the agent:
{context}
 
User question:
{question}
 
Agent's answer:
{answer}
 
Expected answer (reference):
{expected_answer}
 
Score each criterion from 1 to 5 and provide brief reasoning."""
 
async def judge_answer(
    question: str,
    context: str,
    answer: str,
    expected_answer: str,
) -> JudgeScore:
    llm = ChatOpenAI(model="gpt-4o", temperature=0)
    structured_llm = llm.with_structured_output(JudgeScore)
 
    prompt = JUDGE_PROMPT.format(
        context=context,
        question=question,
        answer=answer,
        expected_answer=expected_answer,
    )
    return await structured_llm.ainvoke(prompt)

A critical design choice: we use a different model for judging than for generation. If the agent uses GPT-4o, the judge should ideally use a different provider or at minimum a different configuration. Using the same model to judge itself introduces systematic blind spots.

Building the Evaluation Pipeline

Individual metrics are useful, but the real power comes from an automated pipeline that runs on every pull request and produces a comprehensive quality report.

import asyncio
import json
from datetime import datetime
from pathlib import Path
 
class EvaluationPipeline:
    def __init__(self, agent, retriever, eval_dataset_path: str):
        self.agent = agent
        self.retriever = retriever
        self.eval_cases = self._load_dataset(eval_dataset_path)
 
    def _load_dataset(self, path: str) -> list[dict]:
        with open(path) as f:
            return json.load(f)
 
    async def run_full_evaluation(self) -> dict:
        """Execute all evaluation dimensions."""
        results = {
            "timestamp": datetime.utcnow().isoformat(),
            "num_cases": len(self.eval_cases),
            "retrieval": {},
            "correctness": {},
            "operational": {},
        }
 
        # Retrieval evaluation
        results["retrieval"] = evaluate_retrieval(
            self.retriever, self.eval_cases
        )
 
        # Answer evaluation (parallel for speed)
        correctness_tasks = []
        latencies = []
        token_counts = []
 
        for case in self.eval_cases:
            start = asyncio.get_event_loop().time()
            answer, metadata = await self.agent.aquery(case["query"])
            elapsed = asyncio.get_event_loop().time() - start
 
            latencies.append(elapsed)
            token_counts.append(metadata.get("total_tokens", 0))
 
            # Fact checking
            if "expected_facts" in case:
                fact_result = fact_check_answer(
                    answer, case["expected_facts"]
                )
                correctness_tasks.append(fact_result["score"])
 
            # LLM judge (fire and forget, collect later)
            if "expected_answer" in case:
                judge_task = judge_answer(
                    question=case["query"],
                    context=case.get("context", ""),
                    answer=answer,
                    expected_answer=case["expected_answer"],
                )
                correctness_tasks.append(judge_task)
 
        # Aggregate operational metrics
        results["operational"] = {
            "p50_latency_ms": round(
                np.percentile(latencies, 50) * 1000
            ),
            "p95_latency_ms": round(
                np.percentile(latencies, 95) * 1000
            ),
            "p99_latency_ms": round(
                np.percentile(latencies, 99) * 1000
            ),
            "avg_tokens": round(np.mean(token_counts)),
            "estimated_cost_per_query": round(
                np.mean(token_counts) * 0.000005, 6  # Example rate
            ),
        }
 
        return results
 
    def compare_with_baseline(
        self,
        current: dict,
        baseline: dict,
        regression_threshold: float = 0.02,
    ) -> dict:
        """Compare current results with a baseline and flag regressions."""
        comparisons = {}
        for dimension in ["retrieval", "correctness"]:
            for metric, current_val in current.get(dimension, {}).items():
                baseline_val = baseline.get(dimension, {}).get(metric)
                if baseline_val is not None:
                    delta = current_val - baseline_val
                    status = "regression" if delta < -regression_threshold else "ok"
                    if delta > regression_threshold:
                        status = "improvement"
                    comparisons[f"{dimension}.{metric}"] = {
                        "current": current_val,
                        "baseline": baseline_val,
                        "delta": round(delta, 4),
                        "status": status,
                    }
        return comparisons

Hallucination Detection

Hallucination is the most dangerous failure mode for an analytics agent. A hallucinated revenue number that looks plausible can mislead business decisions. We run a dedicated hallucination detection pass that checks every numerical claim in the answer against the provided context.

async def detect_hallucinations(
    answer: str,
    context: str,
    question: str,
) -> dict:
    """Detect potential hallucinations in the agent's answer."""
    answer_numbers = set(extract_numbers(answer))
    context_numbers = set(extract_numbers(context))
 
    # Numbers in the answer but not in the context
    novel_numbers = answer_numbers - context_numbers
 
    # Some novel numbers are legitimate (calculated from context numbers)
    # Use an LLM to verify if calculations are valid
    verification_results = []
 
    if novel_numbers:
        llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
        for number in novel_numbers:
            prompt = (
                f"The context contains these numbers: {context_numbers}\n"
                f"The answer contains the number {number} which is not "
                f"directly in the context.\n"
                f"Question: {question}\n"
                f"Answer excerpt around this number: "
                f"{_extract_surrounding_text(answer, str(number))}\n\n"
                f"Is this number a valid calculation from the context "
                f"numbers? Answer YES or NO with a brief explanation."
            )
            response = await llm.ainvoke(prompt)
            is_valid = response.content.strip().upper().startswith("YES")
            verification_results.append({
                "number": number,
                "verified": is_valid,
                "explanation": response.content,
            })
 
    hallucinated = [
        v for v in verification_results if not v["verified"]
    ]
 
    return {
        "total_numbers_in_answer": len(answer_numbers),
        "numbers_from_context": len(answer_numbers - novel_numbers),
        "novel_numbers": len(novel_numbers),
        "verified_calculations": len(
            [v for v in verification_results if v["verified"]]
        ),
        "potential_hallucinations": hallucinated,
        "hallucination_rate": (
            len(hallucinated) / len(answer_numbers)
            if answer_numbers else 0.0
        ),
    }
 
def _extract_surrounding_text(text: str, target: str, window: int = 100) -> str:
    idx = text.find(target)
    if idx == -1:
        return ""
    start = max(0, idx - window)
    end = min(len(text), idx + len(target) + window)
    return text[start:end]

Setting Quality Gates

Metrics are only useful if they gate deployment. We define minimum acceptable thresholds for each dimension and block deployments that regress below them:

QUALITY_GATES = {
    "retrieval.recall@5": {"min": 0.82, "target": 0.90},
    "retrieval.mrr": {"min": 0.75, "target": 0.85},
    "correctness.fact_accuracy": {"min": 0.90, "target": 0.95},
    "correctness.faithfulness": {"min": 4.0, "target": 4.5},  # Out of 5
    "operational.p95_latency_ms": {"max": 5000, "target": 3000},
    "operational.hallucination_rate": {"max": 0.05, "target": 0.02},
}
 
def check_quality_gates(results: dict) -> tuple[bool, list[str]]:
    """Check if evaluation results pass all quality gates."""
    failures = []
 
    for metric_path, thresholds in QUALITY_GATES.items():
        dimension, metric = metric_path.split(".")
        value = results.get(dimension, {}).get(metric)
        if value is None:
            continue
 
        if "min" in thresholds and value < thresholds["min"]:
            failures.append(
                f"FAIL: {metric_path} = {value:.4f} "
                f"(minimum: {thresholds['min']})"
            )
        if "max" in thresholds and value > thresholds["max"]:
            failures.append(
                f"FAIL: {metric_path} = {value:.4f} "
                f"(maximum: {thresholds['max']})"
            )
 
    passed = len(failures) == 0
    return passed, failures

Conclusion

Evaluating AI agents is a fundamentally different challenge from evaluating traditional software. The nondeterministic nature of LLM outputs, the multi-dimensional definition of quality, and the subtlety of failure modes all demand specialized tooling. The framework described here, spanning retrieval metrics, fact-checking, LLM-as-judge scoring, hallucination detection, and automated quality gates, gives us confidence that changes to Data Whispal Agent improve quality rather than degrade it.

The most valuable investment is the evaluation dataset itself. Two hundred carefully labeled question-context-answer triples, with expected facts and relevance judgments, are worth more than any amount of clever evaluation code. Start building your evaluation dataset on day one, grow it continuously from production query logs, and treat it as a first-class engineering artifact alongside your code.

Related Articles

business

Scaling AI Agents: From Prototype to Production

A practical guide to scaling AI agent systems from initial prototype to production deployment, covering infrastructure architecture, cost management, reliability engineering, and team organization.

13 min read
business

Data Privacy and Security in AI Agent Systems

A practical guide to building privacy-preserving AI agent systems, covering data classification, access controls, PII handling, audit logging, and compliance requirements.

11 min read