Safety Guardrails for AI Agents

An AI agent that can query databases, process refunds, and send notifications has real power — and real risk. Without guardrails, a prompt injection attack could trick the agent into issuing unauthorized refunds. A hallucinated tool argument could send a notification to the wrong customer. An overly enthusiastic agent could execute ten expensive API calls when one would suffice.

Guardrails are the safety boundaries that constrain an agent's behavior to acceptable limits. This article covers the guardrail patterns implemented in Klivvr Agent, from input validation to output filtering to action budgets.

The Guardrail Pipeline

Klivvr Agent applies guardrails at four checkpoints in the agent loop: input (before the user message reaches the LLM), pre-tool (before a tool executes), post-tool (after a tool returns), and output (before the response reaches the user).

interface Guardrail {
  name: string;
  checkpoint: "input" | "pre_tool" | "post_tool" | "output";
  check: (context: GuardrailContext) => Promise<GuardrailResult>;
}
 
interface GuardrailContext {
  message?: Message;
  toolCall?: ToolCall;
  toolResult?: ToolResult;
  agentState: AgentState;
  sessionMetadata: Record<string, unknown>;
}
 
interface GuardrailResult {
  allowed: boolean;
  reason?: string;
  modified?: Message | ToolCall;  // Optionally modify instead of block
}
 
class GuardrailPipeline {
  private guardrails: Map<string, Guardrail[]> = new Map();
 
  register(guardrail: Guardrail): void {
    const existing = this.guardrails.get(guardrail.checkpoint) ?? [];
    existing.push(guardrail);
    this.guardrails.set(guardrail.checkpoint, existing);
  }
 
  async check(
    checkpoint: Guardrail["checkpoint"],
    context: GuardrailContext
  ): Promise<GuardrailResult> {
    const guards = this.guardrails.get(checkpoint) ?? [];
 
    for (const guard of guards) {
      const result = await guard.check(context);
      if (!result.allowed) {
        await this.logViolation(guard.name, checkpoint, context, result);
        return result;
      }
    }
 
    return { allowed: true };
  }
 
  private async logViolation(
    guardName: string,
    checkpoint: string,
    context: GuardrailContext,
    result: GuardrailResult
  ): Promise<void> {
    console.warn(
      `[Guardrail] ${guardName} blocked at ${checkpoint}: ${result.reason}`
    );
    await auditLog.write({
      type: "guardrail_violation",
      guard: guardName,
      checkpoint,
      reason: result.reason,
      timestamp: new Date(),
    });
  }
}

Input Guardrails: Prompt Injection Detection

Prompt injection is the most significant security risk for AI agents. An attacker crafts input that causes the LLM to ignore its instructions and execute unauthorized actions. Input guardrails detect and block these attempts.

const promptInjectionGuard: Guardrail = {
  name: "prompt_injection_detector",
  checkpoint: "input",
  check: async (context) => {
    const message = context.message?.content ?? "";
 
    // Pattern-based detection
    const injectionPatterns = [
      /ignore\s+(all\s+)?(previous|prior|above)\s+(instructions|prompts)/i,
      /you\s+are\s+now\s+(a|an)\s+/i,
      /system\s*:\s*/i,
      /\[INST\]/i,
      /\<\|im_start\|\>/i,
      /forget\s+(everything|all|your)\s+(you|instructions|rules)/i,
      /override\s+(your|the|all)\s+(rules|instructions|guidelines)/i,
    ];
 
    for (const pattern of injectionPatterns) {
      if (pattern.test(message)) {
        return {
          allowed: false,
          reason: "Potential prompt injection detected",
        };
      }
    }
 
    // Semantic similarity check against known injection templates
    const injectionScore = await classifyInjection(message);
    if (injectionScore > 0.85) {
      return {
        allowed: false,
        reason: `High injection probability: ${injectionScore}`,
      };
    }
 
    return { allowed: true };
  },
};

Pre-Tool Guardrails: Action Limits

Pre-tool guardrails validate tool calls before execution. They enforce action budgets, parameter policies, and authorization checks.

const actionBudgetGuard: Guardrail = {
  name: "action_budget",
  checkpoint: "pre_tool",
  check: async (context) => {
    const state = context.agentState;
    const toolName = context.toolCall?.name ?? "";
 
    // Global step limit
    if (state.stepCount >= 15) {
      return {
        allowed: false,
        reason: "Maximum agent steps exceeded (15)",
      };
    }
 
    // Per-tool call limits
    const toolCallCounts = state.toolResults.reduce(
      (acc, r) => {
        acc[r.toolName] = (acc[r.toolName] ?? 0) + 1;
        return acc;
      },
      {} as Record<string, number>
    );
 
    const toolLimits: Record<string, number> = {
      issue_refund: 3,
      initiate_transfer: 2,
      send_notification: 5,
      lookup_customer: 5,
    };
 
    const limit = toolLimits[toolName] ?? 10;
    const currentCount = toolCallCounts[toolName] ?? 0;
 
    if (currentCount >= limit) {
      return {
        allowed: false,
        reason: `Tool ${toolName} call limit reached (${limit})`,
      };
    }
 
    return { allowed: true };
  },
};
 
const financialLimitGuard: Guardrail = {
  name: "financial_limit",
  checkpoint: "pre_tool",
  check: async (context) => {
    const toolCall = context.toolCall;
    if (!toolCall) return { allowed: true };
 
    // Check refund amounts
    if (toolCall.name === "issue_refund") {
      const amount = toolCall.arguments.amount as number | undefined;
      const maxAutoRefund = 500;  // EGP
 
      if (amount && amount > maxAutoRefund) {
        return {
          allowed: false,
          reason:
            `Refund amount ${amount} EGP exceeds auto-approval limit ` +
            `of ${maxAutoRefund} EGP. Requires human approval.`,
        };
      }
    }
 
    // Check transfer amounts
    if (toolCall.name === "initiate_transfer") {
      const amount = toolCall.arguments.amount as number;
      const maxAutoTransfer = 10000;
 
      if (amount > maxAutoTransfer) {
        return {
          allowed: false,
          reason:
            `Transfer amount ${amount} EGP exceeds auto-approval limit. ` +
            `Escalating to human agent.`,
        };
      }
    }
 
    return { allowed: true };
  },
};

Output Guardrails: Content Safety

Output guardrails ensure the agent's response does not contain sensitive information, hallucinated claims, or inappropriate content.

const piiRedactionGuard: Guardrail = {
  name: "pii_redaction",
  checkpoint: "output",
  check: async (context) => {
    const content = context.message?.content ?? "";
 
    // Detect and redact potential PII in agent responses
    const piiPatterns: Array<{ pattern: RegExp; replacement: string }> = [
      {
        pattern: /\b[A-Z]{2}\d{2}[A-Z0-9]{4}\d{7}[A-Z0-9]{0,16}\b/g,
        replacement: "[IBAN REDACTED]",
      },
      {
        pattern: /\b\d{14,16}\b/g,
        replacement: "[CARD NUMBER REDACTED]",
      },
      {
        pattern: /\b\d{3}-?\d{2}-?\d{4}\b/g,
        replacement: "[SSN REDACTED]",
      },
      {
        pattern: /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/g,
        replacement: "[EMAIL REDACTED]",
      },
    ];
 
    let redactedContent = content;
    let wasRedacted = false;
 
    for (const { pattern, replacement } of piiPatterns) {
      if (pattern.test(redactedContent)) {
        redactedContent = redactedContent.replace(pattern, replacement);
        wasRedacted = true;
      }
    }
 
    if (wasRedacted) {
      return {
        allowed: true,
        modified: { ...context.message!, content: redactedContent },
      };
    }
 
    return { allowed: true };
  },
};
 
const halluccinationGuard: Guardrail = {
  name: "hallucination_detector",
  checkpoint: "output",
  check: async (context) => {
    const content = context.message?.content ?? "";
    const state = context.agentState;
 
    // Check if agent claims to have taken actions it did not take
    const actionClaims = [
      { pattern: /refund.*processed/i, tool: "issue_refund" },
      { pattern: /transfer.*initiated/i, tool: "initiate_transfer" },
      { pattern: /account.*closed/i, tool: "close_account" },
    ];
 
    const executedTools = new Set(
      state.toolResults.map((r) => r.toolName)
    );
 
    for (const claim of actionClaims) {
      if (claim.pattern.test(content) && !executedTools.has(claim.tool)) {
        return {
          allowed: false,
          reason:
            `Agent claims "${claim.pattern.source}" but tool ` +
            `"${claim.tool}" was never executed`,
        };
      }
    }
 
    return { allowed: true };
  },
};

Circuit Breaker

When guardrails detect repeated violations or the agent enters a loop, a circuit breaker stops the agent and escalates to a human.

class AgentCircuitBreaker {
  private violations: Array<{ guard: string; timestamp: Date }> = [];
  private threshold: number;
  private windowMs: number;
 
  constructor(threshold: number = 3, windowMs: number = 60000) {
    this.threshold = threshold;
    this.windowMs = windowMs;
  }
 
  recordViolation(guardName: string): void {
    this.violations.push({ guard: guardName, timestamp: new Date() });
  }
 
  isTripped(): boolean {
    const now = Date.now();
    const recentViolations = this.violations.filter(
      (v) => now - v.timestamp.getTime() < this.windowMs
    );
    return recentViolations.length >= this.threshold;
  }
 
  getEscalationReason(): string {
    const recent = this.violations.slice(-this.threshold);
    const guards = [...new Set(recent.map((v) => v.guard))];
    return `Circuit breaker tripped: ${this.threshold} guardrail violations ` +
      `in ${this.windowMs / 1000}s from guards: ${guards.join(", ")}`;
  }
}

Conclusion

Guardrails are not optional for production AI agents — they are the safety net that prevents the agent from causing harm. Input guardrails block malicious prompts before they reach the model. Pre-tool guardrails enforce action limits and authorization policies. Output guardrails redact sensitive information and detect hallucinated claims. And circuit breakers stop runaway agents before they cause cascading damage. In Klivvr Agent, every tool call passes through this pipeline, ensuring that the agent operates within defined boundaries regardless of what the user asks or what the model generates.

Safety Guardrails for AI Agents

The Guardrail Pipeline

Input Guardrails: Prompt Injection Detection

Pre-Tool Guardrails: Action Limits

Output Guardrails: Content Safety

Circuit Breaker

Conclusion

Related Articles

AI Agents in Fintech Operations

Human-in-the-Loop Patterns for AI Agents

Multi-Agent Systems in TypeScript

Related Articles

AI Agents in Fintech Operations
October 20, 20257 min read

Human-in-the-Loop Patterns for AI Agents
September 5, 20257 min read

Multi-Agent Systems in TypeScript
August 14, 20256 min read