Safety Guardrails for AI Agents
How to implement safety guardrails for AI agents including input validation, output filtering, action limits, and prompt injection detection — with patterns from Klivvr Agent.
An AI agent that can query databases, process refunds, and send notifications has real power — and real risk. Without guardrails, a prompt injection attack could trick the agent into issuing unauthorized refunds. A hallucinated tool argument could send a notification to the wrong customer. An overly enthusiastic agent could execute ten expensive API calls when one would suffice.
Guardrails are the safety boundaries that constrain an agent's behavior to acceptable limits. This article covers the guardrail patterns implemented in Klivvr Agent, from input validation to output filtering to action budgets.
The Guardrail Pipeline
Klivvr Agent applies guardrails at four checkpoints in the agent loop: input (before the user message reaches the LLM), pre-tool (before a tool executes), post-tool (after a tool returns), and output (before the response reaches the user).
interface Guardrail {
name: string;
checkpoint: "input" | "pre_tool" | "post_tool" | "output";
check: (context: GuardrailContext) => Promise<GuardrailResult>;
}
interface GuardrailContext {
message?: Message;
toolCall?: ToolCall;
toolResult?: ToolResult;
agentState: AgentState;
sessionMetadata: Record<string, unknown>;
}
interface GuardrailResult {
allowed: boolean;
reason?: string;
modified?: Message | ToolCall; // Optionally modify instead of block
}
class GuardrailPipeline {
private guardrails: Map<string, Guardrail[]> = new Map();
register(guardrail: Guardrail): void {
const existing = this.guardrails.get(guardrail.checkpoint) ?? [];
existing.push(guardrail);
this.guardrails.set(guardrail.checkpoint, existing);
}
async check(
checkpoint: Guardrail["checkpoint"],
context: GuardrailContext
): Promise<GuardrailResult> {
const guards = this.guardrails.get(checkpoint) ?? [];
for (const guard of guards) {
const result = await guard.check(context);
if (!result.allowed) {
await this.logViolation(guard.name, checkpoint, context, result);
return result;
}
}
return { allowed: true };
}
private async logViolation(
guardName: string,
checkpoint: string,
context: GuardrailContext,
result: GuardrailResult
): Promise<void> {
console.warn(
`[Guardrail] ${guardName} blocked at ${checkpoint}: ${result.reason}`
);
await auditLog.write({
type: "guardrail_violation",
guard: guardName,
checkpoint,
reason: result.reason,
timestamp: new Date(),
});
}
}Input Guardrails: Prompt Injection Detection
Prompt injection is the most significant security risk for AI agents. An attacker crafts input that causes the LLM to ignore its instructions and execute unauthorized actions. Input guardrails detect and block these attempts.
const promptInjectionGuard: Guardrail = {
name: "prompt_injection_detector",
checkpoint: "input",
check: async (context) => {
const message = context.message?.content ?? "";
// Pattern-based detection
const injectionPatterns = [
/ignore\s+(all\s+)?(previous|prior|above)\s+(instructions|prompts)/i,
/you\s+are\s+now\s+(a|an)\s+/i,
/system\s*:\s*/i,
/\[INST\]/i,
/\<\|im_start\|\>/i,
/forget\s+(everything|all|your)\s+(you|instructions|rules)/i,
/override\s+(your|the|all)\s+(rules|instructions|guidelines)/i,
];
for (const pattern of injectionPatterns) {
if (pattern.test(message)) {
return {
allowed: false,
reason: "Potential prompt injection detected",
};
}
}
// Semantic similarity check against known injection templates
const injectionScore = await classifyInjection(message);
if (injectionScore > 0.85) {
return {
allowed: false,
reason: `High injection probability: ${injectionScore}`,
};
}
return { allowed: true };
},
};Pre-Tool Guardrails: Action Limits
Pre-tool guardrails validate tool calls before execution. They enforce action budgets, parameter policies, and authorization checks.
const actionBudgetGuard: Guardrail = {
name: "action_budget",
checkpoint: "pre_tool",
check: async (context) => {
const state = context.agentState;
const toolName = context.toolCall?.name ?? "";
// Global step limit
if (state.stepCount >= 15) {
return {
allowed: false,
reason: "Maximum agent steps exceeded (15)",
};
}
// Per-tool call limits
const toolCallCounts = state.toolResults.reduce(
(acc, r) => {
acc[r.toolName] = (acc[r.toolName] ?? 0) + 1;
return acc;
},
{} as Record<string, number>
);
const toolLimits: Record<string, number> = {
issue_refund: 3,
initiate_transfer: 2,
send_notification: 5,
lookup_customer: 5,
};
const limit = toolLimits[toolName] ?? 10;
const currentCount = toolCallCounts[toolName] ?? 0;
if (currentCount >= limit) {
return {
allowed: false,
reason: `Tool ${toolName} call limit reached (${limit})`,
};
}
return { allowed: true };
},
};
const financialLimitGuard: Guardrail = {
name: "financial_limit",
checkpoint: "pre_tool",
check: async (context) => {
const toolCall = context.toolCall;
if (!toolCall) return { allowed: true };
// Check refund amounts
if (toolCall.name === "issue_refund") {
const amount = toolCall.arguments.amount as number | undefined;
const maxAutoRefund = 500; // EGP
if (amount && amount > maxAutoRefund) {
return {
allowed: false,
reason:
`Refund amount ${amount} EGP exceeds auto-approval limit ` +
`of ${maxAutoRefund} EGP. Requires human approval.`,
};
}
}
// Check transfer amounts
if (toolCall.name === "initiate_transfer") {
const amount = toolCall.arguments.amount as number;
const maxAutoTransfer = 10000;
if (amount > maxAutoTransfer) {
return {
allowed: false,
reason:
`Transfer amount ${amount} EGP exceeds auto-approval limit. ` +
`Escalating to human agent.`,
};
}
}
return { allowed: true };
},
};Output Guardrails: Content Safety
Output guardrails ensure the agent's response does not contain sensitive information, hallucinated claims, or inappropriate content.
const piiRedactionGuard: Guardrail = {
name: "pii_redaction",
checkpoint: "output",
check: async (context) => {
const content = context.message?.content ?? "";
// Detect and redact potential PII in agent responses
const piiPatterns: Array<{ pattern: RegExp; replacement: string }> = [
{
pattern: /\b[A-Z]{2}\d{2}[A-Z0-9]{4}\d{7}[A-Z0-9]{0,16}\b/g,
replacement: "[IBAN REDACTED]",
},
{
pattern: /\b\d{14,16}\b/g,
replacement: "[CARD NUMBER REDACTED]",
},
{
pattern: /\b\d{3}-?\d{2}-?\d{4}\b/g,
replacement: "[SSN REDACTED]",
},
{
pattern: /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/g,
replacement: "[EMAIL REDACTED]",
},
];
let redactedContent = content;
let wasRedacted = false;
for (const { pattern, replacement } of piiPatterns) {
if (pattern.test(redactedContent)) {
redactedContent = redactedContent.replace(pattern, replacement);
wasRedacted = true;
}
}
if (wasRedacted) {
return {
allowed: true,
modified: { ...context.message!, content: redactedContent },
};
}
return { allowed: true };
},
};
const halluccinationGuard: Guardrail = {
name: "hallucination_detector",
checkpoint: "output",
check: async (context) => {
const content = context.message?.content ?? "";
const state = context.agentState;
// Check if agent claims to have taken actions it did not take
const actionClaims = [
{ pattern: /refund.*processed/i, tool: "issue_refund" },
{ pattern: /transfer.*initiated/i, tool: "initiate_transfer" },
{ pattern: /account.*closed/i, tool: "close_account" },
];
const executedTools = new Set(
state.toolResults.map((r) => r.toolName)
);
for (const claim of actionClaims) {
if (claim.pattern.test(content) && !executedTools.has(claim.tool)) {
return {
allowed: false,
reason:
`Agent claims "${claim.pattern.source}" but tool ` +
`"${claim.tool}" was never executed`,
};
}
}
return { allowed: true };
},
};Circuit Breaker
When guardrails detect repeated violations or the agent enters a loop, a circuit breaker stops the agent and escalates to a human.
class AgentCircuitBreaker {
private violations: Array<{ guard: string; timestamp: Date }> = [];
private threshold: number;
private windowMs: number;
constructor(threshold: number = 3, windowMs: number = 60000) {
this.threshold = threshold;
this.windowMs = windowMs;
}
recordViolation(guardName: string): void {
this.violations.push({ guard: guardName, timestamp: new Date() });
}
isTripped(): boolean {
const now = Date.now();
const recentViolations = this.violations.filter(
(v) => now - v.timestamp.getTime() < this.windowMs
);
return recentViolations.length >= this.threshold;
}
getEscalationReason(): string {
const recent = this.violations.slice(-this.threshold);
const guards = [...new Set(recent.map((v) => v.guard))];
return `Circuit breaker tripped: ${this.threshold} guardrail violations ` +
`in ${this.windowMs / 1000}s from guards: ${guards.join(", ")}`;
}
}Conclusion
Guardrails are not optional for production AI agents — they are the safety net that prevents the agent from causing harm. Input guardrails block malicious prompts before they reach the model. Pre-tool guardrails enforce action limits and authorization policies. Output guardrails redact sensitive information and detect hallucinated claims. And circuit breakers stop runaway agents before they cause cascading damage. In Klivvr Agent, every tool call passes through this pipeline, ensuring that the agent operates within defined boundaries regardless of what the user asks or what the model generates.
Related Articles
AI Agents in Fintech Operations
How AI agents automate fintech operational workflows including compliance monitoring, fraud detection, dispute resolution, and regulatory reporting — with insights from Klivvr Agent deployments.
Human-in-the-Loop Patterns for AI Agents
How to design effective human-in-the-loop workflows for AI agents, covering escalation policies, approval workflows, the autonomy ladder, and trust-building strategies.
Multi-Agent Systems in TypeScript
Architecture patterns for multi-agent systems including supervisor topologies, agent-to-agent communication, task delegation, and shared state management in Klivvr Agent.