How to Build Secure AI Assistants with User Interaction Guardrails
Practical tutorial: It highlights user interaction and security challenges with AI assistants, which is relevant but not groundbreaking.
How to Build Secure AI Assistants with User Interaction Guardrails
Table of Contents
- How to Build Secure AI Assistants with User Interaction Guardrails
- Create a virtual environment
- Install core dependencies
- Download spaCy model
- guardrails/input_guardrails.py
📺 Watch: Neural Networks Explained
Video by 3Blue1Brown
AI assistants are moving from chatbots to production systems that handle sensitive data, make decisions, and interact with users in complex ways. The challenge isn't just building a smart assistant—it's building one that doesn't leak data, amplify biases, or make dangerous decisions. This tutorial walks through building a production-grade AI assistant with explicit guardrails for user interaction and security, using Python, LangChain [10], and FastAPI.
Why Guardrails Matter More Than Intelligence
In 2025, a major financial institution had to take down their customer service AI after it accidentally revealed another user's account balance during a conversation. This isn't a hypothetical—it's the reality of deploying AI without proper interaction controls. The DeBiasMe paper from ArXiv highlights how metacognitive interventions can reduce bias in human-AI interactions, but these interventions need to be baked into the architecture, not bolted on later.
The core problem: AI assistants are stateless by default, but user interactions are deeply contextual. A user might ask "what's my balance?" and the assistant needs to know which user, which account, and whether the previous conversation established identity. Without guardrails, the assistant either trusts everything (dangerous) or trusts nothing (useless).
Architecture Overview: The Three-Layer Guardrail System
We're building a system with three distinct guardrail layers:
- Input Guardrails: Validate and sanitize user input before it reaches the LLM
- Context Guardrails: Manage conversation state and user identity securely
- Output Guardrails: Filter and validate LLM responses before returning to user
This follows patterns from the Supporting Data-Frame Dynamics paper, which shows that AI-assisted decision making requires structured interaction frameworks to prevent errors.
Prerequisites and Environment Setup
# Create a virtual environment
python -m venv guardrail-env
source guardrail-env/bin/activate # On Windows: guardrail-env\Scripts\activate
# Install core dependencies
pip install langchain==0.3.14 langchain-openai [8]==0.2.14 fastapi==0.115.6 uvicorn==0.34.0
pip install pydantic==2.10.3 python-multipart==0.0.18
pip install redis==5.2.1 # For session management
pip install presidio-analyzer==2.2.351 presidio-anonymizer==2.2.351 # PII detection
pip install spacy==3.8.2 # For NLP-based guardrails
# Download spaCy model
python -m spacy download en_core_web_sm
You'll need an OpenAI API key (or any LLM provider). Set it as an environment variable:
export OPENAI_API_KEY="sk-your-key-here"
export REDIS_URL="redis://localhost:6379/0"
Core Implementation: Building the Guardrail System
Step 1: Input Guardrails with PII Detection
The first line of defense is preventing sensitive data from reaching the LLM. We use Microsoft's Presidio for PII detection:
# guardrails/input_guardrails.py
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from typing import Dict, List, Optional
import re
class InputGuardrail:
"""
Validates and sanitizes user input before it reaches the LLM.
Uses Presidio for PII detection and custom regex patterns for security.
"""
def __init__(self):
self.analyzer = AnalyzerEngine()
self.anonymizer = AnonymizerEngine()
# Custom security patterns - these catch things Presidio might miss
self.security_patterns = {
"api_key": r'(?i)(?:api[_-]?key|secret|token)[\s:=]+["\']?[a-zA-Z0-9_\-]{16,}["\']?',
"ssn": r'\b\d{3}-\d{2}-\d{4}\b',
"credit_card": r'\b(?:\d[ -]*?){13,16}\b',
}
def analyze_input(self, text: str) -> Dict:
"""
Analyze input text for PII and security issues.
Returns a dict with findings and risk score.
"""
# Run Presidio analysis
presidio_results = self.analyzer.analyze(
text=text,
entities=["PHONE_NUMBER", "EMAIL_ADDRESS", "CREDIT_CARD",
"US_SSN", "US_BANK_NUMBER"],
language="en"
)
# Run custom regex patterns
custom_findings = []
for pattern_name, pattern in self.security_patterns.items():
matches = re.finditer(pattern, text)
for match in matches:
custom_findings.append({
"entity_type": pattern_name,
"start": match.start(),
"end": match.end(),
"score": 0.95 # High confidence for regex matches
})
# Calculate risk score
total_findings = len(presidio_results) + len(custom_findings)
risk_score = min(total_findings * 0.25, 1.0) # Cap at 1.0
return {
"has_pii": total_findings > 0,
"risk_score": risk_score,
"findings": presidio_results + custom_findings,
"requires_action": risk_score > 0.5
}
def sanitize_input(self, text: str) -> str:
"""
Sanitize input by replacing PII with placeholders.
Returns sanitized text and logs what was replaced.
"""
analysis = self.analyze_input(text)
if not analysis["has_pii"]:
return text
# Use Presidio's anonymizer for structured PII
anonymized_text = self.anonymizer.anonymize(
text=text,
analyzer_results=analysis["findings"]
).text
# Apply custom regex replacements for security patterns
for pattern_name, pattern in self.security_patterns.items():
anonymized_text = re.sub(
pattern,
f"[REDACTED_{pattern_name.upper()}]",
anonymized_text
)
return anonymized_text
Why this matters: Without input guardrails, users can accidentally (or intentionally) inject sensitive data into prompts. The risk score system lets you decide whether to sanitize, block, or flag the interaction. In production, you'd log all PII detections for audit trails.
Step 2: Context Guardrails with Session Management
The second layer manages conversation state and user identity. This is where most security failures happen—when the assistant confuses one user's context with another's.
# guardrails/context_guardrails.py
import redis
import json
import hashlib
from datetime import datetime, timedelta
from typing import Optional, Dict, List
from pydantic import BaseModel, Field
class SessionContext(BaseModel):
"""Pydantic model for session data with validation"""
user_id: str
session_id: str
created_at: datetime = Field(default_factory=datetime.utcnow)
last_active: datetime = Field(default_factory=datetime.utcnow)
message_count: int = 0
max_messages: int = 100 # Prevent infinite context growth
context_window: List[Dict] = Field(default_factory=list, max_length=50)
class Config:
arbitrary_types_allowed = True
class ContextGuardrail:
"""
Manages conversation context with security boundaries.
Uses Redis for fast, distributed session storage.
"""
def __init__(self, redis_url: str = "redis://localhost:6379/0"):
self.redis_client = redis.from_url(redis_url)
self.session_ttl = 3600 # 1 hour session timeout
def create_session(self, user_id: str) -> SessionContext:
"""Create a new session with unique ID"""
session_id = hashlib.sha256(
f"{user_id}:{datetime.utcnow().isoformat()}".encode()
).hexdigest()[:16]
session = SessionContext(
user_id=user_id,
session_id=session_id
)
# Store in Redis with TTL
self.redis_client.setex(
f"session:{session_id}",
self.session_ttl,
session.model_dump_json()
)
return session
def get_session(self, session_id: str) -> Optional[SessionContext]:
"""Retrieve session with validation"""
data = self.redis_client.get(f"session:{session_id}")
if not data:
return None
session = SessionContext.model_validate_json(data)
# Check if session has expired
if datetime.utcnow() - session.last_active > timedelta(hours=1):
self.redis_client.delete(f"session:{session_id}")
return None
# Check message limit
if session.message_count >= session.max_messages:
return None # Session exhausted, force new session
return session
def update_context(self, session_id: str,
user_message: str,
assistant_response: str) -> bool:
"""Add to conversation context with size limits"""
session = self.get_session(session_id)
if not session:
return False
# Add to context window
session.context_window.append({
"role": "user",
"content": user_message,
"timestamp": datetime.utcnow().isoformat()
})
session.context_window.append({
"role": "assistant",
"content": assistant_response,
"timestamp": datetime.utcnow().isoformat()
})
# Enforce context window size
if len(session.context_window) > 50:
# Keep only last 50 messages (25 exchanges)
session.context_window = session.context_window[-50:]
session.message_count += 1
session.last_active = datetime.utcnow()
# Update Redis
self.redis_client.setex(
f"session:{session_id}",
self.session_ttl,
session.model_dump_json()
)
return True
Edge case handling: The context window is limited to 50 messages to prevent token overflow and reduce hallucination risk from overly long contexts. The session TTL of 1 hour means inactive sessions are automatically cleaned up. The message count limit prevents abuse through infinite conversations.
Step 3: Output Guardrails with Response Validation
The final layer validates what the LLM says before it reaches the user. This is critical for preventing hallucinated data, inappropriate content, or security leaks.
# guardrails/output_guardrails.py
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from typing import Dict, Optional
import re
class OutputGuardrail:
"""
Validates LLM responses before returning to user.
Uses a secondary LLM call for content safety checking.
"""
def __init__(self, model_name: str = "gpt [6]-4o-mini"):
# Use a cheaper model for validation to reduce costs
self.validator_llm = ChatOpenAI(
model=model_name,
temperature=0.0, # Deterministic output for validation
max_tokens=50 # Short responses for classification
)
self.validation_prompt = ChatPromptTemplate.from_messages([
("system", """You are a response validator. Analyze the following
assistant response and classify it. Return ONLY a JSON object with
these fields:
- "safe": boolean (true if response is safe to show user)
- "reason": string (why it passed or failed)
- "contains_pii": boolean
- "contains_hallucination": boolean (if response makes unverifiable claims)
- "action": "pass" | "block" | "flag"
Rules:
- Block if response contains personal data, financial info, or credentials
- Block if response makes specific claims about user data without context
- Flag if response is speculative or uses uncertain language
- Pass only if response is factual, safe, and contextually appropriate"""),
("human", "Response to validate: {response}")
])
def validate_response(self, response: str,
context: Optional[Dict] = None) -> Dict:
"""
Validate LLM response. Returns validation result with action.
"""
# Quick regex checks before LLM validation (cheaper)
quick_checks = self._quick_safety_checks(response)
if quick_checks["block"]:
return {
"safe": False,
"action": "block",
"reason": quick_checks["reason"],
"contains_pii": True,
"contains_hallucination": False
}
# LLM-based validation for nuanced checks
chain = self.validation_prompt | self.validator_llm
try:
result = chain.invoke({"response": response})
validation = json.loads(result.content)
# Override to block if quick checks found issues
if quick_checks["flag"]:
validation["action"] = "flag"
validation["reason"] += f" | Quick check: {quick_checks['reason']}"
return validation
except (json.JSONDecodeError, Exception) as e:
# If validation fails, block by default (fail closed)
return {
"safe": False,
"action": "block",
"reason": f"Validation system error: {str(e)}",
"contains_pii": False,
"contains_hallucination": False
}
def _quick_safety_checks(self, text: str) -> Dict:
"""
Fast regex-based safety checks before LLM validation.
Catches obvious issues without API calls.
"""
block_patterns = {
"ssn": r'\b\d{3}-\d{2}-\d{4}\b',
"credit_card": r'\b(?:\d[ -]*?){13,16}\b',
"api_key": r'(?i)(?:api[_-]?key|secret|token)[\s:=]+["\']?[a-zA-Z0-9_\-]{16,}["\']?',
}
flag_patterns = {
"uncertainty": r'\b(maybe|perhaps|possibly|might|could be|not sure)\b',
"speculation": r'\b(I think|I believe|I guess|probably)\b',
}
for pattern_name, pattern in block_patterns.items():
if re.search(pattern, text):
return {
"block": True,
"flag": False,
"reason": f"Detected {pattern_name} in response"
}
flags = []
for pattern_name, pattern in flag_patterns.items():
if re.search(pattern, text):
flags.append(pattern_name)
return {
"block": False,
"flag": len(flags) > 0,
"reason": f"Flagged patterns: {', '.join(flags)}" if flags else ""
}
Production consideration: The output guardrail uses a two-stage approach—fast regex checks first, then LLM validation for nuanced cases. This reduces costs because most problematic responses are caught by the regex layer. The "fail closed" approach means if the validator itself fails, we block the response rather than letting it through.
Step 4: FastAPI Integration
Now we wire everything together into a production API:
# main.py
from fastapi import FastAPI, HTTPException, Depends
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
from pydantic import BaseModel
from typing import Optional
import uuid
from guardrails.input_guardrails import InputGuardrail
from guardrails.context_guardrails import ContextGuardrail
from guardrails.output_guardrails import OutputGuardrail
app = FastAPI(title="Secure AI Assistant API")
security = HTTPBearer()
# Initialize guardrails
input_guardrail = InputGuardrail()
context_guardrail = ContextGuardrail()
output_guardrail = OutputGuardrail()
class ChatRequest(BaseModel):
message: str
session_id: Optional[str] = None
class ChatResponse(BaseModel):
response: str
session_id: str
warning: Optional[str] = None
@app.post("/chat", response_model=ChatResponse)
async def chat(
request: ChatRequest,
credentials: HTTPAuthorizationCredentials = Depends(security)
):
"""
Main chat endpoint with three-layer guardrail protection.
"""
# Step 1: Input Guardrail - sanitize user input
input_analysis = input_guardrail.analyze_input(request.message)
if input_analysis["requires_action"]:
# Log the incident for security audit
print(f"PII detected in input from user {credentials.credentials}")
# Sanitize the input
sanitized_message = input_guardrail.sanitize_input(request.message)
else:
sanitized_message = request.message
# Step 2: Context Guardrail - manage session
if request.session_id:
session = context_guardrail.get_session(request.session_id)
if not session:
# Session expired or invalid, create new one
session = context_guardrail.create_session(credentials.credentials)
else:
session = context_guardrail.create_session(credentials.credentials)
# Step 3: Generate response (simplified - in production use LangChain)
# This is where you'd call your actual LLM chain
raw_response = f"Echo: {sanitized_message}" # Placeholder
# Step 4: Output Guardrail - validate response
validation = output_guardrail.validate_response(
raw_response,
context={"user_id": credentials.credentials}
)
if validation["action"] == "block":
# Log blocked response for review
print(f"Blocked response for session {session.session_id}: {validation['reason']}")
raise HTTPException(
status_code=400,
detail="Response blocked by safety guardrails"
)
# Step 5: Update context with validated response
context_guardrail.update_context(
session.session_id,
request.message,
raw_response
)
warning = None
if validation["action"] == "flag":
warning = f"Response flagged: {validation['reason']}"
return ChatResponse(
response=raw_response,
session_id=session.session_id,
warning=warning
)
@app.get("/health")
async def health_check():
return {"status": "healthy", "guardrails": "active"}
Pitfalls & Production Tips
1. The Token Budget Trap
Most developers set context windows too large. A 50-message context with GPT-4 costs about $0.03 per request just for the context tokens. At 1000 requests/hour, that's $30/hour in context costs alone. Keep context windows small and use summarization for older messages.
2. Session Hijacking via Token Reuse
The session ID in our implementation is a hash of user_id and timestamp. In production, use a cryptographically secure random token and store it in an HTTP-only cookie, not in the request body. The current implementation sends session_id in the request body, which is vulnerable to XSS attacks.
3. The Validation Loop Problem
If your output guardrail uses an LLM, that LLM can also produce unsafe responses. We mitigate this by using a different model (gpt-4o-mini vs gpt-4) and setting temperature to 0. But this adds latency—expect 200-500ms per validation call. For high-throughput systems, consider using a dedicated safety classifier like Llama Guard instead.
4. Rate Limiting at the Wrong Layer
Don't rate limit at the API level alone. A single user can open 100 sessions and exhaust your context budget. Implement per-user rate limiting at the session creation level, and per-session rate limiting at the message level. Redis is good for this, but watch out for race conditions in distributed deployments.
5. The False Positive Problem
Output guardrails will block legitimate responses. Our regex for credit card numbers will block any 16-digit number, including order IDs and phone numbers. You need a feedback loop where users can report false positives, and you need to tune your patterns regularly. Expect a 1-3% false positive rate even with good tuning.
6. Context Leakage Across Sessions
If you're using a vector store for RAG, make sure each user's documents are isolated. A common mistake is storing all documents in one collection and filtering by user_id in the query. This works until someone forgets the filter. Use separate collections or namespaces per user.
What's Next
This guardrail system handles the basics, but production systems need more:
- Bias detection: The DeBiasMe paper shows that metacognitive interventions can reduce bias in AI interactions. Implement a bias detection layer that flags responses for demographic bias.
- Data-frame dynamics: The Supporting Data-Frame Dynamics paper demonstrates that structured interaction frameworks improve decision quality. Consider adding explicit decision-making workflows for high-stakes queries.
- Code as interface: The Will Code Remain a Relevant User Interface paper questions whether natural language will replace code for end-user programming. For now, hybrid approaches work best—let users write code for complex tasks but validate it through guardrails.
The key insight from building this system: security isn't a feature, it's an architecture. You can't add guardrails after deployment and expect them to work. They need to be part of every interaction layer, from input to context to output. Start with the three-layer approach here, then add layers as your threat model evolves.
References
Was this article helpful?
Let us know to improve our AI generation.
Related Articles
Custom AI Chips: How OpenAI and SpaceX Are Reshaping Hardware in 2026
Practical tutorial: It highlights a significant trend in the industry with major players like OpenAI and SpaceX investing in custom chips, i
How to Build a Production AI Pipeline with GenIR Foundations
Practical tutorial: The story reflects on past challenges in the AI industry but does not introduce new major developments, releases, or com
How to Reduce LLM Hallucination with Ontology Grounding
Practical tutorial: It critiques a specific approach to enhancing AI capabilities, which is relevant but not groundbreaking.