How to Build Secure AI Assistants with User Interaction Guardrails

How to Build Secure AI Assistants with User Interaction Guardrails
Create a virtual environment
Install core dependencies
Download spaCy model
- Core Implementation: Building the Guardrail System
  - Step 1: Input Guardrails with PII Detection
guardrails/input_guardrails.py

📺 Watch: Neural Networks Explained

Video by 3Blue1Brown

AI assistants are moving from chatbots to production systems that handle sensitive data, make decisions, and interact with users in complex ways. The challenge isn't just building a smart assistant—it's building one that doesn't leak data, amplify biases, or make dangerous decisions. This tutorial walks through building a production-grade AI assistant with explicit guardrails for user interaction and security, using Python, LangChain [10], and FastAPI.

Why Guardrails Matter More Than Intelligence

In 2025, a major financial institution had to take down their customer service AI after it accidentally revealed another user's account balance during a conversation. This isn't a hypothetical—it's the reality of deploying AI without proper interaction controls. The DeBiasMe paper from ArXiv highlights how metacognitive interventions can reduce bias in human-AI interactions, but these interventions need to be baked into the architecture, not bolted on later.

The core problem: AI assistants are stateless by default, but user interactions are deeply contextual. A user might ask "what's my balance?" and the assistant needs to know which user, which account, and whether the previous conversation established identity. Without guardrails, the assistant either trusts everything (dangerous) or trusts nothing (useless).

Architecture Overview: The Three-Layer Guardrail System

We're building a system with three distinct guardrail layers:

Input Guardrails: Validate and sanitize user input before it reaches the LLM
Context Guardrails: Manage conversation state and user identity securely
Output Guardrails: Filter and validate LLM responses before returning to user

This follows patterns from the Supporting Data-Frame Dynamics paper, which shows that AI-assisted decision making requires structured interaction frameworks to prevent errors.

Prerequisites and Environment Setup

# Create a virtual environment
python -m venv guardrail-env
source guardrail-env/bin/activate  # On Windows: guardrail-env\Scripts\activate

# Install core dependencies
pip install langchain==0.3.14 langchain-openai [8]==0.2.14 fastapi==0.115.6 uvicorn==0.34.0
pip install pydantic==2.10.3 python-multipart==0.0.18
pip install redis==5.2.1  # For session management
pip install presidio-analyzer==2.2.351 presidio-anonymizer==2.2.351  # PII detection
pip install spacy==3.8.2  # For NLP-based guardrails

# Download spaCy model
python -m spacy download en_core_web_sm

You'll need an OpenAI API key (or any LLM provider). Set it as an environment variable:

export OPENAI_API_KEY="sk-your-key-here"
export REDIS_URL="redis://localhost:6379/0"

Core Implementation: Building the Guardrail System

Step 1: Input Guardrails with PII Detection

The first line of defense is preventing sensitive data from reaching the LLM. We use Microsoft's Presidio for PII detection:

# guardrails/input_guardrails.py
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from typing import Dict, List, Optional
import re

class InputGuardrail:
    """
    Validates and sanitizes user input before it reaches the LLM.
    Uses Presidio for PII detection and custom regex patterns for security.
    """

    def __init__(self):
        self.analyzer = AnalyzerEngine()
        self.anonymizer = AnonymizerEngine()

        # Custom security patterns - these catch things Presidio might miss
        self.security_patterns = {
            "api_key": r'(?i)(?:api[_-]?key|secret|token)[\s:=]+["\']?[a-zA-Z0-9_\-]{16,}["\']?',
            "ssn": r'\b\d{3}-\d{2}-\d{4}\b',
            "credit_card": r'\b(?:\d[ -]*?){13,16}\b',
        }

    def analyze_input(self, text: str) -> Dict:
        """
        Analyze input text for PII and security issues.
        Returns a dict with findings and risk score.
        """
        # Run Presidio analysis
        presidio_results = self.analyzer.analyze(
            text=text,
            entities=["PHONE_NUMBER", "EMAIL_ADDRESS", "CREDIT_CARD", 
                     "US_SSN", "US_BANK_NUMBER"],
            language="en"
        )

        # Run custom regex patterns
        custom_findings = []
        for pattern_name, pattern in self.security_patterns.items():
            matches = re.finditer(pattern, text)
            for match in matches:
                custom_findings.append({
                    "entity_type": pattern_name,
                    "start": match.start(),
                    "end": match.end(),
                    "score": 0.95  # High confidence for regex matches
                })

        # Calculate risk score
        total_findings = len(presidio_results) + len(custom_findings)
        risk_score = min(total_findings * 0.25, 1.0)  # Cap at 1.0

        return {
            "has_pii": total_findings > 0,
            "risk_score": risk_score,
            "findings": presidio_results + custom_findings,
            "requires_action": risk_score > 0.5
        }

    def sanitize_input(self, text: str) -> str:
        """
        Sanitize input by replacing PII with placeholders.
        Returns sanitized text and logs what was replaced.
        """
        analysis = self.analyze_input(text)

        if not analysis["has_pii"]:
            return text

        # Use Presidio's anonymizer for structured PII
        anonymized_text = self.anonymizer.anonymize(
            text=text,
            analyzer_results=analysis["findings"]
        ).text

        # Apply custom regex replacements for security patterns
        for pattern_name, pattern in self.security_patterns.items():
            anonymized_text = re.sub(
                pattern, 
                f"[REDACTED_{pattern_name.upper()}]", 
                anonymized_text
            )

        return anonymized_text

Why this matters: Without input guardrails, users can accidentally (or intentionally) inject sensitive data into prompts. The risk score system lets you decide whether to sanitize, block, or flag the interaction. In production, you'd log all PII detections for audit trails.

Step 2: Context Guardrails with Session Management

The second layer manages conversation state and user identity. This is where most security failures happen—when the assistant confuses one user's context with another's.

# guardrails/context_guardrails.py
import redis
import json
import hashlib
from datetime import datetime, timedelta
from typing import Optional, Dict, List
from pydantic import BaseModel, Field

class SessionContext(BaseModel):
    """Pydantic model for session data with validation"""
    user_id: str
    session_id: str
    created_at: datetime = Field(default_factory=datetime.utcnow)
    last_active: datetime = Field(default_factory=datetime.utcnow)
    message_count: int = 0
    max_messages: int = 100  # Prevent infinite context growth
    context_window: List[Dict] = Field(default_factory=list, max_length=50)

    class Config:
        arbitrary_types_allowed = True

class ContextGuardrail:
    """
    Manages conversation context with security boundaries.
    Uses Redis for fast, distributed session storage.
    """

    def __init__(self, redis_url: str = "redis://localhost:6379/0"):
        self.redis_client = redis.from_url(redis_url)
        self.session_ttl = 3600  # 1 hour session timeout

    def create_session(self, user_id: str) -> SessionContext:
        """Create a new session with unique ID"""
        session_id = hashlib.sha256(
            f"{user_id}:{datetime.utcnow().isoformat()}".encode()
        ).hexdigest()[:16]

        session = SessionContext(
            user_id=user_id,
            session_id=session_id
        )

        # Store in Redis with TTL
        self.redis_client.setex(
            f"session:{session_id}",
            self.session_ttl,
            session.model_dump_json()
        )

        return session

    def get_session(self, session_id: str) -> Optional[SessionContext]:
        """Retrieve session with validation"""
        data = self.redis_client.get(f"session:{session_id}")
        if not data:
            return None

        session = SessionContext.model_validate_json(data)

        # Check if session has expired
        if datetime.utcnow() - session.last_active > timedelta(hours=1):
            self.redis_client.delete(f"session:{session_id}")
            return None

        # Check message limit
        if session.message_count >= session.max_messages:
            return None  # Session exhausted, force new session

        return session

    def update_context(self, session_id: str, 
                      user_message: str, 
                      assistant_response: str) -> bool:
        """Add to conversation context with size limits"""
        session = self.get_session(session_id)
        if not session:
            return False

        # Add to context window
        session.context_window.append({
            "role": "user",
            "content": user_message,
            "timestamp": datetime.utcnow().isoformat()
        })
        session.context_window.append({
            "role": "assistant",
            "content": assistant_response,
            "timestamp": datetime.utcnow().isoformat()
        })

        # Enforce context window size
        if len(session.context_window) > 50:
            # Keep only last 50 messages (25 exchanges)
            session.context_window = session.context_window[-50:]

        session.message_count += 1
        session.last_active = datetime.utcnow()

        # Update Redis
        self.redis_client.setex(
            f"session:{session_id}",
            self.session_ttl,
            session.model_dump_json()
        )

        return True

Edge case handling: The context window is limited to 50 messages to prevent token overflow and reduce hallucination risk from overly long contexts. The session TTL of 1 hour means inactive sessions are automatically cleaned up. The message count limit prevents abuse through infinite conversations.

Step 3: Output Guardrails with Response Validation

The final layer validates what the LLM says before it reaches the user. This is critical for preventing hallucinated data, inappropriate content, or security leaks.

# guardrails/output_guardrails.py
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from typing import Dict, Optional
import re

class OutputGuardrail:
    """
    Validates LLM responses before returning to user.
    Uses a secondary LLM call for content safety checking.
    """

    def __init__(self, model_name: str = "gpt [6]-4o-mini"):
        # Use a cheaper model for validation to reduce costs
        self.validator_llm = ChatOpenAI(
            model=model_name,
            temperature=0.0,  # Deterministic output for validation
            max_tokens=50  # Short responses for classification
        )

        self.validation_prompt = ChatPromptTemplate.from_messages([
            ("system", """You are a response validator. Analyze the following 
            assistant response and classify it. Return ONLY a JSON object with 
            these fields:
            - "safe": boolean (true if response is safe to show user)
            - "reason": string (why it passed or failed)
            - "contains_pii": boolean
            - "contains_hallucination": boolean (if response makes unverifiable claims)
            - "action": "pass" | "block" | "flag"

            Rules:
            - Block if response contains personal data, financial info, or credentials
            - Block if response makes specific claims about user data without context
            - Flag if response is speculative or uses uncertain language
            - Pass only if response is factual, safe, and contextually appropriate"""),
            ("human", "Response to validate: {response}")
        ])

    def validate_response(self, response: str, 
                         context: Optional[Dict] = None) -> Dict:
        """
        Validate LLM response. Returns validation result with action.
        """
        # Quick regex checks before LLM validation (cheaper)
        quick_checks = self._quick_safety_checks(response)
        if quick_checks["block"]:
            return {
                "safe": False,
                "action": "block",
                "reason": quick_checks["reason"],
                "contains_pii": True,
                "contains_hallucination": False
            }

        # LLM-based validation for nuanced checks
        chain = self.validation_prompt | self.validator_llm
        try:
            result = chain.invoke({"response": response})
            validation = json.loads(result.content)

            # Override to block if quick checks found issues
            if quick_checks["flag"]:
                validation["action"] = "flag"
                validation["reason"] += f" | Quick check: {quick_checks['reason']}"

            return validation

        except (json.JSONDecodeError, Exception) as e:
            # If validation fails, block by default (fail closed)
            return {
                "safe": False,
                "action": "block",
                "reason": f"Validation system error: {str(e)}",
                "contains_pii": False,
                "contains_hallucination": False
            }

    def _quick_safety_checks(self, text: str) -> Dict:
        """
        Fast regex-based safety checks before LLM validation.
        Catches obvious issues without API calls.
        """
        block_patterns = {
            "ssn": r'\b\d{3}-\d{2}-\d{4}\b',
            "credit_card": r'\b(?:\d[ -]*?){13,16}\b',
            "api_key": r'(?i)(?:api[_-]?key|secret|token)[\s:=]+["\']?[a-zA-Z0-9_\-]{16,}["\']?',
        }

        flag_patterns = {
            "uncertainty": r'\b(maybe|perhaps|possibly|might|could be|not sure)\b',
            "speculation": r'\b(I think|I believe|I guess|probably)\b',
        }

        for pattern_name, pattern in block_patterns.items():
            if re.search(pattern, text):
                return {
                    "block": True,
                    "flag": False,
                    "reason": f"Detected {pattern_name} in response"
                }

        flags = []
        for pattern_name, pattern in flag_patterns.items():
            if re.search(pattern, text):
                flags.append(pattern_name)

        return {
            "block": False,
            "flag": len(flags) > 0,
            "reason": f"Flagged patterns: {', '.join(flags)}" if flags else ""
        }

Production consideration: The output guardrail uses a two-stage approach—fast regex checks first, then LLM validation for nuanced cases. This reduces costs because most problematic responses are caught by the regex layer. The "fail closed" approach means if the validator itself fails, we block the response rather than letting it through.

Step 4: FastAPI Integration

Now we wire everything together into a production API:

# main.py
from fastapi import FastAPI, HTTPException, Depends
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
from pydantic import BaseModel
from typing import Optional
import uuid

from guardrails.input_guardrails import InputGuardrail
from guardrails.context_guardrails import ContextGuardrail
from guardrails.output_guardrails import OutputGuardrail

app = FastAPI(title="Secure AI Assistant API")
security = HTTPBearer()

# Initialize guardrails
input_guardrail = InputGuardrail()
context_guardrail = ContextGuardrail()
output_guardrail = OutputGuardrail()

class ChatRequest(BaseModel):
    message: str
    session_id: Optional[str] = None

class ChatResponse(BaseModel):
    response: str
    session_id: str
    warning: Optional[str] = None

@app.post("/chat", response_model=ChatResponse)
async def chat(
    request: ChatRequest,
    credentials: HTTPAuthorizationCredentials = Depends(security)
):
    """
    Main chat endpoint with three-layer guardrail protection.
    """
    # Step 1: Input Guardrail - sanitize user input
    input_analysis = input_guardrail.analyze_input(request.message)

    if input_analysis["requires_action"]:
        # Log the incident for security audit
        print(f"PII detected in input from user {credentials.credentials}")

        # Sanitize the input
        sanitized_message = input_guardrail.sanitize_input(request.message)
    else:
        sanitized_message = request.message

    # Step 2: Context Guardrail - manage session
    if request.session_id:
        session = context_guardrail.get_session(request.session_id)
        if not session:
            # Session expired or invalid, create new one
            session = context_guardrail.create_session(credentials.credentials)
    else:
        session = context_guardrail.create_session(credentials.credentials)

    # Step 3: Generate response (simplified - in production use LangChain)
    # This is where you'd call your actual LLM chain
    raw_response = f"Echo: {sanitized_message}"  # Placeholder

    # Step 4: Output Guardrail - validate response
    validation = output_guardrail.validate_response(
        raw_response,
        context={"user_id": credentials.credentials}
    )

    if validation["action"] == "block":
        # Log blocked response for review
        print(f"Blocked response for session {session.session_id}: {validation['reason']}")
        raise HTTPException(
            status_code=400,
            detail="Response blocked by safety guardrails"
        )

    # Step 5: Update context with validated response
    context_guardrail.update_context(
        session.session_id,
        request.message,
        raw_response
    )

    warning = None
    if validation["action"] == "flag":
        warning = f"Response flagged: {validation['reason']}"

    return ChatResponse(
        response=raw_response,
        session_id=session.session_id,
        warning=warning
    )

@app.get("/health")
async def health_check():
    return {"status": "healthy", "guardrails": "active"}

Pitfalls & Production Tips

1. The Token Budget Trap

Most developers set context windows too large. A 50-message context with GPT-4 costs about $0.03 per request just for the context tokens. At 1000 requests/hour, that's $30/hour in context costs alone. Keep context windows small and use summarization for older messages.

2. Session Hijacking via Token Reuse

The session ID in our implementation is a hash of user_id and timestamp. In production, use a cryptographically secure random token and store it in an HTTP-only cookie, not in the request body. The current implementation sends session_id in the request body, which is vulnerable to XSS attacks.

3. The Validation Loop Problem

If your output guardrail uses an LLM, that LLM can also produce unsafe responses. We mitigate this by using a different model (gpt-4o-mini vs gpt-4) and setting temperature to 0. But this adds latency—expect 200-500ms per validation call. For high-throughput systems, consider using a dedicated safety classifier like Llama Guard instead.

4. Rate Limiting at the Wrong Layer

Don't rate limit at the API level alone. A single user can open 100 sessions and exhaust your context budget. Implement per-user rate limiting at the session creation level, and per-session rate limiting at the message level. Redis is good for this, but watch out for race conditions in distributed deployments.

5. The False Positive Problem

Output guardrails will block legitimate responses. Our regex for credit card numbers will block any 16-digit number, including order IDs and phone numbers. You need a feedback loop where users can report false positives, and you need to tune your patterns regularly. Expect a 1-3% false positive rate even with good tuning.

6. Context Leakage Across Sessions

If you're using a vector store for RAG, make sure each user's documents are isolated. A common mistake is storing all documents in one collection and filtering by user_id in the query. This works until someone forgets the filter. Use separate collections or namespaces per user.

What's Next

This guardrail system handles the basics, but production systems need more:

Bias detection: The DeBiasMe paper shows that metacognitive interventions can reduce bias in AI interactions. Implement a bias detection layer that flags responses for demographic bias.
Data-frame dynamics: The Supporting Data-Frame Dynamics paper demonstrates that structured interaction frameworks improve decision quality. Consider adding explicit decision-making workflows for high-stakes queries.
Code as interface: The Will Code Remain a Relevant User Interface paper questions whether natural language will replace code for end-user programming. For now, hybrid approaches work best—let users write code for complex tasks but validate it through guardrails.

The key insight from building this system: security isn't a feature, it's an architecture. You can't add guardrails after deployment and expect them to work. They need to be part of every interaction layer, from input to context to output. Start with the three-layer approach here, then add layers as your threat model evolves.

References

1. Wikipedia - GPT. Wikipedia. [Source]

2. Wikipedia - LangChain. Wikipedia. [Source]

3. Wikipedia - OpenAI. Wikipedia. [Source]

4. arXiv - Learning Dexterous In-Hand Manipulation. Arxiv. [Source]

5. arXiv - DeBiasMe: De-biasing Human-AI Interactions with Metacognitiv. Arxiv. [Source]

6. GitHub - Significant-Gravitas/AutoGPT. Github. [Source]

7. GitHub - langchain-ai/langchain. Github. [Source]

8. GitHub - openai/openai-python. Github. [Source]

9. GitHub - Shubhamsaboo/awesome-llm-apps. Github. [Source]

10. LangChain Pricing. Pricing. [Source]

How to Build Secure AI Assistants with User Interaction Guardrails

How to Build Secure AI Assistants with User Interaction Guardrails

Table of Contents

📺 Watch: Neural Networks Explained

Why Guardrails Matter More Than Intelligence

Architecture Overview: The Three-Layer Guardrail System

Prerequisites and Environment Setup

Core Implementation: Building the Guardrail System

Step 1: Input Guardrails with PII Detection

Step 2: Context Guardrails with Session Management

Step 3: Output Guardrails with Response Validation

Step 4: FastAPI Integration

Pitfalls & Production Tips

1. The Token Budget Trap

2. Session Hijacking via Token Reuse

3. The Validation Loop Problem

4. Rate Limiting at the Wrong Layer

5. The False Positive Problem

6. Context Leakage Across Sessions

What's Next

References

Was this article helpful?

Related Articles

Custom AI Chips: How OpenAI and SpaceX Are Reshaping Hardware in 2026

How to Build a Production AI Pipeline with GenIR Foundations

How to Reduce LLM Hallucination with Ontology Grounding