How to Build a SOC Assistant with AI Threat Detection

How to Build a SOC Assistant with AI Threat Detection
- Why a SOC Assistant Matters in Production
- Prerequisites and Environment Setup
  - Install Required Packages
Core dependencies
ingestion.py
- Step 2: Creating the Vector Store for Threat Correlation

📺 Watch: Neural Networks Explained

Video by 3Blue1Brown

Security Operations Centers (SOCs) face an overwhelming volume of alerts daily—often exceeding 10,000 per day for mid-sized enterprises. Building an AI-powered SOC assistant isn't just about automation; it's about reducing mean time to detect (MTTD) and mean time to respond (MTTR) by intelligently triaging threats. In this tutorial, you'll construct a production-ready SOC assistant using Python, LangChain, and a vector database for real-time threat correlation.

Why a SOC Assistant Matters in Production

Modern SOC analysts spend 30-40% of their time on false positives, according to industry reports. An AI assistant can ingest logs, correlate indicators of compromise (IoCs), and generate contextual summaries. In production, this system must handle streaming data, enforce strict access controls, and maintain audit trails. We'll build a modular architecture that separates ingestion, analysis, and response—critical for scaling across multiple data sources.

Prerequisites and Environment Setup

Before writing code, ensure you have the following installed:

Python 3.10+ (3.11 recommended for performance)
pip 23.0+
Docker (for running local vector database)
At least 8GB RAM (16GB recommended for embedding models)

Install Required Packages

# Core dependencies
pip install langchain==0.1.16 langchain-community==0.0.29 langchain-openai==0.0.8
pip install chromadb==0.4.24 sentence-transformers==2.2.2
pip install fastapi==0.109.2 uvicorn==0.27.1 pydantic==2.6.1
pip install pandas==2.2.0 pyyaml==6.0.1 python-dotenv==1.0.1
pip install httpx==0.27.0 tenacity==8.2.3

Set Up Environment Variables

Create a .env file in your project root:

OPENAI_API_KEY=sk-your-key-here
LOG_LEVEL=INFO
MAX_ALERTS_PER_BATCH=100
VECTOR_DB_PATH=./data/chromadb

Edge Case: If you're using a local embedding model instead of OpenAI, set EMBEDDING_MODEL=all-MiniLM-L6-v2 and ensure you have at least 4GB free disk space for model caching.

Architecture: Ingestion, Embedding, and Response

Our SOC assistant follows a three-stage pipeline:

Ingestion Layer: Parses logs from JSON, CSV, or syslog formats
Analysis Layer: Embeds threat data into vector space and queries for similar incidents
Response Layer: Generates human-readable summaries with recommended actions

We'll use ChromaDB as our vector store because it's lightweight, supports persistent storage, and integrates natively with LangChain. For production at scale, consider migrating to Pinecone [8] or Weaviate.

Step 1: Building the Threat Ingestion Pipeline

Start with a robust log parser that handles malformed data gracefully. In production, logs often contain encoding errors or missing fields.

# ingestion.py
import json
import csv
import logging
from datetime import datetime
from typing import Dict, List, Optional
from pydantic import BaseModel, ValidationError

logger = logging.getLogger(__name__)

class ThreatEvent(BaseModel):
    timestamp: datetime
    source_ip: str
    destination_ip: str
    event_type: str  # e.g., "port_scan", "malware_detected"
    severity: int  # 1-5, where 5 is critical
    raw_log: str
    tags: List[str] = []

class LogIngestor:
    """Handles multiple log formats with error recovery."""

    def __init__(self, max_errors: int = 10):
        self.max_errors = max_errors
        self.error_count = 0

    def parse_json_log(self, file_path: str) -> List[ThreatEvent]:
        events = []
        with open(file_path, 'r') as f:
            for line_num, line in enumerate(f, 1):
                try:
                    data = json.loads(line.strip())
                    event = ThreatEvent(**data)
                    events.append(event)
                except (json.JSONDecodeError, ValidationError) as e:
                    self.error_count += 1
                    logger.warning(f"Line {line_num}: {e}")
                    if self.error_count > self.max_errors:
                        raise RuntimeError("Too many parse errors, aborting")
        return events

    def parse_csv_log(self, file_path: str) -> List[ThreatEvent]:
        events = []
        with open(file_path, 'r') as f:
            reader = csv.DictReader(f)
            for row_num, row in enumerate(reader, 2):  # Skip header
                try:
                    # Handle missing fields with defaults
                    event = ThreatEvent(
                        timestamp=datetime.fromisoformat(row.get('timestamp', '2024-01-01T00:00:00')),
                        source_ip=row.get('src_ip', '0.0.0.0'),
                        destination_ip=row.get('dst_ip', '0.0.0.0'),
                        event_type=row.get('type', 'unknown'),
                        severity=int(row.get('severity', 1)),
                        raw_log=str(row),
                        tags=row.get('tags', '').split(';') if row.get('tags') else []
                    )
                    events.append(event)
                except (ValueError, ValidationError) as e:
                    logger.error(f"Row {row_num}: {e}")
        return events

Production Consideration: The max_errors parameter prevents infinite loops on corrupted files. In a real SOC, you'd also implement backpressure—if the ingestion rate exceeds 1000 events/second, buffer to disk or use a message queue like Kafka.

Step 2: Creating the Vector Store for Threat Correlation

We'll embed threat events and store them in ChromaDB. The embedding model converts IP addresses, event types, and severity into 384-dimensional vectors. This enables semantic similarity search—finding "port scan from 10.0.0.1" when querying "reconnaissance activity."

# vector_store.py
import chromadb
from chromadb.config import Settings
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.schema import Document
from typing import List
import hashlib

class ThreatVectorStore:
    def __init__(self, persist_directory: str = "./data/chromadb"):
        # Use persistent client for production
        self.client = chromadb.PersistentClient(
            path=persist_directory,
            settings=Settings(anonymized_telemetry=False)
        )
        self.collection_name = "threat_events"

        # Initialize embedding function
        self.embeddings = OpenAIEmbeddings(
            model="text-embedding-ada-002",
            disallowed_special=()
        )

        # Create or get collection
        try:
            self.collection = self.client.get_collection(self.collection_name)
        except ValueError:
            self.collection = self.client.create_collection(
                name=self.collection_name,
                metadata={"hnsw:space": "cosine"}  # Cosine similarity for threat matching
            )

        # Wrap in LangChain Chroma for easier querying
        self.vector_store = Chroma(
            client=self.client,
            collection_name=self.collection_name,
            embedding_function=self.embeddings
        )

    def add_threat_events(self, events: List[ThreatEvent]) -> int:
        """Add events to vector store with deduplication."""
        documents = []
        metadatas = []
        ids = []

        for event in events:
            # Create unique ID based on content hash
            content_hash = hashlib.sha256(
                f"{event.timestamp}{event.source_ip}{event.event_type}".encode()
            ).hexdigest()[:16]

            # Build document text for embedding
            doc_text = f"""
            Threat Event: {event.event_type}
            Source: {event.source_ip} -> Destination: {event.destination_ip}
            Severity: {event.severity}
            Tags: {', '.join(event.tags)}
            Raw: {event.raw_log[:500]}
            """

            documents.append(doc_text)
            metadatas.append({
                "timestamp": event.timestamp.isoformat(),
                "source_ip": event.source_ip,
                "dest_ip": event.destination_ip,
                "severity": event.severity,
                "event_type": event.event_type
            })
            ids.append(content_hash)

        # Add to ChromaDB (handles duplicates via upsert)
        self.vector_store.add_texts(
            texts=documents,
            metadatas=metadatas,
            ids=ids
        )
        return len(documents)

    def query_similar_threats(self, query: str, k: int = 5) -> List[Document]:
        """Find similar threats using semantic search."""
        results = self.vector_store.similarity_search(
            query,
            k=k,
            filter={"severity": {"$gte": 3}}  # Only high-severity matches
        )
        return results

Edge Case: The content_hash prevents duplicate embeddings when the same log is ingested twice—common in SOCs with redundant log shippers. The filter parameter in query_similar_threats ensures we only return actionable threats (severity 3+), reducing noise for analysts.

Step 3: Implementing the AI Analysis Engine

This is the core of the SOC assistant. It uses LangChain to create a chain that:

Takes a new alert
Queries the vector store for similar historical threats
Generates a contextual summary with recommended actions

# analysis_engine.py
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.llms import OpenAI
from langchain.callbacks import get_openai_callback
from typing import Dict, Any
import logging

logger = logging.getLogger(__name__)

class ThreatAnalyzer:
    def __init__(self, vector_store: ThreatVectorStore):
        self.vector_store = vector_store

        # Initialize LLM with conservative settings
        self.llm = OpenAI(
            model="gpt-3.5-turbo-instruct",
            temperature=0.1,  # Low temperature for deterministic output
            max_tokens=500,
            request_timeout=30
        )

        # Define prompt template
        self.prompt = PromptTemplate(
            input_variables=["alert", "similar_threats"],
            template="""
You are a senior SOC analyst. Analyze the following security alert and provide:
1. Threat classification (malware, phishing, reconnaissance, etc.)
2. Confidence level (low/medium/high)
3. Recommended immediate actions
4. Similar historical incidents for context

Current Alert:
{alert}

Similar Historical Threats:
{similar_threats}

Provide your analysis in a structured format:
"""
        )

        self.chain = LLMChain(llm=self.llm, prompt=self.prompt)

    def analyze_alert(self, alert: ThreatEvent) -> Dict[str, Any]:
        """Analyze a single threat event."""
        # Query for similar threats
        query_text = f"{alert.event_type} from {alert.source_ip} severity {alert.severity}"
        similar = self.vector_store.query_similar_threats(query_text, k=3)

        # Format similar threats for prompt
        similar_text = "\n".join([
            f"- {doc.metadata.get('event_type', 'unknown')} from {doc.metadata.get('source_ip', 'N/A')} "
            f"(severity: {doc.metadata.get('severity', 'N/A')})"
            for doc in similar
        ]) if similar else "No similar threats found in database."

        # Run LLM chain with cost tracking
        with get_openai_callback() as cb:
            result = self.chain.run(
                alert=f"Type: {alert.event_type}\nSource: {alert.source_ip}\nSeverity: {alert.severity}\nRaw: {alert.raw_log[:300]}",
                similar_threats=similar_text
            )
            logger.info(f"Analysis cost: ${cb.total_cost:.4f} (tokens: {cb.total_tokens})")

        return {
            "alert_id": hashlib.sha256(str(alert.timestamp).encode()).hexdigest()[:8],
            "analysis": result,
            "similar_threats_found": len(similar),
            "cost_usd": cb.total_cost
        }

Production Note: The temperature=0.1 setting is deliberate—in security contexts, you want deterministic, repeatable outputs. Higher temperatures can cause the model to hallucinate threat indicators, which is unacceptable in a SOC environment. The cost tracking via get_openai_callback is essential for budgeting; at $0.002 per 1K tokens, a busy SOC processing 10,000 alerts daily could spend $20-50/day on LLM calls.

Step 4: Building the FastAPI REST API

Expose the SOC assistant as a REST API for integration with existing SIEM tools like Splunk or Elastic.

# api.py
from fastapi import FastAPI, HTTPException, BackgroundTasks
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from typing import List, Optional
import asyncio
from datetime import datetime

app = FastAPI(title="SOC Assistant API", version="1.0.0")

# Allow CORS for SIEM integrations
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["POST", "GET"],
    allow_headers=["*"],
)

# Initialize components (singleton pattern)
ingestor = LogIngestor()
vector_store = ThreatVectorStore()
analyzer = ThreatAnalyzer(vector_store)

class AlertRequest(BaseModel):
    logs: List[Dict[str, Any]]
    source_format: str = "json"  # json or csv

class AnalysisResponse(BaseModel):
    alert_id: str
    analysis: str
    similar_threats: int
    processing_time_ms: float

@app.post("/analyze", response_model=List[AnalysisResponse])
async def analyze_alerts(request: AlertRequest, background_tasks: BackgroundTasks):
    """
    Analyze a batch of security alerts.
    Returns analysis for each alert.
    """
    start_time = datetime.now()

    # Parse logs
    events = []
    for log_data in request.logs:
        try:
            event = ThreatEvent(**log_data)
            events.append(event)
        except ValidationError as e:
            logger.warning(f"Invalid log entry: {e}")
            continue

    if not events:
        raise HTTPException(status_code=400, detail="No valid log entries found")

    # Add to vector store in background
    background_tasks.add_task(vector_store.add_threat_events, events)

    # Analyze each alert (with concurrency limit)
    semaphore = asyncio.Semaphore(5)  # Max 5 concurrent LLM calls
    async def analyze_with_limit(event):
        async with semaphore:
            return await asyncio.to_thread(analyzer.analyze_alert, event)

    tasks = [analyze_with_limit(event) for event in events[:10]]  # Limit to 10 per request
    results = await asyncio.gather(*tasks)

    processing_time = (datetime.now() - start_time).total_seconds() * 1000

    return [
        AnalysisResponse(
            alert_id=r["alert_id"],
            analysis=r["analysis"],
            similar_threats=r["similar_threats_found"],
            processing_time_ms=processing_time / len(results)
        )
        for r in results
    ]

@app.get("/health")
async def health_check():
    """Health check endpoint for load balancers."""
    return {"status": "healthy", "timestamp": datetime.now().isoformat()}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000, log_level="info")

Edge Case Handling: The asyncio.Semaphore(5) prevents overwhelming the OpenAI API rate limits (typically 3,000 RPM for GPT-3.5). The background_tasks parameter ensures vector store updates don't block the response. In production, you'd also implement retry logic with exponential backoff for transient API failures.

Step 5: Running the SOC Assistant

Start the API server and test with sample data.

# Start the server
python api.py

# In another terminal, send test alerts
curl -X POST http://localhost:8000/analyze \
  -H "Content-Type: application/json" \
  -d '{
    "logs": [
      {
        "timestamp": "2026-06-12T10:30:00",
        "source_ip": "192.168.1.100",
        "destination_ip": "10.0.0.5",
        "event_type": "port_scan",
        "severity": 4,
        "raw_log": "Nmap scan detected from internal host to database server",
        "tags": ["reconnaissance", "internal"]
      }
    ]
  }'

Expected response (truncated for brevity):

[
  {
    "alert_id": "a1b2c3d4",
    "analysis": "Threat Classification: Reconnaissance (Port Scan)\nConfidence: High\nRecommended Actions: 1. Block source IP at firewall 2. Investigate host 192.168.1.100 for compromise 3. Review recent access logs..",
    "similar_threats": 2,
    "processing_time_ms": 1250.45
  }
]

Production Deployment Considerations

Scaling the Vector Store

ChromaDB is suitable for up to 1 million vectors. Beyond that, migrate to Pinecone or Weaviate. For 10 million+ vectors, consider using a distributed solution like Milvus [5].

Rate Limiting and Cost Control

Implement a token bucket algorithm to limit API calls:

from time import time
from collections import deque

class RateLimiter:
    def __init__(self, max_calls: int = 60, period: int = 60):
        self.max_calls = max_calls
        self.period = period
        self.calls = deque()

    def can_call(self) -> bool:
        now = time()
        while self.calls and self.calls[0] < now - self.period:
            self.calls.popleft()
        if len(self.calls) < self.max_calls:
            self.calls.append(now)
            return True
        return False

Security Hardening

Never expose the API without authentication (use API keys or OAuth2)
Sanitize all log inputs to prevent prompt injection attacks
Encrypt vector store data at rest using AES-256
Implement audit logging for all analysis requests

What's Next

Your SOC assistant is now functional, but production readiness requires additional work:

Integrate with SIEM tools: Build connectors for Splunk, Elastic, or Azure Sentinel using their REST APIs
Add real-time streaming: Replace batch processing with Apache Kafka for sub-second alert analysis
Implement feedback loops: Allow analysts to rate AI responses and retrain the model
Explore advanced models: Consider fine-tuning [1] a smaller model like Llama 3 for on-premise deployment

For deeper dives, check out our guides on building custom LangChain chains and optimizing vector search performance.

The code in this tutorial is production-ready for small to medium SOCs handling up to 10,000 alerts daily. For enterprise-scale deployments, you'll need to add horizontal scaling, database sharding, and multi-region failover. But the core architecture—ingestion, vector embedding, and LLM analysis—remains the foundation of any AI-powered security assistant.

References

1. Wikipedia - Fine-tuning. Wikipedia. [Source]

2. Wikipedia - Milvus. Wikipedia. [Source]

3. Wikipedia - Conifer cone. Wikipedia. [Source]

4. GitHub - hiyouga/LlamaFactory. Github. [Source]

5. GitHub - milvus-io/milvus. Github. [Source]

6. GitHub - pinecone-io/python-sdk. Github. [Source]

7. GitHub - chroma-core/chroma. Github. [Source]

8. Pinecone Pricing. Pricing. [Source]

How to Build a SOC Assistant with AI Threat Detection

How to Build a SOC Assistant with AI Threat Detection

Table of Contents

📺 Watch: Neural Networks Explained

Why a SOC Assistant Matters in Production

Prerequisites and Environment Setup

Install Required Packages

Set Up Environment Variables

Architecture: Ingestion, Embedding, and Response

Step 1: Building the Threat Ingestion Pipeline

Step 2: Creating the Vector Store for Threat Correlation

Step 3: Implementing the AI Analysis Engine

Step 4: Building the FastAPI REST API

Step 5: Running the SOC Assistant

Production Deployment Considerations

Scaling the Vector Store

Rate Limiting and Cost Control

Security Hardening

What's Next

References

Was this article helpful?

Related Articles

How to Build a Multimodal App with Gemini 2.0 Vision API

How to Build an AI Pentesting Assistant with LangChain

How to Build Autonomous Scientific Discovery Agents with EurekAgent