How to Build a SOC Assistant with TensorFlow and PyTorch
Practical tutorial: Detect threats with AI: building a SOC assistant
The Hybrid SOC Analyst: Building an AI Assistant with TensorFlow and PyTorch
The security operations center is drowning. Every day, organizations generate terabytes of logs—authentication attempts, network flows, endpoint telemetry, cloud API calls—far more than human analysts can meaningfully process. The industry response has been predictable: throw more bodies at the problem, or worse, rely on brittle rule-based systems that generate noise instead of signal. But there's a third path, one that's been quietly maturing in research labs and forward-leaning security teams: hybrid AI architectures that combine natural language understanding with anomaly detection.
This isn't about replacing analysts. It's about giving them a co-pilot—a SOC assistant that can ingest raw logs, extract semantic meaning, and flag behavioral deviations in real time. And thanks to the convergence of TensorFlow and PyTorch, building such a system is more accessible than ever. Let's walk through the architecture, the code, and the production considerations that make this approach viable for real-world deployments.
The Dual-Engine Architecture: NLP Meets Anomaly Detection
The fundamental insight behind this approach is that security logs contain two kinds of signal. The first is explicit: "User logged in," "Firewall rule applied," "Process terminated." This is natural language, and it demands natural language processing. The second is implicit: patterns of behavior that deviate from established baselines. A user logging in at 3 AM from an unfamiliar IP isn't just a log entry—it's an anomaly.
The architecture we'll implement mirrors this duality. On one side, a pre-trained transformer model (BERT, specifically) processes log text to extract contextual embeddings. On the other, a custom deep neural network—built with TensorFlow's Keras API—learns to distinguish normal from anomalous behavior based on those embeddings. The result is a system that understands what happened and whether it matters.
This hybrid model has shown promising results in preliminary tests conducted by various cybersecurity firms as of April 2026, particularly in reducing false positives compared to signature-based approaches. The key is that BERT's attention mechanism captures semantic relationships that rule-based systems miss—like recognizing that "WARNING: Suspicious activity detected" and "ALERT: Potential lateral movement" describe fundamentally similar events even though the wording differs.
Setting the Stage: Dependencies and Environment
Before we dive into implementation, let's address the tooling question that inevitably arises: TensorFlow or PyTorch? The honest answer is both, and that's the point. TensorFlow's Keras API provides an exceptionally clean interface for building and training feedforward networks, which is exactly what we need for the anomaly detection component. PyTorch, meanwhile, offers the dynamic computational graphs that make research and experimentation with transformer models more fluid.
The stack breaks down as follows:
- TensorFlow [5] for the anomaly detection model and production deployment
- PyTorch [7] for the NLP pipeline (via the Hugging Face Transformers library [4])
- Pandas for log ingestion and manipulation
- Scikit-Learn for preprocessing, vectorization, and feature extraction
Installation is straightforward, though GPU acceleration requires careful CUDA configuration:
pip install tensorflow pytorch pandas scikit-learn
For production environments, consider using Docker containers with pre-configured CUDA toolkits and cuDNN libraries. The compatibility matrix between TensorFlow, PyTorch, and CUDA versions can be a source of subtle bugs, so pinning versions in your requirements.txt is strongly recommended.
Building the NLP Pipeline: From Raw Logs to Semantic Features
The first challenge is transforming unstructured log text into something a neural network can consume. Raw security logs are notoriously messy—mixed formats, inconsistent timestamps, vendor-specific jargon. Our preprocessing pipeline needs to normalize this chaos into clean, vectorized inputs.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
def preprocess_logs(logs):
# Tokenize and remove stopwords
tokens = [token for log in logs for token in log.split() if token not in STOPWORDS]
# Vectorize using TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(tokens)
return X, vectorizer
This initial pass serves two purposes. First, it reduces noise by filtering out common stopwords that carry no security-relevant meaning. Second, it creates a sparse matrix representation that captures term frequency-inverse document frequency—essentially, how important each word is relative to the entire log corpus.
But TF-IDF, while useful for initial exploration, lacks the contextual understanding that modern NLP demands. That's where BERT enters the picture. By feeding our preprocessed logs through a pre-trained BERT model, we obtain dense vector representations that encode semantic relationships:
from transformers import BertTokenizer, TFBertModel
def analyze_logs(X):
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = TFBertModel.from_pretrained('bert-base-uncased')
inputs = tokenizer(X, return_tensors='tf', padding=True, truncation=True)
outputs = bert_model(**inputs)
return outputs.last_hidden_state
The last_hidden_state output contains contextual embeddings for each token in the input sequence. For log analysis, we typically take the [CLS] token's embedding—a fixed-size representation that captures the overall meaning of the input. This becomes our feature vector for the anomaly detection model.
This approach is particularly powerful for SOC use cases because it handles the vocabulary drift that plagues traditional NLP systems. Attack techniques evolve, and so does the language used to describe them. A model trained on BERT's general-purpose embeddings can generalize to novel attack descriptions without retraining.
The Anomaly Detection Engine: Training a Deep Neural Network
With semantic features extracted, we need a model that can learn the boundary between normal and anomalous behavior. This is a binary classification problem, but with an important twist: the distribution of normal vs. anomalous logs is heavily imbalanced. In a typical SOC, 99% of events are benign. Our model must be sensitive enough to catch the 1% without drowning the analyst in false positives.
The architecture we'll use is a deep feedforward network with dropout regularization:
import tensorflow as tf
def build_anomaly_detection_model(input_shape):
model = tf.keras.Sequential([
tf.keras.layers.Dense(128, activation='relu', input_shape=input_shape),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer=tf.keras.optimizers.Adam(),
loss=tf.keras.losses.BinaryCrossentropy(),
metrics=['accuracy'])
return model
The dropout layers are critical here. At 50% dropout, we're effectively training an ensemble of subnetworks, which prevents overfitting to the dominant normal class. The ReLU activations introduce non-linearity, allowing the model to learn complex decision boundaries.
Training requires labeled data—logs that have been manually classified as normal or anomalous. In practice, this is the most expensive part of the pipeline. Many organizations start with a small labeled dataset and use active learning to iteratively improve the model. The training loop itself is standard:
history = model.fit(X_train, y_train, epochs=10, validation_data=(X_val, y_val))
Monitor the validation loss carefully. If it diverges from the training loss, your model is overfitting—increase dropout or reduce the number of epochs. If both losses plateau early, consider increasing model capacity or adding more training data.
Production Deployment: Batch Processing and Asynchronous Pipelines
A model that works in a Jupyter notebook is a prototype. A model that works in production is a system. The transition requires careful attention to throughput, latency, and error handling.
The most common deployment pattern for SOC assistants is batch processing. Logs arrive continuously, but processing them individually would be prohibitively expensive. Instead, we buffer logs and process them in batches:
batch_size = 1024
def process_logs_in_batches(logs, batch_size):
batches = [logs[i:i+batch_size] for i in range(0, len(logs), batch_size)]
results = []
for batch in batches:
features = analyze_logs(batch)
predictions = model.predict(features)
results.extend(predictions)
return results
The batch size is a tunable parameter. Larger batches improve GPU utilization but increase memory pressure and latency. For real-time SOC operations, a batch size of 1024 with a 5-second processing window is a reasonable starting point.
Asynchronous processing adds another layer of sophistication. By decoupling log ingestion from model inference, we can handle spikes in log volume without dropping events. Python's asyncio library is well-suited for this, though production systems often use message queues like Apache Kafka or RabbitMQ to manage the pipeline.
Error handling is non-negotiable in a SOC context. A crashed model pipeline means blind spots in your security coverage:
try:
features = analyze_logs(batch)
predictions = model.predict(features)
except Exception as e:
# Log the error, alert the team, and continue processing
logger.error(f"Model inference failed for batch: {e}")
# Fall back to rule-based detection
predictions = fallback_rule_engine(batch)
This try-catch pattern ensures that even if the AI component fails, the system degrades gracefully rather than catastrophically.
Security Considerations and the Road Ahead
Building an AI-powered SOC assistant introduces its own security risks. The most pressing is prompt injection—if your system uses large language models, an attacker could craft log entries that manipulate the model's behavior. Input sanitization is essential: strip control characters, validate encoding, and consider using a separate model instance for untrusted inputs.
There's also the question of model poisoning. If an attacker can inject training data, they could teach the model to ignore certain attack patterns. This is why training data provenance matters. Use cryptographically signed logs for training, and monitor for data drift that might indicate tampering.
Looking ahead, the next frontier is multi-modal SOC assistants that combine log analysis with network flow data, endpoint telemetry, and threat intelligence feeds. The architecture we've built here is extensible—the same BERT-plus-anomaly-detection pattern can be applied to any structured or unstructured security data.
For teams ready to take the next step, consider integrating with existing SIEM platforms and exploring open-source LLMs for log summarization. The ecosystem is maturing rapidly, and the tools are finally good enough to build something that actually helps analysts sleep better at night.
Was this article helpful?
Let us know to improve our AI generation.
Related Articles
How to Build a Multimodal App with Gemini 2.0 Vision API
Practical tutorial: Build a multimodal app with Gemini 2.0 Vision API
How to Build an AI Pentesting Assistant with LangChain
Practical tutorial: Build an AI-powered pentesting assistant
How to Build Autonomous Scientific Discovery Agents with EurekAgent
Practical tutorial: The story discusses a significant advancement in AI research that could impact autonomous scientific discovery.