How to Build an Emotion Recognition Tool with PyTorch and FastAPI
Practical tutorial: It discusses an advanced emotion recognition tool, which is a niche but important area in AI research and application.
How to Build an Emotion Recognition Tool with PyTorch and FastAPI
Table of Contents
- How to Build an Emotion Recognition Tool with PyTorch and FastAPI
- Create a virtual environment
- Core dependencies
- train_emotion.py
- Constants
- api.py
📺 Watch: Neural Networks Explained
Video by 3Blue1Brown
Emotion recognition from facial expressions remains one of the most challenging yet commercially valuable problems in computer vision. While basic classifiers can distinguish between happy and sad faces with reasonable accuracy, production-grade systems must handle real-world variability in lighting, head pose, occlusions, and cultural differences in emotional expression. In this tutorial, we'll build a complete emotion recognition pipeline that achieves competitive accuracy while remaining deployable on modest hardware.
We'll use a ResNet-18 backbone fine-tuned on the FER2013 dataset, wrapped in a FastAPI service with proper input validation, batching, and monitoring. The system will recognize seven basic emotions: anger, disgust, fear, happiness, sadness, surprise, and neutral. By the end, you'll have a production-ready API that can process both static images and video streams.
Real-World Use Case and Architecture
Emotion recognition systems are deployed across multiple industries. According to a 2025 MarketsandMarkets report, the facial recognition market (including emotion detection) is projected to reach $12.67 billion by 2027, with healthcare and automotive sectors driving adoption. Common use cases include:
- Automotive safety: Detecting driver drowsiness or road rag [2]e
- Healthcare monitoring: Assessing patient pain levels or depression severity
- Retail analytics: Measuring customer satisfaction in physical stores
- Human-computer interaction: Adaptive interfaces that respond to user frustration
Our architecture follows a three-tier design:
- Preprocessing layer: Face detection using MTCNN, alignment, and normalization
- Inference engine: PyTorch [7] model with ONNX Runtime for optimized serving
- API layer: FastAPI with async endpoints, rate limiting, and health checks
This separation allows independent scaling of each component. The preprocessing layer can be offloaded to GPU if needed, while the inference engine benefits from ONNX's cross-platform optimization.
Prerequisites and Environment Setup
Before writing code, ensure your environment has the following dependencies. We'll use Python 3.10+ and CUDA 11.8 if available.
# Create a virtual environment
python -m venv emotion_env
source emotion_env/bin/activate # On Windows: emotion_env\Scripts\activate
# Core dependencies
pip install torch==2.1.0 torchvision==0.16.0 --index-url https://download.pytorch.org/whl/cu118
pip install fastapi==0.104.1 uvicorn[standard]==0.24.0
pip install opencv-python==4.8.1.78 pillow==10.1.0
pip install numpy==1.26.2 scikit-learn==1.3.2
pip install onnxruntime-gpu==1.16.3 # Use onnxruntime for CPU-only systems
pip install python-multipart==0.0.6 # For file uploads
pip install pydantic==2.5.2 pydantic-settings==2.1.0
pip install prometheus-client==0.19.0 # For monitoring
For face detection, we'll use MTCNN from the facenet-pytorch library, which provides a pre-trained model:
pip install facenet-pytorch==2.5.3
Hardware considerations: The model requires approximately 2GB of GPU memory for batch inference of 32 images. On CPU, expect 50-100ms per image with MTCNN detection. For production, consider using a smaller face detector like RetinaFace or MediaPipe if latency is critical.
Core Implementation: Training the Emotion Classifier
We'll start by training a ResNet-18 model on the FER2013 dataset. This dataset contains 35,887 grayscale 48x48 pixel faces labeled with seven emotions. The dataset is imbalanced, with "happy" and "neutral" being overrepresented compared to "disgust" and "fear".
# train_emotion.py
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from torchvision import transforms, models
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import os
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# Constants
EMOTIONS = ['Angry', 'Disgust', 'Fear', 'Happy', 'Sad', 'Surprise', 'Neutral']
BATCH_SIZE = 64
EPOCHS = 50
LEARNING_RATE = 1e-4
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
class FER2013Dataset(Dataset):
"""Custom dataset for FER2013 CSV format."""
def __init__(self, dataframe, transform=None):
self.dataframe = dataframe
self.transform = transform
def __len__(self):
return len(self.dataframe)
def __getitem__(self, idx):
row = self.dataframe.iloc[idx]
# Parse pixel values from space-separated string
pixels = np.array([int(p) for p in row['pixels'].split()], dtype=np.uint8)
image = pixels.reshape(48, 48).astype(np.float32)
# Normalize to [0, 1] and convert to 3-channel
image = image / 255.0
image = np.stack([image] * 3, axis=0) # Shape: (3, 48, 48)
label = int(row['emotion'])
if self.transform:
# Convert to tensor and apply transforms
image_tensor = torch.from_numpy(image)
# Resize to 224x224 for ResNet
image_tensor = torch.nn.functional.interpolate(
image_tensor.unsqueeze(0), size=(224, 224), mode='bilinear'
).squeeze(0)
image_tensor = self.transform(image_tensor)
else:
image_tensor = torch.from_numpy(image)
image_tensor = torch.nn.functional.interpolate(
image_tensor.unsqueeze(0), size=(224, 224), mode='bilinear'
).squeeze(0)
return image_tensor, label
def load_data(csv_path='fer2013.csv'):
"""Load and split FER2013 dataset."""
df = pd.read_csv(csv_path)
# The dataset has a 'Usage' column for train/test split
train_df = df[df['Usage'] == 'Training']
val_df = df[df['Usage'] == 'PublicTest']
test_df = df[df['Usage'] == 'PrivateTest']
logger.info(f"Train: {len(train_df)}, Val: {len(val_df)}, Test: {len(test_df)}")
# Data augmentation for training
train_transform = transforms.Compose([
transforms.RandomHorizontalFlip(p=0.5),
transforms.RandomRotation(degrees=10),
transforms.ColorJitter(brightness=0.1, contrast=0.1),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
val_transform = transforms.Compose([
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
train_dataset = FER2013Dataset(train_df, transform=train_transform)
val_dataset = FER2013Dataset(val_df, transform=val_transform)
test_dataset = FER2013Dataset(test_df, transform=val_transform)
return train_dataset, val_dataset, test_dataset
def create_model():
"""Create ResNet-18 with custom classifier head."""
model = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)
# Freeze early layers to prevent overfitting on small dataset
for param in list(model.parameters())[:-8]: # Keep last 2 blocks trainable
param.requires_grad = False
# Replace classifier head
num_features = model.fc.in_features
model.fc = nn.Sequential(
nn.Dropout(0.5),
nn.Linear(num_features, 256),
nn.ReLU(),
nn.BatchNorm1d(256),
nn.Dropout(0.3),
nn.Linear(256, 7) # 7 emotion classes
)
return model
def train_model(model, train_loader, val_loader, criterion, optimizer, scheduler):
"""Training loop with validation and early stopping."""
best_val_acc = 0.0
patience = 10
patience_counter = 0
for epoch in range(EPOCHS):
# Training phase
model.train()
train_loss = 0.0
train_correct = 0
train_total = 0
for images, labels in train_loader:
images, labels = images.to(DEVICE), labels.to(DEVICE)
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
# Gradient clipping to prevent exploding gradients
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
train_loss += loss.item() * images.size(0)
_, predicted = torch.max(outputs, 1)
train_total += labels.size(0)
train_correct += (predicted == labels).sum().item()
train_acc = 100 * train_correct / train_total
train_loss = train_loss / train_total
# Validation phase
model.eval()
val_loss = 0.0
val_correct = 0
val_total = 0
with torch.no_grad():
for images, labels in val_loader:
images, labels = images.to(DEVICE), labels.to(DEVICE)
outputs = model(images)
loss = criterion(outputs, labels)
val_loss += loss.item() * images.size(0)
_, predicted = torch.max(outputs, 1)
val_total += labels.size(0)
val_correct += (predicted == labels).sum().item()
val_acc = 100 * val_correct / val_total
val_loss = val_loss / val_total
scheduler.step(val_loss)
logger.info(f"Epoch {epoch+1}/{EPOCHS} | "
f"Train Loss: {train_loss:.4f} Acc: {train_acc:.2f}% | "
f"Val Loss: {val_loss:.4f} Acc: {val_acc:.2f}%")
# Early stopping and model checkpoint
if val_acc > best_val_acc:
best_val_acc = val_acc
patience_counter = 0
torch.save(model.state_dict(), 'best_emotion_model.pth')
logger.info(f"Saved new best model with val_acc: {val_acc:.2f}%")
else:
patience_counter += 1
if patience_counter >= patience:
logger.info(f"Early stopping triggered after {epoch+1} epochs")
break
return model
def main():
logger.info(f"Using device: {DEVICE}")
# Load data
train_dataset, val_dataset, test_dataset = load_data()
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=4)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=4)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=4)
# Create model
model = create_model().to(DEVICE)
# Loss and optimizer
# Use class weights to handle imbalance
class_counts = [0] * 7
for _, label in train_dataset:
class_counts[label] += 1
class_weights = 1.0 / torch.tensor(class_counts, dtype=torch.float)
class_weights = class_weights / class_weights.sum()
class_weights = class_weights.to(DEVICE)
criterion = nn.CrossEntropyLoss(weight=class_weights)
optimizer = optim.AdamW(model.parameters(), lr=LEARNING_RATE, weight_decay=1e-4)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=5)
# Train
model = train_model(model, train_loader, val_loader, criterion, optimizer, scheduler)
# Load best model for testing
model.load_state_dict(torch.load('best_emotion_model.pth'))
model.eval()
# Evaluate on test set
all_preds = []
all_labels = []
with torch.no_grad():
for images, labels in test_loader:
images, labels = images.to(DEVICE), labels.to(DEVICE)
outputs = model(images)
_, predicted = torch.max(outputs, 1)
all_preds.extend(predicted.cpu().numpy())
all_labels.extend(labels.cpu().numpy())
logger.info("\nTest Set Classification Report:")
logger.info(classification_report(all_labels, all_preds, target_names=EMOTIONS))
# Export to ONNX for production
dummy_input = torch.randn(1, 3, 224, 224).to(DEVICE)
torch.onnx.export(
model, dummy_input, 'emotion_model.onnx',
input_names=['input'], output_names=['output'],
dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}},
opset_version=17
)
logger.info("Model exported to ONNX format")
if __name__ == '__main__':
main()
Key design decisions in the training code:
-
Transfer learning with selective freezing: We freeze all but the last two residual blocks of ResNet-18. This prevents overfitting on the relatively small FER2013 dataset (35K images) while allowing the model to adapt high-level features to emotion recognition.
-
Class weighting: The FER2013 dataset has severe class imbalance. "Disgust" has only 600 samples compared to "Happy" with 8,989. Using inverse frequency weights in the loss function helps the model learn minority classes without oversampling.
-
Gradient clipping: With a learning rate of 1e-4 and AdamW, gradient norms can spike during early training. Clipping at 1.0 stabilizes training.
-
ONNX export: Converting to ONNX allows deployment on edge devices (Raspberry Pi, Jetson) and enables optimizations like INT8 quantization. The dynamic axes parameter allows variable batch sizes during inference.
Building the Production API with FastAPI
Now we'll create the inference server. This API handles image uploads, runs face detection, performs emotion classification, and returns structured results with confidence scores.
# api.py
import io
import time
from typing import List, Optional
import numpy as np
import cv2
from PIL import Image
import torch
import onnxruntime as ort
from fastapi import FastAPI, File, UploadFile, HTTPException, Depends
from fastapi.responses import JSONResponse
from pydantic import BaseModel, Field
import uvicorn
from facenet_pytorch import MTCNN
import logging
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
from starlette.responses import Response
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# Prometheus metrics
PREDICTION_COUNTER = Counter('emotion_predictions_total', 'Total predictions made')
PREDICTION_LATENCY = Histogram('emotion_prediction_latency_seconds', 'Prediction latency')
DETECTION_FAILURES = Counter('face_detection_failures_total', 'Failed face detections')
app = FastAPI(
title="Emotion Recognition API",
description="Production-grade emotion recognition from facial expressions",
version="1.0.0"
)
# Global model instances
class EmotionModel:
"""Wrapper for ONNX emotion classifier with face detection."""
def __init__(self, onnx_path: str = 'emotion_model.onnx'):
self.emotions = ['Angry', 'Disgust', 'Fear', 'Happy', 'Sad', 'Surprise', 'Neutral']
# Initialize face detector
self.face_detector = MTCNN(
image_size=160, # MTCNN default
margin=20,
min_face_size=20,
thresholds=[0.6, 0.7, 0.7],
factor=0.709,
post_process=True,
device='cuda' if torch.cuda.is_available() else 'cpu'
)
# Initialize ONNX runtime
providers = ['CUDAExecutionProvider', 'CPUExecutionProvider'] if torch.cuda.is_available() else ['CPUExecutionProvider']
self.session = ort.InferenceSession(onnx_path, providers=providers)
# Input/output details
self.input_name = self.session.get_inputs()[0].name
self.output_name = self.session.get_outputs()[0].name
logger.info(f"Model loaded with providers: {providers}")
def preprocess_face(self, face_tensor: torch.Tensor) -> np.ndarray:
"""Preprocess detected face for emotion model."""
# MTCNN returns tensor of shape (3, 160, 160)
# Resize to (3, 224, 224) for ResNet
face_resized = torch.nn.functional.interpolate(
face_tensor.unsqueeze(0), size=(224, 224), mode='bilinear'
).squeeze(0)
# Normalize with ImageNet stats
mean = torch.tensor([0.485, 0.456, 0.406]).view(3, 1, 1)
std = torch.tensor([0.229, 0.224, 0.225]).view(3, 1, 1)
face_normalized = (face_resized / 255.0 - mean) / std
# Convert to numpy for ONNX
return face_normalized.numpy().astype(np.float32)
def predict(self, image: np.ndarray) -> List[dict]:
"""
Detect faces and predict emotions.
Args:
image: RGB image as numpy array (H, W, 3)
Returns:
List of dicts with 'bbox', 'emotion', 'confidence'
"""
# Detect faces
boxes, probs = self.face_detector.detect(image)
if boxes is None:
DETECTION_FAILURES.inc()
return []
results = []
for i, (box, prob) in enumerate(zip(boxes, probs)):
if prob < 0.9: # Confidence threshold
continue
# Extract face using MTCNN's internal method
face_tensor = self.face_detector.extract(image, [box], save_path=None)
if face_tensor is None:
continue
# Preprocess and predict
input_tensor = self.preprocess_face(face_tensor)
# ONNX inference
start_time = time.time()
outputs = self.session.run(
[self.output_name],
{self.input_name: np.expand_dims(input_tensor, axis=0)}
)
latency = time.time() - start_time
# Get probabilities
logits = outputs[0][0]
exp_logits = np.exp(logits - np.max(logits))
probabilities = exp_logits / exp_logits.sum()
# Get top prediction
pred_idx = np.argmax(probabilities)
confidence = float(probabilities[pred_idx])
# Convert box to int list
bbox = [int(x) for x in box.tolist()]
results.append({
'bbox': bbox,
'face_confidence': float(prob),
'emotion': self.emotions[pred_idx],
'confidence': confidence,
'probabilities': {
emotion: float(probabilities[j])
for j, emotion in enumerate(self.emotions)
},
'latency_ms': round(latency * 1000, 2)
})
PREDICTION_COUNTER.inc()
PREDICTION_LATENCY.observe(latency)
return results
# Initialize model at startup
model_instance = None
@app.on_event("startup")
async def startup_event():
global model_instance
model_instance = EmotionModel()
logger.info("Emotion model initialized")
# Pydantic models for response
class EmotionResult(BaseModel):
bbox: List[int] = Field(.., description="Bounding box [x1, y1, x2, y2]")
face_confidence: float = Field(.., ge=0, le=1)
emotion: str
confidence: float = Field(.., ge=0, le=1)
probabilities: dict
latency_ms: float
class PredictionResponse(BaseModel):
faces: List[EmotionResult]
total_faces: int
processing_time_ms: float
@app.post("/predict", response_model=PredictionResponse)
async def predict_emotion(file: UploadFile = File(..)):
"""
Upload an image and get emotion predictions for all detected faces.
Accepts: JPEG, PNG, WebP
Max file size: 10MB (configured in nginx/proxy)
"""
# Validate file type
if file.content_type not in ['image/jpeg', 'image/png', 'image/webp']:
raise HTTPException(status_code=400, detail="Unsupported image format")
# Read image
contents = await file.read()
try:
# Convert to numpy array
image = Image.open(io.BytesIO(contents))
image = image.convert('RGB')
image_np = np.array(image)
except Exception as e:
raise HTTPException(status_code=400, detail=f"Invalid image: {str(e)}")
# Predict
start_time = time.time()
results = model_instance.predict(image_np)
processing_time = (time.time() - start_time) * 1000
return PredictionResponse(
faces=[EmotionResult(**r) for r in results],
total_faces=len(results),
processing_time_ms=round(processing_time, 2)
)
@app.post("/predict_batch")
async def predict_batch(files: List[UploadFile] = File(..)):
"""
Batch prediction for multiple images.
Useful for video frame analysis or bulk processing.
"""
all_results = []
for file in files:
contents = await file.read()
image = Image.open(io.BytesIO(contents)).convert('RGB')
image_np = np.array(image)
results = model_instance.predict(image_np)
all_results.append({
'filename': file.filename,
'faces': results
})
return {'results': all_results, 'total_images': len(files)}
@app.get("/health")
async def health_check():
"""Health check endpoint for Kubernetes liveness probe."""
return {
'status': 'healthy',
'model_loaded': model_instance is not None,
'device': 'cuda' if torch.cuda.is_available() else 'cpu'
}
@app.get("/metrics")
async def metrics():
"""Prometheus metrics endpoint."""
return Response(content=generate_latest(), media_type=CONTENT_TYPE_LATEST)
if __name__ == '__main__':
uvicorn.run(
'api:app',
host='0.0.0.0',
port=8000,
workers=4, # Adjust based on CPU cores
log_level='info'
)
Critical production considerations in the API code:
-
Face detection threshold: We set MTCNN's confidence threshold to 0.9. This reduces false positives but may miss some faces in challenging conditions. In production, you might want to make this configurable via environment variables.
-
ONNX Runtime providers: The code automatically selects CUDA if available. On CPU-only systems, it falls back to CPUExecutionProvider. For edge deployment, consider using TensorRT or OpenVINO providers.
-
Batch processing: The
/predict_batchendpoint allows processing multiple images in a single request. This is useful for video analysis where you extract frames at regular intervals. -
Prometheus metrics: We track prediction count, latency, and detection failures. These can be scraped by Prometheus and visualized in Grafana for monitoring.
-
File validation: We check content type and handle malformed images gracefully. In production, add file size limits at the reverse proxy level (nginx/AWS ALB).
Edge Cases and Error Handling
Real-world emotion recognition systems face numerous edge cases that can break naive implementations:
1. Multiple faces with varying sizes: Our MTCNN detector handles this naturally, but large group photos may cause memory issues. Consider adding a maximum face count (e.g., 20 faces per image) to prevent resource exhaustion.
2. Occluded faces: Sunglasses, masks, or hands covering parts of the face reduce accuracy. The model was trained on mostly frontal faces, so profile views will perform poorly. Consider adding a head pose estimation module to filter non-frontal faces.
3. Low-light conditions: FER2013 images are grayscale and relatively well-lit. In dark environments, consider preprocessing with histogram equalization or using a denoising autoencoder.
4. Children vs adults: The model was trained primarily on adult faces. Children's facial proportions differ significantly, leading to lower accuracy. If your use case involves pediatric populations, consider fine-tuning [3] on a dataset like AffectNet.
5. Cultural differences: Emotional expression varies across cultures. For example, East Asian cultures may suppress outward displays of sadness. The FER2013 dataset is predominantly Western, so accuracy may degrade for non-Western populations.
Error handling strategy:
- Return empty results (not an error) when no face is detected
- Log detection failures separately from prediction errors
- Implement circuit breakers for downstream services
- Use exponential backoff for retries on transient failures
Deployment and Scaling
For production deployment, consider the following architecture:
# docker-compose.yml
version: '3.8'
services:
emotion-api:
build: .
ports:
- "8000:8000"
environment:
- CUDA_VISIBLE_DEVICES=0
- ONNX_PROVIDER=CUDAExecutionProvider
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
Scaling considerations:
- Use a load balancer (nginx, AWS ALB) in front of multiple API instances
- Cache face detection results for identical images (e.g., video frames)
- Consider using Redis for request queuing during traffic spikes
- Implement rate limiting per API key to prevent abuse
What's Next
This tutorial provides a production-ready foundation for emotion recognition, but several improvements can enhance accuracy and robustness:
-
Multi-modal fusion: Combine facial expressions with voice tone and text sentiment for more accurate emotion detection. Research from MIT Media Lab shows that multi-modal systems achieve 15-20% higher accuracy than vision-only systems.
-
Temporal modeling: For video analysis, use a 3D CNN or LSTM to capture emotional transitions over time. This is critical for detecting micro-expressions that last only 1/25th of a second.
-
Federated learning: If deploying on edge devices, consider federated learning to improve the model without centralizing sensitive facial data. Google's TensorFlow [4] Federated framework supports this.
-
Explainability: Add Grad-CAM visualizations to show which facial regions influenced the prediction. This builds trust with users and helps debug misclassifications.
-
Privacy preservation: Implement on-device processing to avoid transmitting facial images over the network. Apple's Vision framework already does this for face detection on iOS.
The complete source code for this tutorial is available on GitHub. For further reading, check out our guides on deploying PyTorch models with ONNX Runtime and building scalable computer vision APIs.
Was this article helpful?
Let us know to improve our AI generation.
Related Articles
How to Analyze Security Logs with DeepSeek Locally
Practical tutorial: Analyze security logs with DeepSeek locally
How to Build a Multimodal App with Gemini 2.0 Vision API
Practical tutorial: Build a multimodal app with Gemini 2.0 Vision API
How to Build an AI Research Assistant with Perplexity API
Practical tutorial: Create an AI research assistant with Perplexity API