How to Monitor LLM Apps with LangSmith and Weights & Biases
Practical tutorial: Monitor LLM apps with LangSmith and Weights & Biases
How to Monitor LLM Apps with LangSmith and Weights & Biases
Table of Contents
- How to Monitor LLM Apps with LangSmith and Weights & Biases
- Create a virtual environment
- Install core dependencies
- For this tutorial, we'll use OpenAI [10] as our LLM provider
- Install the OpenAI SDK
- Set environment variables
- rag_pipeline.py
📺 Watch: Intro to Large Language Models
Video by Andrej Karpathy
Building production LLM applications is fundamentally different from prototyping in a notebook. When you move from a single prompt test to a system handling thousands of requests, you need observability—not just for debugging, but for continuous improvement. This tutorial walks through integrating two complementary monitoring platforms: LangSmith for trace-level debugging and Weights & Biases (W&B) for aggregate performance tracking.
We'll build a real-time monitoring pipeline for a RAG (Retrieval-Augmented Generation) application, covering trace collection, latency analysis, cost tracking, and automated evaluation. By the end, you'll have a production-ready monitoring setup that catches regressions before they reach users.
Why Dual Monitoring Matters in Production LLM Systems
A single monitoring tool cannot solve all LLM observability problems. LangSmith excels at per-request tracing—showing you exactly which prompt, retrieval step, or model call caused a failure. W&B provides the macro view: latency distributions over time, cost trends across model versions, and experiment tracking for prompt iterations.
Consider a common production scenario: your RAG pipeline suddenly starts returning irrelevant answers. LangSmith traces show the retriever is returning empty chunks. W&B dashboards reveal this correlates with a recent embedding model update. Without both views, you'd waste hours guessing whether the issue is prompt drift, model degradation, or data pipeline changes.
According to LangChain [8]'s documentation, LangSmith supports tracing for over 40 LLM providers and frameworks as of early 2026. W&B's LLM monitoring features, documented in their Weave library, provide automatic token counting and latency tracking for OpenAI, Anthropic, and open-source models.
Prerequisites and Environment Setup
Before writing code, set up your environment. You'll need Python 3.10+ and API keys for both platforms.
# Create a virtual environment
python -m venv llm-monitor-env
source llm-monitor-env/bin/activate
# Install core dependencies
pip install langchain langchain-openai langsmith wandb weaviate-client fastapi uvicorn
# For this tutorial, we'll use OpenAI as our LLM provider
# Install the OpenAI SDK
pip install openai
# Set environment variables
export OPENAI_API_KEY="your-openai-key"
export LANGCHAIN_API_KEY="your-langsmith-key"
export LANGCHAIN_TRACING_V2="true"
export LANGCHAIN_PROJECT="llm-monitoring-tutorial"
export WANDB_API_KEY="your-wandb-key"
The LANGCHAIN_TRACING_V2=true environment variable is critical—it enables automatic trace collection for all LangChain runs. Without it, LangSmith won't capture any data.
Building the RAG Pipeline with Instrumentation
Let's create a RAG application that answers questions about technical documentation. We'll instrument it from the ground up with both LangSmith and W&B monitoring.
# rag_pipeline.py
import os
import time
from typing import List, Dict, Any
from dataclasses import dataclass, field
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Weaviate
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain_core.output_parsers import StrOutputParser
import weaviate
import wandb
@dataclass
class MonitorMetrics:
"""Container for per-request monitoring data."""
latency_ms: float = 0.0
tokens_used: int = 0
retrieval_count: int = 0
error: str = None
model: str = "gpt [6]-4"
timestamp: float = field(default_factory=time.time)
class MonitoredRAGPipeline:
"""
Production RAG pipeline with dual monitoring.
Tracks every request through LangSmith traces and W&B metrics.
"""
def __init__(self, weaviate_url: str = "http://localhost:8080"):
# Initialize Weaviate vector store
self.client = weaviate.Client(weaviate_url)
self.embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
self.vectorstore = Weaviate(
client=self.client,
index_name="Documentation",
text_key="content",
embedding=self.embeddings,
attributes=["source", "title"]
)
# Initialize LLM with monitoring
self.llm = ChatOpenAI(
model="gpt-4",
temperature=0.1,
# LangSmith automatically captures these parameters
model_kwargs={"user": "monitoring-tutorial"}
)
# Create the RAG prompt
self.prompt = ChatPromptTemplate.from_messages([
("system", """You are a technical documentation assistant.
Answer questions based on the provided context.
If the context doesn't contain the answer, say so clearly.
Context: {context}"""),
("human", "{question}")
])
# Initialize W&B run for this session
self.wandb_run = wandb.init(
project="llm-monitoring-tutorial",
config={
"model": "gpt-4",
"embedding_model": "text-embedding-3-small",
"vector_store": "weaviate",
"chunk_size": 1000
}
)
# Build the chain with monitoring hooks
self.chain = self._build_monitored_chain()
def _build_monitored_chain(self):
"""
Construct the RAG chain with monitoring callbacks.
Each step is wrapped to capture metrics.
"""
def retrieve_docs(question: str) -> Dict[str, Any]:
"""Retrieve relevant documents with timing."""
start = time.perf_counter()
try:
docs = self.vectorstore.similarity_search(
question,
k=4,
# LangSmith traces this as a retriever step
run_name="retrieve_docs"
)
latency = (time.perf_counter() - start) * 1000
# Log retrieval metrics to W&B
wandb.log({
"retrieval_latency_ms": latency,
"retrieved_docs_count": len(docs)
})
return {
"context": "\n\n".join([d.page_content for d in docs]),
"sources": [d.metadata.get("source", "unknown") for d in docs],
"retrieval_latency_ms": latency
}
except Exception as e:
wandb.log({"retrieval_error": str(e)})
raise
def format_response(response: str, metadata: Dict) -> Dict:
"""Format the final response with metadata."""
return {
"answer": response,
"sources": metadata.get("sources", []),
"retrieval_latency_ms": metadata.get("retrieval_latency_ms", 0)
}
# Build the chain using LangChain Expression Language
chain = (
RunnablePassthrough.assign(
retrieved=RunnableLambda(retrieve_docs)
)
| RunnablePassthrough.assign(
context=lambda x: x["retrieved"]["context"]
)
| self.prompt
| self.llm
| StrOutputParser()
| RunnableLambda(
lambda response, metadata: format_response(response, metadata)
)
)
return chain
def query(self, question: str) -> Dict[str, Any]:
"""
Execute a query with full monitoring.
Returns both the answer and monitoring metrics.
"""
metrics = MonitorMetrics()
start = time.perf_counter()
try:
# LangSmith automatically creates a trace for this run
result = self.chain.invoke(
{"question": question},
config={
"run_name": f"query_{int(time.time())}",
"tags": ["production", "rag"]
}
)
# Calculate total latency
metrics.latency_ms = (time.perf_counter() - start) * 1000
# Log aggregate metrics to W&B
wandb.log({
"total_latency_ms": metrics.latency_ms,
"retrieval_latency_ms": result.get("retrieval_latency_ms", 0),
"query_timestamp": metrics.timestamp
})
return {
"answer": result["answer"],
"sources": result["sources"],
"metrics": {
"total_latency_ms": metrics.latency_ms,
"retrieval_latency_ms": result["retrieval_latency_ms"]
}
}
except Exception as e:
metrics.error = str(e)
wandb.log({"query_error": str(e), "error_timestamp": time.time()})
raise
def close(self):
"""Clean up monitoring connections."""
self.wandb_run.finish()
This pipeline captures three critical monitoring dimensions:
- Per-request traces in LangSmith show the exact sequence of retrieval, prompt construction, and LLM generation for every query.
- Real-time metrics in W&B track latency distributions and error rates as they happen.
- Session-level configuration in W&B records the exact model versions and parameters used.
Deploying the Monitoring API with FastAPI
To make this production-ready, wrap the pipeline in a FastAPI server with health checks and proper error handling.
# api_server.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from contextlib import asynccontextmanager
import logging
from rag_pipeline import MonitoredRAGPipeline
# Configure structured logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
class QueryRequest(BaseModel):
question: str = Field(.., min_length=1, max_length=5000)
user_id: str = Field(default="anonymous", max_length=100)
class QueryResponse(BaseModel):
answer: str
sources: list[str]
metrics: dict
# Global pipeline instance
pipeline: MonitoredRAGPipeline = None
@asynccontextmanager
async def lifespan(app: FastAPI):
"""Initialize and cleanup the RAG pipeline."""
global pipeline
logger.info("Initializing RAG pipeline with monitoring..")
pipeline = MonitoredRAGPipeline()
yield
logger.info("Shutting down monitoring connections..")
pipeline.close()
app = FastAPI(
title="Monitored RAG API",
version="1.0.0",
lifespan=lifespan
)
@app.get("/health")
async def health_check():
"""Health endpoint for monitoring uptime."""
return {
"status": "healthy",
"pipeline_initialized": pipeline is not None,
"timestamp": time.time()
}
@app.post("/query", response_model=QueryResponse)
async def query_endpoint(request: QueryRequest):
"""
Process a RAG query with full monitoring.
LangSmith captures the trace automatically.
W&B logs aggregate metrics.
"""
if not pipeline:
raise HTTPException(status_code=503, detail="Pipeline not initialized")
try:
# Add user_id as a LangSmith tag for user-level monitoring
result = pipeline.query(
request.question,
config={"tags": [f"user:{request.user_id}"]}
)
logger.info(
f"Query processed - latency: {result['metrics']['total_latency_ms']:.2f}ms"
)
return QueryResponse(**result)
except Exception as e:
logger.error(f"Query failed: {str(e)}")
raise HTTPException(status_code=500, detail=str(e))
if __name__ == "__main__":
import uvicorn
uvicorn.run(
"api_server:app",
host="0.0.0.0",
port=8000,
log_level="info"
)
Start the server with:
uvicorn api_server:app --host 0.0.0.0 --port 8000 --reload
Setting Up Automated Evaluation and Alerting
Monitoring without evaluation is just data collection. Let's add automated quality checks that trigger alerts when metrics degrade.
# evaluation_monitor.py
import asyncio
from datetime import datetime, timedelta
from typing import List, Dict
import wandb
from langsmith import Client as LangSmithClient
class QualityMonitor:
"""
Automated quality monitoring that checks for regressions
and sends alerts when metrics fall below thresholds.
"""
def __init__(self,
latency_threshold_ms: float = 5000,
error_rate_threshold: float = 0.05,
evaluation_interval_minutes: int = 15):
self.latency_threshold = latency_threshold_ms
self.error_rate_threshold = error_rate_threshold
self.interval = evaluation_interval_minutes
self.langsmith_client = LangSmithClient()
self.wandb_api = wandb.Api()
# Load reference metrics from last good deployment
self.reference_metrics = self._load_reference_metrics()
def _load_reference_metrics(self) -> Dict:
"""Load baseline metrics from W&B for comparison."""
try:
# Get the last successful run's metrics
runs = self.wandb_api.runs(
"llm-monitoring-tutorial",
{"config.deployment": "production"}
)
if runs:
last_run = runs[0]
return {
"avg_latency_ms": last_run.summary.get("avg_latency_ms", 1000),
"p95_latency_ms": last_run.summary.get("p95_latency_ms", 3000),
"error_rate": last_run.summary.get("error_rate", 0.01)
}
except Exception as e:
print(f"Could not load reference metrics: {e}")
return {
"avg_latency_ms": 1000,
"p95_latency_ms": 3000,
"error_rate": 0.01
}
async def check_current_metrics(self):
"""
Fetch recent metrics from LangSmith and W&B,
then compare against thresholds.
"""
# Get metrics from last interval
since = datetime.utcnow() - timedelta(minutes=self.interval)
# LangSmith provides per-run metrics
recent_runs = self.langsmith_client.list_runs(
project_name="llm-monitoring-tutorial",
start_time=since,
run_type="chain"
)
# Calculate aggregate metrics
latencies = []
errors = 0
total = 0
async for run in recent_runs:
total += 1
if run.error:
errors += 1
if run.latency_ms:
latencies.append(run.latency_ms)
if total == 0:
return {"status": "no_data", "message": "No runs in interval"}
# Calculate statistics
avg_latency = sum(latencies) / len(latencies) if latencies else 0
error_rate = errors / total if total > 0 else 0
# Sort for percentile calculation
sorted_latencies = sorted(latencies)
p95_index = int(len(sorted_latencies) * 0.95)
p95_latency = sorted_latencies[p95_index] if sorted_latencies else 0
# Log to W&B for historical tracking
wandb.log({
"monitoring/avg_latency_ms": avg_latency,
"monitoring/p95_latency_ms": p95_latency,
"monitoring/error_rate": error_rate,
"monitoring/total_requests": total,
"monitoring/check_timestamp": time.time()
})
# Check thresholds
alerts = []
if avg_latency > self.latency_threshold:
alerts.append(
f"High average latency: {avg_latency:.0f}ms "
f"(threshold: {self.latency_threshold}ms)"
)
if p95_latency > self.latency_threshold * 2:
alerts.append(
f"High P95 latency: {p95_latency:.0f}ms "
f"(threshold: {self.latency_threshold * 2}ms)"
)
if error_rate > self.error_rate_threshold:
alerts.append(
f"High error rate: {error_rate:.2%} "
f"(threshold: {self.error_rate_threshold:.2%})"
)
# Log alerts to W&B as a table for dashboard visibility
if alerts:
alert_table = wandb.Table(
columns=["timestamp", "alert", "avg_latency", "error_rate"],
data=[[time.time(), alert, avg_latency, error_rate]
for alert in alerts]
)
wandb.log({"alerts": alert_table})
return {
"status": "alert" if alerts else "ok",
"metrics": {
"avg_latency_ms": avg_latency,
"p95_latency_ms": p95_latency,
"error_rate": error_rate,
"total_requests": total
},
"alerts": alerts
}
async def run_continuously(self):
"""Run monitoring checks at regular intervals."""
print(f"Starting quality monitor (checking every {self.interval} minutes)")
while True:
try:
result = await self.check_current_metrics()
if result["status"] == "alert":
print(f"ALERTS DETECTED: {result['alerts']}")
# In production, send to PagerDuty, Slack, etc.
else:
print(f"Health check passed: {result['metrics']}")
except Exception as e:
print(f"Monitoring check failed: {e}")
wandb.log({"monitoring_error": str(e)})
await asyncio.sleep(self.interval * 60)
# Run the monitor
if __name__ == "__main__":
monitor = QualityMonitor()
asyncio.run(monitor.run_continuously())
This evaluation monitor runs as a background service, checking metrics every 15 minutes. It compares current performance against baseline metrics stored in W&B, and logs alerts as structured tables for dashboard visualization.
Handling Edge Cases and Production Considerations
Production monitoring systems fail in predictable ways. Here are the critical edge cases to handle:
API Rate Limits: Both LangSmith and W&B have rate limits. LangSmith's free tier allows 10,000 traces per month, while W&B's free tier allows 100 GB of logged data. For production, budget for paid tiers: LangSmith starts at $99/month for teams, and W&B Teams starts at $50/user/month.
Trace Sampling: At high throughput, tracing every request becomes expensive. Implement sampling:
import random
def should_trace(sample_rate: float = 0.1) -> bool:
"""Sample traces to reduce monitoring costs."""
return random.random() < sample_rate
# In your pipeline
if should_trace(0.1): # Trace 10% of requests
config["callbacks"] = [LangSmithCallbackHandler()]
Memory Management: W&B runs accumulate data in memory. For long-running services, periodically finish and restart runs:
import wandb
from datetime import datetime
class ManagedWandbRun:
"""Automatically rotates W&B runs to prevent memory leaks."""
def __init__(self, rotation_hours: int = 24):
self.rotation_hours = rotation_hours
self.run = None
self.start_time = None
self._initialize_run()
def _initialize_run(self):
if self.run:
self.run.finish()
self.run = wandb.init(
project="llm-monitoring-tutorial",
config={"rotation_start": datetime.utcnow().isoformat()}
)
self.start_time = time.time()
def log(self, data: dict):
if time.time() - self.start_time > self.rotation_hours * 3600:
self._initialize_run()
self.run.log(data)
Error Recovery: If W&B or LangSmith APIs are unreachable, your application should degrade gracefully, not crash:
import functools
def safe_monitoring(func):
"""Decorator that catches monitoring failures without crashing the app."""
@functools.wraps(func)
def wrapper(*args, **kwargs):
try:
return func(*args, **kwargs)
except Exception as e:
logger.warning(f"Monitoring call failed: {e}")
return None
return wrapper
# Usage
@safe_monitoring
def log_to_wandb(metrics):
wandb.log(metrics)
What's Next
You now have a production-ready monitoring setup that combines LangSmith's per-request traces with W&B's aggregate dashboards. The system automatically detects latency regressions, error rate spikes, and quality degradation.
To extend this further:
- Add LLM-as-judge evaluation: Use a separate LLM to score response quality and log those scores to W&B alongside latency metrics.
- Implement A/B testing: Route traffic between model versions and compare metrics in W&B dashboards.
- Build custom dashboards: W&B's reporting features let you create executive summaries showing cost per query, user satisfaction trends, and model drift over time.
The key insight from this implementation is that monitoring is not a separate concern—it's embedded in every request path. By instrumenting at the pipeline level, you capture data that helps you understand not just what happened, but why it happened. In production LLM systems, that distinction is the difference between hours of debugging and immediate resolution.
References
Was this article helpful?
Let us know to improve our AI generation.
Related Articles
How to Analyze Security Logs with DeepSeek Locally
Practical tutorial: Analyze security logs with DeepSeek locally
How to Build a Multimodal App with Gemini 2.0 Vision API
Practical tutorial: Build a multimodal app with Gemini 2.0 Vision API
How to Build an AI Research Assistant with Perplexity API
Practical tutorial: Create an AI research assistant with Perplexity API