How to Monitor LLM Apps with LangSmith and Weights & Biases
Practical tutorial: Monitor LLM apps with LangSmith and Weights & Biases
How to Monitor LLM Apps with LangSmith and Weights & Biases
Table of Contents
- How to Monitor LLM Apps with LangSmith and Weights & Biases
- Create a virtual environment
- Install core dependencies
- Verify installations
- .env
- rag_pipeline.py
- Initialize LangSmith client for manual tracing
📺 Watch: Intro to Large Language Models
Video by Andrej Karpathy
Building production LLM applications is fundamentally different from prototyping in a notebook. When you move from a single prompt test to serving thousands of users, you need observability into every aspect of your pipeline: prompt quality, model latency, token consumption, and response correctness. Two platforms have emerged as the standard tools for this task: LangSmith for tracing and debugging, and Weights & Biases (W&B) for experiment tracking and artifact management.
In this tutorial, you will build a production-grade monitoring system that integrates both platforms into a FastAPI application serving a RAG (Retrieval-Augmented Generation) pipeline. You will learn how to instrument every request, log traces to LangSmith, log metrics and artifacts to W&B, and handle the edge cases that break naive monitoring setups.
Why Dual Monitoring Matters in Production
A single monitoring tool cannot cover the full lifecycle of an LLM application. LangSmith excels at per-request tracing—showing you exactly which prompt was sent, which documents were retrieved, and how the model responded. W&B excels at aggregate observability—tracking how your system performs over time, comparing prompt versions, and storing evaluation results.
According to LangChain [7]'s documentation, LangSmith supports tracing for LangChain, OpenAI, and custom applications via the OpenTelemetry standard. As of June 2026, LangSmith's free tier includes 10,000 traced requests per month, with paid plans starting at $99/month for teams. W&B's free tier includes unlimited personal projects with 100 GB of artifact storage, with team plans starting at $50/user/month.
The architecture we will build looks like this:
User Request → FastAPI → LangChain RAG Pipeline
↓ ↓
LangSmith Trace W&B Metrics
(per-request) (aggregate + artifacts)
Prerequisites and Environment Setup
Before writing any code, set up your environment with the required dependencies. You will need Python 3.10 or later, a LangSmith API key, and a W&B API key.
# Create a virtual environment
python -m venv llm-monitor-env
source llm-monitor-env/bin/activate
# Install core dependencies
pip install fastapi uvicorn langchain langchain-openai langsmith wandb chromadb [8] pydantic python-dotenv
# Verify installations
python -c "import langsmith; print(langsmith.__version__)"
python -c "import wandb; print(wandb.__version__)"
Create a .env file with your credentials:
# .env
OPENAI_API_KEY=sk-your-openai-key
LANGSMITH_API_KEY=ls-your-langsmith-key
LANGSMITH_PROJECT=llm-monitor-tutorial
WANDB_API_KEY=your-wandb-api-key
WANDB_PROJECT=llm-monitor-tutorial
Edge case: If you are behind a corporate proxy, set LANGSMITH_TRACING_V2=false and configure the proxy via environment variables. LangSmith uses HTTPS on port 443, which most proxies allow, but W&B uses WebSockets for live metrics—ensure your network permits WebSocket connections to api.wandb.ai.
Instrumenting the RAG Pipeline with LangSmith Tracing
LangSmith provides automatic tracing for LangChain applications. When you set the environment variable LANGSMITH_TRACING_V2=true, every LangChain call is automatically logged. However, for production systems, you need explicit control over what gets traced and how traces are organized.
Create a file called rag_pipeline.py that defines your RAG chain with explicit LangSmith instrumentation:
# rag_pipeline.py
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough, RunnableConfig
from langchain_core.callbacks import CallbackManager
from langsmith import Client as LangSmithClient
from langsmith.run_helpers import traceable
load_dotenv()
# Initialize LangSmith client for manual tracing
langsmith_client = LangSmithClient(
api_key=os.getenv("LANGSMITH_API_KEY"),
api_url="https://api.smith.langchain.com"
)
# Initialize embedding model and vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vector_store = Chroma(
collection_name="docs",
embedding_function=embeddings,
persist_directory="./chroma_db"
)
retriever = vector_store.as_retriever(search_kwargs={"k": 3})
# Define the RAG prompt template
prompt_template = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant. Answer the question based on the provided context."),
("human", "Context: {context}\n\nQuestion: {question}")
])
# Initialize the LLM
llm = ChatOpenAI(
model="gpt [5]-4o-mini",
temperature=0,
max_tokens=1024
)
def format_docs(docs):
"""Format retrieved documents into a single context string."""
return "\n\n".join(doc.page_content for doc in docs)
# Build the RAG chain with explicit tracing
@traceable(
project_name=os.getenv("LANGSMITH_PROJECT", "llm-monitor-tutorial"),
name="rag_chain",
run_type="chain"
)
def rag_chain(question: str, user_id: str = "anonymous") -> dict:
"""
Execute the RAG pipeline with full LangSmith tracing.
Args:
question: The user's question
user_id: Identifier for the user making the request
Returns:
dict containing the answer, retrieved documents, and metadata
"""
# Step 1: Retrieve documents
with traceable(
project_name=os.getenv("LANGSMITH_PROJECT"),
name="retrieve_docs",
run_type="retriever"
) as retrieve_trace:
docs = retriever.invoke(question)
retrieve_trace.add_metadata({
"num_docs": len(docs),
"doc_sources": [doc.metadata.get("source", "unknown") for doc in docs]
})
# Step 2: Format context
context = format_docs(docs)
# Step 3: Generate response
with traceable(
project_name=os.getenv("LANGSMITH_PROJECT"),
name="generate_response",
run_type="llm"
) as llm_trace:
messages = prompt_template.format_messages(context=context, question=question)
response = llm.invoke(messages)
llm_trace.add_metadata({
"model": "gpt-4o-mini",
"temperature": 0,
"max_tokens": 1024,
"input_tokens": response.response_metadata.get("token_usage", {}).get("prompt_tokens", 0),
"output_tokens": response.response_metadata.get("token_usage", {}).get("completion_tokens", 0)
})
return {
"answer": response.content,
"source_documents": [doc.page_content for doc in docs],
"metadata": {
"user_id": user_id,
"model": "gpt-4o-mini",
"total_tokens": response.response_metadata.get("token_usage", {}).get("total_tokens", 0)
}
}
Key design decisions:
-
Explicit
@traceabledecorators: While LangSmith can auto-trace, explicit decorators give you control over run names and metadata. This is critical when you need to filter traces by operation type in the LangSmith UI. -
Nested traces: The
rag_chainfunction creates a parent trace, and the retrieval and generation steps create child traces. This preserves the hierarchical relationship, making it easy to see which documents were used for each response. -
Metadata injection: Adding metadata like
num_docs,doc_sources, and token counts allows you to create dashboards in LangSmith that show average retrieval size or token consumption per user.
Edge case: If the vector store is empty or the retriever returns zero documents, the format_docs function returns an empty string. The LLM will still generate a response, but it will lack context. Your monitoring should flag this—add a check in the retrieve_trace metadata to record num_docs: 0 so you can alert on empty retrievals.
Logging Metrics and Artifacts to Weights & Biases
LangSmith handles per-request tracing, but you need W&B for aggregate metrics over time. Create a file called wandb_monitor.py that logs performance metrics and evaluation results:
# wandb_monitor.py
import os
import time
import json
from datetime import datetime
from typing import Dict, List, Optional
import wandb
from dotenv import load_dotenv
load_dotenv()
class WBDMonitor:
"""
Production-grade monitor for logging LLM metrics to Weights & Biases.
Handles:
- Real-time metric logging
- Artifact versioning for prompts and evaluation results
- Failure recovery with local buffering
"""
def __init__(self, project_name: str = None, run_name: str = None):
self.project = project_name or os.getenv("WANDB_PROJECT", "llm-monitor-tutorial")
self.run_name = run_name or f"run_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
self.run = None
self.local_buffer = [] # Buffer for failed log attempts
self.buffer_file = "./wandb_buffer.json"
def start_run(self, config: Dict = None):
"""Initialize a W&B run with optional configuration."""
self.run = wandb.init(
project=self.project,
name=self.run_name,
config=config or {
"model": "gpt-4o-mini",
"embedding_model": "text-embedding-3-small",
"retrieval_k": 3,
"temperature": 0
},
# Resume if run was interrupted
resume="allow",
# Save code to W&B for reproducibility
save_code=True
)
# Flush any buffered logs from previous failed attempts
self._flush_buffer()
def log_request_metrics(self, metrics: Dict):
"""
Log per-request metrics to W&B.
Args:
metrics: Dictionary containing latency, tokens, etc.
"""
try:
if self.run:
self.run.log(metrics)
else:
self._buffer_log(metrics)
except Exception as e:
print(f"W&B log failed, buffering: {e}")
self._buffer_log(metrics)
def log_evaluation_results(self, eval_results: List[Dict], dataset_name: str):
"""
Log evaluation results as a W&B Table artifact.
Args:
eval_results: List of dictionaries with question, answer, score
dataset_name: Name for the evaluation dataset
"""
if not self.run:
self.start_run()
# Create a W&B Table for structured evaluation data
table = wandb.Table(
columns=["question", "answer", "expected_answer", "score", "latency_ms"]
)
for result in eval_results:
table.add_data(
result.get("question", ""),
result.get("answer", ""),
result.get("expected_answer", ""),
result.get("score", 0.0),
result.get("latency_ms", 0)
)
# Log the table as an artifact
artifact = wandb.Artifact(
name=f"eval_{dataset_name}_{datetime.now().strftime('%Y%m%d')}",
type="evaluation_results",
description=f"Evaluation results for {dataset_name} on {datetime.now().strftime('%Y-%m-%d')}"
)
artifact.add(table, "results")
self.run.log_artifact(artifact)
# Also log aggregate metrics
avg_score = sum(r.get("score", 0) for r in eval_results) / len(eval_results) if eval_results else 0
avg_latency = sum(r.get("latency_ms", 0) for r in eval_results) / len(eval_results) if eval_results else 0
self.run.log({
f"eval/{dataset_name}/avg_score": avg_score,
f"eval/{dataset_name}/avg_latency_ms": avg_latency,
f"eval/{dataset_name}/num_samples": len(eval_results)
})
def log_prompt_version(self, prompt_text: str, version: str, metadata: Dict = None):
"""
Version a prompt template as a W&B artifact.
Args:
prompt_text: The prompt template string
version: Semantic version string (e.g., "1.2.0")
metadata: Additional metadata about the prompt
"""
if not self.run:
self.start_run()
artifact = wandb.Artifact(
name=f"prompt_template_v{version}",
type="prompt",
description=f"Prompt template version {version}"
)
# Save prompt as a text file in the artifact
with artifact.new_file("prompt.txt", "w") as f:
f.write(prompt_text)
if metadata:
with artifact.new_file("metadata.json", "w") as f:
json.dump(metadata, f)
self.run.log_artifact(artifact)
def _buffer_log(self, metrics: Dict):
"""Buffer metrics locally when W&B is unavailable."""
self.local_buffer.append({
"timestamp": datetime.now().isoformat(),
"metrics": metrics
})
# Persist buffer to disk for crash recovery
with open(self.buffer_file, "w") as f:
json.dump(self.local_buffer, f)
def _flush_buffer(self):
"""Flush buffered logs to W&B."""
if not self.local_buffer:
return
try:
for entry in self.local_buffer:
self.run.log(entry["metrics"])
self.local_buffer = []
# Clear the buffer file
if os.path.exists(self.buffer_file):
os.remove(self.buffer_file)
except Exception as e:
print(f"Failed to flush buffer: {e}")
def finish_run(self):
"""Properly close the W&B run."""
if self.run:
self._flush_buffer()
self.run.finish()
Critical design patterns:
-
Local buffering with crash recovery: W&B API calls can fail due to network issues or rate limits. The
_buffer_logmethod saves metrics to a local JSON file. On the next successfulstart_run,_flush_bufferreplays those logs. This prevents data loss during transient failures. -
Artifact versioning: Prompts and evaluation results are stored as versioned artifacts. This allows you to roll back to a previous prompt version if a deployment causes quality degradation. W&B artifacts support aliases like "production" and "staging" for easy reference.
-
Structured evaluation tables: Instead of logging raw numbers, use
wandb.Tableto store question-answer pairs with scores. This enables interactive querying in the W&B UI—you can filter by score range to find failure cases.
Edge case: W&B rate limits free accounts to 100 requests per minute. If you are logging per-request metrics for a high-traffic application, batch metrics every 10-100 requests instead of logging each one individually. The log_request_metrics method can accept a list of metrics and compute aggregates before sending.
Building the FastAPI Application with Integrated Monitoring
Now combine LangSmith tracing and W&B monitoring into a FastAPI application. Create app.py:
# app.py
import os
import time
import uuid
from datetime import datetime
from typing import Optional
from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException, Request
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, Field
from dotenv import load_dotenv
# Import our custom modules
from rag_pipeline import rag_chain
from wandb_monitor import WBDMonitor
load_dotenv()
# Initialize W&B monitor
wandb_monitor = WBDMonitor(
project_name=os.getenv("WANDB_PROJECT", "llm-monitor-tutorial")
)
# Application lifecycle management
@asynccontextmanager
async def lifespan(app: FastAPI):
"""Start W&B run on application startup and finish on shutdown."""
wandb_monitor.start_run(config={
"model": "gpt-4o-mini",
"embedding_model": "text-embedding-3-small",
"retrieval_k": 3,
"temperature": 0,
"deployment_date": datetime.now().isoformat()
})
yield
wandb_monitor.finish_run()
app = FastAPI(
title="LLM Monitoring Demo",
version="1.0.0",
lifespan=lifespan
)
# CORS for production deployments
app.add_middleware(
CORSMiddleware,
allow_origins=["*"], # Restrict in production
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# Request/Response models
class QueryRequest(BaseModel):
question: str = Field(.., min_length=1, max_length=2000, description="User question")
user_id: Optional[str] = Field(default="anonymous", max_length=100)
session_id: Optional[str] = Field(default=None)
class QueryResponse(BaseModel):
answer: str
source_documents: list[str]
metadata: dict
request_id: str
latency_ms: float
@app.post("/query", response_model=QueryResponse)
async def query_endpoint(request: QueryRequest, http_request: Request):
"""
Main query endpoint with integrated monitoring.
Logs to:
- LangSmith: Full trace of the RAG pipeline
- W&B: Latency, token usage, and error rates
"""
request_id = str(uuid.uuid4())
session_id = request.session_id or request_id
start_time = time.time()
try:
# Execute the RAG pipeline (traced by LangSmith)
result = rag_chain(
question=request.question,
user_id=request.user_id
)
latency_ms = (time.time() - start_time) * 1000
# Log metrics to W&B
wandb_monitor.log_request_metrics({
"request/latency_ms": latency_ms,
"request/total_tokens": result["metadata"].get("total_tokens", 0),
"request/num_source_docs": len(result["source_documents"]),
"request/user_id": request.user_id,
"request/session_id": session_id,
"request/timestamp": datetime.now().isoformat()
})
return QueryResponse(
answer=result["answer"],
source_documents=result["source_documents"],
metadata=result["metadata"],
request_id=request_id,
latency_ms=latency_ms
)
except Exception as e:
latency_ms = (time.time() - start_time) * 1000
# Log error to W&B
wandb_monitor.log_request_metrics({
"request/latency_ms": latency_ms,
"request/error": str(e)[:200], # Truncate long errors
"request/error_type": type(e).__name__,
"request/user_id": request.user_id,
"request/session_id": session_id
})
raise HTTPException(
status_code=500,
detail={
"error": "Internal server error",
"request_id": request_id,
"message": str(e)
}
)
@app.get("/health")
async def health_check():
"""Health check endpoint for load balancers."""
return {
"status": "healthy",
"timestamp": datetime.now().isoformat(),
"wandb_run": wandb_monitor.run.id if wandb_monitor.run else "not_started"
}
@app.post("/evaluate")
async def evaluate_endpoint():
"""
Run evaluation on a test dataset and log results to W&B.
This endpoint is typically called by CI/CD pipelines after deployment.
"""
# Example evaluation dataset
eval_data = [
{"question": "What is RAG?", "expected_answer": "Retrieval-Augmented Generation"},
{"question": "How does LangSmith work?", "expected_answer": "It traces LLM calls"},
]
eval_results = []
for item in eval_data:
start = time.time()
try:
result = rag_chain(question=item["question"], user_id="eval")
latency = (time.time() - start) * 1000
# Simple exact-match scoring (replace with semantic similarity in production)
score = 1.0 if item["expected_answer"].lower() in result["answer"].lower() else 0.0
eval_results.append({
"question": item["question"],
"answer": result["answer"],
"expected_answer": item["expected_answer"],
"score": score,
"latency_ms": latency
})
except Exception as e:
eval_results.append({
"question": item["question"],
"answer": f"ERROR: {str(e)}",
"expected_answer": item["expected_answer"],
"score": 0.0,
"latency_ms": (time.time() - start) * 1000
})
# Log evaluation results to W&B
wandb_monitor.log_evaluation_results(eval_results, "production_test")
return {
"status": "completed",
"num_samples": len(eval_results),
"avg_score": sum(r["score"] for r in eval_results) / len(eval_results)
}
if __name__ == "__main__":
import uvicorn
uvicorn.run(
"app:app",
host="0.0.0.0",
port=8000,
reload=True, # Disable in production
log_level="info"
)
Production considerations:
-
Request ID propagation: Every request gets a UUID that appears in both LangSmith traces and W&B logs. This allows you to cross-reference a specific user request across both platforms.
-
Error handling with monitoring: When an exception occurs, the error is logged to W&B with the error type and truncated message. This creates a time-series of error rates that you can alert on.
-
Session tracking: The
session_idfield allows you to group multiple requests from the same user session. In LangSmith, you can filter traces by session to debug multi-turn conversations.
Edge case: If the OpenAI API returns a 429 rate limit error, the rag_chain function will raise an exception. The error handler logs this to W&B, but the user receives a 500 error. In production, implement retry logic with exponential backoff in the rag_chain function, and log the number of retries as metadata.
Running and Verifying the Monitoring Setup
Start the application and verify that both monitoring systems are receiving data:
# Start the FastAPI server
python app.py
# In another terminal, send test requests
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"question": "What is LangSmith used for?", "user_id": "test_user_1"}'
# Run evaluation
curl -X POST http://localhost:8000/evaluate
# Check health
curl http://localhost:8000/health
Verification steps:
-
LangSmith UI: Go to
smith.langchain.comand select your project. You should see traces for each request, with nested spans for retrieval and generation. Click on a trace to see the exact prompt sent to OpenAI and the response received. -
W&B UI: Go to
wandb.aiand select your project. You should see:- A run named
run_YYYYMMDD_HHMMSSwith latency and token metrics - An artifact named
eval_production_test_YYYYMMDDcontaining the evaluation table - Time-series charts showing average latency and error rate over time
- A run named
Edge case: If you see no traces in LangSmith, check that LANGSMITH_TRACING_V2=true is set in your environment. The @traceable decorator requires this environment variable to be set, even when using the explicit client. If you are using a self-hosted LangSmith instance, set LANGSMITH_ENDPOINT to your server URL.
What's Next
You now have a production-grade monitoring system that combines LangSmith's per-request tracing with W&B's aggregate observability. To extend this setup:
-
Add alerting: Configure W&B to send Slack or email alerts when latency exceeds 5 seconds or error rate exceeds 1%. Use W&B's "Alerts" feature in the project settings.
-
Implement A/B testing: Run two prompt versions simultaneously, log the version ID in both LangSmith metadata and W&B metrics, and compare performance using W&B's parallel coordinates charts.
-
Integrate with CI/CD: Add the
/evaluateendpoint to your deployment pipeline. Before promoting a new model version to production, run the evaluation suite and check that the average score does not decrease by more than 5%. -
Add user feedback: Extend the
QueryResponsemodel to include afeedback_urlthat users can click to rate responses. Log this feedback to W&B as a separate metric, enabling you to correlate user satisfaction with specific prompt versions.
The combination of LangSmith and W&B gives you both the microscope to debug individual failures and the telescope to see system-wide trends. In production LLM applications, you need both perspectives to maintain quality at scale.
References
Was this article helpful?
Let us know to improve our AI generation.
Related Articles
How to Automate CVE Analysis with LLMs and RAG
Practical tutorial: Automate CVE analysis with LLMs and RAG
How to Build a Brain-Computer Interface Pipeline with Python 2026
Practical tutorial: The story covers significant developments in brain implant technology and South Korea's AI strategy, both of which are i
How to Build an AI Anomaly Detection System for Particle Physics Data
Practical tutorial: The story discusses the impact of AI on a specific industry segment, which is relevant but not groundbreaking.