How to Monitor LLM Apps with LangSmith and Weights & Biases

How to Monitor LLM Apps with LangSmith and Weights & Biases
- Why Dual Monitoring Matters in Production LLM Systems
- Prerequisites and Environment Setup
Create a virtual environment
Install core dependencies
For this tutorial, we'll use OpenAI [10] as our LLM provider
Install the OpenAI SDK
Set environment variables
- Building the RAG Pipeline with Instrumentation
rag_pipeline.py

📺 Watch: Intro to Large Language Models

Video by Andrej Karpathy

Building production LLM applications is fundamentally different from prototyping in a notebook. When you move from a single prompt test to a system handling thousands of requests, you need observability—not just for debugging, but for continuous improvement. This tutorial walks through integrating two complementary monitoring platforms: LangSmith for trace-level debugging and Weights & Biases (W&B) for aggregate performance tracking.

We'll build a real-time monitoring pipeline for a RAG (Retrieval-Augmented Generation) application, covering trace collection, latency analysis, cost tracking, and automated evaluation. By the end, you'll have a production-ready monitoring setup that catches regressions before they reach users.

Why Dual Monitoring Matters in Production LLM Systems

A single monitoring tool cannot solve all LLM observability problems. LangSmith excels at per-request tracing—showing you exactly which prompt, retrieval step, or model call caused a failure. W&B provides the macro view: latency distributions over time, cost trends across model versions, and experiment tracking for prompt iterations.

Consider a common production scenario: your RAG pipeline suddenly starts returning irrelevant answers. LangSmith traces show the retriever is returning empty chunks. W&B dashboards reveal this correlates with a recent embedding model update. Without both views, you'd waste hours guessing whether the issue is prompt drift, model degradation, or data pipeline changes.

According to LangChain [8]'s documentation, LangSmith supports tracing for over 40 LLM providers and frameworks as of early 2026. W&B's LLM monitoring features, documented in their Weave library, provide automatic token counting and latency tracking for OpenAI, Anthropic, and open-source models.

Prerequisites and Environment Setup

Before writing code, set up your environment. You'll need Python 3.10+ and API keys for both platforms.

# Create a virtual environment
python -m venv llm-monitor-env
source llm-monitor-env/bin/activate

# Install core dependencies
pip install langchain langchain-openai langsmith wandb weaviate-client fastapi uvicorn

# For this tutorial, we'll use OpenAI as our LLM provider
# Install the OpenAI SDK
pip install openai

# Set environment variables
export OPENAI_API_KEY="your-openai-key"
export LANGCHAIN_API_KEY="your-langsmith-key"
export LANGCHAIN_TRACING_V2="true"
export LANGCHAIN_PROJECT="llm-monitoring-tutorial"
export WANDB_API_KEY="your-wandb-key"

The LANGCHAIN_TRACING_V2=true environment variable is critical—it enables automatic trace collection for all LangChain runs. Without it, LangSmith won't capture any data.

Building the RAG Pipeline with Instrumentation

Let's create a RAG application that answers questions about technical documentation. We'll instrument it from the ground up with both LangSmith and W&B monitoring.

# rag_pipeline.py
import os
import time
from typing import List, Dict, Any
from dataclasses import dataclass, field

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Weaviate
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain_core.output_parsers import StrOutputParser
import weaviate
import wandb

@dataclass
class MonitorMetrics:
    """Container for per-request monitoring data."""
    latency_ms: float = 0.0
    tokens_used: int = 0
    retrieval_count: int = 0
    error: str = None
    model: str = "gpt [6]-4"
    timestamp: float = field(default_factory=time.time)

class MonitoredRAGPipeline:
    """
    Production RAG pipeline with dual monitoring.
    Tracks every request through LangSmith traces and W&B metrics.
    """

    def __init__(self, weaviate_url: str = "http://localhost:8080"):
        # Initialize Weaviate vector store
        self.client = weaviate.Client(weaviate_url)
        self.embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
        self.vectorstore = Weaviate(
            client=self.client,
            index_name="Documentation",
            text_key="content",
            embedding=self.embeddings,
            attributes=["source", "title"]
        )

        # Initialize LLM with monitoring
        self.llm = ChatOpenAI(
            model="gpt-4",
            temperature=0.1,
            # LangSmith automatically captures these parameters
            model_kwargs={"user": "monitoring-tutorial"}
        )

        # Create the RAG prompt
        self.prompt = ChatPromptTemplate.from_messages([
            ("system", """You are a technical documentation assistant. 
            Answer questions based on the provided context. 
            If the context doesn't contain the answer, say so clearly.

            Context: {context}"""),
            ("human", "{question}")
        ])

        # Initialize W&B run for this session
        self.wandb_run = wandb.init(
            project="llm-monitoring-tutorial",
            config={
                "model": "gpt-4",
                "embedding_model": "text-embedding-3-small",
                "vector_store": "weaviate",
                "chunk_size": 1000
            }
        )

        # Build the chain with monitoring hooks
        self.chain = self._build_monitored_chain()

    def _build_monitored_chain(self):
        """
        Construct the RAG chain with monitoring callbacks.
        Each step is wrapped to capture metrics.
        """
        def retrieve_docs(question: str) -> Dict[str, Any]:
            """Retrieve relevant documents with timing."""
            start = time.perf_counter()
            try:
                docs = self.vectorstore.similarity_search(
                    question, 
                    k=4,
                    # LangSmith traces this as a retriever step
                    run_name="retrieve_docs"
                )
                latency = (time.perf_counter() - start) * 1000

                # Log retrieval metrics to W&B
                wandb.log({
                    "retrieval_latency_ms": latency,
                    "retrieved_docs_count": len(docs)
                })

                return {
                    "context": "\n\n".join([d.page_content for d in docs]),
                    "sources": [d.metadata.get("source", "unknown") for d in docs],
                    "retrieval_latency_ms": latency
                }
            except Exception as e:
                wandb.log({"retrieval_error": str(e)})
                raise

        def format_response(response: str, metadata: Dict) -> Dict:
            """Format the final response with metadata."""
            return {
                "answer": response,
                "sources": metadata.get("sources", []),
                "retrieval_latency_ms": metadata.get("retrieval_latency_ms", 0)
            }

        # Build the chain using LangChain Expression Language
        chain = (
            RunnablePassthrough.assign(
                retrieved=RunnableLambda(retrieve_docs)
            )
            | RunnablePassthrough.assign(
                context=lambda x: x["retrieved"]["context"]
            )
            | self.prompt
            | self.llm
            | StrOutputParser()
            | RunnableLambda(
                lambda response, metadata: format_response(response, metadata)
            )
        )

        return chain

    def query(self, question: str) -> Dict[str, Any]:
        """
        Execute a query with full monitoring.
        Returns both the answer and monitoring metrics.
        """
        metrics = MonitorMetrics()
        start = time.perf_counter()

        try:
            # LangSmith automatically creates a trace for this run
            result = self.chain.invoke(
                {"question": question},
                config={
                    "run_name": f"query_{int(time.time())}",
                    "tags": ["production", "rag"]
                }
            )

            # Calculate total latency
            metrics.latency_ms = (time.perf_counter() - start) * 1000

            # Log aggregate metrics to W&B
            wandb.log({
                "total_latency_ms": metrics.latency_ms,
                "retrieval_latency_ms": result.get("retrieval_latency_ms", 0),
                "query_timestamp": metrics.timestamp
            })

            return {
                "answer": result["answer"],
                "sources": result["sources"],
                "metrics": {
                    "total_latency_ms": metrics.latency_ms,
                    "retrieval_latency_ms": result["retrieval_latency_ms"]
                }
            }

        except Exception as e:
            metrics.error = str(e)
            wandb.log({"query_error": str(e), "error_timestamp": time.time()})
            raise

    def close(self):
        """Clean up monitoring connections."""
        self.wandb_run.finish()

This pipeline captures three critical monitoring dimensions:

Per-request traces in LangSmith show the exact sequence of retrieval, prompt construction, and LLM generation for every query.
Real-time metrics in W&B track latency distributions and error rates as they happen.
Session-level configuration in W&B records the exact model versions and parameters used.

Deploying the Monitoring API with FastAPI

To make this production-ready, wrap the pipeline in a FastAPI server with health checks and proper error handling.

# api_server.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from contextlib import asynccontextmanager
import logging
from rag_pipeline import MonitoredRAGPipeline

# Configure structured logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

class QueryRequest(BaseModel):
    question: str = Field(.., min_length=1, max_length=5000)
    user_id: str = Field(default="anonymous", max_length=100)

class QueryResponse(BaseModel):
    answer: str
    sources: list[str]
    metrics: dict

# Global pipeline instance
pipeline: MonitoredRAGPipeline = None

@asynccontextmanager
async def lifespan(app: FastAPI):
    """Initialize and cleanup the RAG pipeline."""
    global pipeline
    logger.info("Initializing RAG pipeline with monitoring..")
    pipeline = MonitoredRAGPipeline()
    yield
    logger.info("Shutting down monitoring connections..")
    pipeline.close()

app = FastAPI(
    title="Monitored RAG API",
    version="1.0.0",
    lifespan=lifespan
)

@app.get("/health")
async def health_check():
    """Health endpoint for monitoring uptime."""
    return {
        "status": "healthy",
        "pipeline_initialized": pipeline is not None,
        "timestamp": time.time()
    }

@app.post("/query", response_model=QueryResponse)
async def query_endpoint(request: QueryRequest):
    """
    Process a RAG query with full monitoring.
    LangSmith captures the trace automatically.
    W&B logs aggregate metrics.
    """
    if not pipeline:
        raise HTTPException(status_code=503, detail="Pipeline not initialized")

    try:
        # Add user_id as a LangSmith tag for user-level monitoring
        result = pipeline.query(
            request.question,
            config={"tags": [f"user:{request.user_id}"]}
        )

        logger.info(
            f"Query processed - latency: {result['metrics']['total_latency_ms']:.2f}ms"
        )

        return QueryResponse(**result)

    except Exception as e:
        logger.error(f"Query failed: {str(e)}")
        raise HTTPException(status_code=500, detail=str(e))

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(
        "api_server:app",
        host="0.0.0.0",
        port=8000,
        log_level="info"
    )

Start the server with:

uvicorn api_server:app --host 0.0.0.0 --port 8000 --reload

Setting Up Automated Evaluation and Alerting

Monitoring without evaluation is just data collection. Let's add automated quality checks that trigger alerts when metrics degrade.

# evaluation_monitor.py
import asyncio
from datetime import datetime, timedelta
from typing import List, Dict
import wandb
from langsmith import Client as LangSmithClient

class QualityMonitor:
    """
    Automated quality monitoring that checks for regressions
    and sends alerts when metrics fall below thresholds.
    """

    def __init__(self, 
                 latency_threshold_ms: float = 5000,
                 error_rate_threshold: float = 0.05,
                 evaluation_interval_minutes: int = 15):

        self.latency_threshold = latency_threshold_ms
        self.error_rate_threshold = error_rate_threshold
        self.interval = evaluation_interval_minutes

        self.langsmith_client = LangSmithClient()
        self.wandb_api = wandb.Api()

        # Load reference metrics from last good deployment
        self.reference_metrics = self._load_reference_metrics()

    def _load_reference_metrics(self) -> Dict:
        """Load baseline metrics from W&B for comparison."""
        try:
            # Get the last successful run's metrics
            runs = self.wandb_api.runs(
                "llm-monitoring-tutorial",
                {"config.deployment": "production"}
            )
            if runs:
                last_run = runs[0]
                return {
                    "avg_latency_ms": last_run.summary.get("avg_latency_ms", 1000),
                    "p95_latency_ms": last_run.summary.get("p95_latency_ms", 3000),
                    "error_rate": last_run.summary.get("error_rate", 0.01)
                }
        except Exception as e:
            print(f"Could not load reference metrics: {e}")

        return {
            "avg_latency_ms": 1000,
            "p95_latency_ms": 3000,
            "error_rate": 0.01
        }

    async def check_current_metrics(self):
        """
        Fetch recent metrics from LangSmith and W&B,
        then compare against thresholds.
        """
        # Get metrics from last interval
        since = datetime.utcnow() - timedelta(minutes=self.interval)

        # LangSmith provides per-run metrics
        recent_runs = self.langsmith_client.list_runs(
            project_name="llm-monitoring-tutorial",
            start_time=since,
            run_type="chain"
        )

        # Calculate aggregate metrics
        latencies = []
        errors = 0
        total = 0

        async for run in recent_runs:
            total += 1
            if run.error:
                errors += 1
            if run.latency_ms:
                latencies.append(run.latency_ms)

        if total == 0:
            return {"status": "no_data", "message": "No runs in interval"}

        # Calculate statistics
        avg_latency = sum(latencies) / len(latencies) if latencies else 0
        error_rate = errors / total if total > 0 else 0

        # Sort for percentile calculation
        sorted_latencies = sorted(latencies)
        p95_index = int(len(sorted_latencies) * 0.95)
        p95_latency = sorted_latencies[p95_index] if sorted_latencies else 0

        # Log to W&B for historical tracking
        wandb.log({
            "monitoring/avg_latency_ms": avg_latency,
            "monitoring/p95_latency_ms": p95_latency,
            "monitoring/error_rate": error_rate,
            "monitoring/total_requests": total,
            "monitoring/check_timestamp": time.time()
        })

        # Check thresholds
        alerts = []

        if avg_latency > self.latency_threshold:
            alerts.append(
                f"High average latency: {avg_latency:.0f}ms "
                f"(threshold: {self.latency_threshold}ms)"
            )

        if p95_latency > self.latency_threshold * 2:
            alerts.append(
                f"High P95 latency: {p95_latency:.0f}ms "
                f"(threshold: {self.latency_threshold * 2}ms)"
            )

        if error_rate > self.error_rate_threshold:
            alerts.append(
                f"High error rate: {error_rate:.2%} "
                f"(threshold: {self.error_rate_threshold:.2%})"
            )

        # Log alerts to W&B as a table for dashboard visibility
        if alerts:
            alert_table = wandb.Table(
                columns=["timestamp", "alert", "avg_latency", "error_rate"],
                data=[[time.time(), alert, avg_latency, error_rate] 
                      for alert in alerts]
            )
            wandb.log({"alerts": alert_table})

        return {
            "status": "alert" if alerts else "ok",
            "metrics": {
                "avg_latency_ms": avg_latency,
                "p95_latency_ms": p95_latency,
                "error_rate": error_rate,
                "total_requests": total
            },
            "alerts": alerts
        }

    async def run_continuously(self):
        """Run monitoring checks at regular intervals."""
        print(f"Starting quality monitor (checking every {self.interval} minutes)")

        while True:
            try:
                result = await self.check_current_metrics()

                if result["status"] == "alert":
                    print(f"ALERTS DETECTED: {result['alerts']}")
                    # In production, send to PagerDuty, Slack, etc.
                else:
                    print(f"Health check passed: {result['metrics']}")

            except Exception as e:
                print(f"Monitoring check failed: {e}")
                wandb.log({"monitoring_error": str(e)})

            await asyncio.sleep(self.interval * 60)

# Run the monitor
if __name__ == "__main__":
    monitor = QualityMonitor()
    asyncio.run(monitor.run_continuously())

This evaluation monitor runs as a background service, checking metrics every 15 minutes. It compares current performance against baseline metrics stored in W&B, and logs alerts as structured tables for dashboard visualization.

Handling Edge Cases and Production Considerations

Production monitoring systems fail in predictable ways. Here are the critical edge cases to handle:

API Rate Limits: Both LangSmith and W&B have rate limits. LangSmith's free tier allows 10,000 traces per month, while W&B's free tier allows 100 GB of logged data. For production, budget for paid tiers: LangSmith starts at $99/month for teams, and W&B Teams starts at $50/user/month.

Trace Sampling: At high throughput, tracing every request becomes expensive. Implement sampling:

import random

def should_trace(sample_rate: float = 0.1) -> bool:
    """Sample traces to reduce monitoring costs."""
    return random.random() < sample_rate

# In your pipeline
if should_trace(0.1):  # Trace 10% of requests
    config["callbacks"] = [LangSmithCallbackHandler()]

Memory Management: W&B runs accumulate data in memory. For long-running services, periodically finish and restart runs:

import wandb
from datetime import datetime

class ManagedWandbRun:
    """Automatically rotates W&B runs to prevent memory leaks."""

    def __init__(self, rotation_hours: int = 24):
        self.rotation_hours = rotation_hours
        self.run = None
        self.start_time = None
        self._initialize_run()

    def _initialize_run(self):
        if self.run:
            self.run.finish()
        self.run = wandb.init(
            project="llm-monitoring-tutorial",
            config={"rotation_start": datetime.utcnow().isoformat()}
        )
        self.start_time = time.time()

    def log(self, data: dict):
        if time.time() - self.start_time > self.rotation_hours * 3600:
            self._initialize_run()
        self.run.log(data)

Error Recovery: If W&B or LangSmith APIs are unreachable, your application should degrade gracefully, not crash:

import functools

def safe_monitoring(func):
    """Decorator that catches monitoring failures without crashing the app."""
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except Exception as e:
            logger.warning(f"Monitoring call failed: {e}")
            return None
    return wrapper

# Usage
@safe_monitoring
def log_to_wandb(metrics):
    wandb.log(metrics)

What's Next

You now have a production-ready monitoring setup that combines LangSmith's per-request traces with W&B's aggregate dashboards. The system automatically detects latency regressions, error rate spikes, and quality degradation.

To extend this further:

Add LLM-as-judge evaluation: Use a separate LLM to score response quality and log those scores to W&B alongside latency metrics.
Implement A/B testing: Route traffic between model versions and compare metrics in W&B dashboards.
Build custom dashboards: W&B's reporting features let you create executive summaries showing cost per query, user satisfaction trends, and model drift over time.

The key insight from this implementation is that monitoring is not a separate concern—it's embedded in every request path. By instrumenting at the pipeline level, you capture data that helps you understand not just what happened, but why it happened. In production LLM systems, that distinction is the difference between hours of debugging and immediate resolution.

References

1. Wikipedia - GPT. Wikipedia. [Source]

2. Wikipedia - OpenAI. Wikipedia. [Source]

3. Wikipedia - LangChain. Wikipedia. [Source]

4. arXiv - Learning Dexterous In-Hand Manipulation. Arxiv. [Source]

5. arXiv - OpenAI o1 System Card. Arxiv. [Source]

6. GitHub - Significant-Gravitas/AutoGPT. Github. [Source]

7. GitHub - openai/openai-python. Github. [Source]

8. GitHub - langchain-ai/langchain. Github. [Source]

9. GitHub - fighting41love/funNLP. Github. [Source]

10. OpenAI Pricing. Pricing. [Source]

How to Monitor LLM Apps with LangSmith and Weights & Biases

How to Monitor LLM Apps with LangSmith and Weights & Biases

Table of Contents

📺 Watch: Intro to Large Language Models

Why Dual Monitoring Matters in Production LLM Systems

Prerequisites and Environment Setup

Building the RAG Pipeline with Instrumentation

Deploying the Monitoring API with FastAPI

Setting Up Automated Evaluation and Alerting

Handling Edge Cases and Production Considerations

What's Next

References

Was this article helpful?

Related Articles

How to Analyze Security Logs with DeepSeek Locally

How to Build a Multimodal App with Gemini 2.0 Vision API

How to Build an AI Research Assistant with Perplexity API