Back to Tutorials
tutorialstutorialaillm

How to Monitor LLM Apps with LangSmith and Weights & Biases

Practical tutorial: Monitor LLM apps with LangSmith and Weights & Biases

BlogIA AcademyJune 12, 202615 min read2 863 words

How to Monitor LLM Apps with LangSmith and Weights & Biases

Table of Contents

📺 Watch: Intro to Large Language Models

Video by Andrej Karpathy


Building production LLM applications is fundamentally different from prototyping in a notebook. When you move from a single prompt test to serving thousands of users, you need observability into every aspect of your pipeline: prompt quality, model latency, token consumption, and response correctness. Two platforms have emerged as the standard tools for this task: LangSmith for tracing and debugging, and Weights & Biases (W&B) for experiment tracking and artifact management.

In this tutorial, you will build a production-grade monitoring system that integrates both platforms into a FastAPI application serving a RAG (Retrieval-Augmented Generation) pipeline. You will learn how to instrument every request, log traces to LangSmith, log metrics and artifacts to W&B, and handle the edge cases that break naive monitoring setups.

Why Dual Monitoring Matters in Production

A single monitoring tool cannot cover the full lifecycle of an LLM application. LangSmith excels at per-request tracing—showing you exactly which prompt was sent, which documents were retrieved, and how the model responded. W&B excels at aggregate observability—tracking how your system performs over time, comparing prompt versions, and storing evaluation results.

According to LangChain [7]'s documentation, LangSmith supports tracing for LangChain, OpenAI, and custom applications via the OpenTelemetry standard. As of June 2026, LangSmith's free tier includes 10,000 traced requests per month, with paid plans starting at $99/month for teams. W&B's free tier includes unlimited personal projects with 100 GB of artifact storage, with team plans starting at $50/user/month.

The architecture we will build looks like this:

User Request → FastAPI → LangChain RAG Pipeline
                ↓                    ↓
           LangSmith Trace      W&B Metrics
           (per-request)        (aggregate + artifacts)

Prerequisites and Environment Setup

Before writing any code, set up your environment with the required dependencies. You will need Python 3.10 or later, a LangSmith API key, and a W&B API key.

# Create a virtual environment
python -m venv llm-monitor-env
source llm-monitor-env/bin/activate

# Install core dependencies
pip install fastapi uvicorn langchain langchain-openai langsmith wandb chromadb [8] pydantic python-dotenv

# Verify installations
python -c "import langsmith; print(langsmith.__version__)"
python -c "import wandb; print(wandb.__version__)"

Create a .env file with your credentials:

# .env
OPENAI_API_KEY=sk-your-openai-key
LANGSMITH_API_KEY=ls-your-langsmith-key
LANGSMITH_PROJECT=llm-monitor-tutorial
WANDB_API_KEY=your-wandb-api-key
WANDB_PROJECT=llm-monitor-tutorial

Edge case: If you are behind a corporate proxy, set LANGSMITH_TRACING_V2=false and configure the proxy via environment variables. LangSmith uses HTTPS on port 443, which most proxies allow, but W&B uses WebSockets for live metrics—ensure your network permits WebSocket connections to api.wandb.ai.

Instrumenting the RAG Pipeline with LangSmith Tracing

LangSmith provides automatic tracing for LangChain applications. When you set the environment variable LANGSMITH_TRACING_V2=true, every LangChain call is automatically logged. However, for production systems, you need explicit control over what gets traced and how traces are organized.

Create a file called rag_pipeline.py that defines your RAG chain with explicit LangSmith instrumentation:

# rag_pipeline.py
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough, RunnableConfig
from langchain_core.callbacks import CallbackManager
from langsmith import Client as LangSmithClient
from langsmith.run_helpers import traceable

load_dotenv()

# Initialize LangSmith client for manual tracing
langsmith_client = LangSmithClient(
    api_key=os.getenv("LANGSMITH_API_KEY"),
    api_url="https://api.smith.langchain.com"
)

# Initialize embedding model and vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vector_store = Chroma(
    collection_name="docs",
    embedding_function=embeddings,
    persist_directory="./chroma_db"
)
retriever = vector_store.as_retriever(search_kwargs={"k": 3})

# Define the RAG prompt template
prompt_template = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. Answer the question based on the provided context."),
    ("human", "Context: {context}\n\nQuestion: {question}")
])

# Initialize the LLM
llm = ChatOpenAI(
    model="gpt [5]-4o-mini",
    temperature=0,
    max_tokens=1024
)

def format_docs(docs):
    """Format retrieved documents into a single context string."""
    return "\n\n".join(doc.page_content for doc in docs)

# Build the RAG chain with explicit tracing
@traceable(
    project_name=os.getenv("LANGSMITH_PROJECT", "llm-monitor-tutorial"),
    name="rag_chain",
    run_type="chain"
)
def rag_chain(question: str, user_id: str = "anonymous") -> dict:
    """
    Execute the RAG pipeline with full LangSmith tracing.

    Args:
        question: The user's question
        user_id: Identifier for the user making the request

    Returns:
        dict containing the answer, retrieved documents, and metadata
    """
    # Step 1: Retrieve documents
    with traceable(
        project_name=os.getenv("LANGSMITH_PROJECT"),
        name="retrieve_docs",
        run_type="retriever"
    ) as retrieve_trace:
        docs = retriever.invoke(question)
        retrieve_trace.add_metadata({
            "num_docs": len(docs),
            "doc_sources": [doc.metadata.get("source", "unknown") for doc in docs]
        })

    # Step 2: Format context
    context = format_docs(docs)

    # Step 3: Generate response
    with traceable(
        project_name=os.getenv("LANGSMITH_PROJECT"),
        name="generate_response",
        run_type="llm"
    ) as llm_trace:
        messages = prompt_template.format_messages(context=context, question=question)
        response = llm.invoke(messages)
        llm_trace.add_metadata({
            "model": "gpt-4o-mini",
            "temperature": 0,
            "max_tokens": 1024,
            "input_tokens": response.response_metadata.get("token_usage", {}).get("prompt_tokens", 0),
            "output_tokens": response.response_metadata.get("token_usage", {}).get("completion_tokens", 0)
        })

    return {
        "answer": response.content,
        "source_documents": [doc.page_content for doc in docs],
        "metadata": {
            "user_id": user_id,
            "model": "gpt-4o-mini",
            "total_tokens": response.response_metadata.get("token_usage", {}).get("total_tokens", 0)
        }
    }

Key design decisions:

  1. Explicit @traceable decorators: While LangSmith can auto-trace, explicit decorators give you control over run names and metadata. This is critical when you need to filter traces by operation type in the LangSmith UI.

  2. Nested traces: The rag_chain function creates a parent trace, and the retrieval and generation steps create child traces. This preserves the hierarchical relationship, making it easy to see which documents were used for each response.

  3. Metadata injection: Adding metadata like num_docs, doc_sources, and token counts allows you to create dashboards in LangSmith that show average retrieval size or token consumption per user.

Edge case: If the vector store is empty or the retriever returns zero documents, the format_docs function returns an empty string. The LLM will still generate a response, but it will lack context. Your monitoring should flag this—add a check in the retrieve_trace metadata to record num_docs: 0 so you can alert on empty retrievals.

Logging Metrics and Artifacts to Weights & Biases

LangSmith handles per-request tracing, but you need W&B for aggregate metrics over time. Create a file called wandb_monitor.py that logs performance metrics and evaluation results:

# wandb_monitor.py
import os
import time
import json
from datetime import datetime
from typing import Dict, List, Optional
import wandb
from dotenv import load_dotenv

load_dotenv()

class WBDMonitor:
    """
    Production-grade monitor for logging LLM metrics to Weights & Biases.

    Handles:
    - Real-time metric logging
    - Artifact versioning for prompts and evaluation results
    - Failure recovery with local buffering
    """

    def __init__(self, project_name: str = None, run_name: str = None):
        self.project = project_name or os.getenv("WANDB_PROJECT", "llm-monitor-tutorial")
        self.run_name = run_name or f"run_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
        self.run = None
        self.local_buffer = []  # Buffer for failed log attempts
        self.buffer_file = "./wandb_buffer.json"

    def start_run(self, config: Dict = None):
        """Initialize a W&B run with optional configuration."""
        self.run = wandb.init(
            project=self.project,
            name=self.run_name,
            config=config or {
                "model": "gpt-4o-mini",
                "embedding_model": "text-embedding-3-small",
                "retrieval_k": 3,
                "temperature": 0
            },
            # Resume if run was interrupted
            resume="allow",
            # Save code to W&B for reproducibility
            save_code=True
        )

        # Flush any buffered logs from previous failed attempts
        self._flush_buffer()

    def log_request_metrics(self, metrics: Dict):
        """
        Log per-request metrics to W&B.

        Args:
            metrics: Dictionary containing latency, tokens, etc.
        """
        try:
            if self.run:
                self.run.log(metrics)
            else:
                self._buffer_log(metrics)
        except Exception as e:
            print(f"W&B log failed, buffering: {e}")
            self._buffer_log(metrics)

    def log_evaluation_results(self, eval_results: List[Dict], dataset_name: str):
        """
        Log evaluation results as a W&B Table artifact.

        Args:
            eval_results: List of dictionaries with question, answer, score
            dataset_name: Name for the evaluation dataset
        """
        if not self.run:
            self.start_run()

        # Create a W&B Table for structured evaluation data
        table = wandb.Table(
            columns=["question", "answer", "expected_answer", "score", "latency_ms"]
        )

        for result in eval_results:
            table.add_data(
                result.get("question", ""),
                result.get("answer", ""),
                result.get("expected_answer", ""),
                result.get("score", 0.0),
                result.get("latency_ms", 0)
            )

        # Log the table as an artifact
        artifact = wandb.Artifact(
            name=f"eval_{dataset_name}_{datetime.now().strftime('%Y%m%d')}",
            type="evaluation_results",
            description=f"Evaluation results for {dataset_name} on {datetime.now().strftime('%Y-%m-%d')}"
        )
        artifact.add(table, "results")

        self.run.log_artifact(artifact)

        # Also log aggregate metrics
        avg_score = sum(r.get("score", 0) for r in eval_results) / len(eval_results) if eval_results else 0
        avg_latency = sum(r.get("latency_ms", 0) for r in eval_results) / len(eval_results) if eval_results else 0

        self.run.log({
            f"eval/{dataset_name}/avg_score": avg_score,
            f"eval/{dataset_name}/avg_latency_ms": avg_latency,
            f"eval/{dataset_name}/num_samples": len(eval_results)
        })

    def log_prompt_version(self, prompt_text: str, version: str, metadata: Dict = None):
        """
        Version a prompt template as a W&B artifact.

        Args:
            prompt_text: The prompt template string
            version: Semantic version string (e.g., "1.2.0")
            metadata: Additional metadata about the prompt
        """
        if not self.run:
            self.start_run()

        artifact = wandb.Artifact(
            name=f"prompt_template_v{version}",
            type="prompt",
            description=f"Prompt template version {version}"
        )

        # Save prompt as a text file in the artifact
        with artifact.new_file("prompt.txt", "w") as f:
            f.write(prompt_text)

        if metadata:
            with artifact.new_file("metadata.json", "w") as f:
                json.dump(metadata, f)

        self.run.log_artifact(artifact)

    def _buffer_log(self, metrics: Dict):
        """Buffer metrics locally when W&B is unavailable."""
        self.local_buffer.append({
            "timestamp": datetime.now().isoformat(),
            "metrics": metrics
        })
        # Persist buffer to disk for crash recovery
        with open(self.buffer_file, "w") as f:
            json.dump(self.local_buffer, f)

    def _flush_buffer(self):
        """Flush buffered logs to W&B."""
        if not self.local_buffer:
            return

        try:
            for entry in self.local_buffer:
                self.run.log(entry["metrics"])
            self.local_buffer = []
            # Clear the buffer file
            if os.path.exists(self.buffer_file):
                os.remove(self.buffer_file)
        except Exception as e:
            print(f"Failed to flush buffer: {e}")

    def finish_run(self):
        """Properly close the W&B run."""
        if self.run:
            self._flush_buffer()
            self.run.finish()

Critical design patterns:

  1. Local buffering with crash recovery: W&B API calls can fail due to network issues or rate limits. The _buffer_log method saves metrics to a local JSON file. On the next successful start_run, _flush_buffer replays those logs. This prevents data loss during transient failures.

  2. Artifact versioning: Prompts and evaluation results are stored as versioned artifacts. This allows you to roll back to a previous prompt version if a deployment causes quality degradation. W&B artifacts support aliases like "production" and "staging" for easy reference.

  3. Structured evaluation tables: Instead of logging raw numbers, use wandb.Table to store question-answer pairs with scores. This enables interactive querying in the W&B UI—you can filter by score range to find failure cases.

Edge case: W&B rate limits free accounts to 100 requests per minute. If you are logging per-request metrics for a high-traffic application, batch metrics every 10-100 requests instead of logging each one individually. The log_request_metrics method can accept a list of metrics and compute aggregates before sending.

Building the FastAPI Application with Integrated Monitoring

Now combine LangSmith tracing and W&B monitoring into a FastAPI application. Create app.py:

# app.py
import os
import time
import uuid
from datetime import datetime
from typing import Optional
from contextlib import asynccontextmanager

from fastapi import FastAPI, HTTPException, Request
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, Field
from dotenv import load_dotenv

# Import our custom modules
from rag_pipeline import rag_chain
from wandb_monitor import WBDMonitor

load_dotenv()

# Initialize W&B monitor
wandb_monitor = WBDMonitor(
    project_name=os.getenv("WANDB_PROJECT", "llm-monitor-tutorial")
)

# Application lifecycle management
@asynccontextmanager
async def lifespan(app: FastAPI):
    """Start W&B run on application startup and finish on shutdown."""
    wandb_monitor.start_run(config={
        "model": "gpt-4o-mini",
        "embedding_model": "text-embedding-3-small",
        "retrieval_k": 3,
        "temperature": 0,
        "deployment_date": datetime.now().isoformat()
    })
    yield
    wandb_monitor.finish_run()

app = FastAPI(
    title="LLM Monitoring Demo",
    version="1.0.0",
    lifespan=lifespan
)

# CORS for production deployments
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],  # Restrict in production
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

# Request/Response models
class QueryRequest(BaseModel):
    question: str = Field(.., min_length=1, max_length=2000, description="User question")
    user_id: Optional[str] = Field(default="anonymous", max_length=100)
    session_id: Optional[str] = Field(default=None)

class QueryResponse(BaseModel):
    answer: str
    source_documents: list[str]
    metadata: dict
    request_id: str
    latency_ms: float

@app.post("/query", response_model=QueryResponse)
async def query_endpoint(request: QueryRequest, http_request: Request):
    """
    Main query endpoint with integrated monitoring.

    Logs to:
    - LangSmith: Full trace of the RAG pipeline
    - W&B: Latency, token usage, and error rates
    """
    request_id = str(uuid.uuid4())
    session_id = request.session_id or request_id
    start_time = time.time()

    try:
        # Execute the RAG pipeline (traced by LangSmith)
        result = rag_chain(
            question=request.question,
            user_id=request.user_id
        )

        latency_ms = (time.time() - start_time) * 1000

        # Log metrics to W&B
        wandb_monitor.log_request_metrics({
            "request/latency_ms": latency_ms,
            "request/total_tokens": result["metadata"].get("total_tokens", 0),
            "request/num_source_docs": len(result["source_documents"]),
            "request/user_id": request.user_id,
            "request/session_id": session_id,
            "request/timestamp": datetime.now().isoformat()
        })

        return QueryResponse(
            answer=result["answer"],
            source_documents=result["source_documents"],
            metadata=result["metadata"],
            request_id=request_id,
            latency_ms=latency_ms
        )

    except Exception as e:
        latency_ms = (time.time() - start_time) * 1000

        # Log error to W&B
        wandb_monitor.log_request_metrics({
            "request/latency_ms": latency_ms,
            "request/error": str(e)[:200],  # Truncate long errors
            "request/error_type": type(e).__name__,
            "request/user_id": request.user_id,
            "request/session_id": session_id
        })

        raise HTTPException(
            status_code=500,
            detail={
                "error": "Internal server error",
                "request_id": request_id,
                "message": str(e)
            }
        )

@app.get("/health")
async def health_check():
    """Health check endpoint for load balancers."""
    return {
        "status": "healthy",
        "timestamp": datetime.now().isoformat(),
        "wandb_run": wandb_monitor.run.id if wandb_monitor.run else "not_started"
    }

@app.post("/evaluate")
async def evaluate_endpoint():
    """
    Run evaluation on a test dataset and log results to W&B.

    This endpoint is typically called by CI/CD pipelines after deployment.
    """
    # Example evaluation dataset
    eval_data = [
        {"question": "What is RAG?", "expected_answer": "Retrieval-Augmented Generation"},
        {"question": "How does LangSmith work?", "expected_answer": "It traces LLM calls"},
    ]

    eval_results = []
    for item in eval_data:
        start = time.time()
        try:
            result = rag_chain(question=item["question"], user_id="eval")
            latency = (time.time() - start) * 1000
            # Simple exact-match scoring (replace with semantic similarity in production)
            score = 1.0 if item["expected_answer"].lower() in result["answer"].lower() else 0.0
            eval_results.append({
                "question": item["question"],
                "answer": result["answer"],
                "expected_answer": item["expected_answer"],
                "score": score,
                "latency_ms": latency
            })
        except Exception as e:
            eval_results.append({
                "question": item["question"],
                "answer": f"ERROR: {str(e)}",
                "expected_answer": item["expected_answer"],
                "score": 0.0,
                "latency_ms": (time.time() - start) * 1000
            })

    # Log evaluation results to W&B
    wandb_monitor.log_evaluation_results(eval_results, "production_test")

    return {
        "status": "completed",
        "num_samples": len(eval_results),
        "avg_score": sum(r["score"] for r in eval_results) / len(eval_results)
    }

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(
        "app:app",
        host="0.0.0.0",
        port=8000,
        reload=True,  # Disable in production
        log_level="info"
    )

Production considerations:

  1. Request ID propagation: Every request gets a UUID that appears in both LangSmith traces and W&B logs. This allows you to cross-reference a specific user request across both platforms.

  2. Error handling with monitoring: When an exception occurs, the error is logged to W&B with the error type and truncated message. This creates a time-series of error rates that you can alert on.

  3. Session tracking: The session_id field allows you to group multiple requests from the same user session. In LangSmith, you can filter traces by session to debug multi-turn conversations.

Edge case: If the OpenAI API returns a 429 rate limit error, the rag_chain function will raise an exception. The error handler logs this to W&B, but the user receives a 500 error. In production, implement retry logic with exponential backoff in the rag_chain function, and log the number of retries as metadata.

Running and Verifying the Monitoring Setup

Start the application and verify that both monitoring systems are receiving data:

# Start the FastAPI server
python app.py

# In another terminal, send test requests
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What is LangSmith used for?", "user_id": "test_user_1"}'

# Run evaluation
curl -X POST http://localhost:8000/evaluate

# Check health
curl http://localhost:8000/health

Verification steps:

  1. LangSmith UI: Go to smith.langchain.com and select your project. You should see traces for each request, with nested spans for retrieval and generation. Click on a trace to see the exact prompt sent to OpenAI and the response received.

  2. W&B UI: Go to wandb.ai and select your project. You should see:

    • A run named run_YYYYMMDD_HHMMSS with latency and token metrics
    • An artifact named eval_production_test_YYYYMMDD containing the evaluation table
    • Time-series charts showing average latency and error rate over time

Edge case: If you see no traces in LangSmith, check that LANGSMITH_TRACING_V2=true is set in your environment. The @traceable decorator requires this environment variable to be set, even when using the explicit client. If you are using a self-hosted LangSmith instance, set LANGSMITH_ENDPOINT to your server URL.

What's Next

You now have a production-grade monitoring system that combines LangSmith's per-request tracing with W&B's aggregate observability. To extend this setup:

  1. Add alerting: Configure W&B to send Slack or email alerts when latency exceeds 5 seconds or error rate exceeds 1%. Use W&B's "Alerts" feature in the project settings.

  2. Implement A/B testing: Run two prompt versions simultaneously, log the version ID in both LangSmith metadata and W&B metrics, and compare performance using W&B's parallel coordinates charts.

  3. Integrate with CI/CD: Add the /evaluate endpoint to your deployment pipeline. Before promoting a new model version to production, run the evaluation suite and check that the average score does not decrease by more than 5%.

  4. Add user feedback: Extend the QueryResponse model to include a feedback_url that users can click to rate responses. Log this feedback to W&B as a separate metric, enabling you to correlate user satisfaction with specific prompt versions.

The combination of LangSmith and W&B gives you both the microscope to debug individual failures and the telescope to see system-wide trends. In production LLM applications, you need both perspectives to maintain quality at scale.


References

1. Wikipedia - ChromaDB. Wikipedia. [Source]
2. Wikipedia - GPT. Wikipedia. [Source]
3. Wikipedia - Rag. Wikipedia. [Source]
4. GitHub - chroma-core/chroma. Github. [Source]
5. GitHub - Significant-Gravitas/AutoGPT. Github. [Source]
6. GitHub - Shubhamsaboo/awesome-llm-apps. Github. [Source]
7. GitHub - langchain-ai/langchain. Github. [Source]
8. ChromaDB Pricing. Pricing. [Source]
tutorialaillm
Share this article:

Was this article helpful?

Let us know to improve our AI generation.

Related Articles