Back to Tutorials
tutorialstutorialaillm

How to Monitor LLM Apps with LangSmith and Weights & Biases

Practical tutorial: Monitor LLM apps with LangSmith and Weights & Biases

Alexia TorresApril 3, 20269 min read1 774 words
This article was generated by Daily Neural Digest's autonomous neural pipeline — multi-source verified, fact-checked, and quality-scored. Learn how it works

The Dual-Scope Approach: Monitoring LLM Applications with LangSmith and Weights & Biases

The year is 2026, and large language models have quietly become the backbone of modern software infrastructure. From customer service chatbots that actually understand nuance to research tools that synthesize scientific literature, LLMs are no longer experimental curiosities—they're production systems handling real traffic, real money, and real consequences. But with that shift comes an uncomfortable truth: monitoring these systems is fundamentally different from monitoring traditional software.

When a REST API returns a 500 error, you know something broke. When an LLM returns a plausible-sounding but completely fabricated answer, the failure is silent, insidious, and potentially catastrophic. This is the monitoring paradox of the AI era—we've built systems that can generate human-quality text, but we're still figuring out how to tell when they're lying to us.

This is where the combination of LangSmith and Weights & Biases (W&B) enters the picture. These two platforms, designed from the ground up for LLM workflows, offer a dual-lens approach to monitoring that addresses both the immediate, real-time diagnostics and the long-term evolutionary tracking that production AI demands. In this deep dive, we'll build a monitoring architecture that doesn't just log errors—it illuminates the entire lifecycle of your LLM application, from individual request behavior to system-wide performance trends.

The Architecture of Observability: Why Two Tools Are Better Than One

Before we write a single line of code, let's understand why this dual-tool approach matters. Traditional monitoring stacks—think Prometheus and Grafana, or the ELK stack—were designed for deterministic systems. They excel at tracking request rates, error codes, and latency percentiles. But LLMs introduce a new class of metrics that these tools struggle to capture: semantic accuracy, prompt injection detection, response coherence, and model drift.

LangSmith, built by the team behind LangChain, was designed specifically for this new reality. It provides granular, per-request logging with rich metadata about prompts, responses, and intermediate chain steps. Think of it as your real-time diagnostic tool—the equivalent of having a debugger that can inspect every thought process your model goes through.

Weights & Biases, on the other hand, has evolved from its roots in traditional machine learning experiment tracking into a comprehensive platform for LLM lifecycle management. It excels at aggregating metrics across thousands of runs, tracking model versions, and visualizing performance trends over time. If LangSmith is your microscope, W&B is your satellite view.

The architecture we'll build connects these two tools in a complementary loop. Every request to your LLM application gets logged to LangSmith for immediate inspection and debugging, while aggregated performance metrics flow into W&B for long-term analysis and experiment tracking. This isn't just redundancy—it's a deliberate layering of observability that gives you both the tactical and strategic views of your system's health.

Setting the Stage: Prerequisites and the Tools That Matter

Getting started requires minimal setup, but the choices you make here will echo through your entire monitoring pipeline. First, ensure you have Python and the necessary libraries installed:

pip install langsmith wandb

You'll also need API keys for both services. For LangSmith, head to https://app.langchain.com/settings/api_keys [7]. For W&B, generate your key at https://wandb.ai/site.

Why these tools over alternatives like TensorBoard or Prometheus? The answer lies in specialization. TensorBoard was designed for visualizing training metrics, not production inference logs. Prometheus excels at infrastructure monitoring but lacks the semantic understanding needed for LLM-specific metrics. LangSmith and W&B, by contrast, offer native support for the unique challenges of language models—token-level tracing, prompt versioning, and hallucination detection. They're not just monitoring tools; they're tools that understand what an LLM is and how it should behave.

Building the Monitoring Pipeline: From Initialization to Production

Let's walk through the implementation, starting with the foundation and building up to a production-ready system.

Initializing Your W&B Project

The first step is establishing your experiment tracking infrastructure. W&B's init function creates a new run within a project, giving you a container for all the metrics and logs you'll generate:

import wandb

wandb.login()  # Ensure you have logged in with your API key

# Initialize a new or existing W&B project
wandb.init(project="llm-monitoring", name="initial-run")

This simple call sets up the scaffolding for everything that follows. The project parameter organizes your runs into logical groups—perhaps one per application or per deployment environment. The name parameter labels this specific run, making it easy to find later when you're comparing performance across different model versions or configuration changes.

Capturing Performance Metrics

With the project initialized, you can start logging the metrics that matter. The key performance indicators for LLM applications go beyond simple response times. You'll want to track accuracy, latency, token usage, and potentially more nuanced metrics like response coherence or factual consistency:

def log_performance_metrics(response_time: float, accuracy: float):
    wandb.log({"response_time": response_time, "accuracy": accuracy})

This function is deliberately simple, but it's the foundation for a much richer monitoring system. In production, you might extend this to log token counts, model confidence scores, or even the results of automated fact-checking pipelines.

Integrating LangSmith for Real-Time Diagnostics

While W&B handles the aggregate view, LangSmith gives you the ability to drill down into individual requests. This is where the real debugging power lies:

from langsmith import Client

client = Client(api_key="your_api_key")

def log_request_to_langsmith(request_data: dict):
    client.create_request_log(request_data)

Each call to create_request_log sends the full request and response data to LangSmith, where you can inspect it alongside other requests. This is invaluable for diagnosing issues like prompt injection attacks, unexpected model behavior, or performance anomalies that only appear under specific conditions.

The Unified Monitoring Function

The real power emerges when you combine both tools into a single monitoring function:

def monitor_llm_request(request_data: dict):
    response_time = measure_response_time()  # Assume this function measures the time taken to process the request.
    accuracy = calculate_accuracy()          # Assume this function calculates model accuracy based on the response.

    log_performance_metrics(response_time, accuracy)
    log_request_to_langsmith(request_data)

# Example usage
request_data = {"prompt": "What is the weather today?", "response": "Sunny"}
monitor_llm_request(request_data)

This unified function ensures that every request is captured in both systems. LangSmith gets the raw data for immediate analysis, while W&B receives the aggregated metrics for long-term trend tracking. It's a simple pattern, but it creates a powerful feedback loop: when you notice a performance degradation in W&B, you can jump to LangSmith to inspect the specific requests that caused it.

Production Optimization: Scaling Your Monitoring Without Breaking It

A monitoring system that works for a single developer's laptop may fall apart under production load. Here's how to harden your setup for real-world deployment.

Batch Processing with W&B Sweeps

When you're running experiments at scale, W&B's sweep feature becomes invaluable. It allows you to systematically explore different configurations:

sweep_config = {
    "method": "grid",
    "parameters": {
        "batch_size": {"values": [16, 32, 64]},
        "response_time_threshold": {"values": [0.5, 1.0, 1.5]}
    }
}

wandb.sweep(sweep_config)

This configuration lets you test different batch sizes and response time thresholds in a structured way, helping you find the optimal balance between throughput and latency.

Asynchronous Processing for High Throughput

In production, your monitoring system must not become a bottleneck. Asynchronous processing ensures that logging doesn't block request handling:

import asyncio

async def async_monitor_llm_request(request_data: dict):
    loop = asyncio.get_event_loop()
    await loop.run_in_executor(None, monitor_llm_request, request_data)

This pattern offloads the monitoring work to a thread pool, allowing your main application to continue processing requests without waiting for logging to complete.

Hardware Optimization with GPU Timing

For applications running on GPU infrastructure, precise timing is essential for performance optimization. PyTorch's CUDA events provide nanosecond-precision timing:

import torch

def measure_response_time():
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    start.record()
    # Simulate model inference here
    end.record()

    torch.cuda.synchronize()
    return start.elapsed_time(end) / 1000.0

This approach [6] gives you accurate GPU timing, which is critical for identifying bottlenecks in model inference pipelines.

Navigating the Edge Cases: Security, Errors, and Scaling

Production monitoring isn't just about capturing happy-path metrics. It's about handling the edge cases that can bring your application down.

Robust Error Handling

Your monitoring system should be resilient to its own failures. Wrap your monitoring calls in try-except blocks to ensure that a logging failure doesn't crash your application:

def monitor_llm_request(request_data: dict):
    try:
        response_time = measure_response_time()
        accuracy = calculate_accuracy()

        log_performance_metrics(response_time, accuracy)
        log_request_to_langsmith(request_data)
    except Exception as e:
        wandb.log({"error": str(e)})

This pattern ensures that errors in the monitoring pipeline are themselves logged, giving you visibility into the health of your observability infrastructure.

Security: Defending Against Prompt Injection

One of the most insidious threats to LLM applications is prompt injection, where malicious users craft inputs that manipulate the model's behavior. Your monitoring system should include input sanitization:

def sanitize_input(prompt: str):
    # Implement sanitization logic here to prevent malicious inputs
    return sanitized_prompt

While the specifics of sanitization depend on your application, the principle is clear: monitor not just what your model outputs, but what it receives. Anomalies in input patterns can be early warning signs of an attack.

Detecting Scaling Bottlenecks

As your application grows, resource constraints will become the primary performance limiter. Tracking CPU and memory usage alongside your LLM-specific metrics gives you a complete picture:

def log_resource_usage(cpu: float, memory: float):
    wandb.log({"cpu_usage": cpu, "memory_usage": memory})

When you see response times increasing, this data helps you distinguish between model-level issues (like prompt complexity) and infrastructure-level issues (like memory pressure).

The Road Ahead: From Monitoring to Continuous Improvement

You've now built a monitoring system that captures both the microscopic details of individual requests and the macroscopic trends of your entire application. LangSmith gives you the ability to inspect and debug in real-time, while W&B provides the historical context needed to understand how your system evolves over time.

But this is just the beginning. The next steps in your monitoring journey might include integrating additional metrics like end-to-end latency or throughput under load. You could automate the logging process with scheduled jobs that run regular performance benchmarks. And as your application matures, you might expand monitoring to cover user feedback loops, tracking not just what the model outputs, but how users respond to those outputs.

The most sophisticated LLM applications will eventually incorporate automated feedback systems that use monitoring data to trigger model retraining or configuration changes. Imagine a system that detects a drift in accuracy, automatically rolls back to a previous model version, and alerts the engineering team—all without human intervention.

For now, though, you have the foundation. Your LLM application is no longer a black box. Every request is visible, every metric is tracked, and every anomaly is logged. In the fast-moving world of AI, that visibility isn't just a nice-to-have—it's the difference between a system you trust and one that keeps you up at night.


tutorialaillm
Share this article:

Was this article helpful?

Let us know to improve our AI generation.

Related Articles