How to Monitor LLM Apps with LangSmith and Weights & Biases

How to Monitor LLM Apps with LangSmith and Weights & Biases
Initialize a new or existing W&B project
Example usage

📺 Watch: Intro to Large Language Models

Video by Andrej Karpathy

Introduction & Architecture

In 2026, large language models (LLMs) have become integral components of numerous applications ranging from customer service chatbots to advanced research tools. Monitoring these systems is crucial for maintaining performance, ensuring security, and optimizing resource usage. This tutorial will guide you through setting up a robust monitoring system using LangSmith and Weights & Biases (W&B), two powerful tools designed specifically for LLMs.

LangSmith provides an extensive suite of features tailored to the unique needs of language models, including detailed logs, performance metrics, and interactive debugging capabilities. W&B offers comprehensive tracking of experiments, model versions, and deployment status, making it easier to manage multiple iterations and configurations efficiently.

The architecture we will build involves integrating LangSmith for real-time monitoring and diagnostics of LLM applications, while leverag [1]ing W&B for long-term experiment tracking and version control. This dual approach ensures that developers can both respond quickly to immediate issues and maintain a clear record of model evolution over time.

Prerequisites & Setup

To follow this tutorial, you need Python installed on your system along with the necessary libraries. Ensure you have the latest stable versions of LangSmith and Weights & Biases:

pip install langsmith wandb

Additionally, you will require an API key for both services:

For LangSmith: Obtain from https://app.langchain [7].com/settings/api_keys.
For W&B: Generate at https://wandb.ai/site.

These tools are chosen over alternatives like TensorBoard or Prometheus due to their specialized features and ease of integration with LLMs. They provide a more streamlined experience for developers working specifically on language models, offering tailored metrics and logging capabilities that generic monitoring solutions lack.

Core Implementation: Step-by-Step

Initializing W&B Project

First, initialize your W&B project:

import wandb

wandb.login()  # Ensure you have logged in with your API key

# Initialize a new or existing W&B project
wandb.init(project="llm-monitoring", name="initial-run")

This sets up the basic infrastructure for tracking experiments. The project parameter specifies which W&B project to use, and name labels this specific run.

Logging Model Performance Metrics

Next, integrate performance metrics logging:

def log_performance_metrics(response_time: float, accuracy: float):
    wandb.log({"response_time": response_time, "accuracy": accuracy})

This function logs key performance indicators such as response time and model accuracy to W&B. These metrics are critical for understanding how the LLM performs under different conditions.

Integrating LangSmith

Now, integrate LangSmith for detailed monitoring:

from langsmith import Client

client = Client(api_key="your_api_key")

def log_request_to_langsmith(request_data: dict):
    client.create_request_log(request_data)

Here, log_request_to_langsmith sends each request to the LangSmith API, allowing you to track and analyze requests in real-time. This is invaluable for diagnosing issues as they occur.

Combining W&B and LangSmith

Finally, combine both tools by logging a comprehensive set of metrics:

def monitor_llm_request(request_data: dict):
    response_time = measure_response_time()  # Assume this function measures the time taken to process the request.
    accuracy = calculate_accuracy()          # Assume this function calculates model accuracy based on the response.

    log_performance_metrics(response_time, accuracy)
    log_request_to_langsmith(request_data)

# Example usage
request_data = {"prompt": "What is the weather today?", "response": "Sunny"}
monitor_llm_request(request_data)

This setup ensures that every request to your LLM application is logged both in W&B for long-term analysis and in LangSmith for immediate feedback.

Configuration & Production Optimization

To take this monitoring system from a script to production, consider the following configurations:

Batch Processing with W&B

For batch processing of requests, use W&B's sweep feature:

sweep_config = {
    "method": "grid",
    "parameters": {
        "batch_size": {"values": [16, 32, 64]},
        "response_time_threshold": {"values": [0.5, 1.0, 1.5]}
    }
}

wandb.sweep(sweep_config)

This allows you to experiment with different batch sizes and response time thresholds efficiently.

Asynchronous Processing

To handle asynchronous processing of requests:

import asyncio

async def async_monitor_llm_request(request_data: dict):
    loop = asyncio.get_event_loop()
    await loop.run_in_executor(None, monitor_llm_request, request_data)

This ensures that your monitoring system can scale to handle a high volume of concurrent requests without blocking.

Hardware Optimization

For hardware optimization, consider using GPUs for model inference:

import torch

def measure_response_time():
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    start.record()
    # Simulate model inference here
    end.record()

    torch.cuda.synchronize()
    return start.elapsed_time(end) / 1000.0

This example uses PyTorch [6] to measure response times on a GPU, which is crucial for optimizing performance in production environments.

Advanced Tips & Edge Cases (Deep Dive)

Error Handling

Implement robust error handling:

def monitor_llm_request(request_data: dict):
    try:
        response_time = measure_response_time()
        accuracy = calculate_accuracy()

        log_performance_metrics(response_time, accuracy)
        log_request_to_langsmith(request_data)
    except Exception as e:
        wandb.log({"error": str(e)})

This ensures that any issues encountered during monitoring are logged and can be reviewed later.

Security Risks

Be aware of potential security risks such as prompt injection:

def sanitize_input(prompt: str):
    # Implement sanitization logic here to prevent malicious inputs
    return sanitized_prompt

Sanitizing user input is critical for preventing attacks that could compromise your LLM's integrity or performance.

Scaling Bottlenecks

Monitor for scaling bottlenecks by tracking resource usage:

def log_resource_usage(cpu: float, memory: float):
    wandb.log({"cpu_usage": cpu, "memory_usage": memory})

Regularly logging CPU and memory usage helps identify when additional resources are needed to maintain performance.

Results & Next Steps

By following this tutorial, you have set up a comprehensive monitoring system for your LLM applications using LangSmith and W&B. This setup allows you to track both real-time performance metrics and long-term experiment data efficiently.

Next steps could include:

Integrating additional metrics such as latency or throughput.
Automating the logging process with scheduled jobs.
Expanding monitoring to cover more aspects of your application, like user feedback loops.

For further optimization, refer to the official documentation for LangSmith and W&B for advanced configurations and best practices.

References

1. Wikipedia - Rag. Wikipedia. [Source]

2. Wikipedia - LangChain. Wikipedia. [Source]

3. Wikipedia - PyTorch. Wikipedia. [Source]

4. GitHub - Shubhamsaboo/awesome-llm-apps. Github. [Source]

5. GitHub - langchain-ai/langchain. Github. [Source]

6. GitHub - pytorch/pytorch. Github. [Source]

7. LangChain Pricing. Pricing. [Source]

How to Monitor LLM Apps with LangSmith and Weights & Biases

How to Monitor LLM Apps with LangSmith and Weights & Biases

Table of Contents

📺 Watch: Intro to Large Language Models

Introduction & Architecture

Prerequisites & Setup

Core Implementation: Step-by-Step

Initializing W&B Project

Logging Model Performance Metrics

Integrating LangSmith

Combining W&B and LangSmith

Configuration & Production Optimization

Batch Processing with W&B

Asynchronous Processing

Hardware Optimization

Advanced Tips & Edge Cases (Deep Dive)

Error Handling

Security Risks

Scaling Bottlenecks

Results & Next Steps

References

Was this article helpful?

Related Articles

How to Build a Claude 3.5 Artifact Generator with Python

How to Build a Knowledge Assistant with LanceDB and Claude 3.5

How to Build a Semantic Search Engine with Qdrant and text-embedding-3