How to Monitor LLM Apps with LangSmith and Weights & Biases
Practical tutorial: Monitor LLM apps with LangSmith and Weights & Biases
How to Monitor LLM Apps with LangSmith and Weights & Biases
Table of Contents
- How to Monitor LLM Apps with LangSmith and Weights & Biases
- Initialize a new or existing W&B project
- Example usage
📺 Watch: Intro to Large Language Models
Video by Andrej Karpathy
Introduction & Architecture
In 2026, large language models (LLMs) have become integral components of numerous applications ranging from customer service chatbots to advanced research tools. Monitoring these systems is crucial for maintaining performance, ensuring security, and optimizing resource usage. This tutorial will guide you through setting up a robust monitoring system using LangSmith and Weights & Biases (W&B), two powerful tools designed specifically for LLMs.
LangSmith provides an extensive suite of features tailored to the unique needs of language models, including detailed logs, performance metrics, and interactive debugging capabilities. W&B offers comprehensive tracking of experiments, model versions, and deployment status, making it easier to manage multiple iterations and configurations efficiently.
The architecture we will build involves integrating LangSmith for real-time monitoring and diagnostics of LLM applications, while leverag [1]ing W&B for long-term experiment tracking and version control. This dual approach ensures that developers can both respond quickly to immediate issues and maintain a clear record of model evolution over time.
Prerequisites & Setup
To follow this tutorial, you need Python installed on your system along with the necessary libraries. Ensure you have the latest stable versions of LangSmith and Weights & Biases:
pip install langsmith wandb
Additionally, you will require an API key for both services:
- For LangSmith: Obtain from https://app.langchain [7].com/settings/api_keys.
- For W&B: Generate at https://wandb.ai/site.
These tools are chosen over alternatives like TensorBoard or Prometheus due to their specialized features and ease of integration with LLMs. They provide a more streamlined experience for developers working specifically on language models, offering tailored metrics and logging capabilities that generic monitoring solutions lack.
Core Implementation: Step-by-Step
Initializing W&B Project
First, initialize your W&B project:
import wandb
wandb.login() # Ensure you have logged in with your API key
# Initialize a new or existing W&B project
wandb.init(project="llm-monitoring", name="initial-run")
This sets up the basic infrastructure for tracking experiments. The project parameter specifies which W&B project to use, and name labels this specific run.
Logging Model Performance Metrics
Next, integrate performance metrics logging:
def log_performance_metrics(response_time: float, accuracy: float):
wandb.log({"response_time": response_time, "accuracy": accuracy})
This function logs key performance indicators such as response time and model accuracy to W&B. These metrics are critical for understanding how the LLM performs under different conditions.
Integrating LangSmith
Now, integrate LangSmith for detailed monitoring:
from langsmith import Client
client = Client(api_key="your_api_key")
def log_request_to_langsmith(request_data: dict):
client.create_request_log(request_data)
Here, log_request_to_langsmith sends each request to the LangSmith API, allowing you to track and analyze requests in real-time. This is invaluable for diagnosing issues as they occur.
Combining W&B and LangSmith
Finally, combine both tools by logging a comprehensive set of metrics:
def monitor_llm_request(request_data: dict):
response_time = measure_response_time() # Assume this function measures the time taken to process the request.
accuracy = calculate_accuracy() # Assume this function calculates model accuracy based on the response.
log_performance_metrics(response_time, accuracy)
log_request_to_langsmith(request_data)
# Example usage
request_data = {"prompt": "What is the weather today?", "response": "Sunny"}
monitor_llm_request(request_data)
This setup ensures that every request to your LLM application is logged both in W&B for long-term analysis and in LangSmith for immediate feedback.
Configuration & Production Optimization
To take this monitoring system from a script to production, consider the following configurations:
Batch Processing with W&B
For batch processing of requests, use W&B's sweep feature:
sweep_config = {
"method": "grid",
"parameters": {
"batch_size": {"values": [16, 32, 64]},
"response_time_threshold": {"values": [0.5, 1.0, 1.5]}
}
}
wandb.sweep(sweep_config)
This allows you to experiment with different batch sizes and response time thresholds efficiently.
Asynchronous Processing
To handle asynchronous processing of requests:
import asyncio
async def async_monitor_llm_request(request_data: dict):
loop = asyncio.get_event_loop()
await loop.run_in_executor(None, monitor_llm_request, request_data)
This ensures that your monitoring system can scale to handle a high volume of concurrent requests without blocking.
Hardware Optimization
For hardware optimization, consider using GPUs for model inference:
import torch
def measure_response_time():
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
# Simulate model inference here
end.record()
torch.cuda.synchronize()
return start.elapsed_time(end) / 1000.0
This example uses PyTorch [6] to measure response times on a GPU, which is crucial for optimizing performance in production environments.
Advanced Tips & Edge Cases (Deep Dive)
Error Handling
Implement robust error handling:
def monitor_llm_request(request_data: dict):
try:
response_time = measure_response_time()
accuracy = calculate_accuracy()
log_performance_metrics(response_time, accuracy)
log_request_to_langsmith(request_data)
except Exception as e:
wandb.log({"error": str(e)})
This ensures that any issues encountered during monitoring are logged and can be reviewed later.
Security Risks
Be aware of potential security risks such as prompt injection:
def sanitize_input(prompt: str):
# Implement sanitization logic here to prevent malicious inputs
return sanitized_prompt
Sanitizing user input is critical for preventing attacks that could compromise your LLM's integrity or performance.
Scaling Bottlenecks
Monitor for scaling bottlenecks by tracking resource usage:
def log_resource_usage(cpu: float, memory: float):
wandb.log({"cpu_usage": cpu, "memory_usage": memory})
Regularly logging CPU and memory usage helps identify when additional resources are needed to maintain performance.
Results & Next Steps
By following this tutorial, you have set up a comprehensive monitoring system for your LLM applications using LangSmith and W&B. This setup allows you to track both real-time performance metrics and long-term experiment data efficiently.
Next steps could include:
- Integrating additional metrics such as latency or throughput.
- Automating the logging process with scheduled jobs.
- Expanding monitoring to cover more aspects of your application, like user feedback loops.
For further optimization, refer to the official documentation for LangSmith and W&B for advanced configurations and best practices.
Was this article helpful?
Let us know to improve our AI generation.
Related Articles
How to Build a Claude 3.5 Artifact Generator with Python
Practical tutorial: Build a Claude 3.5 artifact generator
How to Build a Knowledge Assistant with LanceDB and Claude 3.5
Practical tutorial: RAG: Build a knowledge assistant with LanceDB and Claude 3.5
How to Build a Semantic Search Engine with Qdrant and text-embedding-3
Practical tutorial: Build a semantic search engine with Qdrant and text-embedding-3