How to Run Large Language Models Locally with Ollama
Practical tutorial: It introduces a new way to run large language models locally, which is useful for developers and researchers.
How to Run Large Language Models Locally with Ollama
Table of Contents
- How to Run Large Language Models Locally with Ollama
- Linux/macOS installation
- Verify installation
- Expected output: ollama [6] version 0.6.2
- Start Ollama service (runs on localhost:11434 by default)
- Check if the server is running
- Should return an empty list initially
📺 Watch: Neural Networks Explained
Video by 3Blue1Brown
Large language models (LLMs) have transformed how we interact with AI, but most developers rely on cloud APIs that introduce latency, privacy concerns, and recurring costs. Running LLMs locally gives you complete control over your data, eliminates API dependencies, and enables offline experimentation. According to, Ollama is a developer-tools platform that lets you "run large language models locally" with a "simple CLI to download and run LLMs on your machine." As of May 20, 2026, Ollama has 171.8k stars on GitHub, with its latest version at 0.6.2 and a last commit on 2026-05-20, indicating active maintenance. This tutorial will walk you through setting up Ollama, downloading models, building a production-grade REST API, and handling edge cases like memory constraints and concurrent requests.
Why Local LLMs Matter in Production
Running LLMs locally isn't just about avoiding cloud costs—it's about architectural flexibility. A large language model is a neural network trained on vast text data for natural language generation tasks. When you run these models locally, you eliminate network latency, ensure data sovereignty (critical for healthcare, finance, or legal applications), and can fine-tune models without sharing proprietary data. Ollama abstracts away the complexity of model quantization, GPU acceleration, and memory management, allowing you to focus on building applications. With a rating of 4.6 and open-source pricing, it's a compelling choice for developers who need production-ready local inference.
Prerequisites and Environment Setup
Before diving into code, ensure your system meets the minimum requirements. You'll need a machine with at least 8GB of RAM for small models (like Llama 3.2 1B) and 16GB+ for larger models (like Mistral 7B). A GPU with CUDA support is optional but significantly improves inference speed.
Installing Ollama
Ollama provides a straightforward installation script for Linux and macOS. For Windows, use the official installer from at https://ollama.ai.
# Linux/macOS installation
curl -fsSL https://ollama.ai/install.sh | sh
# Verify installation
ollama --version
# Expected output: ollama version 0.6.2
After installation, start the Ollama server in the background:
# Start Ollama service (runs on localhost:11434 by default)
ollama serve &
# Check if the server is running
curl http://localhost:11434/api/tags
# Should return an empty list initially
Python Environment Setup
We'll build a Python application using FastAPI for the REST API and the official ollama Python library. Create a virtual environment and install dependencies:
python3 -m venv venv
source venv/bin/activate
# Install core dependencies
pip install ollama==0.6.2 fastapi==0.115.0 uvicorn==0.30.0 pydantic==2.9.0
# For advanced features (optional)
pip install langchain [8]==0.3.0 langchain-ollama==0.2.0
Downloading and Managing Models
Ollama supports a wide range of open-weight models. As of the latest data, you can pull models like Llama 3.2, Llama 3.1, and Mistral directly from Ollama's registry. The CLI handles model quantization automatically, downloading the most efficient variant for your hardware.
Pulling a Model
Let's download Llama 3.2, a lightweight model suitable for most development machines:
# Pull the model (approximately 2-4 GB download)
ollama pull llama3.2
# List downloaded models
ollama list
# Output will show model name, size, and modification date
For production, you might want multiple models for different tasks. Pull Mistral for code generation and Llama 3.1 for general chat:
ollama pull mistral
ollama pull llama3.1
Model Management Best Practices
Ollama stores models in ~/.ollama/models/ by default. Monitor disk usage with:
# Check disk usage of Ollama models
du -sh ~/.ollama/models/
To remove unused models and free space:
# Remove a specific model
ollama rm llama3.2
# Remove all models (caution: irreversible)
ollama rm -a
Building a Production-Grade REST API
Now we'll create a FastAPI application that exposes Ollama's capabilities through a well-documented REST API. This architecture is suitable for integrating local LLMs into microservices, chatbots, or data pipelines.
Core API Implementation
Create a file named app.py with the following production-ready code:
import logging
from typing import Optional, List, Dict, Any
from datetime import datetime
import ollama
from fastapi import FastAPI, HTTPException, BackgroundTasks
from pydantic import BaseModel, Field, field_validator
import uvicorn
# Configure logging for production observability
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)
# Initialize FastAPI with metadata
app = FastAPI(
title="Local LLM Inference API",
description="Production-ready API for running LLMs locally via Ollama",
version="1.0.0"
)
# --- Pydantic Models for Request/Response Validation ---
class ChatMessage(BaseModel):
role: str = Field(.., description="Message role: 'user', 'assistant', or 'system'")
content: str = Field(.., description="Message content")
@field_validator("role")
@classmethod
def validate_role(cls, v: str) -> str:
allowed_roles = {"user", "assistant", "system"}
if v not in allowed_roles:
raise ValueError(f"Role must be one of {allowed_roles}")
return v
class ChatRequest(BaseModel):
model: str = Field(default="llama3.2", description="Model name to use")
messages: List[ChatMessage] = Field(.., description="Conversation history")
temperature: float = Field(default=0.7, ge=0.0, le=2.0, description="Sampling temperature")
max_tokens: Optional[int] = Field(default=None, ge=1, description="Maximum tokens to generate")
stream: bool = Field(default=False, description="Enable streaming response")
class ChatResponse(BaseModel):
model: str
message: ChatMessage
total_duration: float
tokens_per_second: float
timestamp: str
# --- In-memory model cache for performance ---
model_cache: Dict[str, Dict[str, Any]] = {}
def get_model_info(model_name: str) -> Dict[str, Any]:
"""Fetch model metadata with caching to avoid repeated API calls."""
if model_name not in model_cache:
try:
# Ollama's show command returns model details
model_info = ollama.show(model_name)
model_cache[model_name] = {
"name": model_name,
"size": model_info.get("size", 0),
"modified_at": model_info.get("modified_at", ""),
"details": model_info.get("details", {})
}
logger.info(f"Cached model info for {model_name}")
except Exception as e:
logger.error(f"Failed to fetch model info for {model_name}: {e}")
raise HTTPException(status_code=404, detail=f"Model '{model_name}' not found")
return model_cache[model_name]
# --- API Endpoints ---
@app.get("/health", tags=["System"])
async def health_check():
"""Health check endpoint for monitoring."""
try:
# Verify Ollama server is running
ollama.list()
return {
"status": "healthy",
"ollama_version": ollama.__version__,
"timestamp": datetime.utcnow().isoformat()
}
except Exception as e:
logger.error(f"Health check failed: {e}")
raise HTTPException(status_code=503, detail="Ollama server not reachable")
@app.get("/models", tags=["Models"])
async def list_models():
"""List all available models with metadata."""
try:
models = ollama.list()
return {
"models": [
{
"name": model["name"],
"size": model["size"],
"modified_at": model["modified_at"]
}
for model in models.get("models", [])
],
"count": len(models.get("models", []))
}
except Exception as e:
logger.error(f"Failed to list models: {e}")
raise HTTPException(status_code=500, detail="Failed to retrieve models")
@app.post("/chat", response_model=ChatResponse, tags=["Inference"])
async def chat_completion(request: ChatRequest, background_tasks: BackgroundTasks):
"""
Generate a chat completion using a local LLM.
This endpoint handles conversation history, temperature control, and
returns performance metrics for monitoring.
"""
# Validate model exists and warm up cache
model_info = get_model_info(request.model)
# Prepare messages for Ollama API
messages = [{"role": msg.role, "content": msg.content} for msg in request.messages]
start_time = datetime.utcnow()
try:
# Call Ollama's chat API
response = ollama.chat(
model=request.model,
messages=messages,
options={
"temperature": request.temperature,
"num_predict": request.max_tokens if request.max_tokens else -1
}
)
end_time = datetime.utcnow()
duration_ms = (end_time - start_time).total_seconds() * 1000
# Extract response content
assistant_message = response["message"]["content"]
total_tokens = response.get("eval_count", 0)
# Calculate tokens per second for performance monitoring
tokens_per_second = total_tokens / (duration_ms / 1000) if duration_ms > 0 else 0
logger.info(
f"Chat completion - model: {request.model}, "
f"tokens: {total_tokens}, "
f"duration: {duration_ms:.2f}ms, "
f"tokens/sec: {tokens_per_second:.2f}"
)
# Schedule cache cleanup in background (optional)
background_tasks.add_task(cleanup_old_cache_entries)
return ChatResponse(
model=request.model,
message=ChatMessage(role="assistant", content=assistant_message),
total_duration=duration_ms,
tokens_per_second=tokens_per_second,
timestamp=end_time.isoformat()
)
except ollama.ResponseError as e:
logger.error(f"Ollama API error: {e.status_code} - {e.error}")
raise HTTPException(status_code=e.status_code, detail=e.error)
except Exception as e:
logger.error(f"Unexpected error during chat: {e}")
raise HTTPException(status_code=500, detail="Internal inference error")
@app.post("/generate", tags=["Inference"])
async def generate_text(
prompt: str = Field(.., description="Input prompt"),
model: str = Field(default="llama3.2", description="Model name"),
temperature: float = Field(default=0.7, ge=0.0, le=2.0)
):
"""
Simple text generation endpoint for non-conversational tasks.
Useful for summarization, translation, or data extraction.
"""
try:
response = ollama.generate(
model=model,
prompt=prompt,
options={"temperature": temperature}
)
return {
"model": model,
"response": response["response"],
"eval_count": response.get("eval_count", 0),
"eval_duration": response.get("eval_duration", 0)
}
except Exception as e:
logger.error(f"Generation failed: {e}")
raise HTTPException(status_code=500, detail="Generation failed")
# --- Background Tasks ---
def cleanup_old_cache_entries():
"""Periodically clean model cache to prevent memory leaks."""
global model_cache
# Simple cleanup: keep only models that were accessed recently
# In production, use TTL-based caching with Redis
logger.debug(f"Cache size before cleanup: {len(model_cache)} entries")
# No-op for now; extend with TTL logic as needed
# --- Main Entry Point ---
if __name__ == "__main__":
uvicorn.run(
"app:app",
host="0.0.0.0",
port=8000,
reload=True, # Disable in production
log_level="info"
)
Running the API
Start the server and test the endpoints:
# Start the FastAPI server
python app.py
# In another terminal, test health endpoint
curl http://localhost:8000/health
# Test chat completion
curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"messages": [
{"role": "user", "content": "Explain quantum computing in simple terms"}
],
"temperature": 0.7
}'
# List available models
curl http://localhost:8000/models
Handling Edge Cases and Production Concerns
Memory Management
Local LLMs consume significant RAM. Ollama uses model quantization (e.g., 4-bit, 8-bit) to reduce memory footprint, but you must still monitor usage. Implement a memory guard in your API:
import psutil
def check_memory_available(required_gb: float = 4.0) -> bool:
"""Check if sufficient memory is available before loading a model."""
memory = psutil.virtual_memory()
available_gb = memory.available / (1024 ** 3)
if available_gb < required_gb:
logger.warning(f"Low memory: {available_gb:.2f}GB available, need {required_gb}GB")
return False
return True
Concurrent Request Handling
Ollama processes requests sequentially by default. For concurrent requests, use a queue with worker threads:
from concurrent.futures import ThreadPoolExecutor
import asyncio
# Create a thread pool for concurrent inference
executor = ThreadPoolExecutor(max_workers=4)
async def async_chat(model: str, messages: list, options: dict):
"""Run Ollama chat in a separate thread to avoid blocking."""
loop = asyncio.get_event_loop()
return await loop.run_in_executor(
executor,
ollama.chat,
model,
messages,
options
)
Error Handling for Model Loading
Models can fail to load due to corruption or version mismatches. Implement retry logic:
import time
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10))
def load_model_with_retry(model_name: str):
"""Attempt to load a model with exponential backoff."""
try:
ollama.show(model_name)
logger.info(f"Model {model_name} loaded successfully")
except Exception as e:
logger.error(f"Failed to load model {model_name}: {e}")
raise
Performance Optimization and Monitoring
Benchmarking Inference Speed
Measure tokens per second to compare models and hardware configurations:
import time
def benchmark_model(model_name: str, prompt: str, iterations: int = 5):
"""Benchmark a model's inference speed."""
times = []
for i in range(iterations):
start = time.time()
response = ollama.generate(model=model_name, prompt=prompt)
elapsed = time.time() - start
tokens = response.get("eval_count", 0)
times.append(tokens / elapsed if elapsed > 0 else 0)
avg_tps = sum(times) / len(times)
logger.info(f"Model {model_name}: Average {avg_tps:.2f} tokens/second")
return avg_tps
Logging and Monitoring
Integrate with structured logging for production observability:
import json
from pythonjsonlogger import jsonlogger
# Configure JSON logging for log aggregation tools
handler = logging.StreamHandler()
formatter = jsonlogger.JsonFormatter(
fmt="%(asctime)s %(name)s %(levelname)s %(message)s"
)
handler.setFormatter(formatter)
logger.addHandler(handler)
What's Next
You now have a production-ready local LLM inference system using Ollama. This setup eliminates cloud dependencies, reduces latency, and gives you full control over model selection and data privacy. To extend this tutorial:
- Add streaming support: Modify the
/chatendpoint to use Server-Sent Events (SSE) for real-time token-by-token output. - Implement model hot-swapping: Build an endpoint that dynamically loads/unloads models based on demand.
- Integrate with vector databases: Combine local LLMs with embeddings for Retrieval-Augmented Generation (RAG) pipelines.
- Explore fine-tuning: Use Ollama's Modelfile to create custom models with LoRA adapters for domain-specific tasks.
For further reading, check out our guides on building RAG applications and comparing open-weight models. The Ollama ecosystem continues to evolve—as of its latest commit on 2026-05-20, the project maintains 3252 open issues, reflecting active community development. With 171.8k stars and a 4.6 rating, Ollama is a solid foundation for local LLM deployment in 2026.
References
Was this article helpful?
Let us know to improve our AI generation.
Related Articles
How to Build a Gmail AI Assistant with Google Gemini
Practical tutorial: It represents an incremental improvement in user interface and interaction with existing technology.
How to Build a Production ML API with FastAPI and Modal
Practical tutorial: Build a production ML API with FastAPI + Modal
How to Build a Voice Assistant with Whisper and Llama 3.3
Practical tutorial: Build a voice assistant with Whisper + Llama 3.3