How to Run LLMs Locally with Ollama
Practical tutorial: Ollama simplifies running large language models locally, which is a useful development for developers and researchers.
How to Run LLMs Locally with Ollama
Table of Contents
- How to Run LLMs Locally with Ollama
- Output: ollama version 0.6.2
- or
- Pull Llama 3.2 (3B parameters) - fast, suitable for simple tasks
- Pull Mistral (7B parameters) - balanced performance and quality
📺 Watch: Neural Networks Explained
Video by 3Blue1Brown
Large language models have transformed how we interact with software, but relying on cloud APIs introduces latency, privacy concerns, and recurring costs. Running models locally addresses these issues, and Ollama has emerged as the leading open-source platform for this task. As of May 25, 2026, Ollama has accumulated 172,200 stars on GitHub [1] and is actively maintained with its latest commit on 2026-05-25 [3]. The current stable release is version 0.6.2 [4], built primarily in Go [17]. This tutorial will guide you through setting up Ollama, deploying production-grade models, and integrating them into a Python application with a REST API.
Understanding Ollama's Architecture and Production Use Cases
Ollama is a software platform for running and managing large language models on local computers and through hosted cloud models [5]. It provides a command-line interface, a local REST API, model-management tools, and integrations for using open-weight models with coding assistants and other applications [5]. The platform supports models like Llama 3.2, Llama 3.1, and Mistral [7][8][9], all sourced from Ollama's model library.
In production environments, Ollama excels in scenarios requiring data privacy, low latency, or offline operation. For example, a healthcare application processing patient records can run models locally to avoid transmitting sensitive data over the internet. Similarly, a customer support chatbot for a manufacturing plant with intermittent connectivity can maintain functionality without cloud dependencies. The platform's open-source nature (pricing: Open Source) [11] and rating of 4.6 [12] make it accessible for both prototyping and deployment.
The architecture is straightforward: Ollama runs as a background service (daemon) that exposes a REST API on localhost:11434. You interact with it via the CLI or HTTP requests. Models are downloaded and cached locally, and the service handles inference with GPU acceleration when available. This design allows you to swap models without changing your application code, as long as you maintain consistent API contracts.
Prerequisites and Environment Setup
Before diving into implementation, ensure your system meets these requirements:
- Operating System: Linux (Ubuntu 20.04+ recommended), macOS 12+, or Windows 10/11 with WSL2
- Hardware: Minimum 8GB RAM (16GB+ recommended for 7B+ parameter models), GPU with CUDA support optional but beneficial
- Software: Python 3.10+, pip, curl, and Git
Installing Ollama
The installation process varies by platform. On Linux and macOS, use the official install script:
curl -fsSL https://ollama.ai/install.sh | sh
On Windows, download the installer from https://ollama.ai/download and run it. After installation, verify the service is running:
ollama --version
# Output: ollama version 0.6.2
If the command isn't found, add Ollama to your PATH or restart your terminal. The service should start automatically. You can check its status:
systemctl status ollama # Linux
# or
ps aux | grep ollama # macOS/Linux
Setting Up the Python Environment
Create a virtual environment and install the required packages:
python3 -m venv ollama_env
source ollama_env/bin/activate # On Windows: ollama_env\Scripts\activate
pip install requests fastapi uvicorn pydantic
We'll use requests for HTTP communication with Ollama's API, fastapi and uvicorn for building our own REST API, and pydantic for data validation.
Downloading and Running Models
Ollama's model management is handled through the CLI. Let's pull a few models for different use cases:
# Pull Llama 3.2 (3B parameters) - fast, suitable for simple tasks
ollama pull llama3.2
# Pull Mistral (7B parameters) - balanced performance and quality
ollama pull mistral
# Pull Llama 3.1 (8B parameters) - higher quality, more resource-intensive
ollama pull llama3.1
Each command downloads the model weights and configuration. The download size varies: Llama 3.2 is approximately 2GB, Mistral is 4.1GB, and Llama 3.1 is 4.7GB. Ensure you have sufficient disk space (at least 20GB free for multiple models).
To verify the models are available:
ollama list
# Output example:
# NAME ID SIZE MODIFIED
# llama3.2:latest 123abc.. 2.0 GB 5 minutes ago
# mistral:latest 456def.. 4.1 GB 10 minutes ago
# llama3.1:latest 789ghi.. 4.7 GB 2 hours ago
Running a Model Interactively
Test a model directly from the terminal:
ollama run llama3.2
This opens an interactive session. Type prompts and see responses in real-time. Exit with /bye or Ctrl+D.
Building a Production-Grade Python API with Ollama
Now we'll create a FastAPI application that wraps Ollama's API, adding error handling, rate limiting, and structured responses. This is suitable for production deployments where you need to serve multiple clients.
Core Implementation
Create a file named ollama_api.py:
import requests
import json
import time
from typing import Optional, List, Dict, Any
from fastapi import FastAPI, HTTPException, BackgroundTasks
from pydantic import BaseModel, Field
import uvicorn
# Ollama API configuration
OLLAMA_BASE_URL = "http://localhost:11434"
OLLAMA_GENERATE_ENDPOINT = f"{OLLAMA_BASE_URL}/api/generate"
OLLAMA_CHAT_ENDPOINT = f"{OLLAMA_BASE_URL}/api/chat"
OLLAMA_LIST_ENDPOINT = f"{OLLAMA_BASE_URL}/api/tags"
# Initialize FastAPI app
app = FastAPI(
title="Local LLM API",
description="Production-grade API for running LLMs locally via Ollama",
version="1.0.0"
)
# Request models
class GenerateRequest(BaseModel):
model: str = Field(.., description="Model name (e.g., llama3.2, mistral)")
prompt: str = Field(.., description="Input prompt for the model")
system: Optional[str] = Field(None, description="System prompt to set context")
temperature: Optional[float] = Field(0.7, ge=0.0, le=2.0, description="Sampling temperature")
max_tokens: Optional[int] = Field(512, ge=1, le=4096, description="Maximum tokens to generate")
stream: Optional[bool] = Field(False, description="Whether to stream the response")
class ChatMessage(BaseModel):
role: str = Field(.., pattern="^(system|user|assistant)$")
content: str = Field(..)
class ChatRequest(BaseModel):
model: str = Field(.., description="Model name")
messages: List[ChatMessage] = Field(.., min_length=1, description="Chat messages")
temperature: Optional[float] = Field(0.7, ge=0.0, le=2.0)
max_tokens: Optional[int] = Field(512, ge=1, le=4096)
stream: Optional[bool] = Field(False)
# Response models
class GenerateResponse(BaseModel):
model: str
response: str
tokens_generated: int
total_duration_ms: float
class ChatResponse(BaseModel):
model: str
message: ChatMessage
tokens_generated: int
total_duration_ms: float
class ModelInfo(BaseModel):
name: str
size_bytes: int
modified_at: str
# Helper function to call Ollama API
def call_ollama(endpoint: str, payload: Dict[str, Any]) -> Dict[str, Any]:
"""
Make a request to Ollama's API with error handling and timeout.
Args:
endpoint: Full URL to Ollama endpoint
payload: JSON payload for the request
Returns:
Parsed JSON response from Ollama
Raises:
HTTPException: If Ollama is unreachable or returns an error
"""
try:
response = requests.post(
endpoint,
json=payload,
timeout=60 # 60-second timeout for model inference
)
response.raise_for_status()
# Handle streaming responses (we collect all chunks)
if payload.get("stream", False):
full_response = ""
for line in response.iter_lines():
if line:
chunk = json.loads(line.decode('utf-8'))
if 'response' in chunk:
full_response += chunk['response']
if chunk.get('done', False):
return {
"model": chunk.get("model", payload["model"]),
"response": full_response,
"eval_count": chunk.get("eval_count", 0),
"total_duration": chunk.get("total_duration", 0)
}
return {"model": payload["model"], "response": full_response, "eval_count": 0, "total_duration": 0}
else:
return response.json()
except requests.exceptions.ConnectionError:
raise HTTPException(
status_code=503,
detail="Ollama service is not running. Start it with 'ollama serve'."
)
except requests.exceptions.Timeout:
raise HTTPException(
status_code=504,
detail="Model inference timed out. Consider using a smaller model or increasing timeout."
)
except requests.exceptions.RequestException as e:
raise HTTPException(
status_code=500,
detail=f"Ollama API error: {str(e)}"
)
# Health check endpoint
@app.get("/health")
async def health_check():
"""Check if Ollama service is running and responsive."""
try:
response = requests.get(f"{OLLAMA_BASE_URL}/api/tags", timeout=5)
response.raise_for_status()
models = response.json().get("models", [])
return {
"status": "healthy",
"ollama_version": "0.6.2",
"models_available": len(models),
"timestamp": time.time()
}
except Exception as e:
raise HTTPException(status_code=503, detail=f"Ollama unreachable: {str(e)}")
# List available models
@app.get("/models", response_model=List[ModelInfo])
async def list_models():
"""List all models available in Ollama."""
try:
response = requests.get(OLLAMA_LIST_ENDPOINT, timeout=5)
response.raise_for_status()
models_data = response.json().get("models", [])
return [
ModelInfo(
name=m["name"],
size_bytes=m["size"],
modified_at=m["modified_at"]
)
for m in models_data
]
except Exception as e:
raise HTTPException(status_code=500, detail=f"Failed to list models: {str(e)}")
# Generate text (non-chat)
@app.post("/generate", response_model=GenerateResponse)
async def generate_text(request: GenerateRequest):
"""
Generate text using a specified model.
This endpoint is ideal for completion tasks, summarization, and extraction.
"""
payload = {
"model": request.model,
"prompt": request.prompt,
"stream": request.stream,
"options": {
"temperature": request.temperature,
"num_predict": request.max_tokens
}
}
if request.system:
payload["system"] = request.system
start_time = time.time()
result = call_ollama(OLLAMA_GENERATE_ENDPOINT, payload)
elapsed_ms = (time.time() - start_time) * 1000
return GenerateResponse(
model=result.get("model", request.model),
response=result.get("response", ""),
tokens_generated=result.get("eval_count", 0),
total_duration_ms=elapsed_ms
)
# Chat completion
@app.post("/chat", response_model=ChatResponse)
async def chat_completion(request: ChatRequest):
"""
Chat completion endpoint supporting multi-turn conversations.
This endpoint is ideal for chatbots and interactive applications.
"""
messages = [{"role": msg.role, "content": msg.content} for msg in request.messages]
payload = {
"model": request.model,
"messages": messages,
"stream": request.stream,
"options": {
"temperature": request.temperature,
"num_predict": request.max_tokens
}
}
start_time = time.time()
result = call_ollama(OLLAMA_CHAT_ENDPOINT, payload)
elapsed_ms = (time.time() - start_time) * 1000
# Ollama returns 'message' for chat endpoint
response_content = result.get("message", {}).get("content", result.get("response", ""))
return ChatResponse(
model=result.get("model", request.model),
message=ChatMessage(role="assistant", content=response_content),
tokens_generated=result.get("eval_count", 0),
total_duration_ms=elapsed_ms
)
# Run the server
if __name__ == "__main__":
uvicorn.run(
"ollama_api:app",
host="0.0.0.0",
port=8000,
reload=True, # Disable in production
log_level="info"
)
Running the API Server
Start the FastAPI server:
python ollama_api.py
The server will start on http://0.0.0.0:8000. You can access the interactive documentation at http://localhost:8000/docs.
Testing the API
Use curl to test the endpoints:
# Health check
curl http://localhost:8000/health
# List models
curl http://localhost:8000/models
# Generate text
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"prompt": "Explain quantum computing in simple terms.",
"temperature": 0.7,
"max_tokens": 200
}'
# Chat completion
curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{
"model": "mistral",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
]
}'
Handling Edge Cases and Production Considerations
Memory Management
Large language models consume significant RAM. A 7B parameter model like Mistral requires approximately 4-6GB of RAM at 4-bit quantization. If you're running multiple models or have limited memory, consider these strategies:
- Unload unused models: Use
ollama stop <model_name>to free memory. - Use smaller models: Llama 3.2 (3B) uses about 2GB RAM.
- Monitor memory usage: Implement a background task to check memory and warn users.
import psutil
def check_memory_usage(threshold_gb: float = 8.0):
"""Check if available memory is below threshold."""
memory = psutil.virtual_memory()
available_gb = memory.available / (1024 ** 3)
if available_gb < threshold_gb:
print(f"Warning: Only {available_gb:.1f}GB RAM available. Consider stopping unused models.")
return available_gb
Error Handling for Model Unavailability
When a requested model isn't pulled, Ollama returns a 404 error. Our API should handle this gracefully:
# In call_ollama function, add specific handling:
if response.status_code == 404:
error_detail = response.json().get("error", "Model not found")
raise HTTPException(
status_code=404,
detail=f"Model '{payload.get('model')}' not found. Pull it with 'ollama pull {payload.get('model')}'."
)
Rate Limiting and Concurrency
In production, you'll want to prevent abuse. FastAPI can integrate with slowapi for rate limiting:
pip install slowapi
Add to your app:
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(429, _rate_limit_exceeded_handler)
@app.post("/generate")
@limiter.limit("10/minute")
async def generate_text(request: GenerateRequest):
# .. existing code
GPU Acceleration
Ollama automatically uses GPU if available. Verify with:
ollama run llama3.2 --verbose
# Look for "gpu" in the output
If GPU isn't detected, ensure CUDA drivers are installed and Ollama was built with CUDA support. On Linux, you can check:
nvidia-smi
# Should show GPU utilization when running a model
Advanced Integration: Building a RAG Pipeline
Let's extend our API to support Retrieval-Augmented Generation (RAG), which combines local LLMs with a vector database [3] for context-aware responses. This is a common production pattern for question-answering systems.
First, install additional dependencies:
pip install sentence-transformers [4] chromadb
Create a file rag_pipeline.py:
import chromadb
from sentence_transformers import SentenceTransformer
import requests
import json
from typing import List, Dict, Any
class LocalRAGPipeline:
"""
RAG pipeline using local embeddings and Ollama for generation.
"""
def __init__(self, embedding_model: str = "all-MiniLM-L6-v2"):
self.embedder = SentenceTransformer(embedding_model)
self.chroma_client = chromadb.Client()
self.collection = self.chroma_client.create_collection(
name="documents",
embedding_function=None # We'll provide embeddings manually
)
self.ollama_url = "http://localhost:11434/api/generate"
def add_documents(self, documents: List[str], ids: List[str]):
"""
Add documents to the vector store.
Args:
documents: List of text documents
ids: Unique identifiers for each document
"""
embeddings = self.embedder.encode(documents).tolist()
self.collection.add(
embeddings=embeddings,
documents=documents,
ids=ids
)
def query(self, question: str, model: str = "llama3.2", top_k: int = 3) -> str:
"""
Answer a question using RAG.
Args:
question: User's question
model: Ollama model to use for generation
top_k: Number of relevant documents to retrieve
Returns:
Generated answer with context
"""
# Embed the question
question_embedding = self.embedder.encode([question]).tolist()[0]
# Retrieve relevant documents
results = self.collection.query(
query_embeddings=[question_embedding],
n_results=top_k
)
# Build context from retrieved documents
context = "\n\n".join(results["documents"][0])
# Create prompt with context
prompt = f"""Use the following context to answer the question. If the context doesn't contain relevant information, say "I don't have enough information to answer."
Context:
{context}
Question: {question}
Answer:"""
# Generate response using Ollama
payload = {
"model": model,
"prompt": prompt,
"stream": False,
"options": {
"temperature": 0.3, # Lower temperature for factual answers
"num_predict": 512
}
}
response = requests.post(self.ollama_url, json=payload, timeout=60)
response.raise_for_status()
return response.json()["response"]
# Example usage
if __name__ == "__main__":
rag = LocalRAGPipeline()
# Add some documents
rag.add_documents(
documents=[
"Ollama is an open-source platform for running LLMs locally.",
"The latest version of Ollama is 0.6.2 as of May 2026.",
"Ollama supports models like Llama 3.2, Mistral, and Llama 3.1."
],
ids=["doc1", "doc2", "doc3"]
)
# Query
answer = rag.query("What is Ollama and what models does it support?")
print(f"Answer: {answer}")
Monitoring and Logging
For production deployments, implement structured logging:
import logging
from logging.handlers import RotatingFileHandler
# Configure logging
logger = logging.getLogger("ollama_api")
logger.setLevel(logging.INFO)
handler = RotatingFileHandler("ollama_api.log", maxBytes=10*1024*1024, backupCount=5)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)
# Add logging to endpoints
@app.post("/generate")
async def generate_text(request: GenerateRequest):
logger.info(f"Generate request: model={request.model}, prompt_length={len(request.prompt)}")
# .. existing code
logger.info(f"Generate response: tokens={result.get('eval_count', 0)}, duration_ms={elapsed_ms}")
return response
Performance Optimization Tips
-
Model Quantization: Use quantized models (e.g.,
llama3.2:q4_K_M) for faster inference and lower memory usage:ollama pull llama3.2:q4_K_M -
Batch Processing: For multiple prompts, batch them to reduce overhead:
# Instead of individual requests, combine prompts prompts = ["Prompt 1", "Prompt 2", "Prompt 3"] for prompt in prompts: # Process sequentially or in parallel with asyncio pass -
Connection Pooling: Reuse HTTP connections:
import requests from requests.adapters import HTTPAdapter from urllib3.util.retry import Retry session = requests.Session() retries = Retry(total=3, backoff_factor=1) session.mount('http://', HTTPAdapter(max_retries=retries, pool_connections=10, pool_maxsize=10))
Conclusion
Ollama provides a robust, open-source platform for running large language models locally, with 172,200 GitHub stars and active development as of May 2026 [1]. Its version 0.6.2 [4] offers a stable foundation for building production applications that require data privacy, low latency, or offline capabilities. By wrapping Ollama's API with FastAPI, you gain structured error handling, rate limiting, and monitoring capabilities essential for production deployments.
The RAG pipeline example demonstrates how to combine local embeddings with Ollama for context-aware question answering, a pattern increasingly used in enterprise applications. As research continues on topics like forecasting downstream performance of LLMs with proxy metrics [30][31] and understanding biases in multimodal LLMs [35][36], the ability to run models locally will become even more valuable for experimentation and fine-tuning.
What's Next
- Explore Ollama's Modelfile to create custom models with specific system prompts and parameters
- Integrate with LangChain for more complex agent workflows
- Set up monitoring with Prometheus and Grafana for production metrics
- Experiment with different quantization levels to balance performance and quality
- Consider contributing to Ollama's open-source project on GitHub
For further reading, check out our guides on model optimization techniques and building production LLM applications.
References
Was this article helpful?
Let us know to improve our AI generation.
Related Articles
How to Analyze Security Logs with DeepSeek Locally
Practical tutorial: Analyze security logs with DeepSeek locally
How to Build a Multimodal App with Gemini 2.0 Vision API
Practical tutorial: Build a multimodal app with Gemini 2.0 Vision API
How to Build an AI Research Assistant with Perplexity API
Practical tutorial: Create an AI research assistant with Perplexity API