Back to Tutorials
tutorialstutorialaiml

How to Run Large Language Models Locally with Ollama

Practical tutorial: It introduces a new way to run large language models locally, which is useful for developers and researchers.

BlogIA AcademyMay 20, 202611 min read2 152 words
This article was generated by Daily Neural Digest's autonomous neural pipeline — multi-source verified, fact-checked, and quality-scored. Learn how it works

How to Run Large Language Models Locally with Ollama

Table of Contents

📺 Watch: Neural Networks Explained

Video by 3Blue1Brown


Large language models (LLMs) have transformed how we interact with AI, but most developers rely on cloud APIs that introduce latency, privacy concerns, and recurring costs. Running LLMs locally gives you complete control over your data, eliminates API dependencies, and enables offline experimentation. According to, Ollama is a developer-tools platform that lets you "run large language models locally" with a "simple CLI to download and run LLMs on your machine." As of May 20, 2026, Ollama has 171.8k stars on GitHub, with its latest version at 0.6.2 and a last commit on 2026-05-20, indicating active maintenance. This tutorial will walk you through setting up Ollama, downloading models, building a production-grade REST API, and handling edge cases like memory constraints and concurrent requests.

Why Local LLMs Matter in Production

Running LLMs locally isn't just about avoiding cloud costs—it's about architectural flexibility. A large language model is a neural network trained on vast text data for natural language generation tasks. When you run these models locally, you eliminate network latency, ensure data sovereignty (critical for healthcare, finance, or legal applications), and can fine-tune models without sharing proprietary data. Ollama abstracts away the complexity of model quantization, GPU acceleration, and memory management, allowing you to focus on building applications. With a rating of 4.6 and open-source pricing, it's a compelling choice for developers who need production-ready local inference.

Prerequisites and Environment Setup

Before diving into code, ensure your system meets the minimum requirements. You'll need a machine with at least 8GB of RAM for small models (like Llama 3.2 1B) and 16GB+ for larger models (like Mistral 7B). A GPU with CUDA support is optional but significantly improves inference speed.

Installing Ollama

Ollama provides a straightforward installation script for Linux and macOS. For Windows, use the official installer from at https://ollama.ai.

# Linux/macOS installation
curl -fsSL https://ollama.ai/install.sh | sh

# Verify installation
ollama --version
# Expected output: ollama version 0.6.2

After installation, start the Ollama server in the background:

# Start Ollama service (runs on localhost:11434 by default)
ollama serve &

# Check if the server is running
curl http://localhost:11434/api/tags
# Should return an empty list initially

Python Environment Setup

We'll build a Python application using FastAPI for the REST API and the official ollama Python library. Create a virtual environment and install dependencies:

python3 -m venv venv
source venv/bin/activate

# Install core dependencies
pip install ollama==0.6.2 fastapi==0.115.0 uvicorn==0.30.0 pydantic==2.9.0

# For advanced features (optional)
pip install langchain [8]==0.3.0 langchain-ollama==0.2.0

Downloading and Managing Models

Ollama supports a wide range of open-weight models. As of the latest data, you can pull models like Llama 3.2, Llama 3.1, and Mistral directly from Ollama's registry. The CLI handles model quantization automatically, downloading the most efficient variant for your hardware.

Pulling a Model

Let's download Llama 3.2, a lightweight model suitable for most development machines:

# Pull the model (approximately 2-4 GB download)
ollama pull llama3.2

# List downloaded models
ollama list
# Output will show model name, size, and modification date

For production, you might want multiple models for different tasks. Pull Mistral for code generation and Llama 3.1 for general chat:

ollama pull mistral
ollama pull llama3.1

Model Management Best Practices

Ollama stores models in ~/.ollama/models/ by default. Monitor disk usage with:

# Check disk usage of Ollama models
du -sh ~/.ollama/models/

To remove unused models and free space:

# Remove a specific model
ollama rm llama3.2

# Remove all models (caution: irreversible)
ollama rm -a

Building a Production-Grade REST API

Now we'll create a FastAPI application that exposes Ollama's capabilities through a well-documented REST API. This architecture is suitable for integrating local LLMs into microservices, chatbots, or data pipelines.

Core API Implementation

Create a file named app.py with the following production-ready code:

import logging
from typing import Optional, List, Dict, Any
from datetime import datetime

import ollama
from fastapi import FastAPI, HTTPException, BackgroundTasks
from pydantic import BaseModel, Field, field_validator
import uvicorn

# Configure logging for production observability
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)

# Initialize FastAPI with metadata
app = FastAPI(
    title="Local LLM Inference API",
    description="Production-ready API for running LLMs locally via Ollama",
    version="1.0.0"
)

# --- Pydantic Models for Request/Response Validation ---

class ChatMessage(BaseModel):
    role: str = Field(.., description="Message role: 'user', 'assistant', or 'system'")
    content: str = Field(.., description="Message content")

    @field_validator("role")
    @classmethod
    def validate_role(cls, v: str) -> str:
        allowed_roles = {"user", "assistant", "system"}
        if v not in allowed_roles:
            raise ValueError(f"Role must be one of {allowed_roles}")
        return v

class ChatRequest(BaseModel):
    model: str = Field(default="llama3.2", description="Model name to use")
    messages: List[ChatMessage] = Field(.., description="Conversation history")
    temperature: float = Field(default=0.7, ge=0.0, le=2.0, description="Sampling temperature")
    max_tokens: Optional[int] = Field(default=None, ge=1, description="Maximum tokens to generate")
    stream: bool = Field(default=False, description="Enable streaming response")

class ChatResponse(BaseModel):
    model: str
    message: ChatMessage
    total_duration: float
    tokens_per_second: float
    timestamp: str

# --- In-memory model cache for performance ---
model_cache: Dict[str, Dict[str, Any]] = {}

def get_model_info(model_name: str) -> Dict[str, Any]:
    """Fetch model metadata with caching to avoid repeated API calls."""
    if model_name not in model_cache:
        try:
            # Ollama's show command returns model details
            model_info = ollama.show(model_name)
            model_cache[model_name] = {
                "name": model_name,
                "size": model_info.get("size", 0),
                "modified_at": model_info.get("modified_at", ""),
                "details": model_info.get("details", {})
            }
            logger.info(f"Cached model info for {model_name}")
        except Exception as e:
            logger.error(f"Failed to fetch model info for {model_name}: {e}")
            raise HTTPException(status_code=404, detail=f"Model '{model_name}' not found")
    return model_cache[model_name]

# --- API Endpoints ---

@app.get("/health", tags=["System"])
async def health_check():
    """Health check endpoint for monitoring."""
    try:
        # Verify Ollama server is running
        ollama.list()
        return {
            "status": "healthy",
            "ollama_version": ollama.__version__,
            "timestamp": datetime.utcnow().isoformat()
        }
    except Exception as e:
        logger.error(f"Health check failed: {e}")
        raise HTTPException(status_code=503, detail="Ollama server not reachable")

@app.get("/models", tags=["Models"])
async def list_models():
    """List all available models with metadata."""
    try:
        models = ollama.list()
        return {
            "models": [
                {
                    "name": model["name"],
                    "size": model["size"],
                    "modified_at": model["modified_at"]
                }
                for model in models.get("models", [])
            ],
            "count": len(models.get("models", []))
        }
    except Exception as e:
        logger.error(f"Failed to list models: {e}")
        raise HTTPException(status_code=500, detail="Failed to retrieve models")

@app.post("/chat", response_model=ChatResponse, tags=["Inference"])
async def chat_completion(request: ChatRequest, background_tasks: BackgroundTasks):
    """
    Generate a chat completion using a local LLM.

    This endpoint handles conversation history, temperature control, and
    returns performance metrics for monitoring.
    """
    # Validate model exists and warm up cache
    model_info = get_model_info(request.model)

    # Prepare messages for Ollama API
    messages = [{"role": msg.role, "content": msg.content} for msg in request.messages]

    start_time = datetime.utcnow()

    try:
        # Call Ollama's chat API
        response = ollama.chat(
            model=request.model,
            messages=messages,
            options={
                "temperature": request.temperature,
                "num_predict": request.max_tokens if request.max_tokens else -1
            }
        )

        end_time = datetime.utcnow()
        duration_ms = (end_time - start_time).total_seconds() * 1000

        # Extract response content
        assistant_message = response["message"]["content"]
        total_tokens = response.get("eval_count", 0)

        # Calculate tokens per second for performance monitoring
        tokens_per_second = total_tokens / (duration_ms / 1000) if duration_ms > 0 else 0

        logger.info(
            f"Chat completion - model: {request.model}, "
            f"tokens: {total_tokens}, "
            f"duration: {duration_ms:.2f}ms, "
            f"tokens/sec: {tokens_per_second:.2f}"
        )

        # Schedule cache cleanup in background (optional)
        background_tasks.add_task(cleanup_old_cache_entries)

        return ChatResponse(
            model=request.model,
            message=ChatMessage(role="assistant", content=assistant_message),
            total_duration=duration_ms,
            tokens_per_second=tokens_per_second,
            timestamp=end_time.isoformat()
        )

    except ollama.ResponseError as e:
        logger.error(f"Ollama API error: {e.status_code} - {e.error}")
        raise HTTPException(status_code=e.status_code, detail=e.error)
    except Exception as e:
        logger.error(f"Unexpected error during chat: {e}")
        raise HTTPException(status_code=500, detail="Internal inference error")

@app.post("/generate", tags=["Inference"])
async def generate_text(
    prompt: str = Field(.., description="Input prompt"),
    model: str = Field(default="llama3.2", description="Model name"),
    temperature: float = Field(default=0.7, ge=0.0, le=2.0)
):
    """
    Simple text generation endpoint for non-conversational tasks.
    Useful for summarization, translation, or data extraction.
    """
    try:
        response = ollama.generate(
            model=model,
            prompt=prompt,
            options={"temperature": temperature}
        )
        return {
            "model": model,
            "response": response["response"],
            "eval_count": response.get("eval_count", 0),
            "eval_duration": response.get("eval_duration", 0)
        }
    except Exception as e:
        logger.error(f"Generation failed: {e}")
        raise HTTPException(status_code=500, detail="Generation failed")

# --- Background Tasks ---

def cleanup_old_cache_entries():
    """Periodically clean model cache to prevent memory leaks."""
    global model_cache
    # Simple cleanup: keep only models that were accessed recently
    # In production, use TTL-based caching with Redis
    logger.debug(f"Cache size before cleanup: {len(model_cache)} entries")
    # No-op for now; extend with TTL logic as needed

# --- Main Entry Point ---

if __name__ == "__main__":
    uvicorn.run(
        "app:app",
        host="0.0.0.0",
        port=8000,
        reload=True,  # Disable in production
        log_level="info"
    )

Running the API

Start the server and test the endpoints:

# Start the FastAPI server
python app.py

# In another terminal, test health endpoint
curl http://localhost:8000/health

# Test chat completion
curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [
      {"role": "user", "content": "Explain quantum computing in simple terms"}
    ],
    "temperature": 0.7
  }'

# List available models
curl http://localhost:8000/models

Handling Edge Cases and Production Concerns

Memory Management

Local LLMs consume significant RAM. Ollama uses model quantization (e.g., 4-bit, 8-bit) to reduce memory footprint, but you must still monitor usage. Implement a memory guard in your API:

import psutil

def check_memory_available(required_gb: float = 4.0) -> bool:
    """Check if sufficient memory is available before loading a model."""
    memory = psutil.virtual_memory()
    available_gb = memory.available / (1024 ** 3)
    if available_gb < required_gb:
        logger.warning(f"Low memory: {available_gb:.2f}GB available, need {required_gb}GB")
        return False
    return True

Concurrent Request Handling

Ollama processes requests sequentially by default. For concurrent requests, use a queue with worker threads:

from concurrent.futures import ThreadPoolExecutor
import asyncio

# Create a thread pool for concurrent inference
executor = ThreadPoolExecutor(max_workers=4)

async def async_chat(model: str, messages: list, options: dict):
    """Run Ollama chat in a separate thread to avoid blocking."""
    loop = asyncio.get_event_loop()
    return await loop.run_in_executor(
        executor,
        ollama.chat,
        model,
        messages,
        options
    )

Error Handling for Model Loading

Models can fail to load due to corruption or version mismatches. Implement retry logic:

import time
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10))
def load_model_with_retry(model_name: str):
    """Attempt to load a model with exponential backoff."""
    try:
        ollama.show(model_name)
        logger.info(f"Model {model_name} loaded successfully")
    except Exception as e:
        logger.error(f"Failed to load model {model_name}: {e}")
        raise

Performance Optimization and Monitoring

Benchmarking Inference Speed

Measure tokens per second to compare models and hardware configurations:

import time

def benchmark_model(model_name: str, prompt: str, iterations: int = 5):
    """Benchmark a model's inference speed."""
    times = []
    for i in range(iterations):
        start = time.time()
        response = ollama.generate(model=model_name, prompt=prompt)
        elapsed = time.time() - start
        tokens = response.get("eval_count", 0)
        times.append(tokens / elapsed if elapsed > 0 else 0)

    avg_tps = sum(times) / len(times)
    logger.info(f"Model {model_name}: Average {avg_tps:.2f} tokens/second")
    return avg_tps

Logging and Monitoring

Integrate with structured logging for production observability:

import json
from pythonjsonlogger import jsonlogger

# Configure JSON logging for log aggregation tools
handler = logging.StreamHandler()
formatter = jsonlogger.JsonFormatter(
    fmt="%(asctime)s %(name)s %(levelname)s %(message)s"
)
handler.setFormatter(formatter)
logger.addHandler(handler)

What's Next

You now have a production-ready local LLM inference system using Ollama. This setup eliminates cloud dependencies, reduces latency, and gives you full control over model selection and data privacy. To extend this tutorial:

  1. Add streaming support: Modify the /chat endpoint to use Server-Sent Events (SSE) for real-time token-by-token output.
  2. Implement model hot-swapping: Build an endpoint that dynamically loads/unloads models based on demand.
  3. Integrate with vector databases: Combine local LLMs with embeddings for Retrieval-Augmented Generation (RAG) pipelines.
  4. Explore fine-tuning: Use Ollama's Modelfile to create custom models with LoRA adapters for domain-specific tasks.

For further reading, check out our guides on building RAG applications and comparing open-weight models. The Ollama ecosystem continues to evolve—as of its latest commit on 2026-05-20, the project maintains 3252 open issues, reflecting active community development. With 171.8k stars and a 4.6 rating, Ollama is a solid foundation for local LLM deployment in 2026.


References

1. Wikipedia - LangChain. Wikipedia. [Source]
2. Wikipedia - Llama. Wikipedia. [Source]
3. Wikipedia - Ollama. Wikipedia. [Source]
4. GitHub - langchain-ai/langchain. Github. [Source]
5. GitHub - meta-llama/llama. Github. [Source]
6. GitHub - ollama/ollama. Github. [Source]
7. GitHub - milvus-io/milvus. Github. [Source]
8. LangChain Pricing. Pricing. [Source]
9. LlamaIndex Pricing. Pricing. [Source]
tutorialaiml
Share this article:

Was this article helpful?

Let us know to improve our AI generation.

Related Articles