Back to Tutorials
tutorialstutorialaillm

How to Run LLMs Locally with Ollama

Practical tutorial: Ollama simplifies running large language models locally, which is a useful development for developers and researchers.

BlogIA AcademyMay 25, 202614 min read2 754 words

How to Run LLMs Locally with Ollama

Table of Contents

📺 Watch: Neural Networks Explained

Video by 3Blue1Brown


Large language models have transformed how we interact with software, but relying on cloud APIs introduces latency, privacy concerns, and recurring costs. Running models locally addresses these issues, and Ollama has emerged as the leading open-source platform for this task. As of May 25, 2026, Ollama has accumulated 172,200 stars on GitHub [1] and is actively maintained with its latest commit on 2026-05-25 [3]. The current stable release is version 0.6.2 [4], built primarily in Go [17]. This tutorial will guide you through setting up Ollama, deploying production-grade models, and integrating them into a Python application with a REST API.

Understanding Ollama's Architecture and Production Use Cases

Ollama is a software platform for running and managing large language models on local computers and through hosted cloud models [5]. It provides a command-line interface, a local REST API, model-management tools, and integrations for using open-weight models with coding assistants and other applications [5]. The platform supports models like Llama 3.2, Llama 3.1, and Mistral [7][8][9], all sourced from Ollama's model library.

In production environments, Ollama excels in scenarios requiring data privacy, low latency, or offline operation. For example, a healthcare application processing patient records can run models locally to avoid transmitting sensitive data over the internet. Similarly, a customer support chatbot for a manufacturing plant with intermittent connectivity can maintain functionality without cloud dependencies. The platform's open-source nature (pricing: Open Source) [11] and rating of 4.6 [12] make it accessible for both prototyping and deployment.

The architecture is straightforward: Ollama runs as a background service (daemon) that exposes a REST API on localhost:11434. You interact with it via the CLI or HTTP requests. Models are downloaded and cached locally, and the service handles inference with GPU acceleration when available. This design allows you to swap models without changing your application code, as long as you maintain consistent API contracts.

Prerequisites and Environment Setup

Before diving into implementation, ensure your system meets these requirements:

  • Operating System: Linux (Ubuntu 20.04+ recommended), macOS 12+, or Windows 10/11 with WSL2
  • Hardware: Minimum 8GB RAM (16GB+ recommended for 7B+ parameter models), GPU with CUDA support optional but beneficial
  • Software: Python 3.10+, pip, curl, and Git

Installing Ollama

The installation process varies by platform. On Linux and macOS, use the official install script:

curl -fsSL https://ollama.ai/install.sh | sh

On Windows, download the installer from https://ollama.ai/download and run it. After installation, verify the service is running:

ollama --version
# Output: ollama version 0.6.2

If the command isn't found, add Ollama to your PATH or restart your terminal. The service should start automatically. You can check its status:

systemctl status ollama  # Linux
# or
ps aux | grep ollama     # macOS/Linux

Setting Up the Python Environment

Create a virtual environment and install the required packages:

python3 -m venv ollama_env
source ollama_env/bin/activate  # On Windows: ollama_env\Scripts\activate
pip install requests fastapi uvicorn pydantic

We'll use requests for HTTP communication with Ollama's API, fastapi and uvicorn for building our own REST API, and pydantic for data validation.

Downloading and Running Models

Ollama's model management is handled through the CLI. Let's pull a few models for different use cases:

# Pull Llama 3.2 (3B parameters) - fast, suitable for simple tasks
ollama pull llama3.2

# Pull Mistral (7B parameters) - balanced performance and quality
ollama pull mistral

# Pull Llama 3.1 (8B parameters) - higher quality, more resource-intensive
ollama pull llama3.1

Each command downloads the model weights and configuration. The download size varies: Llama 3.2 is approximately 2GB, Mistral is 4.1GB, and Llama 3.1 is 4.7GB. Ensure you have sufficient disk space (at least 20GB free for multiple models).

To verify the models are available:

ollama list
# Output example:
# NAME            ID              SIZE      MODIFIED
# llama3.2:latest 123abc..       2.0 GB    5 minutes ago
# mistral:latest  456def..       4.1 GB    10 minutes ago
# llama3.1:latest 789ghi..       4.7 GB    2 hours ago

Running a Model Interactively

Test a model directly from the terminal:

ollama run llama3.2

This opens an interactive session. Type prompts and see responses in real-time. Exit with /bye or Ctrl+D.

Building a Production-Grade Python API with Ollama

Now we'll create a FastAPI application that wraps Ollama's API, adding error handling, rate limiting, and structured responses. This is suitable for production deployments where you need to serve multiple clients.

Core Implementation

Create a file named ollama_api.py:

import requests
import json
import time
from typing import Optional, List, Dict, Any
from fastapi import FastAPI, HTTPException, BackgroundTasks
from pydantic import BaseModel, Field
import uvicorn

# Ollama API configuration
OLLAMA_BASE_URL = "http://localhost:11434"
OLLAMA_GENERATE_ENDPOINT = f"{OLLAMA_BASE_URL}/api/generate"
OLLAMA_CHAT_ENDPOINT = f"{OLLAMA_BASE_URL}/api/chat"
OLLAMA_LIST_ENDPOINT = f"{OLLAMA_BASE_URL}/api/tags"

# Initialize FastAPI app
app = FastAPI(
    title="Local LLM API",
    description="Production-grade API for running LLMs locally via Ollama",
    version="1.0.0"
)

# Request models
class GenerateRequest(BaseModel):
    model: str = Field(.., description="Model name (e.g., llama3.2, mistral)")
    prompt: str = Field(.., description="Input prompt for the model")
    system: Optional[str] = Field(None, description="System prompt to set context")
    temperature: Optional[float] = Field(0.7, ge=0.0, le=2.0, description="Sampling temperature")
    max_tokens: Optional[int] = Field(512, ge=1, le=4096, description="Maximum tokens to generate")
    stream: Optional[bool] = Field(False, description="Whether to stream the response")

class ChatMessage(BaseModel):
    role: str = Field(.., pattern="^(system|user|assistant)$")
    content: str = Field(..)

class ChatRequest(BaseModel):
    model: str = Field(.., description="Model name")
    messages: List[ChatMessage] = Field(.., min_length=1, description="Chat messages")
    temperature: Optional[float] = Field(0.7, ge=0.0, le=2.0)
    max_tokens: Optional[int] = Field(512, ge=1, le=4096)
    stream: Optional[bool] = Field(False)

# Response models
class GenerateResponse(BaseModel):
    model: str
    response: str
    tokens_generated: int
    total_duration_ms: float

class ChatResponse(BaseModel):
    model: str
    message: ChatMessage
    tokens_generated: int
    total_duration_ms: float

class ModelInfo(BaseModel):
    name: str
    size_bytes: int
    modified_at: str

# Helper function to call Ollama API
def call_ollama(endpoint: str, payload: Dict[str, Any]) -> Dict[str, Any]:
    """
    Make a request to Ollama's API with error handling and timeout.

    Args:
        endpoint: Full URL to Ollama endpoint
        payload: JSON payload for the request

    Returns:
        Parsed JSON response from Ollama

    Raises:
        HTTPException: If Ollama is unreachable or returns an error
    """
    try:
        response = requests.post(
            endpoint,
            json=payload,
            timeout=60  # 60-second timeout for model inference
        )
        response.raise_for_status()

        # Handle streaming responses (we collect all chunks)
        if payload.get("stream", False):
            full_response = ""
            for line in response.iter_lines():
                if line:
                    chunk = json.loads(line.decode('utf-8'))
                    if 'response' in chunk:
                        full_response += chunk['response']
                    if chunk.get('done', False):
                        return {
                            "model": chunk.get("model", payload["model"]),
                            "response": full_response,
                            "eval_count": chunk.get("eval_count", 0),
                            "total_duration": chunk.get("total_duration", 0)
                        }
            return {"model": payload["model"], "response": full_response, "eval_count": 0, "total_duration": 0}
        else:
            return response.json()

    except requests.exceptions.ConnectionError:
        raise HTTPException(
            status_code=503,
            detail="Ollama service is not running. Start it with 'ollama serve'."
        )
    except requests.exceptions.Timeout:
        raise HTTPException(
            status_code=504,
            detail="Model inference timed out. Consider using a smaller model or increasing timeout."
        )
    except requests.exceptions.RequestException as e:
        raise HTTPException(
            status_code=500,
            detail=f"Ollama API error: {str(e)}"
        )

# Health check endpoint
@app.get("/health")
async def health_check():
    """Check if Ollama service is running and responsive."""
    try:
        response = requests.get(f"{OLLAMA_BASE_URL}/api/tags", timeout=5)
        response.raise_for_status()
        models = response.json().get("models", [])
        return {
            "status": "healthy",
            "ollama_version": "0.6.2",
            "models_available": len(models),
            "timestamp": time.time()
        }
    except Exception as e:
        raise HTTPException(status_code=503, detail=f"Ollama unreachable: {str(e)}")

# List available models
@app.get("/models", response_model=List[ModelInfo])
async def list_models():
    """List all models available in Ollama."""
    try:
        response = requests.get(OLLAMA_LIST_ENDPOINT, timeout=5)
        response.raise_for_status()
        models_data = response.json().get("models", [])
        return [
            ModelInfo(
                name=m["name"],
                size_bytes=m["size"],
                modified_at=m["modified_at"]
            )
            for m in models_data
        ]
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Failed to list models: {str(e)}")

# Generate text (non-chat)
@app.post("/generate", response_model=GenerateResponse)
async def generate_text(request: GenerateRequest):
    """
    Generate text using a specified model.

    This endpoint is ideal for completion tasks, summarization, and extraction.
    """
    payload = {
        "model": request.model,
        "prompt": request.prompt,
        "stream": request.stream,
        "options": {
            "temperature": request.temperature,
            "num_predict": request.max_tokens
        }
    }

    if request.system:
        payload["system"] = request.system

    start_time = time.time()
    result = call_ollama(OLLAMA_GENERATE_ENDPOINT, payload)
    elapsed_ms = (time.time() - start_time) * 1000

    return GenerateResponse(
        model=result.get("model", request.model),
        response=result.get("response", ""),
        tokens_generated=result.get("eval_count", 0),
        total_duration_ms=elapsed_ms
    )

# Chat completion
@app.post("/chat", response_model=ChatResponse)
async def chat_completion(request: ChatRequest):
    """
    Chat completion endpoint supporting multi-turn conversations.

    This endpoint is ideal for chatbots and interactive applications.
    """
    messages = [{"role": msg.role, "content": msg.content} for msg in request.messages]

    payload = {
        "model": request.model,
        "messages": messages,
        "stream": request.stream,
        "options": {
            "temperature": request.temperature,
            "num_predict": request.max_tokens
        }
    }

    start_time = time.time()
    result = call_ollama(OLLAMA_CHAT_ENDPOINT, payload)
    elapsed_ms = (time.time() - start_time) * 1000

    # Ollama returns 'message' for chat endpoint
    response_content = result.get("message", {}).get("content", result.get("response", ""))

    return ChatResponse(
        model=result.get("model", request.model),
        message=ChatMessage(role="assistant", content=response_content),
        tokens_generated=result.get("eval_count", 0),
        total_duration_ms=elapsed_ms
    )

# Run the server
if __name__ == "__main__":
    uvicorn.run(
        "ollama_api:app",
        host="0.0.0.0",
        port=8000,
        reload=True,  # Disable in production
        log_level="info"
    )

Running the API Server

Start the FastAPI server:

python ollama_api.py

The server will start on http://0.0.0.0:8000. You can access the interactive documentation at http://localhost:8000/docs.

Testing the API

Use curl to test the endpoints:

# Health check
curl http://localhost:8000/health

# List models
curl http://localhost:8000/models

# Generate text
curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "prompt": "Explain quantum computing in simple terms.",
    "temperature": 0.7,
    "max_tokens": 200
  }'

# Chat completion
curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }'

Handling Edge Cases and Production Considerations

Memory Management

Large language models consume significant RAM. A 7B parameter model like Mistral requires approximately 4-6GB of RAM at 4-bit quantization. If you're running multiple models or have limited memory, consider these strategies:

  1. Unload unused models: Use ollama stop <model_name> to free memory.
  2. Use smaller models: Llama 3.2 (3B) uses about 2GB RAM.
  3. Monitor memory usage: Implement a background task to check memory and warn users.
import psutil

def check_memory_usage(threshold_gb: float = 8.0):
    """Check if available memory is below threshold."""
    memory = psutil.virtual_memory()
    available_gb = memory.available / (1024 ** 3)
    if available_gb < threshold_gb:
        print(f"Warning: Only {available_gb:.1f}GB RAM available. Consider stopping unused models.")
    return available_gb

Error Handling for Model Unavailability

When a requested model isn't pulled, Ollama returns a 404 error. Our API should handle this gracefully:

# In call_ollama function, add specific handling:
if response.status_code == 404:
    error_detail = response.json().get("error", "Model not found")
    raise HTTPException(
        status_code=404,
        detail=f"Model '{payload.get('model')}' not found. Pull it with 'ollama pull {payload.get('model')}'."
    )

Rate Limiting and Concurrency

In production, you'll want to prevent abuse. FastAPI can integrate with slowapi for rate limiting:

pip install slowapi

Add to your app:

from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address

limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(429, _rate_limit_exceeded_handler)

@app.post("/generate")
@limiter.limit("10/minute")
async def generate_text(request: GenerateRequest):
    # .. existing code

GPU Acceleration

Ollama automatically uses GPU if available. Verify with:

ollama run llama3.2 --verbose
# Look for "gpu" in the output

If GPU isn't detected, ensure CUDA drivers are installed and Ollama was built with CUDA support. On Linux, you can check:

nvidia-smi
# Should show GPU utilization when running a model

Advanced Integration: Building a RAG Pipeline

Let's extend our API to support Retrieval-Augmented Generation (RAG), which combines local LLMs with a vector database [3] for context-aware responses. This is a common production pattern for question-answering systems.

First, install additional dependencies:

pip install sentence-transformers [4] chromadb

Create a file rag_pipeline.py:

import chromadb
from sentence_transformers import SentenceTransformer
import requests
import json
from typing import List, Dict, Any

class LocalRAGPipeline:
    """
    RAG pipeline using local embeddings and Ollama for generation.
    """

    def __init__(self, embedding_model: str = "all-MiniLM-L6-v2"):
        self.embedder = SentenceTransformer(embedding_model)
        self.chroma_client = chromadb.Client()
        self.collection = self.chroma_client.create_collection(
            name="documents",
            embedding_function=None  # We'll provide embeddings manually
        )
        self.ollama_url = "http://localhost:11434/api/generate"

    def add_documents(self, documents: List[str], ids: List[str]):
        """
        Add documents to the vector store.

        Args:
            documents: List of text documents
            ids: Unique identifiers for each document
        """
        embeddings = self.embedder.encode(documents).tolist()
        self.collection.add(
            embeddings=embeddings,
            documents=documents,
            ids=ids
        )

    def query(self, question: str, model: str = "llama3.2", top_k: int = 3) -> str:
        """
        Answer a question using RAG.

        Args:
            question: User's question
            model: Ollama model to use for generation
            top_k: Number of relevant documents to retrieve

        Returns:
            Generated answer with context
        """
        # Embed the question
        question_embedding = self.embedder.encode([question]).tolist()[0]

        # Retrieve relevant documents
        results = self.collection.query(
            query_embeddings=[question_embedding],
            n_results=top_k
        )

        # Build context from retrieved documents
        context = "\n\n".join(results["documents"][0])

        # Create prompt with context
        prompt = f"""Use the following context to answer the question. If the context doesn't contain relevant information, say "I don't have enough information to answer."

Context:
{context}

Question: {question}

Answer:"""

        # Generate response using Ollama
        payload = {
            "model": model,
            "prompt": prompt,
            "stream": False,
            "options": {
                "temperature": 0.3,  # Lower temperature for factual answers
                "num_predict": 512
            }
        }

        response = requests.post(self.ollama_url, json=payload, timeout=60)
        response.raise_for_status()

        return response.json()["response"]

# Example usage
if __name__ == "__main__":
    rag = LocalRAGPipeline()

    # Add some documents
    rag.add_documents(
        documents=[
            "Ollama is an open-source platform for running LLMs locally.",
            "The latest version of Ollama is 0.6.2 as of May 2026.",
            "Ollama supports models like Llama 3.2, Mistral, and Llama 3.1."
        ],
        ids=["doc1", "doc2", "doc3"]
    )

    # Query
    answer = rag.query("What is Ollama and what models does it support?")
    print(f"Answer: {answer}")

Monitoring and Logging

For production deployments, implement structured logging:

import logging
from logging.handlers import RotatingFileHandler

# Configure logging
logger = logging.getLogger("ollama_api")
logger.setLevel(logging.INFO)
handler = RotatingFileHandler("ollama_api.log", maxBytes=10*1024*1024, backupCount=5)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)

# Add logging to endpoints
@app.post("/generate")
async def generate_text(request: GenerateRequest):
    logger.info(f"Generate request: model={request.model}, prompt_length={len(request.prompt)}")
    # .. existing code
    logger.info(f"Generate response: tokens={result.get('eval_count', 0)}, duration_ms={elapsed_ms}")
    return response

Performance Optimization Tips

  1. Model Quantization: Use quantized models (e.g., llama3.2:q4_K_M) for faster inference and lower memory usage:

    ollama pull llama3.2:q4_K_M
    
  2. Batch Processing: For multiple prompts, batch them to reduce overhead:

    # Instead of individual requests, combine prompts
    prompts = ["Prompt 1", "Prompt 2", "Prompt 3"]
    for prompt in prompts:
        # Process sequentially or in parallel with asyncio
        pass
    
  3. Connection Pooling: Reuse HTTP connections:

    import requests
    from requests.adapters import HTTPAdapter
    from urllib3.util.retry import Retry
    
    session = requests.Session()
    retries = Retry(total=3, backoff_factor=1)
    session.mount('http://', HTTPAdapter(max_retries=retries, pool_connections=10, pool_maxsize=10))
    

Conclusion

Ollama provides a robust, open-source platform for running large language models locally, with 172,200 GitHub stars and active development as of May 2026 [1]. Its version 0.6.2 [4] offers a stable foundation for building production applications that require data privacy, low latency, or offline capabilities. By wrapping Ollama's API with FastAPI, you gain structured error handling, rate limiting, and monitoring capabilities essential for production deployments.

The RAG pipeline example demonstrates how to combine local embeddings with Ollama for context-aware question answering, a pattern increasingly used in enterprise applications. As research continues on topics like forecasting downstream performance of LLMs with proxy metrics [30][31] and understanding biases in multimodal LLMs [35][36], the ability to run models locally will become even more valuable for experimentation and fine-tuning.

What's Next

  • Explore Ollama's Modelfile to create custom models with specific system prompts and parameters
  • Integrate with LangChain for more complex agent workflows
  • Set up monitoring with Prometheus and Grafana for production metrics
  • Experiment with different quantization levels to balance performance and quality
  • Consider contributing to Ollama's open-source project on GitHub

For further reading, check out our guides on model optimization techniques and building production LLM applications.


References

1. Wikipedia - Transformers. Wikipedia. [Source]
2. Wikipedia - Llama. Wikipedia. [Source]
3. Wikipedia - Vector database. Wikipedia. [Source]
4. GitHub - huggingface/transformers. Github. [Source]
5. GitHub - meta-llama/llama. Github. [Source]
6. GitHub - milvus-io/milvus. Github. [Source]
7. GitHub - ollama/ollama. Github. [Source]
8. LlamaIndex Pricing. Pricing. [Source]
tutorialaillmml
Share this article:

Was this article helpful?

Let us know to improve our AI generation.

Related Articles