How to Build a Local AI Inference Server with Ollama and FastAPI

How to Build a Local AI Inference Server with Ollama and FastAPI
- Understanding the Local Inference Architecture
- Prerequisites and Environment Setup
Install Ollama [8] (Linux/macOS)
Verify installation
Pull a production-ready model (7B parameters)
Create Python virtual environment
Install dependencies
- Building the Core Inference Server
inference_server.py

📺 Watch: Neural Networks Explained

Video by 3Blue1Brown

Building production-ready AI inference systems locally has become increasingly practical with the maturation of tools like Ollama [6] and FastAPI. As of May 2026, running large language models on consumer hardware is no longer a novelty—it's a viable architecture for prototyping, edge deployment, and privacy-sensitive applications. This tutorial walks through constructing a complete, production-grade inference server that handles concurrent requests, manages GPU memory efficiently, and provides a RESTful API for model interaction.

Understanding the Local Inference Architecture

The shift toward local AI inference stems from three critical requirements: data privacy, latency control, and cost predictability. When you run models locally, sensitive data never leaves your hardware, response times aren't subject to network variability, and you avoid per-token API costs. According to Ollama's official documentation, the platform supports over 80 models including Llama 3, Mistral [7], and Gemma, with automatic GPU acceleration via CUDA or Metal.

Our architecture follows a clean separation of concerns:

Ollama handles model loading, inference, and memory management
FastAPI provides the HTTP interface with async support
Python orchestrates communication between components

This design scales from a single laptop to multi-GPU workstations without architectural changes. The key insight is that Ollama's REST API abstracts away model-specific complexities, allowing us to focus on building robust server infrastructure.

Prerequisites and Environment Setup

Before writing any code, ensure your environment meets these requirements:

Hardware Requirements:

Minimum 8GB RAM (16GB+ recommended for 7B parameter models)
NVIDIA GPU with 6GB+ VRAM (optional but strongly recommended)
20GB free disk space for model storag [2]e

Software Requirements:

Python 3.10 or later
Ollama 0.1.32 or later (latest stable as of May 2026)
pip package manager

Let's set up the environment:

# Install Ollama (Linux/macOS)
curl -fsSL https://ollama.ai/install.sh | sh

# Verify installation
ollama --version

# Pull a production-ready model (7B parameters)
ollama pull llama3:7b

# Create Python virtual environment
python3 -m venv inference-env
source inference-env/bin/activate

# Install dependencies
pip install fastapi==0.111.0 uvicorn==0.29.0 httpx==0.27.0 pydantic==2.7.0 python-dotenv==1.0.1

The httpx library is crucial here—it provides async HTTP client capabilities that integrate seamlessly with FastAPI's async endpoints. We'll use it to communicate with Ollama's local API without blocking the server's event loop.

Building the Core Inference Server

Now we'll construct the production inference server. This implementation handles concurrent requests, manages model loading/unloading, and provides proper error handling for edge cases.

# inference_server.py
import asyncio
import logging
import time
from contextlib import asynccontextmanager
from typing import AsyncGenerator, Optional

import httpx
from fastapi import FastAPI, HTTPException, Request
from fastapi.responses import StreamingResponse
from pydantic import BaseModel, Field, validator

# Configure structured logging for production observability
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
)
logger = logging.getLogger(__name__)

# Constants for Ollama API interaction
OLLAMA_BASE_URL = "http://localhost:11434"
OLLAMA_GENERATE_ENDPOINT = f"{OLLAMA_BASE_URL}/api/generate"
OLLAMA_CHAT_ENDPOINT = f"{OLLAMA_BASE_URL}/api/chat"
DEFAULT_MODEL = "llama3:7b"
MAX_RETRIES = 3
RETRY_DELAY = 1.0  # seconds

# Request/Response models with validation
class GenerateRequest(BaseModel):
    prompt: str = Field(.., min_length=1, max_length=4096)
    model: str = Field(default=DEFAULT_MODEL)
    stream: bool = Field(default=False)
    temperature: float = Field(default=0.7, ge=0.0, le=2.0)
    max_tokens: int = Field(default=512, ge=1, le=4096)

    @validator('prompt')
    def prompt_not_empty(cls, v):
        if not v.strip():
            raise ValueError('Prompt cannot be empty or whitespace only')
        return v.strip()

class GenerateResponse(BaseModel):
    response: str
    model: str
    created_at: str
    done: bool
    total_duration: Optional[int] = None
    tokens_per_second: Optional[float] = None

class HealthResponse(BaseModel):
    status: str
    model_loaded: bool
    uptime_seconds: float

# Application state management
class AppState:
    def __init__(self):
        self.start_time = time.time()
        self.client: Optional[httpx.AsyncClient] = None
        self.active_requests = 0
        self.total_requests = 0

app_state = AppState()

@asynccontextmanager
async def lifespan(app: FastAPI):
    """Manage application lifecycle - initialize and cleanup resources."""
    logger.info("Initializing inference server..")
    # Create async HTTP client with connection pooling
    app_state.client = httpx.AsyncClient(
        timeout=httpx.Timeout(300.0, connect=10.0),
        limits=httpx.Limits(max_keepalive_connections=5, max_connections=10),
    )

    # Verify Ollama is running
    try:
        response = await app_state.client.get(f"{OLLAMA_BASE_URL}/api/tags")
        response.raise_for_status()
        logger.info(f"Ollama connected. Available models: {response.json()}")
    except httpx.ConnectError:
        logger.error("Cannot connect to Ollama. Ensure it's running with 'ollama serve'")
        raise RuntimeError("Ollama service unavailable")

    yield

    # Cleanup on shutdown
    logger.info("Shutting down inference server..")
    if app_state.client:
        await app_state.client.aclose()

# Initialize FastAPI with lifespan management
app = FastAPI(
    title="Local AI Inference API",
    version="1.0.0",
    lifespan=lifespan,
)

async def query_ollama(
    payload: dict,
    stream: bool = False,
) -> dict | AsyncGenerator[dict, None]:
    """
    Send request to Ollama with retry logic for transient failures.

    Args:
        payload: Dictionary with model, prompt, and generation parameters
        stream: Whether to stream the response token by token

    Returns:
        Parsed JSON response or async generator for streaming
    """
    last_exception = None

    for attempt in range(MAX_RETRIES):
        try:
            if stream:
                # Streaming response - return async generator
                async def generate():
                    async with app_state.client.stream(
                        "POST",
                        OLLAMA_GENERATE_ENDPOINT,
                        json=payload,
                    ) as response:
                        response.raise_for_status()
                        async for line in response.aiter_lines():
                            if line.strip():
                                yield line

                return generate()
            else:
                # Non-streaming response
                response = await app_state.client.post(
                    OLLAMA_GENERATE_ENDPOINT,
                    json=payload,
                )
                response.raise_for_status()
                return response.json()

        except httpx.TimeoutException as e:
            last_exception = e
            logger.warning(f"Request timeout (attempt {attempt + 1}/{MAX_RETRIES})")
            if attempt < MAX_RETRIES - 1:
                await asyncio.sleep(RETRY_DELAY * (attempt + 1))

        except httpx.HTTPStatusError as e:
            # Don't retry on client errors (4xx)
            if 400 <= e.response.status_code < 500:
                raise HTTPException(
                    status_code=e.response.status_code,
                    detail=f"Ollama API error: {e.response.text}",
                )
            last_exception = e
            logger.warning(f"Server error (attempt {attempt + 1}/{MAX_RETRIES})")
            await asyncio.sleep(RETRY_DELAY)

    raise HTTPException(
        status_code=503,
        detail=f"Ollama service unavailable after {MAX_RETRIES} retries: {str(last_exception)}",
    )

@app.post("/generate", response_model=GenerateResponse)
async def generate_text(request: GenerateRequest):
    """
    Generate text completion from the local model.

    This endpoint handles both streaming and non-streaming responses.
    For streaming, it returns a Server-Sent Events (SSE) stream.
    """
    app_state.total_requests += 1
    app_state.active_requests += 1

    try:
        payload = {
            "model": request.model,
            "prompt": request.prompt,
            "stream": request.stream,
            "options": {
                "temperature": request.temperature,
                "num_predict": request.max_tokens,
            }
        }

        if request.stream:
            # Return streaming response
            stream_generator = await query_ollama(payload, stream=True)

            async def event_stream():
                full_response = ""
                async for line in stream_generator:
                    try:
                        import json
                        data = json.loads(line)
                        if "response" in data:
                            full_response += data["response"]
                            yield f"data: {json.dumps({'token': data['response']})}\n\n"
                        if data.get("done"):
                            yield f"data: {json.dumps({'done': True, 'full_response': full_response})}\n\n"
                    except json.JSONDecodeError:
                        logger.error(f"Failed to parse Ollama response: {line}")

            return StreamingResponse(
                event_stream(),
                media_type="text/event-stream",
                headers={
                    "Cache-Control": "no-cache",
                    "Connection": "keep-alive",
                    "X-Accel-Buffering": "no",
                }
            )
        else:
            # Non-streaming response
            result = await query_ollama(payload)

            # Calculate tokens per second for monitoring
            total_duration = result.get("total_duration", 0)
            tokens_generated = result.get("eval_count", 0)
            tokens_per_second = None
            if total_duration and tokens_generated:
                tokens_per_second = tokens_generated / (total_duration / 1e9)

            return GenerateResponse(
                response=result.get("response", ""),
                model=request.model,
                created_at=result.get("created_at", ""),
                done=result.get("done", True),
                total_duration=total_duration,
                tokens_per_second=tokens_per_second,
            )

    except HTTPException:
        raise
    except Exception as e:
        logger.exception("Unexpected error during generation")
        raise HTTPException(status_code=500, detail=f"Internal server error: {str(e)}")
    finally:
        app_state.active_requests -= 1

@app.get("/health", response_model=HealthResponse)
async def health_check():
    """Health check endpoint for monitoring and load balancers."""
    try:
        # Verify Ollama is still responsive
        response = await app_state.client.get(f"{OLLAMA_BASE_URL}/api/tags")
        response.raise_for_status()
        models = response.json().get("models", [])
        model_loaded = any(m["name"] == DEFAULT_MODEL for m in models)

        return HealthResponse(
            status="healthy",
            model_loaded=model_loaded,
            uptime_seconds=time.time() - app_state.start_time,
        )
    except Exception as e:
        logger.error(f"Health check failed: {e}")
        return HealthResponse(
            status="degraded",
            model_loaded=False,
            uptime_seconds=time.time() - app_state.start_time,
        )

@app.post("/models/load")
async def load_model(model_name: str = DEFAULT_MODEL):
    """
    Explicitly load a model into GPU memory.

    This is useful for pre-loading models to avoid cold-start latency.
    """
    try:
        payload = {"model": model_name, "keep_alive": "5m"}
        response = await app_state.client.post(
            f"{OLLAMA_BASE_URL}/api/generate",
            json=payload,
        )
        response.raise_for_status()
        return {"status": "loaded", "model": model_name}
    except httpx.HTTPStatusError as e:
        raise HTTPException(
            status_code=e.response.status_code,
            detail=f"Failed to load model: {e.response.text}",
        )

@app.post("/models/unload")
async def unload_model(model_name: str = DEFAULT_MODEL):
    """
    Unload a model from GPU memory to free resources.

    Critical for managing memory in multi-model environments.
    """
    try:
        payload = {"model": model_name, "keep_alive": 0}
        response = await app_state.client.post(
            f"{OLLAMA_BASE_URL}/api/generate",
            json=payload,
        )
        response.raise_for_status()
        return {"status": "unloaded", "model": model_name}
    except httpx.HTTPStatusError as e:
        raise HTTPException(
            status_code=e.response.status_code,
            detail=f"Failed to unload model: {e.response.text}",
        )

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(
        "inference_server:app",
        host="0.0.0.0",
        port=8000,
        reload=False,
        workers=1,  # Single worker to avoid GPU memory conflicts
        log_level="info",
    )

This implementation addresses several critical production concerns:

Connection Pooling: The httpx.AsyncClient with connection pooling prevents socket exhaustion under high load. The max_keepalive_connections=5 setting ensures we reuse connections to Ollama rather than opening new ones for each request.

Retry Logic: The query_ollama function implements exponential backoff for transient failures. This is essential because Ollama may temporarily reject requests during model loading or garbage collection.

Memory Management: The /models/unload endpoint allows explicit GPU memory management. Without this, models remain loaded indefinitely, consuming VRAM. According to Ollama's documentation, setting keep_alive: 0 forces immediate model unloading.

Streaming Support: The Server-Sent Events (SSE) implementation provides real-time token delivery while maintaining proper error handling. The X-Accel-Buffering: no header prevents nginx from buffering streaming responses in production deployments.

Production Deployment and Monitoring

Running this server in production requires additional considerations for reliability and observability. Let's create a deployment configuration and monitoring setup.

# config.py - Environment-based configuration
import os
from dataclasses import dataclass

@dataclass
class Settings:
    ollama_base_url: str = os.getenv("OLLAMA_BASE_URL", "http://localhost:11434")
    default_model: str = os.getenv("DEFAULT_MODEL", "llama3:7b")
    max_tokens: int = int(os.getenv("MAX_TOKENS", "4096"))
    request_timeout: int = int(os.getenv("REQUEST_TIMEOUT", "300"))
    max_concurrent_requests: int = int(os.getenv("MAX_CONCURRENT_REQUESTS", "4"))
    log_level: str = os.getenv("LOG_LEVEL", "INFO")

    # GPU memory management
    gpu_memory_fraction: float = float(os.getenv("GPU_MEMORY_FRACTION", "0.9"))
    model_keep_alive: str = os.getenv("MODEL_KEEP_ALIVE", "5m")

settings = Settings()

For production deployment, use a process manager like Supervisor or systemd to ensure the server restarts automatically:

# /etc/systemd/system/inference-server.service
[Unit]
Description=Local AI Inference Server
After=network.target ollama.service

[Service]
Type=simple
User=inference
WorkingDirectory=/opt/inference-server
Environment="OLLAMA_BASE_URL=http://localhost:11434"
Environment="DEFAULT_MODEL=llama3:7b"
Environment="MAX_CONCURRENT_REQUESTS=4"
ExecStart=/opt/inference-server/inference-env/bin/uvicorn inference_server:app --host 0.0.0.0 --port 8000 --workers 1
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Monitoring with Prometheus metrics:

# metrics.py - Add to inference_server.py
from prometheus_client import Counter, Histogram, Gauge, generate_latest
from fastapi.responses import Response

# Define metrics
REQUEST_COUNT = Counter(
    "inference_requests_total",
    "Total inference requests",
    ["model", "streaming"]
)

REQUEST_DURATION = Histogram(
    "inference_request_duration_seconds",
    "Request duration in seconds",
    ["model"],
    buckets=(0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0)
)

ACTIVE_REQUESTS = Gauge(
    "inference_active_requests",
    "Currently active requests"
)

TOKENS_PER_SECOND = Histogram(
    "inference_tokens_per_second",
    "Tokens generated per second",
    ["model"],
    buckets=(1, 5, 10, 20, 50, 100)
)

@app.get("/metrics")
async def metrics():
    """Prometheus metrics endpoint for monitoring."""
    return Response(content=generate_latest(), media_type="text/plain")

Handling Edge Cases and Performance Optimization

Production inference servers face several edge cases that can degrade performance or cause failures. Here's how to handle them:

GPU Memory Fragmentation: Long-running servers accumulate memory fragmentation. Implement periodic model unloading:

async def scheduled_memory_cleanup():
    """Periodically unload and reload models to prevent memory fragmentation."""
    while True:
        await asyncio.sleep(3600)  # Every hour
        logger.info("Performing scheduled memory cleanup")
        try:
            await unload_model(settings.default_model)
            await asyncio.sleep(2)  # Allow GPU memory to be freed
            await load_model(settings.default_model)
        except Exception as e:
            logger.error(f"Memory cleanup failed: {e}")

# Add to lifespan
@asynccontextmanager
async def lifespan(app: FastAPI):
    # .. existing initialization ..
    cleanup_task = asyncio.create_task(scheduled_memory_cleanup())
    yield
    cleanup_task.cancel()
    # .. cleanup ..

Rate Limiting: Protect against abuse and resource exhaustion:

from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address

limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(429, _rate_limit_exceeded_handler)

@app.post("/generate")
@limiter.limit("10/minute")  # Adjust based on your hardware
async def generate_text(request: Request, generate_request: GenerateRequest):
    # .. existing implementation ..

Concurrent Request Queue: When GPU memory is limited, queue requests to prevent out-of-memory errors:

import asyncio
from collections import deque

class RequestQueue:
    def __init__(self, max_concurrent: int = 4):
        self.semaphore = asyncio.Semaphore(max_concurrent)
        self.queue = deque()
        self.processing = set()

    async def acquire(self, request_id: str):
        """Acquire slot with queue position tracking."""
        self.queue.append(request_id)
        await self.semaphore.acquire()
        self.queue.popleft()
        self.processing.add(request_id)

    def release(self, request_id: str):
        """Release slot and allow next queued request."""
        self.processing.discard(request_id)
        self.semaphore.release()

request_queue = RequestQueue(max_concurrent=settings.max_concurrent_requests)

Conclusion and What's Next

Building a local AI inference server with Ollama and FastAPI provides a production-ready foundation for deploying language models on your own hardware. This architecture handles concurrent requests, manages GPU memory, and provides monitoring capabilities essential for production environments.

The key takeaways from this tutorial:

Ollama abstracts model complexity while providing a clean REST API
FastAPI's async capabilities enable efficient concurrent request handling
Proper error handling and retry logic are essential for reliability
GPU memory management requires explicit model loading/unloading
Monitoring with Prometheus metrics provides operational visibility

What's Next:

Implement model caching with Redis for frequently used prompts
Add authentication with API keys or JWT tokens
Deploy behind nginx for SSL termination and load balancing
Experiment with model quantization (e.g., llama3:7b-q4_0) for faster inference
Implement request batching for higher throughput on GPU

For further reading, explore our guides on optimizing GPU memory for LLMs and building RAG pipelines with local models. The combination of local inference with retrieval-augmented generation creates powerful, privacy-preserving AI applications that run entirely on your infrastructure.

References

1. Wikipedia - Llama. Wikipedia. [Source]

2. Wikipedia - Rag. Wikipedia. [Source]

3. Wikipedia - Ollama. Wikipedia. [Source]

4. GitHub - meta-llama/llama. Github. [Source]

5. GitHub - Shubhamsaboo/awesome-llm-apps. Github. [Source]

6. GitHub - ollama/ollama. Github. [Source]

7. GitHub - mistralai/mistral-inference. Github. [Source]

8. LlamaIndex Pricing. Pricing. [Source]

How to Build a Local AI Inference Server with Ollama and FastAPI

How to Build a Local AI Inference Server with Ollama and FastAPI

Table of Contents

📺 Watch: Neural Networks Explained

Understanding the Local Inference Architecture

Prerequisites and Environment Setup

Building the Core Inference Server

Production Deployment and Monitoring

Handling Edge Cases and Performance Optimization

Conclusion and What's Next

References

Was this article helpful?

Related Articles

How to Analyze Security Logs with DeepSeek Locally

How to Build a Multimodal App with Gemini 2.0 Vision API

How to Build an AI Research Assistant with Perplexity API