How to Build a Local AI Inference Server with Ollama and FastAPI
Practical tutorial: The story appears to be a personal achievement related to setting up hardware for an AI model, which is not significant
How to Build a Local AI Inference Server with Ollama and FastAPI
Table of Contents
- How to Build a Local AI Inference Server with Ollama and FastAPI
- Install Ollama [8] (Linux/macOS)
- Verify installation
- Pull a production-ready model (7B parameters)
- Create Python virtual environment
- Install dependencies
- inference_server.py
📺 Watch: Neural Networks Explained
Video by 3Blue1Brown
Building production-ready AI inference systems locally has become increasingly practical with the maturation of tools like Ollama [6] and FastAPI. As of May 2026, running large language models on consumer hardware is no longer a novelty—it's a viable architecture for prototyping, edge deployment, and privacy-sensitive applications. This tutorial walks through constructing a complete, production-grade inference server that handles concurrent requests, manages GPU memory efficiently, and provides a RESTful API for model interaction.
Understanding the Local Inference Architecture
The shift toward local AI inference stems from three critical requirements: data privacy, latency control, and cost predictability. When you run models locally, sensitive data never leaves your hardware, response times aren't subject to network variability, and you avoid per-token API costs. According to Ollama's official documentation, the platform supports over 80 models including Llama 3, Mistral [7], and Gemma, with automatic GPU acceleration via CUDA or Metal.
Our architecture follows a clean separation of concerns:
- Ollama handles model loading, inference, and memory management
- FastAPI provides the HTTP interface with async support
- Python orchestrates communication between components
This design scales from a single laptop to multi-GPU workstations without architectural changes. The key insight is that Ollama's REST API abstracts away model-specific complexities, allowing us to focus on building robust server infrastructure.
Prerequisites and Environment Setup
Before writing any code, ensure your environment meets these requirements:
Hardware Requirements:
- Minimum 8GB RAM (16GB+ recommended for 7B parameter models)
- NVIDIA GPU with 6GB+ VRAM (optional but strongly recommended)
- 20GB free disk space for model storag [2]e
Software Requirements:
- Python 3.10 or later
- Ollama 0.1.32 or later (latest stable as of May 2026)
- pip package manager
Let's set up the environment:
# Install Ollama (Linux/macOS)
curl -fsSL https://ollama.ai/install.sh | sh
# Verify installation
ollama --version
# Pull a production-ready model (7B parameters)
ollama pull llama3:7b
# Create Python virtual environment
python3 -m venv inference-env
source inference-env/bin/activate
# Install dependencies
pip install fastapi==0.111.0 uvicorn==0.29.0 httpx==0.27.0 pydantic==2.7.0 python-dotenv==1.0.1
The httpx library is crucial here—it provides async HTTP client capabilities that integrate seamlessly with FastAPI's async endpoints. We'll use it to communicate with Ollama's local API without blocking the server's event loop.
Building the Core Inference Server
Now we'll construct the production inference server. This implementation handles concurrent requests, manages model loading/unloading, and provides proper error handling for edge cases.
# inference_server.py
import asyncio
import logging
import time
from contextlib import asynccontextmanager
from typing import AsyncGenerator, Optional
import httpx
from fastapi import FastAPI, HTTPException, Request
from fastapi.responses import StreamingResponse
from pydantic import BaseModel, Field, validator
# Configure structured logging for production observability
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
)
logger = logging.getLogger(__name__)
# Constants for Ollama API interaction
OLLAMA_BASE_URL = "http://localhost:11434"
OLLAMA_GENERATE_ENDPOINT = f"{OLLAMA_BASE_URL}/api/generate"
OLLAMA_CHAT_ENDPOINT = f"{OLLAMA_BASE_URL}/api/chat"
DEFAULT_MODEL = "llama3:7b"
MAX_RETRIES = 3
RETRY_DELAY = 1.0 # seconds
# Request/Response models with validation
class GenerateRequest(BaseModel):
prompt: str = Field(.., min_length=1, max_length=4096)
model: str = Field(default=DEFAULT_MODEL)
stream: bool = Field(default=False)
temperature: float = Field(default=0.7, ge=0.0, le=2.0)
max_tokens: int = Field(default=512, ge=1, le=4096)
@validator('prompt')
def prompt_not_empty(cls, v):
if not v.strip():
raise ValueError('Prompt cannot be empty or whitespace only')
return v.strip()
class GenerateResponse(BaseModel):
response: str
model: str
created_at: str
done: bool
total_duration: Optional[int] = None
tokens_per_second: Optional[float] = None
class HealthResponse(BaseModel):
status: str
model_loaded: bool
uptime_seconds: float
# Application state management
class AppState:
def __init__(self):
self.start_time = time.time()
self.client: Optional[httpx.AsyncClient] = None
self.active_requests = 0
self.total_requests = 0
app_state = AppState()
@asynccontextmanager
async def lifespan(app: FastAPI):
"""Manage application lifecycle - initialize and cleanup resources."""
logger.info("Initializing inference server..")
# Create async HTTP client with connection pooling
app_state.client = httpx.AsyncClient(
timeout=httpx.Timeout(300.0, connect=10.0),
limits=httpx.Limits(max_keepalive_connections=5, max_connections=10),
)
# Verify Ollama is running
try:
response = await app_state.client.get(f"{OLLAMA_BASE_URL}/api/tags")
response.raise_for_status()
logger.info(f"Ollama connected. Available models: {response.json()}")
except httpx.ConnectError:
logger.error("Cannot connect to Ollama. Ensure it's running with 'ollama serve'")
raise RuntimeError("Ollama service unavailable")
yield
# Cleanup on shutdown
logger.info("Shutting down inference server..")
if app_state.client:
await app_state.client.aclose()
# Initialize FastAPI with lifespan management
app = FastAPI(
title="Local AI Inference API",
version="1.0.0",
lifespan=lifespan,
)
async def query_ollama(
payload: dict,
stream: bool = False,
) -> dict | AsyncGenerator[dict, None]:
"""
Send request to Ollama with retry logic for transient failures.
Args:
payload: Dictionary with model, prompt, and generation parameters
stream: Whether to stream the response token by token
Returns:
Parsed JSON response or async generator for streaming
"""
last_exception = None
for attempt in range(MAX_RETRIES):
try:
if stream:
# Streaming response - return async generator
async def generate():
async with app_state.client.stream(
"POST",
OLLAMA_GENERATE_ENDPOINT,
json=payload,
) as response:
response.raise_for_status()
async for line in response.aiter_lines():
if line.strip():
yield line
return generate()
else:
# Non-streaming response
response = await app_state.client.post(
OLLAMA_GENERATE_ENDPOINT,
json=payload,
)
response.raise_for_status()
return response.json()
except httpx.TimeoutException as e:
last_exception = e
logger.warning(f"Request timeout (attempt {attempt + 1}/{MAX_RETRIES})")
if attempt < MAX_RETRIES - 1:
await asyncio.sleep(RETRY_DELAY * (attempt + 1))
except httpx.HTTPStatusError as e:
# Don't retry on client errors (4xx)
if 400 <= e.response.status_code < 500:
raise HTTPException(
status_code=e.response.status_code,
detail=f"Ollama API error: {e.response.text}",
)
last_exception = e
logger.warning(f"Server error (attempt {attempt + 1}/{MAX_RETRIES})")
await asyncio.sleep(RETRY_DELAY)
raise HTTPException(
status_code=503,
detail=f"Ollama service unavailable after {MAX_RETRIES} retries: {str(last_exception)}",
)
@app.post("/generate", response_model=GenerateResponse)
async def generate_text(request: GenerateRequest):
"""
Generate text completion from the local model.
This endpoint handles both streaming and non-streaming responses.
For streaming, it returns a Server-Sent Events (SSE) stream.
"""
app_state.total_requests += 1
app_state.active_requests += 1
try:
payload = {
"model": request.model,
"prompt": request.prompt,
"stream": request.stream,
"options": {
"temperature": request.temperature,
"num_predict": request.max_tokens,
}
}
if request.stream:
# Return streaming response
stream_generator = await query_ollama(payload, stream=True)
async def event_stream():
full_response = ""
async for line in stream_generator:
try:
import json
data = json.loads(line)
if "response" in data:
full_response += data["response"]
yield f"data: {json.dumps({'token': data['response']})}\n\n"
if data.get("done"):
yield f"data: {json.dumps({'done': True, 'full_response': full_response})}\n\n"
except json.JSONDecodeError:
logger.error(f"Failed to parse Ollama response: {line}")
return StreamingResponse(
event_stream(),
media_type="text/event-stream",
headers={
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"X-Accel-Buffering": "no",
}
)
else:
# Non-streaming response
result = await query_ollama(payload)
# Calculate tokens per second for monitoring
total_duration = result.get("total_duration", 0)
tokens_generated = result.get("eval_count", 0)
tokens_per_second = None
if total_duration and tokens_generated:
tokens_per_second = tokens_generated / (total_duration / 1e9)
return GenerateResponse(
response=result.get("response", ""),
model=request.model,
created_at=result.get("created_at", ""),
done=result.get("done", True),
total_duration=total_duration,
tokens_per_second=tokens_per_second,
)
except HTTPException:
raise
except Exception as e:
logger.exception("Unexpected error during generation")
raise HTTPException(status_code=500, detail=f"Internal server error: {str(e)}")
finally:
app_state.active_requests -= 1
@app.get("/health", response_model=HealthResponse)
async def health_check():
"""Health check endpoint for monitoring and load balancers."""
try:
# Verify Ollama is still responsive
response = await app_state.client.get(f"{OLLAMA_BASE_URL}/api/tags")
response.raise_for_status()
models = response.json().get("models", [])
model_loaded = any(m["name"] == DEFAULT_MODEL for m in models)
return HealthResponse(
status="healthy",
model_loaded=model_loaded,
uptime_seconds=time.time() - app_state.start_time,
)
except Exception as e:
logger.error(f"Health check failed: {e}")
return HealthResponse(
status="degraded",
model_loaded=False,
uptime_seconds=time.time() - app_state.start_time,
)
@app.post("/models/load")
async def load_model(model_name: str = DEFAULT_MODEL):
"""
Explicitly load a model into GPU memory.
This is useful for pre-loading models to avoid cold-start latency.
"""
try:
payload = {"model": model_name, "keep_alive": "5m"}
response = await app_state.client.post(
f"{OLLAMA_BASE_URL}/api/generate",
json=payload,
)
response.raise_for_status()
return {"status": "loaded", "model": model_name}
except httpx.HTTPStatusError as e:
raise HTTPException(
status_code=e.response.status_code,
detail=f"Failed to load model: {e.response.text}",
)
@app.post("/models/unload")
async def unload_model(model_name: str = DEFAULT_MODEL):
"""
Unload a model from GPU memory to free resources.
Critical for managing memory in multi-model environments.
"""
try:
payload = {"model": model_name, "keep_alive": 0}
response = await app_state.client.post(
f"{OLLAMA_BASE_URL}/api/generate",
json=payload,
)
response.raise_for_status()
return {"status": "unloaded", "model": model_name}
except httpx.HTTPStatusError as e:
raise HTTPException(
status_code=e.response.status_code,
detail=f"Failed to unload model: {e.response.text}",
)
if __name__ == "__main__":
import uvicorn
uvicorn.run(
"inference_server:app",
host="0.0.0.0",
port=8000,
reload=False,
workers=1, # Single worker to avoid GPU memory conflicts
log_level="info",
)
This implementation addresses several critical production concerns:
Connection Pooling: The httpx.AsyncClient with connection pooling prevents socket exhaustion under high load. The max_keepalive_connections=5 setting ensures we reuse connections to Ollama rather than opening new ones for each request.
Retry Logic: The query_ollama function implements exponential backoff for transient failures. This is essential because Ollama may temporarily reject requests during model loading or garbage collection.
Memory Management: The /models/unload endpoint allows explicit GPU memory management. Without this, models remain loaded indefinitely, consuming VRAM. According to Ollama's documentation, setting keep_alive: 0 forces immediate model unloading.
Streaming Support: The Server-Sent Events (SSE) implementation provides real-time token delivery while maintaining proper error handling. The X-Accel-Buffering: no header prevents nginx from buffering streaming responses in production deployments.
Production Deployment and Monitoring
Running this server in production requires additional considerations for reliability and observability. Let's create a deployment configuration and monitoring setup.
# config.py - Environment-based configuration
import os
from dataclasses import dataclass
@dataclass
class Settings:
ollama_base_url: str = os.getenv("OLLAMA_BASE_URL", "http://localhost:11434")
default_model: str = os.getenv("DEFAULT_MODEL", "llama3:7b")
max_tokens: int = int(os.getenv("MAX_TOKENS", "4096"))
request_timeout: int = int(os.getenv("REQUEST_TIMEOUT", "300"))
max_concurrent_requests: int = int(os.getenv("MAX_CONCURRENT_REQUESTS", "4"))
log_level: str = os.getenv("LOG_LEVEL", "INFO")
# GPU memory management
gpu_memory_fraction: float = float(os.getenv("GPU_MEMORY_FRACTION", "0.9"))
model_keep_alive: str = os.getenv("MODEL_KEEP_ALIVE", "5m")
settings = Settings()
For production deployment, use a process manager like Supervisor or systemd to ensure the server restarts automatically:
# /etc/systemd/system/inference-server.service
[Unit]
Description=Local AI Inference Server
After=network.target ollama.service
[Service]
Type=simple
User=inference
WorkingDirectory=/opt/inference-server
Environment="OLLAMA_BASE_URL=http://localhost:11434"
Environment="DEFAULT_MODEL=llama3:7b"
Environment="MAX_CONCURRENT_REQUESTS=4"
ExecStart=/opt/inference-server/inference-env/bin/uvicorn inference_server:app --host 0.0.0.0 --port 8000 --workers 1
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
Monitoring with Prometheus metrics:
# metrics.py - Add to inference_server.py
from prometheus_client import Counter, Histogram, Gauge, generate_latest
from fastapi.responses import Response
# Define metrics
REQUEST_COUNT = Counter(
"inference_requests_total",
"Total inference requests",
["model", "streaming"]
)
REQUEST_DURATION = Histogram(
"inference_request_duration_seconds",
"Request duration in seconds",
["model"],
buckets=(0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0)
)
ACTIVE_REQUESTS = Gauge(
"inference_active_requests",
"Currently active requests"
)
TOKENS_PER_SECOND = Histogram(
"inference_tokens_per_second",
"Tokens generated per second",
["model"],
buckets=(1, 5, 10, 20, 50, 100)
)
@app.get("/metrics")
async def metrics():
"""Prometheus metrics endpoint for monitoring."""
return Response(content=generate_latest(), media_type="text/plain")
Handling Edge Cases and Performance Optimization
Production inference servers face several edge cases that can degrade performance or cause failures. Here's how to handle them:
GPU Memory Fragmentation: Long-running servers accumulate memory fragmentation. Implement periodic model unloading:
async def scheduled_memory_cleanup():
"""Periodically unload and reload models to prevent memory fragmentation."""
while True:
await asyncio.sleep(3600) # Every hour
logger.info("Performing scheduled memory cleanup")
try:
await unload_model(settings.default_model)
await asyncio.sleep(2) # Allow GPU memory to be freed
await load_model(settings.default_model)
except Exception as e:
logger.error(f"Memory cleanup failed: {e}")
# Add to lifespan
@asynccontextmanager
async def lifespan(app: FastAPI):
# .. existing initialization ..
cleanup_task = asyncio.create_task(scheduled_memory_cleanup())
yield
cleanup_task.cancel()
# .. cleanup ..
Rate Limiting: Protect against abuse and resource exhaustion:
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(429, _rate_limit_exceeded_handler)
@app.post("/generate")
@limiter.limit("10/minute") # Adjust based on your hardware
async def generate_text(request: Request, generate_request: GenerateRequest):
# .. existing implementation ..
Concurrent Request Queue: When GPU memory is limited, queue requests to prevent out-of-memory errors:
import asyncio
from collections import deque
class RequestQueue:
def __init__(self, max_concurrent: int = 4):
self.semaphore = asyncio.Semaphore(max_concurrent)
self.queue = deque()
self.processing = set()
async def acquire(self, request_id: str):
"""Acquire slot with queue position tracking."""
self.queue.append(request_id)
await self.semaphore.acquire()
self.queue.popleft()
self.processing.add(request_id)
def release(self, request_id: str):
"""Release slot and allow next queued request."""
self.processing.discard(request_id)
self.semaphore.release()
request_queue = RequestQueue(max_concurrent=settings.max_concurrent_requests)
Conclusion and What's Next
Building a local AI inference server with Ollama and FastAPI provides a production-ready foundation for deploying language models on your own hardware. This architecture handles concurrent requests, manages GPU memory, and provides monitoring capabilities essential for production environments.
The key takeaways from this tutorial:
- Ollama abstracts model complexity while providing a clean REST API
- FastAPI's async capabilities enable efficient concurrent request handling
- Proper error handling and retry logic are essential for reliability
- GPU memory management requires explicit model loading/unloading
- Monitoring with Prometheus metrics provides operational visibility
What's Next:
- Implement model caching with Redis for frequently used prompts
- Add authentication with API keys or JWT tokens
- Deploy behind nginx for SSL termination and load balancing
- Experiment with model quantization (e.g., llama3:7b-q4_0) for faster inference
- Implement request batching for higher throughput on GPU
For further reading, explore our guides on optimizing GPU memory for LLMs and building RAG pipelines with local models. The combination of local inference with retrieval-augmented generation creates powerful, privacy-preserving AI applications that run entirely on your infrastructure.
References
Was this article helpful?
Let us know to improve our AI generation.
Related Articles
How to Analyze Security Logs with DeepSeek Locally
Practical tutorial: Analyze security logs with DeepSeek locally
How to Build a Multimodal App with Gemini 2.0 Vision API
Practical tutorial: Build a multimodal app with Gemini 2.0 Vision API
How to Build an AI Research Assistant with Perplexity API
Practical tutorial: Create an AI research assistant with Perplexity API