Back to Tutorials
tutorialstutorialaiml

How to Run Stable Diffusion Locally for Image Generation

Practical tutorial: Stable Diffusion is an interesting open-source model release that has significant implications for image generation tech

BlogIA AcademyJune 5, 202612 min read2 244 words

How to Run Stable Diffusion Locally for Image Generation

Table of Contents

📺 Watch: Stable Diffusion Explained

Video by Computerphile


Stable Diffusion represents a paradigm shift in how we approach image generation technology. Released in 2022 by Stability AI, this open-source text-to-image model democratized access to generative AI by allowing anyone with a decent GPU to run it locally. As of June 2026, Stable Diffusion remains the premier open-source image generation model, with a rating of 4.4 on our platform. Unlike proprietary alternatives that require API subscriptions and internet connectivity, Stable Diffusion can be deployed on your own hardware, giving you complete control over your data, costs, and generation parameters.

In this production-grade tutorial, you'll learn how to set up and run Stable Diffusion locally, optimize inference performance, handle edge cases like out-of-memory errors, and build a simple API server around your model. By the end, you'll have a fully functional local image generation pipeline that respects your privacy and scales with your hardware.

Understanding the Stable Diffusion Architecture

Before diving into code, it's critical to understand what makes Stable Diffusion tick. The model is a deep learning, text-to-image model based on diffusion techniques. At its core, it uses a latent diffusion architecture that operates in a compressed latent space rather than pixel space, dramatically reducing computational requirements.

The architecture consists of three main components:

  1. Text Encoder: A CLIP-based model that converts your text prompt into a 768-dimensional embedding [1] vector
  2. UNet: The denoising neural network that progressively removes noise from a latent representation
  3. VAE (Variational Autoencoder): Encodes images into latent space and decodes latents back to pixel space

The key insight is that diffusion happens in latent space (typically 64x64 or 96x96), which is 48x smaller than the 512x512 pixel output. This makes Stable Diffusion orders of magnitude more efficient than pixel-space diffusion models like DALL-E 2.

Prerequisites and Environment Setup

You'll need the following hardware and software:

Hardware Requirements:

  • GPU with at least 6GB VRAM (NVIDIA RTX 2060 or better)
  • 16GB system RAM
  • 20GB free disk space for model weights

Software Stack:

  • Python 3.10 or 3.11
  • CUDA 11.8+ (for NVIDIA GPUs)
  • PyTorch 2.x with CUDA support

Let's set up a clean environment:

# Create a fresh Python environment
python3.11 -m venv sd-env
source sd-env/bin/activate

# Upgrade pip and install core dependencies
pip install --upgrade pip setuptools wheel

# Install PyTorch with CUDA support (adjust CUDA version to match your system)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install Hugging Face ecosystem
pip install transformers diffusers accelerate safetensors

# Install utility libraries
pip install pillow numpy fastapi uvicorn python-multipart

Verification Step: Test that CUDA is available:

import torch
print(f"CUDA Available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

If CUDA is not available, you can still run Stable Diffusion on CPU, but generation times will be 50-100x slower (minutes per image instead of seconds).

Core Implementation: Building a Production-Ready Pipeline

Step 1: Loading the Model with Memory Optimization

The base Stable Diffusion model weights are approximately 5GB. Loading them naively can consume 8-10GB of VRAM. We'll use memory-efficient techniques:

import torch
from diffusers import StableDiffusionPipeline
from diffusers.utils import load_image
import gc

class StableDiffusionEngine:
    """Production-ready Stable Diffusion wrapper with memory optimization."""

    def __init__(self, model_id: str = "runwayml/stable-diffusion-v1-5"):
        """
        Initialize the pipeline with memory-efficient settings.

        Args:
            model_id: Hugging Face model identifier. Default is v1.5.
                      For v2.1, use "stabilityai/stable-diffusion-2-1"
        """
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.dtype = torch.float16 if self.device == "cuda" else torch.float32

        print(f"Loading {model_id} on {self.device} with {self.dtype} precision..")

        # Load pipeline with memory optimizations
        self.pipe = StableDiffusionPipeline.from_pretrained(
            model_id,
            torch_dtype=self.dtype,
            safety_checker=None,  # Disable safety checker for performance
            requires_safety_checker=False,
            use_safetensors=True,  # Use safetensors format for faster loading
        )

        # Move to GPU and enable memory optimizations
        self.pipe = self.pipe.to(self.device)

        if self.device == "cuda":
            # Enable attention slicing to reduce VRAM usage
            # This trades ~10% speed for ~30% memory reduction
            self.pipe.enable_attention_slicing()

            # Enable model CPU offloading for large models
            # self.pipe.enable_model_cpu_offload()  # Uncomment if VRAM < 8GB

        print("Model loaded successfully.")

    def generate(
        self,
        prompt: str,
        negative_prompt: str = "",
        num_inference_steps: int = 50,
        guidance_scale: float = 7.5,
        seed: int = None,
        width: int = 512,
        height: int = 512,
    ) -> torch.Tensor:
        """
        Generate an image from a text prompt.

        Args:
            prompt: Text description of desired image
            negative_prompt: What to avoid in the image
            num_inference_steps: More steps = better quality but slower
            guidance_scale: How closely to follow the prompt (1-20)
            seed: Random seed for reproducibility
            width, height: Output dimensions (must be multiples of 8)

        Returns:
            PIL Image object
        """
        # Validate dimensions
        if width % 8 != 0 or height % 8 != 0:
            raise ValueError(f"Dimensions must be multiples of 8, got {width}x{height}")

        # Set seed for reproducibility
        generator = None
        if seed is not None:
            generator = torch.Generator(device=self.device).manual_seed(seed)

        # Clear GPU cache before generation
        if self.device == "cuda":
            torch.cuda.empty_cache()

        with torch.inference_mode():
            output = self.pipe(
                prompt=prompt,
                negative_prompt=negative_prompt,
                num_inference_steps=num_inference_steps,
                guidance_scale=guidance_scale,
                generator=generator,
                width=width,
                height=height,
            )

        return output.images[0]

    def cleanup(self):
        """Free GPU memory."""
        del self.pipe
        gc.collect()
        if self.device == "cuda":
            torch.cuda.empty_cache()

# Usage example
engine = StableDiffusionEngine()
image = engine.generate(
    prompt="a serene mountain landscape at sunset, digital art",
    negative_prompt="blurry, low quality, distorted",
    seed=42,
)
image.save("output.png")
engine.cleanup()

Key Architecture Decisions:

  1. Float16 Precision: Using half-precision (float16) on CUDA devices reduces memory usage by 50% with negligible quality loss. This is the single most impactful optimization.

  2. Attention Slicing: This technique processes attention computations in slices rather than all at once. It's essential for GPUs with less than 10GB VRAM.

  3. Safety Checker Disabled: The default safety checker (NSFW filter) adds overhead. In production, you might want to implement your own moderation pipeline separately.

  4. Safetensors Format: Using .safetensors files instead of PyTorch's default pickle format provides faster loading and better security.

Step 2: Handling Edge Cases and Memory Management

Production systems must gracefully handle resource constraints. Here's a robust generation function with error handling:

import time
import logging
from typing import Optional, Tuple

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class RobustStableDiffusionGenerator:
    """Handles edge cases like OOM errors and long prompts."""

    MAX_PROMPT_LENGTH = 77  # CLIP token limit

    def __init__(self, engine: StableDiffusionEngine):
        self.engine = engine
        self.max_retries = 3

    def truncate_prompt(self, prompt: str) -> str:
        """Truncate prompt to fit CLIP's 77 token limit."""
        from transformers import CLIPTokenizer

        tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
        tokens = tokenizer.encode(prompt)

        if len(tokens) > self.MAX_PROMPT_LENGTH:
            # Truncate to max tokens, leaving room for special tokens
            tokens = tokens[:self.MAX_PROMPT_LENGTH - 2]
            prompt = tokenizer.decode(tokens)
            logger.warning(f"Prompt truncated to {len(tokens)} tokens")

        return prompt

    def generate_with_fallback(
        self,
        prompt: str,
        **kwargs
    ) -> Tuple[Optional[torch.Tensor], dict]:
        """
        Generate image with automatic fallback on OOM errors.

        Returns:
            Tuple of (image, metadata dict)
        """
        metadata = {
            "prompt": prompt,
            "timestamp": time.time(),
            "attempts": 0,
            "success": False
        }

        # Truncate prompt if needed
        prompt = self.truncate_prompt(prompt)

        for attempt in range(self.max_retries):
            metadata["attempts"] = attempt + 1

            try:
                start_time = time.time()
                image = self.engine.generate(prompt, **kwargs)
                generation_time = time.time() - start_time

                metadata["success"] = True
                metadata["generation_time"] = generation_time

                logger.info(f"Generated image in {generation_time:.2f}s (attempt {attempt + 1})")
                return image, metadata

            except torch.cuda.OutOfMemoryError:
                logger.warning(f"OOM on attempt {attempt + 1}. Reducing memory usage..")

                # Progressive fallback strategy
                if attempt == 0:
                    # First fallback: reduce image size
                    kwargs["width"] = min(kwargs.get("width", 512), 384)
                    kwargs["height"] = min(kwargs.get("height", 512), 384)
                elif attempt == 1:
                    # Second fallback: reduce inference steps
                    kwargs["num_inference_steps"] = min(
                        kwargs.get("num_inference_steps", 50), 25
                    )

                # Clear cache before retry
                torch.cuda.empty_cache()
                time.sleep(1)  # Allow system to stabilize

            except Exception as e:
                logger.error(f"Generation failed: {str(e)}")
                metadata["error"] = str(e)
                break

        return None, metadata

# Usage with error handling
generator = RobustStableDiffusionGenerator(engine)
image, meta = generator.generate_with_fallback(
    prompt="a detailed oil painting of a futuristic city with flying cars",
    num_inference_steps=50,
    seed=123,
)

if meta["success"]:
    image.save("futuristic_city.png")
    print(f"Generated in {meta['generation_time']:.2f}s")
else:
    print(f"Generation failed after {meta['attempts']} attempts")

Edge Cases Handled:

  1. Out-of-Memory (OOM) Errors: The fallback strategy progressively reduces resolution and inference steps. This is critical for production systems where hardware varies.

  2. Prompt Length Limits: CLIP's tokenizer has a 77-token limit. Longer prompts get silently truncated, which can produce unexpected results. We explicitly handle this.

  3. GPU Cache Management: We clear the CUDA cache before each generation and after errors to prevent memory fragmentation.

Step 3: Building a FastAPI Inference Server

For production deployment, you'll want an API server. Here's a minimal but production-ready implementation:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from typing import Optional
import base64
from io import BytesIO

app = FastAPI(title="Stable Diffusion API", version="1.0.0")

# Initialize engine at startup
engine = None

class GenerationRequest(BaseModel):
    prompt: str = Field(.., min_length=1, max_length=500)
    negative_prompt: str = Field("", max_length=500)
    num_inference_steps: int = Field(50, ge=1, le=150)
    guidance_scale: float = Field(7.5, ge=1.0, le=20.0)
    seed: Optional[int] = None
    width: int = Field(512, ge=64, le=1024, multiple_of=8)
    height: int = Field(512, ge=64, le=1024, multiple_of=8)

class GenerationResponse(BaseModel):
    image_base64: str
    generation_time: float
    seed: int

@app.on_event("startup")
async def load_model():
    global engine
    engine = StableDiffusionEngine()
    logger.info("Model loaded and ready")

@app.on_event("shutdown")
async def unload_model():
    if engine:
        engine.cleanup()

@app.post("/generate", response_model=GenerationResponse)
async def generate_image(request: GenerationRequest):
    """
    Generate an image from a text prompt.

    Returns base64-encoded PNG image with metadata.
    """
    if engine is None:
        raise HTTPException(status_code=503, detail="Model not loaded")

    generator = RobustStableDiffusionGenerator(engine)

    start_time = time.time()
    image, metadata = generator.generate_with_fallback(
        prompt=request.prompt,
        negative_prompt=request.negative_prompt,
        num_inference_steps=request.num_inference_steps,
        guidance_scale=request.guidance_scale,
        seed=request.seed,
        width=request.width,
        height=request.height,
    )

    if not metadata["success"]:
        raise HTTPException(
            status_code=500,
            detail=f"Generation failed: {metadata.get('error', 'Unknown error')}"
        )

    # Convert PIL Image to base64
    buffer = BytesIO()
    image.save(buffer, format="PNG")
    image_base64 = base64.b64encode(buffer.getvalue()).decode()

    return GenerationResponse(
        image_base64=image_base64,
        generation_time=time.time() - start_time,
        seed=metadata.get("seed", 0),
    )

@app.get("/health")
async def health_check():
    """Health check endpoint."""
    return {
        "status": "healthy",
        "model_loaded": engine is not None,
        "device": engine.device if engine else "unknown"
    }

# Run with: uvicorn server:app --host 0.0.0.0 --port 8000

Production Considerations:

  1. Request Validation: Pydantic models enforce type safety and value constraints (e.g., dimensions must be multiples of 8).

  2. Graceful Startup/Shutdown: The model loads once at startup and cleans up on shutdown, preventing memory leaks.

  3. Health Checks: Essential for container orchestration (Kubernetes liveness probes).

  4. Base64 Encoding: For API responses, base64 is universally supported. For high-throughput systems, consider streaming or direct file storage.

Performance Optimization and Benchmarking

Let's benchmark our implementation to understand real-world performance:

import time
import pandas as pd

def benchmark_generation(engine, prompts, num_runs=3):
    """Benchmark generation times for different configurations."""
    results = []

    for prompt in prompts:
        for steps in [20, 50, 100]:
            for size in [(512, 512), (768, 768)]:
                times = []
                for _ in range(num_runs):
                    start = time.time()
                    engine.generate(
                        prompt=prompt,
                        num_inference_steps=steps,
                        width=size[0],
                        height=size[1],
                    )
                    times.append(time.time() - start)

                results.append({
                    "prompt": prompt[:30] + "..",
                    "steps": steps,
                    "size": f"{size[0]}x{size[1]}",
                    "avg_time": sum(times) / len(times),
                    "min_time": min(times),
                    "max_time": max(times),
                })

    return pd.DataFrame(results)

# Run benchmark
prompts = [
    "a cat wearing a hat",
    "a detailed fantasy landscape with mountains and rivers",
    "a photorealistic portrait of an elderly man with wrinkles",
]

df = benchmark_generation(engine, prompts)
print(df.to_string())

Expected Performance (NVIDIA RTX 3090):

  • 512x512, 50 steps: ~3-4 seconds
  • 768x768, 50 steps: ~8-10 seconds
  • 512x512, 20 steps: ~1.5-2 seconds (lower quality)

What's Next

You now have a production-ready Stable Diffusion pipeline that handles memory constraints, edge cases, and provides a clean API interface. Here are your next steps:

  1. Explore Fine-Tuning [3]: Use DreamBooth or LoRA to fine-tune the model on custom datasets. Check our guide on model fine-tuning techniques.

  2. Implement Batch Processing: For generating multiple images, batch inference can improve throughput by 2-3x.

  3. Add Image-to-Image: Stable Diffusion supports img2img generation where you start from an existing image. The StableDiffusionImg2ImgPipeline is well-documented.

  4. Monitor Production: Add Prometheus metrics for request latency, memory usage, and generation counts.

  5. Consider Cloud Deployment: For scaling beyond a single GPU, explore services like Replicate, which hosts Stable Diffusion. The official Stability AI website at https://stability.ai provides enterprise deployment options.

Remember that Stable Diffusion is open-source under a permissive license, meaning you can modify, distribute, and deploy it without licensing fees. This makes it an ideal foundation for building custom image generation applications that respect user privacy and data sovereignty.

The model's rating of 4.4 reflects its strong community support and continuous improvement through community checkpoints. For a comprehensive list of community models, check rentry.org/sdmodels. As the AI landscape evolves, Stable Diffusion remains the gold standard for open-source image generation, and mastering its deployment gives you a significant advantage in building generative AI applications.


References

1. Wikipedia - Embedding. Wikipedia. [Source]
2. Wikipedia - Stable Diffusion. Wikipedia. [Source]
3. Wikipedia - Fine-tuning. Wikipedia. [Source]
4. GitHub - fighting41love/funNLP. Github. [Source]
5. GitHub - hiyouga/LlamaFactory. Github. [Source]
6. GitHub - danny-avila/LibreChat. Github. [Source]
tutorialaimlapivision
Share this article:

Was this article helpful?

Let us know to improve our AI generation.

Related Articles