How to Extract Structured Data from PDFs with Claude 3.5 Sonnet
Practical tutorial: Extract structured data from PDFs with Claude 3.5 Sonnet
How to Extract Structured Data from PDFs with Claude 3.5 Sonnet
Table of Contents
- How to Extract Structured Data from PDFs with Claude 3.5 Sonnet
📺 Watch: Neural Networks Explained
Video by 3Blue1Brown
Why PDF Data Extraction Still Fails in Production
Every engineering team eventually faces the PDF problem. You have hundreds or thousands of PDFs—invoices, contracts, research papers, or medical records—and you need structured data out of them. Traditional approaches fail consistently: regex patterns break on formatting variations, OCR introduces errors, and template-based parsers require maintenance for every new document layout.
As of June 2026, Claude 3.5 Sonnet from Anthropic [8] offers a fundamentally different approach. Instead of parsing PDFs as text blobs, you can feed the actual rendered pages as images and use the model's vision capabilities to extract structured information with high accuracy. This tutorial walks through a production-ready implementation that handles real-world edge cases: multi-page documents, tables, mixed formatting, and API rate limits.
Real-World Use Case and Architecture
Consider a medical claims processing pipeline. Your system receives PDFs from dozens of healthcare providers, each with different invoice formats. You need to extract: provider name, patient ID, service dates, CPT codes, charges, and insurance payments. A single extraction error can delay reimbursement by weeks.
The architecture we'll build uses a two-phase approach:
- PDF preprocessing: Convert PDF pages to high-quality images using
pdf2imagewith Poppler, handling DPI, color mode, and page splitting - Structured extraction: Send each page image to Claude 3.5 Sonnet via the Anthropic API with a carefully crafted system prompt that forces JSON output
This avoids the common pitfall of sending raw PDF text, which loses layout information critical for understanding tables and multi-column documents. According to Anthropic's documentation, Claude 3.5 Sonnet accepts images up to 8,000 pixels on the longest side, which covers most standard PDF pages at 200 DPI.
Prerequisites and Environment Setup
You'll need Python 3.10+ and the following dependencies. Create a virtual environment first:
python -m venv pdf-extractor
source pdf-extractor/bin/activate # On Windows: pdf-extractor\Scripts\activate
Install core dependencies:
pip install anthropic pdf2image pillow pydantic python-dotenv
For pdf2image to work, you need Poppler installed on your system:
- macOS:
brew install poppler - Ubuntu/Debian:
sudo apt-get install poppler-utils - Windows: Download from poppler-windows and add to PATH
Create a .env file with your Anthropic API key:
ANTHROPIC_API_KEY=sk-ant-..
Core Implementation: Production-Grade PDF Extractor
Step 1: PDF to Image Conversion with Quality Control
The first critical decision is image quality. Too low DPI and Claude misses fine print; too high and you hit API limits (Claude 3.5 Sonnet supports up to 20 images per request as of the latest API version). We'll use 200 DPI as a balance:
import os
from pathlib import Path
from pdf2image import convert_from_path
from PIL import Image
import io
import base64
class PDFImageConverter:
"""Converts PDF pages to base64-encoded JPEG images for API consumption."""
def __init__(self, dpi: int = 200, max_dimension: int = 8000):
self.dpi = dpi
self.max_dimension = max_dimension
def convert(self, pdf_path: str) -> list[dict]:
"""
Convert PDF to list of image data dicts.
Args:
pdf_path: Path to PDF file
Returns:
List of dicts with 'media_type' and 'data' keys
Raises:
FileNotFoundError: If PDF doesn't exist
ValueError: If PDF has no pages or conversion fails
"""
if not os.path.exists(pdf_path):
raise FileNotFoundError(f"PDF not found: {pdf_path}")
# Convert PDF pages to PIL Images
images = convert_from_path(
pdf_path,
dpi=self.dpi,
fmt='jpeg',
grayscale=False, # Keep color for tables/highlights
use_pdftocairo=True, # Better quality than pdftoppm
thread_count=4 # Parallel processing for multi-page PDFs
)
if not images:
raise ValueError(f"No pages found in PDF: {pdf_path}")
# Validate and encode each page
encoded_pages = []
for i, img in enumerate(images):
# Resize if exceeds max dimension while maintaining aspect ratio
if max(img.size) > self.max_dimension:
ratio = self.max_dimension / max(img.size)
new_size = (int(img.width * ratio), int(img.height * ratio))
img = img.resize(new_size, Image.Resampling.LANCZOS)
# Convert to JPEG bytes
buffer = io.BytesIO()
img.save(buffer, format='JPEG', quality=85, optimize=True)
buffer.seek(0)
# Base64 encode for API
encoded = base64.b64encode(buffer.getvalue()).decode('utf-8')
encoded_pages.append({
'type': 'base64',
'media_type': 'image/jpeg',
'data': encoded
})
return encoded_pages
Edge case handling: The converter handles:
- Missing files with explicit
FileNotFoundError - Empty PDFs (0 pages) with
ValueError - Oversized images by downscaling while preserving aspect ratio
- Memory management via
optimize=Truein JPEG compression
Step 2: Structured Extraction with Claude 3.5 Sonnet
The extraction function uses Pydantic models for type-safe output and a carefully engineered system prompt:
from anthropic import Anthropic
from pydantic import BaseModel, Field, ValidationError
from typing import Optional, Literal
import json
import time
from datetime import datetime
class ExtractedInvoice(BaseModel):
"""Structured invoice data model."""
provider_name: str = Field(description="Healthcare provider or facility name")
patient_id: str = Field(description="Patient identifier (MRN or account number)")
service_date: str = Field(description="Date of service in YYYY-MM-DD format")
cpt_codes: list[str] = Field(description="List of CPT procedure codes")
total_charges: float = Field(description="Total billed amount")
insurance_payment: Optional[float] = Field(default=None, description="Insurance payment amount")
patient_balance: Optional[float] = Field(default=None, description="Remaining patient responsibility")
confidence: Literal["high", "medium", "low"] = Field(description="Extraction confidence level")
class PDFExtractor:
"""Extracts structured data from PDF invoices using Claude 3.5 Sonnet."""
SYSTEM_PROMPT = """You are a precise data extraction system. Extract structured information from the provided invoice PDF page images.
Rules:
1. Extract ONLY information that is visibly present in the images
2. For dates, convert to YYYY-MM-DD format
3. For monetary values, extract as numbers (no currency symbols)
4. If a field is not visible, use null (not "N/A" or empty string)
5. Set confidence to "high" if all fields are clearly visible, "medium" if some are ambiguous, "low" if multiple fields are unclear
6. Return ONLY valid JSON, no markdown formatting or additional text
Output exactly this JSON structure:
{
"provider_name": "string or null",
"patient_id": "string or null",
"service_date": "YYYY-MM-DD or null",
"cpt_codes": ["string"],
"total_charges": number or null,
"insurance_payment": number or null,
"patient_balance": number or null,
"confidence": "high|medium|low"
}"""
def __init__(self, api_key: str, model: str = "claude-3-5-sonnet-20241022"):
self.client = Anthropic(api_key=api_key)
self.model = model
self.converter = PDFImageConverter()
def extract(self, pdf_path: str, max_retries: int = 3) -> ExtractedInvoice:
"""
Extract structured data from a PDF invoice.
Args:
pdf_path: Path to PDF file
max_retries: Number of API retry attempts on failure
Returns:
ExtractedInvoice with parsed data
Raises:
RuntimeError: If extraction fails after all retries
"""
# Convert PDF to images
images = self.converter.convert(pdf_path)
# Build message content with all pages
content = [{"type": "text", "text": "Extract invoice data from these pages:"}]
content.extend(images)
for attempt in range(max_retries):
try:
response = self.client.messages.create(
model=self.model,
max_tokens=1024,
system=self.SYSTEM_PROMPT,
messages=[{
"role": "user",
"content": content
}]
)
# Parse JSON from response
raw_text = response.content[0].text
# Handle potential markdown code blocks
if "```json" in raw_text:
raw_text = raw_text.split("```json")[1].split("```")[0]
elif "```" in raw_text:
raw_text = raw_text.split("```")[1].split("```")[0]
data = json.loads(raw_text.strip())
# Validate with Pydantic
return ExtractedInvoice(**data)
except (json.JSONDecodeError, ValidationError, KeyError) as e:
if attempt == max_retries - 1:
raise RuntimeError(
f"Failed to extract valid data after {max_retries} attempts: {e}"
)
time.sleep(2 ** attempt) # Exponential backoff
raise RuntimeError("Unexpected error in extraction loop")
Key design decisions:
- System prompt engineering: The prompt explicitly forbids markdown formatting and requires null for missing fields. This prevents Claude from inventing data or returning non-JSON responses.
- Retry with exponential backoff: API calls can fail due to transient errors. Starting with 2-second delay and doubling prevents rate limit hammering.
- Markdown stripping: Claude sometimes wraps JSON in markdown code blocks. The parser handles both
```jsonand```variants.
Step 3: Batch Processing with Rate Limiting
Production systems process hundreds of PDFs. The Anthropic API has rate limits (as of June 2026, the standard tier allows 5 requests per second for Claude 3.5 Sonnet). Here's a batch processor with concurrency control:
import asyncio
import aiohttp
from typing import AsyncGenerator
from dataclasses import dataclass
@dataclass
class ExtractionResult:
pdf_path: str
data: Optional[ExtractedInvoice]
error: Optional[str]
processing_time: float
class BatchPDFProcessor:
"""Process multiple PDFs with rate limiting and error handling."""
def __init__(self, api_key: str, max_concurrent: int = 5):
self.extractor = PDFExtractor(api_key)
self.semaphore = asyncio.Semaphore(max_concurrent)
async def process_single(self, pdf_path: str) -> ExtractionResult:
"""Process a single PDF with semaphore-based concurrency control."""
start = time.time()
async with self.semaphore:
try:
# Run synchronous extraction in thread pool
data = await asyncio.to_thread(self.extractor.extract, pdf_path)
return ExtractionResult(
pdf_path=pdf_path,
data=data,
error=None,
processing_time=time.time() - start
)
except Exception as e:
return ExtractionResult(
pdf_path=pdf_path,
data=None,
error=str(e),
processing_time=time.time() - start
)
async def process_batch(self, pdf_paths: list[str]) -> AsyncGenerator[ExtractionResult, None]:
"""Process multiple PDFs, yielding results as they complete."""
tasks = [self.process_single(path) for path in pdf_paths]
for coro in asyncio.as_completed(tasks):
yield await coro
Production considerations:
- Semaphore-based throttling: Limits concurrent API calls to prevent 429 errors
- Thread pool execution: The Anthropic SDK is synchronous, so we use
asyncio.to_threadto avoid blocking the event loop - Graceful error handling: Each PDF failure is captured individually without crashing the batch
Step 4: Validation and Post-Processing
Raw extraction isn't enough. You need to validate that extracted data makes sense:
from datetime import datetime
import re
class DataValidator:
"""Validates extracted invoice data for business logic consistency."""
CPT_PATTERN = re.compile(r'^\d{5}$')
@staticmethod
def validate_invoice(data: ExtractedInvoice) -> list[str]:
"""Returns list of validation warnings. Empty list means valid."""
warnings = []
# Validate date format and range
if data.service_date:
try:
dt = datetime.strptime(data.service_date, "%Y-%m-%d")
if dt > datetime.now():
warnings.append(f"Service date {data.service_date} is in the future")
if dt.year < 2020:
warnings.append(f"Service date {data.service_date} seems too old")
except ValueError:
warnings.append(f"Invalid date format: {data.service_date}")
# Validate CPT codes
for code in data.cpt_codes:
if not DataValidator.CPT_PATTERN.match(code):
warnings.append(f"Invalid CPT code format: {code}")
# Validate monetary consistency
if data.total_charges and data.insurance_payment:
if data.insurance_payment > data.total_charges:
warnings.append(
f"Insurance payment ${data.insurance_payment:.2f} exceeds "
f"total charges ${data.total_charges:.2f}"
)
return warnings
Edge Cases and Production Gotchas
1. Multi-Page Documents
Claude 3.5 Sonnet can process up to 20 images per request. For documents with more pages, you must split into batches and merge results. The current implementation sends all pages at once, which works for typical invoices (1-5 pages). For longer documents, implement page-level extraction and aggregate:
def extract_multi_page(self, pdf_path: str, pages_per_batch: int = 5):
"""Extract data from long PDFs by processing page batches."""
all_images = self.converter.convert(pdf_path)
results = []
for i in range(0, len(all_images), pages_per_batch):
batch = all_images[i:i + pages_per_batch]
# Process batch and extract partial data
partial = self._extract_from_images(batch)
results.append(partial)
return self._merge_results(results)
2. Memory Management for Large PDFs
A 100-page PDF at 200 DPI generates ~500MB of JPEG data in memory. The convert_from_path function loads all pages simultaneously. For production, use a generator-based approach:
def convert_streaming(self, pdf_path: str):
"""Yield pages one at a time to limit memory usage."""
from pdf2image.pdf2image import pdfinfo_from_path
info = pdfinfo_from_path(pdf_path)
total_pages = info['Pages']
for page_num in range(1, total_pages + 1):
images = convert_from_path(
pdf_path,
dpi=self.dpi,
first_page=page_num,
last_page=page_num
)
yield self._encode_image(images[0])
3. Handling Low-Quality Scans
Scanned PDFs (not born-digital) have lower quality. Increase DPI to 300 and enable preprocessing:
from PIL import ImageFilter, ImageEnhance
def preprocess_scan(self, img: Image.Image) -> Image.Image:
"""Enhance scanned document for better OCR/vision extraction."""
# Convert to grayscale for better contrast
img = img.convert('L')
# Increase contrast
enhancer = ImageEnhance.Contrast(img)
img = enhancer.enhance(2.0)
# Sharpen
img = img.filter(ImageFilter.SHARPEN)
return img
Cost Analysis and Optimization
As of June 2026, Claude 3.5 Sonnet pricing is $3.00 per million input tokens and $15.00 per million output tokens. For a typical 2-page invoice:
- Input: ~2 images at ~200K tokens each = 400K tokens = $1.20
- Output: ~200 tokens = $0.003
- Total per invoice: ~$1.20
For 10,000 invoices/month: ~$12,000. Optimization strategies:
- Cache identical pages: Many invoices share header/footer layouts
- Reduce image size: 150 DPI instead of 200 DPI reduces token count by ~44%
- Batch similar documents: Same system prompt reuse reduces overhead
Conclusion and What's Next
You now have a production-ready PDF data extraction pipeline using Claude 3.5 Sonnet that handles real-world edge cases: multi-page documents, low-quality scans, API rate limits, and validation. The key architectural decisions—image-based extraction over text parsing, Pydantic validation, and exponential backoff—make this robust enough for medical billing, legal document processing, or financial statement analysis.
What's Next:
- Implement a caching layer with Redis to avoid re-processing identical PDFs
- Add a webhook system for asynchronous processing of large batches
- Explore fine-tuning [3] a smaller model on your specific document types for reduced costs
- Set up monitoring with Prometheus metrics for extraction latency and error rates
The complete code is available on GitHub (Anthropic's official cookbook repository). For more on building AI-powered data pipelines, check out our guide on designing robust extraction systems.
References
Was this article helpful?
Let us know to improve our AI generation.
Related Articles
How to Build a Multimodal App with Gemini 2.0 Vision API
Practical tutorial: Build a multimodal app with Gemini 2.0 Vision API
How to Build an AI Pentesting Assistant with LangChain
Practical tutorial: Build an AI-powered pentesting assistant
How to Build Autonomous Scientific Discovery Agents with EurekAgent
Practical tutorial: The story discusses a significant advancement in AI research that could impact autonomous scientific discovery.