How to Extract Structured Data from PDFs with Claude 3.5 Sonnet
Practical tutorial: Extract structured data from PDFs with Claude 3.5 Sonnet
How to Extract Structured Data from PDFs with Claude 3.5 Sonnet
Table of Contents
- How to Extract Structured Data from PDFs with Claude 3.5 Sonnet
- Create a virtual environment
- Install core dependencies
- For PDF processing, you'll need poppler-utils
- On macOS: brew install poppler
- On Ubuntu/Debian: sudo apt-get install poppler-utils
- On Windows: Download from https://github.com/oschwartz10612/poppler-windows/releases/
📺 Watch: Neural Networks Explained
Video by 3Blue1Brown
Extracting structured data from PDFs remains one of the most persistent challenges in enterprise document processing. While traditional OCR and regex-based approaches struggle with complex layouts, tables, and varying formats, large language models like Claude [8] 3.5 Sonnet offer a fundamentally different approach: treating PDF extraction as a structured generation problem rather than a parsing one.
In this tutorial, you'll build a production-ready PDF extraction pipeline that uses Claude 3.5 Sonnet to convert unstructured PDF documents into validated JSON schemas. We'll cover the architecture decisions, handle edge cases like multi-column layouts and tables, and implement proper error handling for API rate limits and malformed inputs.
Real-World Use Case and Architecture
Consider a financial services firm processing thousands of invoice PDFs daily. Each invoice has different layouts, currencies, line items, and tax calculations. Traditional template-based extraction fails when vendors change formats, while machine learning approaches require extensive labeled training data.
Claude 3.5 Sonnet solves this by understanding document structure through its vision capabilities and generating structured output directly. According to Anthropic [8]'s documentation, Claude 3.5 Sonnet processes images and text simultaneously, making it ideal for PDF extraction where layout matters.
Our architecture uses a three-stage pipeline:
- PDF Preprocessing: Convert PDF pages to high-resolution images, handling rotation, compression, and quality optimization
- Structured Extraction: Send images to Claude 3.5 Sonnet with a carefully designed system prompt and JSON schema
- Validation and Post-processing: Validate extracted data against schemas, handle missing fields, and retry on failures
This approach handles edge cases that break traditional extraction: rotated pages, handwritten annotations, watermarks, and complex nested tables.
Prerequisites and Environment Setup
First, install the required packages. We'll use pdf2image for PDF-to-image conversion, anthropic for Claude API access, and pydantic for schema validation.
# Create a virtual environment
python -m venv pdf-extractor
source pdf-extractor/bin/activate # On Windows: pdf-extractor\Scripts\activate
# Install core dependencies
pip install anthropic==0.49.0
pip install pdf2image==1.17.0
pip install pydantic==2.10.0
pip install pillow==11.0.0
pip install tenacity==9.0.0 # For retry logic
pip install jsonschema==4.23.0
# For PDF processing, you'll need poppler-utils
# On macOS: brew install poppler
# On Ubuntu/Debian: sudo apt-get install poppler-utils
# On Windows: Download from https://github.com/oschwartz10612/poppler-windows/releases/
Set up your environment variables:
export ANTHROPIC_API_KEY="your-api-key-here"
export MAX_PDF_PAGES=10 # Limit pages to control costs
export EXTRACTION_TIMEOUT=120 # Seconds per page
Core Implementation: Building the Extraction Pipeline
PDF Preprocessing Module
The preprocessing stage is critical for extraction quality. Claude 3.5 Sonnet accepts images up to 8,000 pixels on each dimension, but sending full-resolution PDF pages wastes tokens and increases latency. We need to balance image quality with token efficiency.
# pdf_preprocessor.py
from pdf2image import convert_from_path
from PIL import Image, ImageEnhance
import io
import logging
from typing import List, Optional
logger = logging.getLogger(__name__)
class PDFPreprocessor:
"""
Handles PDF-to-image conversion with quality optimization.
Key decisions:
- DPI: 200 DPI balances quality and size for Claude's vision
- Format: JPEG with quality 85 reduces file size 10x vs PNG
- Max dimensions: Resize to fit within Claude's 8K pixel limit
"""
def __init__(
self,
dpi: int = 200,
max_dimension: int = 4000,
jpeg_quality: int = 85,
max_pages: Optional[int] = None
):
self.dpi = dpi
self.max_dimension = max_dimension
self.jpeg_quality = jpeg_quality
self.max_pages = max_pages
def convert_to_images(self, pdf_path: str) -> List[bytes]:
"""
Convert PDF to optimized JPEG images.
Handles edge cases:
- Empty PDFs: Returns empty list
- Corrupted pages: Logs warning and skips
- Large PDFs: Respects max_pages limit
"""
try:
images = convert_from_path(
pdf_path,
dpi=self.dpi,
fmt='jpeg',
grayscale=False,
size=(self.max_dimension, self.max_dimension),
use_pdftocairo=True # Better quality than pdftoppm
)
except Exception as e:
logger.error(f"Failed to convert PDF {pdf_path}: {e}")
return []
if self.max_pages:
images = images[:self.max_pages]
processed = []
for i, img in enumerate(images):
try:
# Enhance contrast for better text extraction
enhancer = ImageEnhance.Contrast(img)
img = enhancer.enhance(1.2)
# Convert to bytes
buffer = io.BytesIO()
img.save(buffer, format='JPEG', quality=self.jpeg_quality, optimize=True)
processed.append(buffer.getvalue())
logger.debug(f"Processed page {i+1}: {len(buffer.getvalue())} bytes")
except Exception as e:
logger.warning(f"Failed to process page {i+1}: {e}")
continue
return processed
Structured Extraction with Claude 3.5 Sonnet
This is the core of our pipeline. We define a Pydantic schema for the expected output and use Claude's structured output capabilities. The key insight is that Claude 3.5 Sonnet can follow complex JSON schemas when prompted correctly.
# structured_extractor.py
from anthropic import Anthropic
from pydantic import BaseModel, Field, ValidationError
from typing import List, Optional, Dict, Any
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
import json
import logging
import time
from datetime import datetime
logger = logging.getLogger(__name__)
# Define our extraction schema
class InvoiceLineItem(BaseModel):
description: str = Field(.., description="Item description")
quantity: float = Field(.., ge=0, description="Quantity ordered")
unit_price: float = Field(.., ge=0, description="Price per unit")
total_price: float = Field(.., ge=0, description="Line total")
currency: str = Field(default="USD", description="Currency code")
class InvoiceData(BaseModel):
invoice_number: str = Field(.., description="Invoice identifier")
vendor_name: str = Field(.., description="Company issuing invoice")
vendor_address: Optional[str] = Field(None, description="Vendor street address")
customer_name: str = Field(.., description="Customer receiving invoice")
invoice_date: str = Field(.., description="Invoice date in YYYY-MM-DD format")
due_date: Optional[str] = Field(None, description="Payment due date")
line_items: List[InvoiceLineItem] = Field(.., min_length=1)
subtotal: float = Field(.., ge=0)
tax_amount: Optional[float] = Field(None, ge=0)
total_amount: float = Field(.., ge=0)
currency: str = Field(default="USD")
class StructuredExtractor:
"""
Extracts structured data from PDF images using Claude 3.5 Sonnet.
Architecture decisions:
- Uses system prompt with explicit JSON schema
- Implements retry with exponential backoff for rate limits
- Validates output against Pydantic schema before returning
- Handles partial extractions gracefully
"""
SYSTEM_PROMPT = """You are a precise document data extraction system.
Extract the requested information from the provided document image.
Rules:
1. Extract ONLY information visible in the document
2. If a field is not visible, set it to null (not empty string)
3. Convert all monetary values to numbers (remove currency symbols)
4. Dates must be in YYYY-MM-DD format
5. Line items must include quantity, unit price, and total
6. Verify that subtotal + tax = total (within rounding tolerance)
7. Do not infer or guess missing information
Return ONLY valid JSON matching the specified schema."""
def __init__(self, api_key: str, model: str = "claude-3-5-sonnet-20241022"):
self.client = Anthropic(api_key=api_key)
self.model = model
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=60),
retry=retry_if_exception_type((Exception,))
)
def extract_from_image(
self,
image_bytes: bytes,
schema: Dict[str, Any],
page_number: int = 1
) -> Dict[str, Any]:
"""
Extract structured data from a single page image.
Handles:
- API rate limits (429 errors)
- Timeout errors
- Malformed JSON responses
- Schema validation failures
"""
start_time = time.time()
try:
response = self.client.messages.create(
model=self.model,
max_tokens=4096,
temperature=0.0, # Deterministic output for extraction
system=self.SYSTEM_PROMPT,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": image_bytes.hex() # Convert bytes to hex string
}
},
{
"type": "text",
"text": f"Extract data from page {page_number} using this schema:\n{json.dumps(schema, indent=2)}"
}
]
}
]
)
# Parse the response
content = response.content[0].text
# Extract JSON from response (handle markdown code blocks)
if "```json" in content:
json_str = content.split("```json")[1].split("```")[0].strip()
elif "```" in content:
json_str = content.split("```")[1].split("```")[0].strip()
else:
json_str = content.strip()
extracted = json.loads(json_str)
# Validate against schema
validated = self._validate_against_schema(extracted, schema)
elapsed = time.time() - start_time
logger.info(f"Page {page_number} extracted in {elapsed:.2f}s")
return validated
except json.JSONDecodeError as e:
logger.error(f"Failed to parse JSON from page {page_number}: {e}")
logger.debug(f"Raw response: {content[:500]}")
raise
except Exception as e:
logger.error(f"Extraction failed for page {page_number}: {e}")
raise
def _validate_against_schema(
self,
data: Dict[str, Any],
schema: Dict[str, Any]
) -> Dict[str, Any]:
"""
Validate extracted data against JSON schema.
Handles:
- Missing required fields
- Type mismatches
- Value constraints (e.g., negative prices)
"""
try:
# Use jsonschema for validation
import jsonschema
jsonschema.validate(instance=data, schema=schema)
return data
except jsonschema.ValidationError as e:
logger.warning(f"Schema validation failed: {e.message}")
# Return partial data with validation errors noted
data["_validation_errors"] = [str(e)]
return data
Multi-Page Document Processing
Real-world PDFs often span multiple pages. We need to handle cross-page references, repeated headers, and partial data spread across pages.
# document_processor.py
from typing import List, Dict, Any, Optional
import logging
from concurrent.futures import ThreadPoolExecutor, as_completed
from datetime import datetime
logger = logging.getLogger(__name__)
class DocumentProcessor:
"""
Orchestrates multi-page PDF extraction with aggregation logic.
Handles:
- Multi-page invoices with line items spanning pages
- Repeated header/footer information
- Conflicting data across pages (takes last occurrence)
- Partial extraction failures
"""
def __init__(
self,
extractor: 'StructuredExtractor',
preprocessor: 'PDFPreprocessor',
max_workers: int = 3 # Rate limit safe concurrency
):
self.extractor = extractor
self.preprocessor = preprocessor
self.max_workers = max_workers
def process_document(
self,
pdf_path: str,
schema: Dict[str, Any]
) -> Dict[str, Any]:
"""
Process entire PDF document and aggregate results.
Returns:
- Extracted data from all pages
- Metadata about extraction quality
- Any errors encountered
"""
# Convert PDF to images
images = self.preprocessor.convert_to_images(pdf_path)
if not images:
return {
"success": False,
"error": "No pages could be processed",
"data": None
}
# Extract from each page in parallel (with rate limiting)
page_results = []
with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
future_to_page = {
executor.submit(
self.extractor.extract_from_image,
img,
schema,
i + 1
): i + 1
for i, img in enumerate(images)
}
for future in as_completed(future_to_page):
page_num = future_to_page[future]
try:
result = future.result()
page_results.append((page_num, result, None))
except Exception as e:
logger.error(f"Page {page_num} failed: {e}")
page_results.append((page_num, None, str(e)))
# Sort by page number
page_results.sort(key=lambda x: x[0])
# Aggregate results
aggregated = self._aggregate_pages(page_results)
return aggregated
def _aggregate_pages(
self,
page_results: List[tuple]
) -> Dict[str, Any]:
"""
Merge data from multiple pages intelligently.
Strategy:
- Take first occurrence for header fields (invoice number, vendor)
- Concatenate line items across pages
- Use last occurrence for totals (most complete)
- Track confidence per field
"""
aggregated = {
"success": True,
"data": {},
"pages_processed": len([r for r in page_results if r[1] is not None]),
"pages_failed": len([r for r in page_results if r[1] is None]),
"errors": [r[2] for r in page_results if r[2] is not None],
"extraction_time": datetime.now().isoformat()
}
all_line_items = []
header_fields = {}
total_fields = {}
for page_num, data, error in page_results:
if data is None:
continue
# Collect line items from all pages
if "line_items" in data:
all_line_items.extend(data["line_items"])
# Header fields: take first occurrence
for field in ["invoice_number", "vendor_name", "customer_name"]:
if field in data and data[field] and field not in header_fields:
header_fields[field] = data[field]
# Total fields: take last occurrence (most complete)
for field in ["subtotal", "tax_amount", "total_amount"]:
if field in data and data[field] is not None:
total_fields[field] = data[field]
# Build final aggregated data
aggregated["data"] = {
**header_fields,
"line_items": all_line_items,
**total_fields,
"_page_count": len(page_results)
}
return aggregated
Production Considerations and Edge Cases
Rate Limiting and Cost Management
Claude 3.5 Sonnet API calls cost $3.00 per million input tokens and $15.00 per million output tokens (as of January 2026, according to Anthropic's pricing page). A typical invoice page at 200 DPI JPEG consumes approximately 1,500-2,000 input tokens. With our retry logic, budget for 3 attempts per page.
# cost_tracker.py
class CostTracker:
"""
Tracks API usage and costs for extraction jobs.
"""
def __init__(self):
self.total_input_tokens = 0
self.total_output_tokens = 0
self.input_cost_per_million = 3.00
self.output_cost_per_million = 15.00
def add_usage(self, input_tokens: int, output_tokens: int):
self.total_input_tokens += input_tokens
self.total_output_tokens += output_tokens
def calculate_cost(self) -> float:
input_cost = (self.total_input_tokens / 1_000_000) * self.input_cost_per_million
output_cost = (self.total_output_tokens / 1_000_000) * self.output_cost_per_million
return input_cost + output_cost
Handling Common Edge Cases
-
Scanned vs Digital PDFs: Digital PDFs (text-based) can be processed faster by extracting text directly with PyMuPDF, then sending only the text to Claude. Scanned PDFs require the full image pipeline.
-
Multi-language Documents: Claude 3.5 Sonnet handles 100+ languages, but specify the language in your system prompt for better accuracy.
-
Password-Protected PDFs: The
pdf2imagelibrary cannot process encrypted PDFs. Add a decryption step usingpikepdf:
# Add to PDFPreprocessor.__init__
def decrypt_pdf(self, pdf_path: str, password: str) -> str:
"""Decrypt password-protected PDF and return temporary path."""
import pikepdf
import tempfile
with pikepdf.open(pdf_path, password=password) as pdf:
temp = tempfile.NamedTemporaryFile(suffix='.pdf', delete=False)
pdf.save(temp.name)
return temp.name
- Large PDFs (>50 pages): Process in batches of 10 pages, using the first batch to extract header information, then process remaining pages for line items only.
Validation and Quality Assurance
Implement a validation layer that checks extraction quality before accepting results:
# quality_checker.py
class ExtractionQualityChecker:
"""
Validates extraction quality using business rules.
"""
def check_invoice_consistency(self, data: Dict) -> List[str]:
warnings = []
# Check arithmetic consistency
if "line_items" in data and "total_amount" in data:
calculated_total = sum(
item.get("total_price", 0) for item in data["line_items"]
)
if abs(calculated_total - data["total_amount"]) > 0.01:
warnings.append(
f"Total mismatch: calculated {calculated_total}, "
f"extracted {data['total_amount']}"
)
# Check date validity
if "invoice_date" in data:
try:
datetime.strptime(data["invoice_date"], "%Y-%m-%d")
except ValueError:
warnings.append(f"Invalid date format: {data['invoice_date']}")
# Check for required fields
required = ["invoice_number", "vendor_name", "total_amount"]
for field in required:
if field not in data or data[field] is None:
warnings.append(f"Missing required field: {field}")
return warnings
Putting It All Together
Here's the main entry point that ties everything together:
# main.py
import json
import logging
from pathlib import Path
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
def main():
# Initialize components
preprocessor = PDFPreprocessor(
dpi=200,
max_dimension=4000,
max_pages=10
)
extractor = StructuredExtractor(
api_key=os.environ["ANTHROPIC_API_KEY"]
)
processor = DocumentProcessor(
extractor=extractor,
preprocessor=preprocessor,
max_workers=3
)
# Define extraction schema
invoice_schema = InvoiceData.model_json_schema()
# Process document
result = processor.process_document(
pdf_path="invoice_sample.pdf",
schema=invoice_schema
)
# Quality check
checker = ExtractionQualityChecker()
warnings = checker.check_invoice_consistency(result.get("data", {}))
if warnings:
logger.warning(f"Quality warnings: {warnings}")
result["quality_warnings"] = warnings
# Save results
output_path = Path("extraction_result.json")
with open(output_path, "w") as f:
json.dump(result, f, indent=2, default=str)
logger.info(f"Extraction complete. Results saved to {output_path}")
if __name__ == "__main__":
main()
What's Next
This pipeline gives you a production-ready foundation for PDF data extraction with Claude 3.5 Sonnet. To extend it further:
- Add a caching layer using Redis to avoid re-processing identical PDFs
- Implement a feedback loop where human reviewers correct extraction errors, fine-tuning [2] prompts based on correction patterns
- Explore hybrid approaches combining Claude with traditional OCR (Tesseract) for low-quality scans
- Add monitoring with Prometheus metrics for extraction latency, success rates, and cost per document
The key insight from building this system is that Claude 3.5 Sonnet's vision capabilities make it uniquely suited for structured data extraction from complex documents. By treating extraction as a structured generation problem with proper validation, you can achieve accuracy levels that rival human data entry operators while processing documents at machine speed.
References
Was this article helpful?
Let us know to improve our AI generation.
Related Articles
How to Analyze Security Logs with DeepSeek Locally
Practical tutorial: Analyze security logs with DeepSeek locally
How to Build a Multimodal App with Gemini 2.0 Vision API
Practical tutorial: Build a multimodal app with Gemini 2.0 Vision API
How to Build an AI Research Assistant with Perplexity API
Practical tutorial: Create an AI research assistant with Perplexity API