Back to Tutorials
tutorialstutorialaillm

How to Extract Structured Data from PDFs with Claude 3.5 Sonnet

Practical tutorial: Extract structured data from PDFs with Claude 3.5 Sonnet

BlogIA AcademyJune 1, 202614 min read2 601 words

How to Extract Structured Data from PDFs with Claude 3.5 Sonnet

Table of Contents

📺 Watch: Neural Networks Explained

Video by 3Blue1Brown


Extracting structured data from PDFs remains one of the most persistent challenges in enterprise document processing. While traditional OCR and regex-based approaches struggle with complex layouts, tables, and varying formats, large language models like Claude [8] 3.5 Sonnet offer a fundamentally different approach: treating PDF extraction as a structured generation problem rather than a parsing one.

In this tutorial, you'll build a production-ready PDF extraction pipeline that uses Claude 3.5 Sonnet to convert unstructured PDF documents into validated JSON schemas. We'll cover the architecture decisions, handle edge cases like multi-column layouts and tables, and implement proper error handling for API rate limits and malformed inputs.

Real-World Use Case and Architecture

Consider a financial services firm processing thousands of invoice PDFs daily. Each invoice has different layouts, currencies, line items, and tax calculations. Traditional template-based extraction fails when vendors change formats, while machine learning approaches require extensive labeled training data.

Claude 3.5 Sonnet solves this by understanding document structure through its vision capabilities and generating structured output directly. According to Anthropic [8]'s documentation, Claude 3.5 Sonnet processes images and text simultaneously, making it ideal for PDF extraction where layout matters.

Our architecture uses a three-stage pipeline:

  1. PDF Preprocessing: Convert PDF pages to high-resolution images, handling rotation, compression, and quality optimization
  2. Structured Extraction: Send images to Claude 3.5 Sonnet with a carefully designed system prompt and JSON schema
  3. Validation and Post-processing: Validate extracted data against schemas, handle missing fields, and retry on failures

This approach handles edge cases that break traditional extraction: rotated pages, handwritten annotations, watermarks, and complex nested tables.

Prerequisites and Environment Setup

First, install the required packages. We'll use pdf2image for PDF-to-image conversion, anthropic for Claude API access, and pydantic for schema validation.

# Create a virtual environment
python -m venv pdf-extractor
source pdf-extractor/bin/activate  # On Windows: pdf-extractor\Scripts\activate

# Install core dependencies
pip install anthropic==0.49.0
pip install pdf2image==1.17.0
pip install pydantic==2.10.0
pip install pillow==11.0.0
pip install tenacity==9.0.0  # For retry logic
pip install jsonschema==4.23.0

# For PDF processing, you'll need poppler-utils
# On macOS: brew install poppler
# On Ubuntu/Debian: sudo apt-get install poppler-utils
# On Windows: Download from https://github.com/oschwartz10612/poppler-windows/releases/

Set up your environment variables:

export ANTHROPIC_API_KEY="your-api-key-here"
export MAX_PDF_PAGES=10  # Limit pages to control costs
export EXTRACTION_TIMEOUT=120  # Seconds per page

Core Implementation: Building the Extraction Pipeline

PDF Preprocessing Module

The preprocessing stage is critical for extraction quality. Claude 3.5 Sonnet accepts images up to 8,000 pixels on each dimension, but sending full-resolution PDF pages wastes tokens and increases latency. We need to balance image quality with token efficiency.

# pdf_preprocessor.py
from pdf2image import convert_from_path
from PIL import Image, ImageEnhance
import io
import logging
from typing import List, Optional

logger = logging.getLogger(__name__)

class PDFPreprocessor:
    """
    Handles PDF-to-image conversion with quality optimization.

    Key decisions:
    - DPI: 200 DPI balances quality and size for Claude's vision
    - Format: JPEG with quality 85 reduces file size 10x vs PNG
    - Max dimensions: Resize to fit within Claude's 8K pixel limit
    """

    def __init__(
        self,
        dpi: int = 200,
        max_dimension: int = 4000,
        jpeg_quality: int = 85,
        max_pages: Optional[int] = None
    ):
        self.dpi = dpi
        self.max_dimension = max_dimension
        self.jpeg_quality = jpeg_quality
        self.max_pages = max_pages

    def convert_to_images(self, pdf_path: str) -> List[bytes]:
        """
        Convert PDF to optimized JPEG images.

        Handles edge cases:
        - Empty PDFs: Returns empty list
        - Corrupted pages: Logs warning and skips
        - Large PDFs: Respects max_pages limit
        """
        try:
            images = convert_from_path(
                pdf_path,
                dpi=self.dpi,
                fmt='jpeg',
                grayscale=False,
                size=(self.max_dimension, self.max_dimension),
                use_pdftocairo=True  # Better quality than pdftoppm
            )
        except Exception as e:
            logger.error(f"Failed to convert PDF {pdf_path}: {e}")
            return []

        if self.max_pages:
            images = images[:self.max_pages]

        processed = []
        for i, img in enumerate(images):
            try:
                # Enhance contrast for better text extraction
                enhancer = ImageEnhance.Contrast(img)
                img = enhancer.enhance(1.2)

                # Convert to bytes
                buffer = io.BytesIO()
                img.save(buffer, format='JPEG', quality=self.jpeg_quality, optimize=True)
                processed.append(buffer.getvalue())

                logger.debug(f"Processed page {i+1}: {len(buffer.getvalue())} bytes")
            except Exception as e:
                logger.warning(f"Failed to process page {i+1}: {e}")
                continue

        return processed

Structured Extraction with Claude 3.5 Sonnet

This is the core of our pipeline. We define a Pydantic schema for the expected output and use Claude's structured output capabilities. The key insight is that Claude 3.5 Sonnet can follow complex JSON schemas when prompted correctly.

# structured_extractor.py
from anthropic import Anthropic
from pydantic import BaseModel, Field, ValidationError
from typing import List, Optional, Dict, Any
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
import json
import logging
import time
from datetime import datetime

logger = logging.getLogger(__name__)

# Define our extraction schema
class InvoiceLineItem(BaseModel):
    description: str = Field(.., description="Item description")
    quantity: float = Field(.., ge=0, description="Quantity ordered")
    unit_price: float = Field(.., ge=0, description="Price per unit")
    total_price: float = Field(.., ge=0, description="Line total")
    currency: str = Field(default="USD", description="Currency code")

class InvoiceData(BaseModel):
    invoice_number: str = Field(.., description="Invoice identifier")
    vendor_name: str = Field(.., description="Company issuing invoice")
    vendor_address: Optional[str] = Field(None, description="Vendor street address")
    customer_name: str = Field(.., description="Customer receiving invoice")
    invoice_date: str = Field(.., description="Invoice date in YYYY-MM-DD format")
    due_date: Optional[str] = Field(None, description="Payment due date")
    line_items: List[InvoiceLineItem] = Field(.., min_length=1)
    subtotal: float = Field(.., ge=0)
    tax_amount: Optional[float] = Field(None, ge=0)
    total_amount: float = Field(.., ge=0)
    currency: str = Field(default="USD")

class StructuredExtractor:
    """
    Extracts structured data from PDF images using Claude 3.5 Sonnet.

    Architecture decisions:
    - Uses system prompt with explicit JSON schema
    - Implements retry with exponential backoff for rate limits
    - Validates output against Pydantic schema before returning
    - Handles partial extractions gracefully
    """

    SYSTEM_PROMPT = """You are a precise document data extraction system. 
Extract the requested information from the provided document image.

Rules:
1. Extract ONLY information visible in the document
2. If a field is not visible, set it to null (not empty string)
3. Convert all monetary values to numbers (remove currency symbols)
4. Dates must be in YYYY-MM-DD format
5. Line items must include quantity, unit price, and total
6. Verify that subtotal + tax = total (within rounding tolerance)
7. Do not infer or guess missing information

Return ONLY valid JSON matching the specified schema."""

    def __init__(self, api_key: str, model: str = "claude-3-5-sonnet-20241022"):
        self.client = Anthropic(api_key=api_key)
        self.model = model

    @retry(
        stop=stop_after_attempt(3),
        wait=wait_exponential(multiplier=1, min=4, max=60),
        retry=retry_if_exception_type((Exception,))
    )
    def extract_from_image(
        self,
        image_bytes: bytes,
        schema: Dict[str, Any],
        page_number: int = 1
    ) -> Dict[str, Any]:
        """
        Extract structured data from a single page image.

        Handles:
        - API rate limits (429 errors)
        - Timeout errors
        - Malformed JSON responses
        - Schema validation failures
        """
        start_time = time.time()

        try:
            response = self.client.messages.create(
                model=self.model,
                max_tokens=4096,
                temperature=0.0,  # Deterministic output for extraction
                system=self.SYSTEM_PROMPT,
                messages=[
                    {
                        "role": "user",
                        "content": [
                            {
                                "type": "image",
                                "source": {
                                    "type": "base64",
                                    "media_type": "image/jpeg",
                                    "data": image_bytes.hex()  # Convert bytes to hex string
                                }
                            },
                            {
                                "type": "text",
                                "text": f"Extract data from page {page_number} using this schema:\n{json.dumps(schema, indent=2)}"
                            }
                        ]
                    }
                ]
            )

            # Parse the response
            content = response.content[0].text

            # Extract JSON from response (handle markdown code blocks)
            if "```json" in content:
                json_str = content.split("```json")[1].split("```")[0].strip()
            elif "```" in content:
                json_str = content.split("```")[1].split("```")[0].strip()
            else:
                json_str = content.strip()

            extracted = json.loads(json_str)

            # Validate against schema
            validated = self._validate_against_schema(extracted, schema)

            elapsed = time.time() - start_time
            logger.info(f"Page {page_number} extracted in {elapsed:.2f}s")

            return validated

        except json.JSONDecodeError as e:
            logger.error(f"Failed to parse JSON from page {page_number}: {e}")
            logger.debug(f"Raw response: {content[:500]}")
            raise
        except Exception as e:
            logger.error(f"Extraction failed for page {page_number}: {e}")
            raise

    def _validate_against_schema(
        self,
        data: Dict[str, Any],
        schema: Dict[str, Any]
    ) -> Dict[str, Any]:
        """
        Validate extracted data against JSON schema.

        Handles:
        - Missing required fields
        - Type mismatches
        - Value constraints (e.g., negative prices)
        """
        try:
            # Use jsonschema for validation
            import jsonschema
            jsonschema.validate(instance=data, schema=schema)
            return data
        except jsonschema.ValidationError as e:
            logger.warning(f"Schema validation failed: {e.message}")
            # Return partial data with validation errors noted
            data["_validation_errors"] = [str(e)]
            return data

Multi-Page Document Processing

Real-world PDFs often span multiple pages. We need to handle cross-page references, repeated headers, and partial data spread across pages.

# document_processor.py
from typing import List, Dict, Any, Optional
import logging
from concurrent.futures import ThreadPoolExecutor, as_completed
from datetime import datetime

logger = logging.getLogger(__name__)

class DocumentProcessor:
    """
    Orchestrates multi-page PDF extraction with aggregation logic.

    Handles:
    - Multi-page invoices with line items spanning pages
    - Repeated header/footer information
    - Conflicting data across pages (takes last occurrence)
    - Partial extraction failures
    """

    def __init__(
        self,
        extractor: 'StructuredExtractor',
        preprocessor: 'PDFPreprocessor',
        max_workers: int = 3  # Rate limit safe concurrency
    ):
        self.extractor = extractor
        self.preprocessor = preprocessor
        self.max_workers = max_workers

    def process_document(
        self,
        pdf_path: str,
        schema: Dict[str, Any]
    ) -> Dict[str, Any]:
        """
        Process entire PDF document and aggregate results.

        Returns:
        - Extracted data from all pages
        - Metadata about extraction quality
        - Any errors encountered
        """
        # Convert PDF to images
        images = self.preprocessor.convert_to_images(pdf_path)

        if not images:
            return {
                "success": False,
                "error": "No pages could be processed",
                "data": None
            }

        # Extract from each page in parallel (with rate limiting)
        page_results = []
        with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
            future_to_page = {
                executor.submit(
                    self.extractor.extract_from_image,
                    img,
                    schema,
                    i + 1
                ): i + 1
                for i, img in enumerate(images)
            }

            for future in as_completed(future_to_page):
                page_num = future_to_page[future]
                try:
                    result = future.result()
                    page_results.append((page_num, result, None))
                except Exception as e:
                    logger.error(f"Page {page_num} failed: {e}")
                    page_results.append((page_num, None, str(e)))

        # Sort by page number
        page_results.sort(key=lambda x: x[0])

        # Aggregate results
        aggregated = self._aggregate_pages(page_results)

        return aggregated

    def _aggregate_pages(
        self,
        page_results: List[tuple]
    ) -> Dict[str, Any]:
        """
        Merge data from multiple pages intelligently.

        Strategy:
        - Take first occurrence for header fields (invoice number, vendor)
        - Concatenate line items across pages
        - Use last occurrence for totals (most complete)
        - Track confidence per field
        """
        aggregated = {
            "success": True,
            "data": {},
            "pages_processed": len([r for r in page_results if r[1] is not None]),
            "pages_failed": len([r for r in page_results if r[1] is None]),
            "errors": [r[2] for r in page_results if r[2] is not None],
            "extraction_time": datetime.now().isoformat()
        }

        all_line_items = []
        header_fields = {}
        total_fields = {}

        for page_num, data, error in page_results:
            if data is None:
                continue

            # Collect line items from all pages
            if "line_items" in data:
                all_line_items.extend(data["line_items"])

            # Header fields: take first occurrence
            for field in ["invoice_number", "vendor_name", "customer_name"]:
                if field in data and data[field] and field not in header_fields:
                    header_fields[field] = data[field]

            # Total fields: take last occurrence (most complete)
            for field in ["subtotal", "tax_amount", "total_amount"]:
                if field in data and data[field] is not None:
                    total_fields[field] = data[field]

        # Build final aggregated data
        aggregated["data"] = {
            **header_fields,
            "line_items": all_line_items,
            **total_fields,
            "_page_count": len(page_results)
        }

        return aggregated

Production Considerations and Edge Cases

Rate Limiting and Cost Management

Claude 3.5 Sonnet API calls cost $3.00 per million input tokens and $15.00 per million output tokens (as of January 2026, according to Anthropic's pricing page). A typical invoice page at 200 DPI JPEG consumes approximately 1,500-2,000 input tokens. With our retry logic, budget for 3 attempts per page.

# cost_tracker.py
class CostTracker:
    """
    Tracks API usage and costs for extraction jobs.
    """

    def __init__(self):
        self.total_input_tokens = 0
        self.total_output_tokens = 0
        self.input_cost_per_million = 3.00
        self.output_cost_per_million = 15.00

    def add_usage(self, input_tokens: int, output_tokens: int):
        self.total_input_tokens += input_tokens
        self.total_output_tokens += output_tokens

    def calculate_cost(self) -> float:
        input_cost = (self.total_input_tokens / 1_000_000) * self.input_cost_per_million
        output_cost = (self.total_output_tokens / 1_000_000) * self.output_cost_per_million
        return input_cost + output_cost

Handling Common Edge Cases

  1. Scanned vs Digital PDFs: Digital PDFs (text-based) can be processed faster by extracting text directly with PyMuPDF, then sending only the text to Claude. Scanned PDFs require the full image pipeline.

  2. Multi-language Documents: Claude 3.5 Sonnet handles 100+ languages, but specify the language in your system prompt for better accuracy.

  3. Password-Protected PDFs: The pdf2image library cannot process encrypted PDFs. Add a decryption step using pikepdf:

# Add to PDFPreprocessor.__init__
def decrypt_pdf(self, pdf_path: str, password: str) -> str:
    """Decrypt password-protected PDF and return temporary path."""
    import pikepdf
    import tempfile

    with pikepdf.open(pdf_path, password=password) as pdf:
        temp = tempfile.NamedTemporaryFile(suffix='.pdf', delete=False)
        pdf.save(temp.name)
        return temp.name
  1. Large PDFs (>50 pages): Process in batches of 10 pages, using the first batch to extract header information, then process remaining pages for line items only.

Validation and Quality Assurance

Implement a validation layer that checks extraction quality before accepting results:

# quality_checker.py
class ExtractionQualityChecker:
    """
    Validates extraction quality using business rules.
    """

    def check_invoice_consistency(self, data: Dict) -> List[str]:
        warnings = []

        # Check arithmetic consistency
        if "line_items" in data and "total_amount" in data:
            calculated_total = sum(
                item.get("total_price", 0) for item in data["line_items"]
            )
            if abs(calculated_total - data["total_amount"]) > 0.01:
                warnings.append(
                    f"Total mismatch: calculated {calculated_total}, "
                    f"extracted {data['total_amount']}"
                )

        # Check date validity
        if "invoice_date" in data:
            try:
                datetime.strptime(data["invoice_date"], "%Y-%m-%d")
            except ValueError:
                warnings.append(f"Invalid date format: {data['invoice_date']}")

        # Check for required fields
        required = ["invoice_number", "vendor_name", "total_amount"]
        for field in required:
            if field not in data or data[field] is None:
                warnings.append(f"Missing required field: {field}")

        return warnings

Putting It All Together

Here's the main entry point that ties everything together:

# main.py
import json
import logging
from pathlib import Path

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

def main():
    # Initialize components
    preprocessor = PDFPreprocessor(
        dpi=200,
        max_dimension=4000,
        max_pages=10
    )

    extractor = StructuredExtractor(
        api_key=os.environ["ANTHROPIC_API_KEY"]
    )

    processor = DocumentProcessor(
        extractor=extractor,
        preprocessor=preprocessor,
        max_workers=3
    )

    # Define extraction schema
    invoice_schema = InvoiceData.model_json_schema()

    # Process document
    result = processor.process_document(
        pdf_path="invoice_sample.pdf",
        schema=invoice_schema
    )

    # Quality check
    checker = ExtractionQualityChecker()
    warnings = checker.check_invoice_consistency(result.get("data", {}))

    if warnings:
        logger.warning(f"Quality warnings: {warnings}")
        result["quality_warnings"] = warnings

    # Save results
    output_path = Path("extraction_result.json")
    with open(output_path, "w") as f:
        json.dump(result, f, indent=2, default=str)

    logger.info(f"Extraction complete. Results saved to {output_path}")

if __name__ == "__main__":
    main()

What's Next

This pipeline gives you a production-ready foundation for PDF data extraction with Claude 3.5 Sonnet. To extend it further:

  1. Add a caching layer using Redis to avoid re-processing identical PDFs
  2. Implement a feedback loop where human reviewers correct extraction errors, fine-tuning [2] prompts based on correction patterns
  3. Explore hybrid approaches combining Claude with traditional OCR (Tesseract) for low-quality scans
  4. Add monitoring with Prometheus metrics for extraction latency, success rates, and cost per document

The key insight from building this system is that Claude 3.5 Sonnet's vision capabilities make it uniquely suited for structured data extraction from complex documents. By treating extraction as a structured generation problem with proper validation, you can achieve accuracy levels that rival human data entry operators while processing documents at machine speed.


References

1. Wikipedia - Anthropic. Wikipedia. [Source]
2. Wikipedia - Fine-tuning. Wikipedia. [Source]
3. Wikipedia - Claude. Wikipedia. [Source]
4. GitHub - anthropics/anthropic-sdk-python. Github. [Source]
5. GitHub - hiyouga/LlamaFactory. Github. [Source]
6. GitHub - affaan-m/ECC. Github. [Source]
7. Anthropic Claude Pricing. Pricing. [Source]
8. Anthropic Claude Pricing. Pricing. [Source]
tutorialaillm
Share this article:

Was this article helpful?

Let us know to improve our AI generation.

Related Articles