How to Build a Knowledge Graph from Documents with LLMs

How to Build a Knowledge Graph from Documents with LLMs

📺 Watch: Intro to Large Language Models

Video by Andrej Karpathy

Knowledge graphs have become the backbone of enterprise AI systems, enabling machines to understand relationships between entities in a structured, queryable format. While traditional approaches required manual ontology design and rule-based extraction, large language models (LLMs) have fundamentally changed this landscape. As of June 2026, combining LLMs with graph databases allows us to extract entities and relationships from unstructured documents with near-human accuracy, then store them in a format that supports complex reasoning and retrieval-augmented generation (RAG [2]).

In this tutorial, you'll build a production-ready knowledge graph pipeline that ingests documents, uses an LLM to extract entities and relationships, and stores the results in Neo4j for querying. We'll cover architecture decisions, handle edge cases like overlapping entities and API rate limits, and implement caching to reduce costs. By the end, you'll have a system that can process thousands of documents and answer questions like "Which researchers collaborated on projects related to quantum computing in 2025?"

Architecture Overview: Why LLMs + Knowledge Graphs Beat Flat Retrieval

Before diving into code, let's understand why this combination matters in production. Traditional RAG systems store document chunks as vectors and retrieve them based on semantic similarity. This works for simple Q&A but fails when you need to answer questions requiring multi-hop reasoning, such as "What drugs target proteins that interact with TP53?" A knowledge graph explicitly models these relationships, allowing graph traversal algorithms to find answers that no single document chunk contains.

Our architecture consists of four layers:

Document Ingestion Layer: Parses PDFs, HTML, and plain text into structured chunks
Entity Extraction Layer: Uses an LLM to identify entities (people, organizations, concepts) and their relationships
Graph Storage Layer: Stores entities as nodes and relationships as edges in Neo4j
Query Layer: Translates natural language questions into Cypher queries

The key design decision is whether to use a single LLM call per chunk or multiple specialized calls. For production systems, we'll use a single call with structured output parsing to minimize latency and cost, while implementing retry logic for failed extractions.

Prerequisites and Environment Setup

You'll need Python 3.11+ and a Neo4j instance (local or cloud). We'll use the following libraries:

pip install langchain langchain-openai [8] neo4j pydantic python-dotenv tiktoken tenacity

Create a .env file with your credentials:

OPENAI_API_KEY=sk-your-key-here
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your-password

For local Neo4j, you can run it with Docker:

docker run \
    --name neo4j-knowledge-graph \
    -p 7474:7474 -p 7687:7687 \
    -e NEO4J_AUTH=neo4j/your-password \
    -e NEO4J_PLUGINS='["apoc"]' \
    neo4j:5.20.0

The APOC plugin provides graph algorithms we'll use later for path finding and centrality analysis.

Step 1: Document Chunking with Semantic Boundaries

Raw documents need to be split into chunks that preserve semantic meaning while fitting within LLM context windows. We'll use LangChain [7]'s RecursiveCharacterTextSplitter with overlap to ensure entity mentions aren't lost at chunk boundaries.

import os
from dotenv import load_dotenv
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from typing import List

load_dotenv()

def chunk_document(text: str, metadata: dict, chunk_size: int = 2000, chunk_overlap: int = 200) -> List[Document]:
    """
    Split document into overlapping chunks with semantic boundaries.

    Args:
        text: Raw document text
        metadata: Document metadata (source, date, etc.)
        chunk_size: Target chunk size in characters
        chunk_overlap: Overlap between chunks to maintain entity continuity

    Returns:
        List of Document objects with metadata
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", ". ", " ", ""],
        length_function=len,
    )

    chunks = splitter.split_text(text)
    documents = []

    for i, chunk in enumerate(chunks):
        doc = Document(
            page_content=chunk,
            metadata={
                **metadata,
                "chunk_index": i,
                "total_chunks": len(chunks),
            }
        )
        documents.append(doc)

    return documents

Edge case: When a chunk ends mid-sentence, the overlap ensures the next chunk contains the complete sentence. However, for very long entity names (e.g., "National Aeronautics and Space Administration"), you might still lose context. We handle this by using a 10% overlap ratio and tracking entity mentions across chunks in the next step.

Step 2: Entity and Relationship Extraction with Structured Output

This is where the LLM does the heavy lifting. We'll define a Pydantic model for our graph schema and use LangChain's structured output parser to enforce JSON formatting.

from pydantic import BaseModel, Field
from typing import List, Optional
from langchain_openai import ChatOpenAI
from langchain.output_parsers import PydanticOutputParser
from langchain.prompts import ChatPromptTemplate

class Entity(BaseModel):
    """An entity in the knowledge graph."""
    name: str = Field(description="The entity name, normalized to canonical form")
    type: str = Field(description="Entity type: PERSON, ORGANIZATION, CONCEPT, LOCATION, EVENT, TECHNOLOGY")
    description: str = Field(description="Brief description of the entity")

class Relationship(BaseModel):
    """A relationship between two entities."""
    source: str = Field(description="Name of the source entity")
    target: str = Field(description="Name of the target entity")
    relation: str = Field(description="Type of relationship (e.g., WORKS_FOR, LOCATED_IN, PART_OF)")
    context: str = Field(description="The text snippet that supports this relationship")

class KnowledgeGraphExtraction(BaseModel):
    """Complete extraction from a document chunk."""
    entities: List[Entity] = Field(description="Entities found in the text")
    relationships: List[Relationship] = Field(description="Relationships between entities")

def create_extraction_chain():
    """Create a LangChain chain for entity and relationship extraction."""

    parser = PydanticOutputParser(pydantic_object=KnowledgeGraphExtraction)

    prompt = ChatPromptTemplate.from_messages([
        ("system", """You are an expert knowledge graph extractor. Extract entities and their relationships from the given text.

Rules:
1. Normalize entity names to canonical form (e.g., "Dr. Smith" -> "John Smith" if full name is available)
2. Only extract relationships explicitly stated or strongly implied in the text
3. Use standard relationship types: WORKS_FOR, LOCATED_IN, PART_OF, COLLABORATES_WITH, DISCOVERS, LEADS, FUNDS, PUBLISHES_IN
4. Include supporting context for each relationship (exact text snippet)
5. Skip generic entities like "the company" or "the researcher" without specific names

{format_instructions}"""),
        ("human", "Extract entities and relationships from this text:\n\n{text}"),
    ])

    llm = ChatOpenAI(
        model="gpt [6]-4o-mini",  # Cost-effective for extraction tasks
        temperature=0.1,       # Low temperature for consistent output
        max_tokens=2000,
    )

    chain = prompt | llm | parser
    return chain

Why gpt-4o-mini? As of June 2026, this model offers the best cost-performance ratio for structured extraction tasks. At $0.15/1M input tokens and $0.60/1M output tokens, it's 10x cheaper than GPT-4o while maintaining 95%+ accuracy on entity extraction benchmarks.

Step 3: Batch Processing with Retry Logic and Caching

Production systems must handle API failures, rate limits, and duplicate entities. We'll implement a robust processing pipeline with exponential backoff and a local cache to avoid reprocessing.

import json
import hashlib
from pathlib import Path
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
from openai import RateLimitError, APIError

class KnowledgeGraphBuilder:
    def __init__(self, cache_dir: str = "./extraction_cache"):
        self.chain = create_extraction_chain()
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(exist_ok=True)

    def _get_cache_key(self, text: str) -> str:
        """Generate a cache key from text content."""
        return hashlib.sha256(text.encode()).hexdigest()[:16]

    def _load_from_cache(self, text: str) -> Optional[KnowledgeGraphExtraction]:
        """Load cached extraction if available."""
        cache_key = self._get_cache_key(text)
        cache_path = self.cache_dir / f"{cache_key}.json"

        if cache_path.exists():
            with open(cache_path, 'r') as f:
                data = json.load(f)
                return KnowledgeGraphExtraction(**data)
        return None

    def _save_to_cache(self, text: str, extraction: KnowledgeGraphExtraction):
        """Save extraction to cache."""
        cache_key = self._get_cache_key(text)
        cache_path = self.cache_dir / f"{cache_key}.json"

        with open(cache_path, 'w') as f:
            json.dump(extraction.model_dump(), f, indent=2)

    @retry(
        stop=stop_after_attempt(3),
        wait=wait_exponential(multiplier=1, min=4, max=60),
        retry=retry_if_exception_type((RateLimitError, APIError))
    )
    def extract_from_chunk(self, text: str) -> KnowledgeGraphExtraction:
        """Extract entities and relationships with caching and retry."""

        # Check cache first
        cached = self._load_from_cache(text)
        if cached:
            return cached

        # Perform extraction
        result = self.chain.invoke({"text": text})

        # Cache the result
        self._save_to_cache(text, result)

        return result

    def process_documents(self, documents: List[Document]) -> List[KnowledgeGraphExtraction]:
        """Process multiple documents with progress tracking."""
        results = []

        for i, doc in enumerate(documents):
            print(f"Processing chunk {i+1}/{len(documents)}..")

            try:
                extraction = self.extract_from_chunk(doc.page_content)
                results.append(extraction)
            except Exception as e:
                print(f"Failed to process chunk {i+1}: {e}")
                # Log failed chunks for manual review
                with open("failed_chunks.log", "a") as f:
                    f.write(f"Chunk {i+1} from {doc.metadata.get('source', 'unknown')}: {e}\n")
                continue

        return results

Memory management: For large document sets (10,000+ chunks), the cache directory can grow to several GB. We recommend setting up a TTL-based cleanup or using Redis for distributed caching. The current implementation uses local JSON files for simplicity, but in production you'd want to use a proper cache like Redis with an LRU eviction policy.

Step 4: Graph Construction with Entity Resolution

This is the most critical step. Raw LLM output contains duplicate entities (e.g., "OpenAI" vs "OpenAI Inc.") and inconsistent relationship names. We need to resolve these before inserting into Neo4j.

from neo4j import GraphDatabase
from typing import Set, Dict, Tuple
import re

class GraphConstructor:
    def __init__(self, uri: str, user: str, password: str):
        self.driver = GraphDatabase.driver(uri, auth=(user, password))

    def _normalize_entity_name(self, name: str) -> str:
        """Normalize entity names for deduplication."""
        # Remove extra whitespace and standardize
        name = re.sub(r'\s+', ' ', name).strip()
        # Remove common suffixes
        name = re.sub(r'\s+(Inc|Corp|LLC|Ltd|GmbH)\.?$', '', name, flags=re.IGNORECASE)
        # Convert to title case
        name = name.title()
        return name

    def _merge_entities(self, tx, entities: List[Entity]):
        """Merge entities into graph, handling duplicates."""
        for entity in entities:
            normalized_name = self._normalize_entity_name(entity.name)

            tx.run("""
                MERGE (e:Entity {name: $name})
                ON CREATE SET 
                    e.type = $type,
                    e.description = $description,
                    e.created_at = datetime()
                ON MATCH SET
                    e.type = CASE 
                        WHEN e.type IS NULL OR e.type = 'UNKNOWN' THEN $type 
                        ELSE e.type 
                    END,
                    e.description = CASE 
                        WHEN $description IS NOT NULL AND $description != '' THEN $description
                        ELSE e.description 
                    END,
                    e.updated_at = datetime()
            """, name=normalized_name, type=entity.type, description=entity.description)

    def _create_relationships(self, tx, relationships: List[Relationship], entities: Set[str]):
        """Create relationships between entities."""
        for rel in relationships:
            source_normalized = self._normalize_entity_name(rel.source)
            target_normalized = self._normalize_entity_name(rel.target)

            # Skip if entities weren't extracted (shouldn't happen, but defensive)
            if source_normalized not in entities or target_normalized not in entities:
                continue

            tx.run("""
                MATCH (source:Entity {name: $source})
                MATCH (target:Entity {name: $target})
                MERGE (source)-[r:RELATED {type: $relation}]->(target)
                ON CREATE SET 
                    r.context = $context,
                    r.created_at = datetime()
                ON MATCH SET
                    r.context = CASE 
                        WHEN $context IS NOT NULL AND $context != '' THEN $context
                        ELSE r.context 
                    END,
                    r.updated_at = datetime()
            """, source=source_normalized, target=target_normalized, 
                 relation=rel.relation, context=rel.context)

    def insert_extractions(self, extractions: List[KnowledgeGraphExtraction]):
        """Insert all extractions into Neo4j in a single transaction."""
        with self.driver.session() as session:
            # Collect all entities first for deduplication
            all_entities = set()
            for extraction in extractions:
                for entity in extraction.entities:
                    all_entities.add(self._normalize_entity_name(entity.name))

            # Execute in single transaction for atomicity
            session.execute_write(
                self._merge_entities,
                [e for ext in extractions for e in ext.entities]
            )

            session.execute_write(
                self._create_relationships,
                [r for ext in extractions for r in ext.relationships],
                all_entities
            )

Entity resolution edge case: When two chunks mention "Apple" (the fruit) and "Apple" (the company), our simple normalization won't distinguish them. In production, you'd add context-aware disambiguation using the entity description and surrounding text. For now, we store both types and let the graph query handle ambiguity through relationship context.

Step 5: Querying the Knowledge Graph

Now that we have data in Neo4j, we need to translate natural language questions into Cypher queries. We'll use a second LLM call for this, with few-shot examples to improve accuracy.

class GraphQueryEngine:
    def __init__(self, uri: str, user: str, password: str):
        self.driver = GraphDatabase.driver(uri, auth=(user, password))
        self.llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

    def _generate_cypher(self, question: str, schema: str) -> str:
        """Generate Cypher query from natural language question."""

        prompt = f"""Given the following Neo4j graph schema:

{schema}

Generate a Cypher query to answer this question: "{question}"

Rules:
- Use MATCH for pattern matching
- Use WHERE for filtering
- Use RETURN for results
- Limit results to 10 unless specified otherwise
- Use case-insensitive matching for string comparisons

Examples:
Q: "Who works for OpenAI?"
Cypher: MATCH (p:Entity {{type: 'PERSON'}})-[:RELATED {{type: 'WORKS_FOR'}}]->(c:Entity {{name: 'OpenAI'}}) RETURN p.name

Q: "What technologies did researchers at MIT discover?"
Cypher: MATCH (p:Entity {{type: 'PERSON'}})-[:RELATED {{type: 'WORKS_FOR'}}]->(org:Entity {{name: 'MIT'}}), (p)-[:RELATED {{type: 'DISCOVERS'}}]->(tech:Entity {{type: 'TECHNOLOGY'}}) RETURN tech.name

Return only the Cypher query, no explanation."""

        response = self.llm.invoke(prompt)
        return response.content.strip()

    def query(self, question: str) -> List[Dict]:
        """Execute a natural language query against the knowledge graph."""

        # Get schema
        with self.driver.session() as session:
            schema = session.run("CALL db.schema.visualization()").single().value()

        # Generate and execute Cypher
        cypher = self._generate_cypher(question, schema)

        with self.driver.session() as session:
            result = session.run(cypher)
            return [record.data() for record in result]

Performance consideration: For graphs with millions of nodes, you'll want to add indexes on entity name and type. Run these Cypher commands after initial data load:

CREATE INDEX entity_name_idx IF NOT EXISTS FOR (e:Entity) ON (e.name);
CREATE INDEX entity_type_idx IF NOT EXISTS FOR (e:Entity) ON (e.type);

Production Deployment and Monitoring

In production, you'll need to handle several additional concerns:

Rate limiting: The tenacity retry decorator handles transient failures, but you should also implement a token bucket rate limiter to stay within OpenAI's tier limits (5,000 RPM for Tier 5 users as of June 2026).
Cost tracking: Log token usage per document to track costs. At $0.60/1M output tokens, extracting from a 2,000-character chunk costs approximately $0.0012. For 100,000 documents, that's $120.
Incremental updates: Use document hashes to detect changes and only reprocess modified documents. Store the hash in Neo4j as a node property.
Fallback strategy: If the LLM fails to produce valid JSON after retries, fall back to a regex-based entity extractor for critical entities like email addresses, URLs, and dates.

What's Next

This knowledge graph pipeline is production-ready but can be extended in several ways:

Multi-modal extraction: Add image analysis for diagrams and charts using vision-language models
Temporal graphs: Add time dimensions to track how relationships evolve
Graph neural networks: Train GNNs on your knowledge graph for link prediction and node classification
Federated queries: Connect multiple knowledge graphs across organizations using federation protocols

The complete code is available on GitHub at github.com/daily-neural-digest/knowledge-graph-builder (hypothetical repository). For more on graph databases and LLMs, check out our guides on vector search optimization and RAG architecture patterns.

Remember: knowledge graphs are only as good as the quality of your entity extraction. Invest in prompt engineering, maintain a feedback loop for correcting extraction errors, and regularly audit your graph for consistency. With these foundations, you'll have a system that scales from hundreds to millions of documents while maintaining high accuracy.

References

1. Wikipedia - OpenAI. Wikipedia. [Source]

2. Wikipedia - Rag. Wikipedia. [Source]

3. Wikipedia - GPT. Wikipedia. [Source]

4. GitHub - openai/openai-python. Github. [Source]

5. GitHub - Shubhamsaboo/awesome-llm-apps. Github. [Source]

6. GitHub - Significant-Gravitas/AutoGPT. Github. [Source]

7. GitHub - langchain-ai/langchain. Github. [Source]

8. OpenAI Pricing. Pricing. [Source]

How to Build a Knowledge Graph from Documents with LLMs

How to Build a Knowledge Graph from Documents with LLMs

Table of Contents

📺 Watch: Intro to Large Language Models

Architecture Overview: Why LLMs + Knowledge Graphs Beat Flat Retrieval

Prerequisites and Environment Setup

Step 1: Document Chunking with Semantic Boundaries

Step 2: Entity and Relationship Extraction with Structured Output

Step 3: Batch Processing with Retry Logic and Caching

Step 4: Graph Construction with Entity Resolution

Step 5: Querying the Knowledge Graph

Production Deployment and Monitoring

What's Next

References

Was this article helpful?

Related Articles

How to Build a SOC Assistant with AI Threat Detection

How to Build a Voice Assistant with Whisper and Llama 3.3

How to Evaluate Large Language Models for Production: A Technical Guide 2026