How to Build a Knowledge Graph from Documents with LLMs

How to Build a Knowledge Graph from Documents with LLMs
Initialize spaCy's nlp object
Example usage
- Step 2: Extract Key Facts with LLM
Example usage
- Step 3: Identify Entities and Relationships

📺 Watch: Intro to Large Language Models

Video by Andrej Karpathy

Introduction & Architecture

In 2026, large language models (LLMs) have become indispensable tools for natural language processing tasks, including information extraction and knowledge graph construction. A knowledge graph is a structured representation of entities and their relationships, which can be used to enhance search engines, recommendation systems, and other AI applications.

This tutorial will guide you through building a production-ready system that leverag [3]es LLMs to extract key facts from documents and construct a knowledge graph. We'll use the latest stable versions of Python libraries and APIs available as of March 30, 2026. The architecture involves using an LLM for text extraction, then applying natural language processing (NLP) techniques to identify entities and relationships.

The system will be designed with scalability in mind, allowing it to handle large volumes of data efficiently. We'll also cover how to optimize the pipeline for both CPU and GPU environments, ensuring that the solution is adaptable to different deployment scenarios.

Prerequisites & Setup

To follow this tutorial, you need a Python environment set up with the necessary libraries installed. The following dependencies are required:

transformers [5]: For interfacing with LLMs.
spacy: For advanced NLP tasks like entity recognition and relationship extraction.
networkx: To build and visualize the knowledge graph.

pip install transformers spacy networkx

Ensure that you have a compatible version of spaCy installed, as it requires specific language models. You can download them using:

python -m spacy download en_core_web_sm

Core Implementation: Step-by-Step

Step 1: Load and Preprocess Documents

First, we need to load the documents from which we will extract information. This step involves cleaning the text data to ensure that it is in a suitable format for processing by the LLM.

import re
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from spacy.lang.en import English

# Initialize spaCy's nlp object
nlp = English()

def preprocess_text(text):
    # Remove non-alphanumeric characters and extra spaces
    text = re.sub(r'\W+', ' ', text)
    text = re.sub(' +', ' ', text).strip()

    # Tokenize the document using spaCy
    doc = nlp(text)
    return [token.text for token in doc]

# Example usage
document_text = "This is a sample document containing information about entities and relationships."
tokens = preprocess_text(document_text)
print(tokens)

Step 2: Extract Key Facts with LLM

Next, we use an LLM to extract key facts from the preprocessed text. This involves fine-tuning [1] or using a pretrained model for sequence-to-sequence tasks.

tokenizer = AutoTokenizer.from_pretrained("t5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

def extract_facts(text):
    inputs = tokenizer.encode_plus("extract facts: " + text, return_tensors="pt", max_length=1024)
    outputs = model.generate(inputs["input_ids"], max_length=128, num_beams=4, early_stopping=True)
    decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Split the output into individual facts
    facts = [fact.strip() for fact in decoded_output.split(",")]
    return facts

# Example usage
facts = extract_facts(document_text)
print(facts)

Step 3: Identify Entities and Relationships

Now that we have extracted key facts, we can use spaCy to identify entities and their relationships within these facts.

def process_facts(facts):
    entity_relationships = []

    for fact in facts:
        doc = nlp(fact)

        # Extract named entities
        entities = [(ent.text, ent.label_) for ent in doc.ents]

        # Identify relationships between entities (e.g., "A is related to B")
        for token in doc:
            if token.dep_ == 'prep' and token.head.pos_ == 'VERB':
                entity_relationships.append((entities[token.i][0], token.text, entities[token.i+1][0]))

    return entity_relationships

# Example usage
relationships = process_facts(facts)
print(relationships)

Step 4: Construct the Knowledge Graph

Finally, we construct a knowledge graph using networkx to represent the extracted entities and relationships.

import networkx as nx

def build_knowledge_graph(entity_relationships):
    G = nx.Graph()

    for entity1, relation, entity2 in entity_relationships:
        G.add_edge(entity1, entity2, label=relation)

    return G

# Example usage
G = build_knowledge_graph(relationships)
nx.draw(G, with_labels=True)

Configuration & Production Optimization

To scale this system for production use, consider the following optimizations:

Batch Processing: Instead of processing documents one by one, batch them to improve throughput.
Asynchronous Processing: Use asynchronous programming techniques to handle multiple requests concurrently.
GPU Acceleration: For large datasets and complex models, leverage GPU acceleration.

Batch Processing Example

def process_batch(batch):
    facts = [extract_facts(text) for text in batch]
    relationships = [process_facts(fact_list) for fact_list in facts]

    return build_knowledge_graph([rel for sublist in relationships for rel in sublist])

# Example usage with a list of documents
batch_documents = ["Document 1", "Document 2"]
G_batched = process_batch(batch_documents)

Asynchronous Processing Example

import asyncio
from concurrent.futures import ProcessPoolExecutor

async def async_process(document):
    loop = asyncio.get_event_loop()
    with ProcessPoolExecutor() as executor:
        facts = await loop.run_in_executor(executor, extract_facts, document)
        relationships = process_facts(facts)

    return build_knowledge_graph(relationships)

# Example usage
documents = ["Document 1", "Document 2"]
tasks = [async_process(doc) for doc in documents]
graphs = asyncio.run(asyncio.gather(*tasks))

Advanced Tips & Edge Cases (Deep Dive)

Error Handling

Ensure robust error handling to manage issues like network failures or model errors gracefully.

def safe_extract_facts(text):
    try:
        return extract_facts(text)
    except Exception as e:
        print(f"Error processing text: {e}")
        return []

Security Risks

Be cautious of prompt injection attacks where malicious inputs could manipulate the LLM's output. Sanitize input data thoroughly.

def sanitize_input(input_text):
    # Implement sanitization logic here
    pass

Results & Next Steps

By following this tutorial, you have built a system capable of extracting key facts from documents and constructing a knowledge graph using state-of-the-art LLMs. The next steps could include:

Scaling: Deploy the solution on cloud platforms like AWS or Google Cloud for handling larger datasets.
Integration: Integrate with existing data pipelines to automate the extraction process.
Enhancements: Explore more advanced NLP techniques and models to improve accuracy.

This system can serve as a foundational component in various applications, from intelligent search engines to personalized recommendation systems.

References

1. Wikipedia - Fine-tuning. Wikipedia. [Source]

2. Wikipedia - Transformers. Wikipedia. [Source]

3. Wikipedia - Rag. Wikipedia. [Source]

4. GitHub - hiyouga/LlamaFactory. Github. [Source]

5. GitHub - huggingface/transformers. Github. [Source]

6. GitHub - Shubhamsaboo/awesome-llm-apps. Github. [Source]

How to Build a Knowledge Graph from Documents with LLMs

How to Build a Knowledge Graph from Documents with LLMs

Table of Contents

📺 Watch: Intro to Large Language Models

Introduction & Architecture

Prerequisites & Setup

Core Implementation: Step-by-Step

Step 1: Load and Preprocess Documents

Step 2: Extract Key Facts with LLM

Step 3: Identify Entities and Relationships

Step 4: Construct the Knowledge Graph

Configuration & Production Optimization

Batch Processing Example

Asynchronous Processing Example

Advanced Tips & Edge Cases (Deep Dive)

Error Handling

Security Risks

Results & Next Steps

References

Was this article helpful?

Related Articles

How to Analyze Security Logs with DeepSeek Locally

How to Build a Production-Ready Machine Learning Pipeline with TensorFlow and PyTorch

How to Build a Voice Assistant with Whisper + Llama 3.3