How to Build a Knowledge Graph from Documents with LLMs
Practical tutorial: Build a knowledge graph from documents with LLMs
How to Build a Knowledge Graph from Documents with LLMs
Table of Contents
- How to Build a Knowledge Graph from Documents with LLMs
- Initialize spaCy's nlp object
- Example usage
- Example usage
📺 Watch: Intro to Large Language Models
Video by Andrej Karpathy
Introduction & Architecture
In 2026, large language models (LLMs) have become indispensable tools for natural language processing tasks, including information extraction and knowledge graph construction. A knowledge graph is a structured representation of entities and their relationships, which can be used to enhance search engines, recommendation systems, and other AI applications.
This tutorial will guide you through building a production-ready system that leverag [3]es LLMs to extract key facts from documents and construct a knowledge graph. We'll use the latest stable versions of Python libraries and APIs available as of March 30, 2026. The architecture involves using an LLM for text extraction, then applying natural language processing (NLP) techniques to identify entities and relationships.
The system will be designed with scalability in mind, allowing it to handle large volumes of data efficiently. We'll also cover how to optimize the pipeline for both CPU and GPU environments, ensuring that the solution is adaptable to different deployment scenarios.
Prerequisites & Setup
To follow this tutorial, you need a Python environment set up with the necessary libraries installed. The following dependencies are required:
transformers [5]: For interfacing with LLMs.spacy: For advanced NLP tasks like entity recognition and relationship extraction.networkx: To build and visualize the knowledge graph.
pip install transformers spacy networkx
Ensure that you have a compatible version of spaCy installed, as it requires specific language models. You can download them using:
python -m spacy download en_core_web_sm
Core Implementation: Step-by-Step
Step 1: Load and Preprocess Documents
First, we need to load the documents from which we will extract information. This step involves cleaning the text data to ensure that it is in a suitable format for processing by the LLM.
import re
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from spacy.lang.en import English
# Initialize spaCy's nlp object
nlp = English()
def preprocess_text(text):
# Remove non-alphanumeric characters and extra spaces
text = re.sub(r'\W+', ' ', text)
text = re.sub(' +', ' ', text).strip()
# Tokenize the document using spaCy
doc = nlp(text)
return [token.text for token in doc]
# Example usage
document_text = "This is a sample document containing information about entities and relationships."
tokens = preprocess_text(document_text)
print(tokens)
Step 2: Extract Key Facts with LLM
Next, we use an LLM to extract key facts from the preprocessed text. This involves fine-tuning [1] or using a pretrained model for sequence-to-sequence tasks.
tokenizer = AutoTokenizer.from_pretrained("t5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
def extract_facts(text):
inputs = tokenizer.encode_plus("extract facts: " + text, return_tensors="pt", max_length=1024)
outputs = model.generate(inputs["input_ids"], max_length=128, num_beams=4, early_stopping=True)
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Split the output into individual facts
facts = [fact.strip() for fact in decoded_output.split(",")]
return facts
# Example usage
facts = extract_facts(document_text)
print(facts)
Step 3: Identify Entities and Relationships
Now that we have extracted key facts, we can use spaCy to identify entities and their relationships within these facts.
def process_facts(facts):
entity_relationships = []
for fact in facts:
doc = nlp(fact)
# Extract named entities
entities = [(ent.text, ent.label_) for ent in doc.ents]
# Identify relationships between entities (e.g., "A is related to B")
for token in doc:
if token.dep_ == 'prep' and token.head.pos_ == 'VERB':
entity_relationships.append((entities[token.i][0], token.text, entities[token.i+1][0]))
return entity_relationships
# Example usage
relationships = process_facts(facts)
print(relationships)
Step 4: Construct the Knowledge Graph
Finally, we construct a knowledge graph using networkx to represent the extracted entities and relationships.
import networkx as nx
def build_knowledge_graph(entity_relationships):
G = nx.Graph()
for entity1, relation, entity2 in entity_relationships:
G.add_edge(entity1, entity2, label=relation)
return G
# Example usage
G = build_knowledge_graph(relationships)
nx.draw(G, with_labels=True)
Configuration & Production Optimization
To scale this system for production use, consider the following optimizations:
- Batch Processing: Instead of processing documents one by one, batch them to improve throughput.
- Asynchronous Processing: Use asynchronous programming techniques to handle multiple requests concurrently.
- GPU Acceleration: For large datasets and complex models, leverage GPU acceleration.
Batch Processing Example
def process_batch(batch):
facts = [extract_facts(text) for text in batch]
relationships = [process_facts(fact_list) for fact_list in facts]
return build_knowledge_graph([rel for sublist in relationships for rel in sublist])
# Example usage with a list of documents
batch_documents = ["Document 1", "Document 2"]
G_batched = process_batch(batch_documents)
Asynchronous Processing Example
import asyncio
from concurrent.futures import ProcessPoolExecutor
async def async_process(document):
loop = asyncio.get_event_loop()
with ProcessPoolExecutor() as executor:
facts = await loop.run_in_executor(executor, extract_facts, document)
relationships = process_facts(facts)
return build_knowledge_graph(relationships)
# Example usage
documents = ["Document 1", "Document 2"]
tasks = [async_process(doc) for doc in documents]
graphs = asyncio.run(asyncio.gather(*tasks))
Advanced Tips & Edge Cases (Deep Dive)
Error Handling
Ensure robust error handling to manage issues like network failures or model errors gracefully.
def safe_extract_facts(text):
try:
return extract_facts(text)
except Exception as e:
print(f"Error processing text: {e}")
return []
Security Risks
Be cautious of prompt injection attacks where malicious inputs could manipulate the LLM's output. Sanitize input data thoroughly.
def sanitize_input(input_text):
# Implement sanitization logic here
pass
Results & Next Steps
By following this tutorial, you have built a system capable of extracting key facts from documents and constructing a knowledge graph using state-of-the-art LLMs. The next steps could include:
- Scaling: Deploy the solution on cloud platforms like AWS or Google Cloud for handling larger datasets.
- Integration: Integrate with existing data pipelines to automate the extraction process.
- Enhancements: Explore more advanced NLP techniques and models to improve accuracy.
This system can serve as a foundational component in various applications, from intelligent search engines to personalized recommendation systems.
Was this article helpful?
Let us know to improve our AI generation.
Related Articles
How to Analyze Security Logs with DeepSeek Locally
Practical tutorial: Analyze security logs with DeepSeek locally
How to Build a Production-Ready Machine Learning Pipeline with TensorFlow and PyTorch
Practical tutorial: It provides valuable insights and demystifies machine learning concepts for software engineers.
How to Build a Voice Assistant with Whisper + Llama 3.3
Practical tutorial: Build a voice assistant with Whisper + Llama 3.3