Back to Tutorials
tutorialstutorialai

How to Run Local LLMs on a Laptop with Ollama and LangChain

Practical tutorial: It provides an insightful look into how AI technology can be integrated into everyday devices like laptops, which is rel

BlogIA AcademyJune 8, 202610 min read1 962 words

How to Run Local LLMs on a Laptop with Ollama and LangChain

Table of Contents

📺 Watch: Neural Networks Explained

Video by 3Blue1Brown


The landscape of AI development is shifting. While cloud APIs from OpenAI and Anthropic [10] remain dominant, a growing cohort of engineers and researchers are moving inference back to local hardware. This isn't just a hobbyist trend; it is a response to real production constraints: data residency requirements, latency sensitivity, and the escalating costs of API calls at scale.

As of June 2026, running a 7-billion parameter model locally on a consumer laptop is not only feasible but practical for many retrieval-augmented generation (RAG) pipelines and agentic workflows. This tutorial will walk you through building a production-grade local RAG system using Ollama for model serving and LangChain for orchestration. We will cover the architecture decisions, memory management, and edge cases that separate a demo from a deployable tool.

Why Local Inference Matters in Production

Before we write a single line of code, we must understand the trade-offs. The paper Foundations of GenIR (ArXiv) provides an insightful look into how AI technology can be integrated into everyday devices like laptops, which is relevant but not innovative. However, the practical implications are significant.

The Latency Argument: Cloud APIs introduce network jitter. A typical round-trip to a US-based API endpoint from Europe adds 100-200ms before any inference begins. For interactive applications—chatbots, code assistants, real-time document analysis—this latency is unacceptable. Local inference eliminates this overhead entirely.

The Privacy Argument: The paper Competing Visions of Ethical AI: A Case Study of OpenAI (ArXiv) highlights the ethical tensions in relying on third-party AI providers. When processing sensitive legal, medical, or financial documents, sending data to a remote server is a compliance risk. Local models ensure data never leaves the device.

The Cost Argument: API costs scale linearly with usage. A team processing 10 million tokens per day for a RAG pipeline will spend thousands of dollars monthly. Local inference has a fixed hardware cost and zero marginal cost per query.

The trade-off is model quality. A 7B parameter model running on a laptop will not match GPT-4 or Claude [10] 3.5 Opus on complex reasoning tasks. But for summarization, classification, and simple Q&A over a known corpus, it is often sufficient. The key is knowing when to use local vs. cloud inference—a hybrid architecture we will discuss later.

Prerequisites and Environment Setup

We will use the following stack:

  • Ollama (v0.5.x or later) for model management and inference
  • LangChain (v0.3.x) for chain construction and RAG pipelines
  • ChromaDB for vector storage (lightweight, runs in-process)
  • Python 3.11+ with pip and venv

Step 1: Install Ollama

Ollama is the simplest way to run local LLMs. It handles model downloading, quantization, and GPU acceleration (via Metal on macOS, CUDA on Linux/Windows).

# macOS (Homebrew)
brew install ollama

# Linux (curl)
curl -fsSL https://ollama.com/install.sh | sh

# Windows - download from https://ollama.com/download/windows

After installation, start the Ollama server:

ollama serve

This runs a local HTTP server on http://localhost:11434.

Step 2: Pull a Model

We will use llama3.1:8b (8 billion parameters, 4-bit quantized, ~4.7GB RAM). This is the sweet spot for laptops with 16GB RAM.

ollama pull llama3.1:8b

Edge case: If you have less than 16GB RAM, use llama3.2:3b (3B parameters, ~2GB RAM). The quality drop is noticeable but still functional for simple tasks.

Step 3: Python Environment

python3 -m venv local-rag-env
source local-rag-env/bin/activate
pip install langchain langchain-community chromadb pypdf sentence-transformers [9]

We use sentence-transformers for embedding [2] generation. The all-MiniLM-L6-v2 model is a good default: 384-dimensional embeddings, ~80MB RAM, runs on CPU.

Building the Local RAG Pipeline

Our system will:

  1. Load a PDF document
  2. Chunk it into overlapping segments
  3. Embed each chunk and store in ChromaDB
  4. On query, retrieve relevant chunks and feed them to the local LLM

Architecture Decisions

Why ChromaDB over LanceDB or Qdrant? For a single-user laptop application, ChromaDB's in-process mode is ideal. It requires no separate server, persists to disk automatically, and has a simple API. LanceDB is faster for large-scale vector search (millions of vectors), but for a document corpus of 100-500 pages, ChromaDB is sufficient.

Why sentence-transformers over OpenAI embeddings? Local embeddings keep the entire pipeline offline. The all-MiniLM-L6-v2 model produces embeddings that are 384-dimensional, which is smaller than OpenAI's 1536-dimensional embeddings. This means faster search and less RAM, at the cost of slightly lower retrieval accuracy.

Core Implementation

# local_rag.py
import os
from typing import List, Dict, Any

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFace [9]Embeddings
from langchain_community.vectorstores import Chroma
from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Configuration
PDF_PATH = "documents/technical_manual.pdf"
CHROMA_PERSIST_DIR = "./chroma_db"
MODEL_NAME = "llama3.1:8b"
EMBEDDING_MODEL = "all-MiniLM-L6-v2"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200
TOP_K_RESULTS = 4

def load_and_chunk_document(pdf_path: str) -> List[Dict[str, Any]]:
    """
    Load a PDF and split into overlapping chunks.

    Edge case: PyPDFLoader handles encrypted PDFs poorly.
    If you encounter a PDF with restrictions, use pdfminer.six instead.
    """
    if not os.path.exists(pdf_path):
        raise FileNotFoundError(f"PDF not found: {pdf_path}")

    loader = PyPDFLoader(pdf_path)
    documents = loader.load()

    # RecursiveCharacterTextSplitter respects paragraph boundaries
    # better than simple character split
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=CHUNK_SIZE,
        chunk_overlap=CHUNK_OVERLAP,
        separators=["\n\n", "\n", ".", " ", ""],
        length_function=len,
    )

    chunks = text_splitter.split_documents(documents)
    print(f"Loaded {len(documents)} pages, split into {len(chunks)} chunks")
    return chunks

def create_vector_store(chunks: List[Dict[str, Any]]) -> Chroma:
    """
    Create or load a Chroma vector store.

    Memory note: Embedding 1000 chunks with all-MiniLM-L6-v2
    takes ~30 seconds on CPU and ~200MB RAM.
    """
    embeddings = HuggingFaceEmbeddings(
        model_name=EMBEDDING_MODEL,
        model_kwargs={'device': 'cpu'},
        encode_kwargs={'normalize_embeddings': False}
    )

    # Check if persistent store exists
    if os.path.exists(CHROMA_PERSIST_DIR):
        print("Loading existing vector store..")
        vector_store = Chroma(
            persist_directory=CHROMA_PERSIST_DIR,
            embedding_function=embeddings
        )
    else:
        print("Creating new vector store..")
        vector_store = Chroma.from_documents(
            documents=chunks,
            embedding=embeddings,
            persist_directory=CHROMA_PERSIST_DIR
        )
        vector_store.persist()

    return vector_store

def build_qa_chain(vector_store: Chroma) -> RetrievalQA:
    """
    Build a RetrievalQA chain with a custom prompt.

    The prompt is critical: local models are more sensitive to
    prompt formatting than GPT-4. Use clear instructions.
    """
    llm = Ollama(
        model=MODEL_NAME,
        temperature=0.1,  # Low temperature for factual answers
        num_predict=512,  # Max tokens in response
        top_k=40,
        top_p=0.9,
        repeat_penalty=1.1,  # Prevent repetition in local models
    )

    # Custom prompt template
    template = """You are a technical assistant. Use the following context to answer the question.
If you don't know the answer, say "I don't have enough information to answer that."
Do not make up information.

Context:
{context}

Question: {question}

Answer:"""

    prompt = PromptTemplate(
        template=template,
        input_variables=["context", "question"]
    )

    retriever = vector_store.as_retriever(
        search_type="similarity",
        search_kwargs={"k": TOP_K_RESULTS}
    )

    chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",  # "stuff" puts all context into one prompt
        retriever=retriever,
        chain_type_kwargs={"prompt": prompt},
        return_source_documents=True,  # Useful for debugging
    )

    return chain

def main():
    # Step 1: Load and chunk
    chunks = load_and_chunk_document(PDF_PATH)

    # Step 2: Create vector store
    vector_store = create_vector_store(chunks)

    # Step 3: Build QA chain
    qa_chain = build_qa_chain(vector_store)

    # Step 4: Interactive loop
    print("\nLocal RAG system ready. Type 'quit' to exit.")
    while True:
        query = input("\nQuestion: ").strip()
        if query.lower() in ['quit', 'exit', 'q']:
            break

        try:
            result = qa_chain.invoke({"query": query})
            print(f"\nAnswer: {result['result']}")
            print(f"\nSources: {len(result['source_documents'])} chunks retrieved")
        except Exception as e:
            print(f"Error: {e}")

if __name__ == "__main__":
    main()

Running the System

python local_rag.py

Expected output:

Loaded 42 pages, split into 187 chunks
Creating new vector store..
Local RAG system ready. Type 'quit' to exit.

Question: What is the maximum operating temperature?
Answer: According to the technical manual, the maximum operating temperature is 85°C.
Sources: 4 chunks retrieved

Handling Edge Cases and Production Concerns

Memory Management

Local LLMs are memory-hungry. The llama3.1:8b model uses approximately 4.7GB of RAM in 4-bit quantization. Combined with ChromaDB (another 200-500MB for embeddings) and the Python runtime, you need at least 8GB free RAM.

If you hit memory limits:

  1. Use a smaller model: ollama pull llama3.2:3b (1.9GB RAM)
  2. Reduce chunk count: Increase CHUNK_SIZE to 2000 and reduce overlap
  3. Use memory-mapped embeddings: ChromaDB supports mmap for large collections

API Rate Limits and Timeouts

Ollama's local API has no rate limits, but it can become unresponsive under heavy load. The num_predict parameter in our code limits response length, preventing the model from generating infinite tokens.

Timeout handling:

# Add timeout to Ollama calls
llm = Ollama(
    model=MODEL_NAME,
    temperature=0.1,
    num_predict=512,
    timeout=30,  # Seconds before timeout
)

Cold Start Problem

The first inference after starting Ollama is slow (5-15 seconds) because the model must be loaded into memory. Subsequent calls are faster (1-3 seconds per response).

Mitigation: Keep Ollama running as a background service. On macOS, use launchctl to auto-start it. On Linux, use systemd.

Document Quality Issues

PDFs with scanned images, complex tables, or non-standard fonts will produce garbage text. The paper AI prediction leads people to forgo guaranteed rewards (ArXiv) discusses how poor data quality leads to unreliable AI outputs. In our context, this means:

  1. Scanned PDFs: Use OCR (Tesseract + pytesseract) before loading
  2. Tables: Extract with camelot-py or tabula-py
  3. Multi-column layouts: PyPDFLoader may read columns in wrong order. Use pdfplumber for better layout parsing

What's Next

This local RAG system is functional but basic. To move toward production readiness:

  1. Hybrid Search: Combine vector similarity with keyword search (BM25) for better retrieval on rare terms. LangChain supports this via EnsembleRetriever.

  2. Streaming Responses: Replace RetrievalQA with LLMChain and use Ollama's streaming API for real-time token generation.

  3. Multi-Document Support: Extend the system to index an entire directory of PDFs, with metadata tracking for source attribution.

  4. Evaluation Pipeline: Implement automated testing with a held-out set of Q&A pairs. Use ragas (RAG Assessment) to measure faithfulness, answer relevance, and context precision.

  5. Model Quantization: Experiment with GGUF quantizations (Q4_K_M, Q5_K_M) to balance quality and memory usage. Ollama handles this automatically, but you can specify quantization level when pulling models.

The future of AI development is not purely cloud-based or purely local—it is hybrid. Systems that intelligently route simple queries to local models and complex reasoning to cloud APIs will dominate. This tutorial gives you the foundation to build that hybrid architecture. Start with local, measure your failure modes, and escalate to cloud only when necessary.


References

1. Wikipedia - Anthropic. Wikipedia. [Source]
2. Wikipedia - Embedding. Wikipedia. [Source]
3. Wikipedia - Llama. Wikipedia. [Source]
4. arXiv - Observation of the rare $B^0_s\toμ^+μ^-$ decay from the comb. Arxiv. [Source]
5. arXiv - Expected Performance of the ATLAS Experiment - Detector, Tri. Arxiv. [Source]
6. GitHub - anthropics/anthropic-sdk-python. Github. [Source]
7. GitHub - fighting41love/funNLP. Github. [Source]
8. GitHub - meta-llama/llama. Github. [Source]
9. GitHub - huggingface/transformers. Github. [Source]
10. Anthropic Claude Pricing. Pricing. [Source]
tutorialai
Share this article:

Was this article helpful?

Let us know to improve our AI generation.

Related Articles