How to Build an LLM from Scratch with PyTorch

How to Build an LLM from Scratch with PyTorch
- Why Build Instead of Fine-Tune
- Prerequisites and Environment Setup
Create a clean environment
Core dependencies
For profiling and debugging

📺 Watch: RAG [2] Explained

Video by IBM Technology

Building a large language model from scratch sounds like a task reserved for FAANG research teams with unlimited GPU budgets. But the reality is more accessible than most engineers assume. The repository "LLMs-from-scratch" by Sebastian Raschka has accumulated 87,799 stars and 13,374 forks on GitHub as of mid-2026, and it demonstrates exactly how to implement a ChatGPT-like LLM in PyTorch [4] from scratch, step by step. This tutorial walks through that implementation with production-grade code, real architecture decisions, and the edge cases that will break your model if you ignore them.

Why Build Instead of Fine-Tune

Most teams reach for Hugging Face transformers and call it a day. That works for 80% of use cases. But when you need to understand exactly how attention mechanisms propagate gradients, or when you're deploying to edge devices with custom hardware constraints, a black-box approach fails. Building from scratch gives you control over every parameter, every activation function, and every memory allocation.

The LLMs-from-scratch project is written entirely in Jupyter Notebook format, which makes it ideal for experimentation. But for production deployment, you'll want to extract the core components into standalone Python modules. This tutorial does exactly that.

Prerequisites and Environment Setup

You need a machine with at least 16GB of RAM for the small-scale models we'll build. GPU access is optional for training the tiny versions, but you'll want at least an RTX 3090 or A10G for anything above 1 billion parameters.

# Create a clean environment
python -m venv llm_scratch
source llm_scratch/bin/activate

# Core dependencies
pip install torch==2.3.0 torchvision==0.18.0 --index-url https://download.pytorch.org/whl/cu118
pip install numpy==1.26.4 tiktoken==0.7.0 wandb==0.17.0
pip install datasets==2.19.0 transformers==4.41.0

# For profiling and debugging
pip install torchinfo==1.8.0 memory-profiler==0.61.0

The tiktoken library is OpenAI [8]'s fast BPE tokenizer implementation. We use it instead of Hugging Face's tokenizers because it's faster for training from scratch and has zero external dependencies beyond Python.

Core Architecture: The GPT-Style Decoder-Only Model

The model architecture follows the standard decoder-only transformer pattern used by GPT-2, GPT-3, and most modern LLMs. Here's the complete implementation with every component explained.

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class MultiHeadAttention(nn.Module):
    """
    Scaled dot-product attention with multiple heads.
    This is the core mechanism that allows the model to attend to different
    positions in the input sequence simultaneously.
    """
    def __init__(self, d_model: int, n_heads: int, dropout: float = 0.1):
        super().__init__()
        assert d_model % n_heads == 0, "d_model must be divisible by n_heads"

        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads  # dimension per head

        # Linear projections for Q, K, V - all combined for efficiency
        self.w_q = nn.Linear(d_model, d_model, bias=False)
        self.w_k = nn.Linear(d_model, d_model, bias=False)
        self.w_v = nn.Linear(d_model, d_model, bias=False)
        self.w_o = nn.Linear(d_model, d_model, bias=False)

        self.dropout = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:
        batch_size, seq_len, _ = x.shape

        # Project and reshape: (batch, seq, d_model) -> (batch, n_heads, seq, d_k)
        Q = self.w_q(x).view(batch_size, seq_len, self.n_heads, self.d_k).transpose(1, 2)
        K = self.w_k(x).view(batch_size, seq_len, self.n_heads, self.d_k).transpose(1, 2)
        V = self.w_v(x).view(batch_size, seq_len, self.n_heads, self.d_k).transpose(1, 2)

        # Scaled dot-product attention
        # Q * K^T / sqrt(d_k) gives attention scores
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)

        # Apply causal mask to prevent attending to future tokens
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        # Softmax over the last dimension (attention over sequence positions)
        attn_weights = F.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)

        # Weighted sum of values
        context = torch.matmul(attn_weights, V)

        # Reshape back: (batch, n_heads, seq, d_k) -> (batch, seq, d_model)
        context = context.transpose(1, 2).contiguous().view(batch_size, seq_len, self.d_model)

        # Final projection
        output = self.w_o(context)
        return output

class FeedForward(nn.Module):
    """
    Position-wise feed-forward network with GELU activation.
    The expansion factor of 4 is standard for GPT-style models.
    """
    def __init__(self, d_model: int, d_ff: int = None, dropout: float = 0.1):
        super().__init__()
        d_ff = d_ff or 4 * d_model  # default expansion factor

        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # GELU activation - smoother than ReLU, helps with gradient flow
        return self.linear2(self.dropout(F.gelu(self.linear1(x))))

class TransformerBlock(nn.Module):
    """
    Single transformer block with pre-norm architecture.
    Pre-norm (LayerNorm before attention/FFN) is more stable during training
    than post-norm, especially for deeper models.
    """
    def __init__(self, d_model: int, n_heads: int, dropout: float = 0.1):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, n_heads, dropout)
        self.feed_forward = FeedForward(d_model, dropout=dropout)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:
        # Pre-norm: normalize before attention, then residual connection
        attn_output = self.attention(self.norm1(x), mask)
        x = x + self.dropout(attn_output)

        # Pre-norm: normalize before FFN, then residual connection
        ff_output = self.feed_forward(self.norm2(x))
        x = x + self.dropout(ff_output)

        return x

class GPTModel(nn.Module):
    """
    Full GPT-style decoder-only transformer model.

    Architecture decisions:
    - Pre-norm: Better training stability for deep models
    - GELU activation: Smoother gradients than ReLU
    - No bias in QKV projections: Reduces parameters, no accuracy loss
    - Learnable positional embeddings: Simpler than RoPE for small models
    """
    def __init__(
        self,
        vocab_size: int = 50257,  # GPT-2 vocabulary size
        d_model: int = 768,       # Embedding dimension
        n_heads: int = 12,        # Number of attention heads
        n_layers: int = 12,       # Number of transformer blocks
        max_seq_len: int = 1024,  # Maximum sequence length
        dropout: float = 0.1
    ):
        super().__init__()

        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_embedding = nn.Embedding(max_seq_len, d_model)
        self.dropout = nn.Dropout(dropout)

        # Stack of transformer blocks
        self.blocks = nn.ModuleList([
            TransformerBlock(d_model, n_heads, dropout)
            for _ in range(n_layers)
        ])

        self.final_norm = nn.LayerNorm(d_model)
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)

        # Tie weights: share embedding and output projection
        # This reduces parameters and improves training stability
        self.token_embedding.weight = self.lm_head.weight

        # Initialize weights
        self._init_weights()

    def _init_weights(self):
        """Initialize weights using GPT-2's initialization scheme."""
        for module in self.modules():
            if isinstance(module, nn.Linear):
                # Small std for stability
                torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
                if module.bias is not None:
                    torch.nn.init.zeros_(module.bias)
            elif isinstance(module, nn.Embedding):
                torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            elif isinstance(module, nn.LayerNorm):
                torch.nn.init.ones_(module.weight)
                torch.nn.init.zeros_(module.bias)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        batch_size, seq_len = x.shape

        # Create causal mask
        mask = torch.tril(torch.ones(seq_len, seq_len)).view(1, 1, seq_len, seq_len)
        mask = mask.to(x.device)

        # Token + position embeddings
        positions = torch.arange(0, seq_len, device=x.device).unsqueeze(0)
        x = self.token_embedding(x) + self.position_embedding(positions)
        x = self.dropout(x)

        # Pass through transformer blocks
        for block in self.blocks:
            x = block(x, mask)

        # Final normalization and projection to vocabulary
        x = self.final_norm(x)
        logits = self.lm_head(x)

        return logits

The weight tying between the token embedding and the output projection is a critical optimization. It reduces the parameter count by roughly 20% for a 768-dimensional model with a 50k vocabulary, and it forces the model to learn consistent representations for input and output tokens.

Training Loop with Gradient Accumulation

Training a language model from scratch requires careful management of memory and gradient flow. Here's a production-ready training loop that handles the common failure modes.

import torch
from torch.utils.data import DataLoader, Dataset
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR
import wandb
import tiktoken
from datasets import load_dataset

class TextDataset(Dataset):
    """
    Memory-efficient dataset that tokenizes on-the-fly.
    For production, pre-tokenize and save to disk.
    """
    def __init__(self, texts: list, max_length: int = 1024, stride: int = 512):
        self.enc = tiktoken.get_encoding("gpt2")
        self.max_length = max_length
        self.stride = stride

        # Tokenize all texts and concatenate
        self.tokens = []
        for text in texts:
            tokens = self.enc.encode(text)
            self.tokens.extend(tokens)

        # Calculate number of samples
        self.n_samples = max(0, (len(self.tokens) - max_length) // stride + 1)

    def __len__(self):
        return self.n_samples

    def __getitem__(self, idx):
        start = idx * self.stride
        end = start + self.max_length + 1  # +1 for target shift

        chunk = self.tokens[start:end]
        x = torch.tensor(chunk[:-1], dtype=torch.long)
        y = torch.tensor(chunk[1:], dtype=torch.long)

        return x, y

def train_model(
    model: GPTModel,
    train_dataset: TextDataset,
    val_dataset: TextDataset,
    batch_size: int = 4,
    grad_accum_steps: int = 8,  # Effective batch size = batch_size * grad_accum_steps
    learning_rate: float = 3e-4,
    max_epochs: int = 3,
    device: str = "cuda" if torch.cuda.is_available() else "cpu"
):
    model = model.to(device)
    optimizer = AdamW(model.parameters(), lr=learning_rate, weight_decay=0.1)

    # Cosine schedule with linear warmup
    total_steps = len(train_dataset) * max_epochs // (batch_size * grad_accum_steps)
    warmup_steps = int(0.01 * total_steps)  # 1% warmup

    def lr_lambda(step):
        if step < warmup_steps:
            return step / warmup_steps
        return 0.5 * (1 + math.cos(math.pi * (step - warmup_steps) / (total_steps - warmup_steps)))

    scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)

    train_loader = DataLoader(
        train_dataset, 
        batch_size=batch_size, 
        shuffle=True,
        num_workers=4,
        pin_memory=True
    )

    val_loader = DataLoader(
        val_dataset,
        batch_size=batch_size,
        shuffle=False,
        num_workers=4,
        pin_memory=True
    )

    # Initialize wandb for tracking
    wandb.init(project="llm-from-scratch", config={
        "model_size": sum(p.numel() for p in model.parameters()),
        "batch_size": batch_size,
        "grad_accum_steps": grad_accum_steps,
        "learning_rate": learning_rate,
        "total_steps": total_steps
    })

    global_step = 0
    best_val_loss = float('inf')

    for epoch in range(max_epochs):
        model.train()
        total_loss = 0
        optimizer.zero_grad()

        for batch_idx, (x, y) in enumerate(train_loader):
            x, y = x.to(device), y.to(device)

            logits = model(x)
            loss = F.cross_entropy(
                logits.view(-1, logits.size(-1)),
                y.view(-1),
                ignore_index=50256  # Ignore padding tokens
            )

            # Scale loss for gradient accumulation
            loss = loss / grad_accum_steps
            loss.backward()

            if (batch_idx + 1) % grad_accum_steps == 0:
                # Gradient clipping to prevent exploding gradients
                torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

                optimizer.step()
                scheduler.step()
                optimizer.zero_grad()

                global_step += 1

                # Logging
                if global_step % 10 == 0:
                    wandb.log({
                        "train_loss": loss.item() * grad_accum_steps,
                        "learning_rate": scheduler.get_last_lr()[0],
                        "gradient_norm": torch.nn.utils.clip_grad_norm_(
                            model.parameters(), max_norm=float('inf')
                        ).item()
                    })

        # Validation at end of epoch
        model.eval()
        val_loss = 0
        with torch.no_grad():
            for x, y in val_loader:
                x, y = x.to(device), y.to(device)
                logits = model(x)
                loss = F.cross_entropy(
                    logits.view(-1, logits.size(-1)),
                    y.view(-1),
                    ignore_index=50256
                )
                val_loss += loss.item()

        val_loss /= len(val_loader)
        wandb.log({"val_loss": val_loss, "epoch": epoch})

        # Save best model
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            torch.save({
                'epoch': epoch,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'val_loss': val_loss,
            }, 'best_model.pt')

        print(f"Epoch {epoch}: Train Loss: {total_loss/len(train_loader):.4f}, Val Loss: {val_loss:.4f}")

    wandb.finish()
    return model

The gradient accumulation parameter is critical for production training. With a batch size of 4 and 8 accumulation steps, you get an effective batch size of 32 without needing 32x the GPU memory. This matters because most LLM training papers use batch sizes of 512 or more, and you simply cannot fit that on a single GPU without accumulation.

Text Generation with Temperature and Top-k Sampling

Once trained, the model needs a generation function that handles the stochastic nature of language generation. Here's a production-ready implementation.

@torch.no_grad()
def generate_text(
    model: GPTModel,
    prompt: str,
    max_new_tokens: int = 100,
    temperature: float = 0.7,
    top_k: int = 50,
    top_p: float = 0.9,
    device: str = "cuda" if torch.cuda.is_available() else "cpu"
) -> str:
    """
    Generate text with temperature scaling, top-k filtering, and nucleus sampling.

    Temperature: Controls randomness. Lower = more deterministic.
    Top-k: Only sample from the k most likely tokens.
    Top-p (nucleus): Only sample from tokens with cumulative probability p.
    """
    model.eval()
    enc = tiktoken.get_encoding("gpt2")

    # Encode prompt
    input_ids = enc.encode(prompt)
    input_tensor = torch.tensor([input_ids], dtype=torch.long).to(device)

    for _ in range(max_new_tokens):
        # Truncate to max sequence length
        if input_tensor.size(1) > model.position_embedding.num_embeddings:
            input_tensor = input_tensor[:, -model.position_embedding.num_embeddings:]

        # Forward pass
        logits = model(input_tensor)
        next_token_logits = logits[:, -1, :] / temperature

        # Top-k filtering
        if top_k > 0:
            top_k_values, _ = torch.topk(next_token_logits, top_k, dim=-1)
            min_top_k = top_k_values[:, -1].unsqueeze(-1)
            next_token_logits[next_token_logits < min_top_k] = float('-inf')

        # Top-p (nucleus) filtering
        if top_p < 1.0:
            sorted_logits, sorted_indices = torch.sort(next_token_logits, descending=True)
            cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)

            # Remove tokens with cumulative probability above threshold
            sorted_indices_to_remove = cumulative_probs > top_p
            sorted_indices_to_remove[:, 1:] = sorted_indices_to_remove[:, :-1].clone()
            sorted_indices_to_remove[:, 0] = False

            indices_to_remove = sorted_indices_to_remove.scatter(
                1, sorted_indices, sorted_indices_to_remove
            )
            next_token_logits[indices_to_remove] = float('-inf')

        # Sample from filtered distribution
        probs = F.softmax(next_token_logits, dim=-1)
        next_token = torch.multinomial(probs, num_samples=1)

        # Append to sequence
        input_tensor = torch.cat([input_tensor, next_token], dim=-1)

        # Stop if we generate the end-of-text token
        if next_token.item() == 50256:  # EOS token for GPT-2
            break

    # Decode
    generated_ids = input_tensor[0].tolist()
    generated_text = enc.decode(generated_ids)

    return generated_text

The combination of top-k and top-p filtering prevents the model from generating gibberish while maintaining diversity. Without these filters, the model tends to fall into repetitive loops or produce incoherent text. The temperature parameter gives you control over the creativity vs. determinism tradeoff.

Pitfalls and Production Tips

Memory Management

The single biggest issue when training LLMs from scratch is GPU memory. A 12-layer model with 768-dimensional embeddings and 12 attention heads uses roughly 1.2GB for the parameters alone. But the optimizer states (AdamW stores 2 additional values per parameter) and activations can push memory usage to 8-12GB per batch.

Use gradient checkpointing to trade compute for memory. This recomputes activations during the backward pass instead of storing them, reducing memory usage by 60% at the cost of 20% slower training.

# Enable gradient checkpointing
model.gradient_checkpointing_enable()

Numerical Stability

The softmax in attention can produce NaN values when logits are very large. This happens most often during the first few training steps when weights are randomly initialized. The fix is to add a small epsilon to the softmax denominator:

# In MultiHeadAttention.forward
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
scores = scores - scores.max(dim=-1, keepdim=True).values  # Subtract max for numerical stability

Learning Rate Warmup

Starting training with the full learning rate causes the model to diverge immediately. The 1% warmup in the scheduler above is the minimum safe value. For deeper models (24+ layers), use 5-10% warmup.

Data Quality Over Quantity

The LLMs-from-scratch repository uses the TinyShakespeare dataset for demonstration, but for production you need clean, deduplicated data. A single duplicated document in your training set can cause the model to memorize and repeat that text verbatim. Use the datasets library's deduplication tools:

from datasets import load_dataset

dataset = load_dataset("your_dataset", split="train")
dataset = dataset.filter(lambda x: len(x['text']) > 100)  # Remove short documents
dataset = dataset.map(lambda x: {'text': x['text'].strip()})

What's Next

The model we built here is a solid foundation, but production LLMs require additional components. The "Awesome-Knowledge-Distillation-of-LLMs" repository (1,264 stars, 71 forks) collects papers on distilling large models into smaller, faster versions. This is how companies like Microsoft deploy GPT-4-level performance on consumer hardware.

For safety monitoring, the paper "Online Safety Monitoring for LLMs" by Schirmer et al. (published July 2, 2026 on arXiv) presents techniques for detecting harmful outputs during inference. This is essential for any production deployment.

If you're interested in efficient inference, the "WattGPU" paper by Argerich et al. (also July 2, 2026) predicts power consumption and latency across different GPU architectures. This matters when you're choosing between cloud providers or designing edge deployments.

The complete code for this tutorial is available on GitHub. Start with the 12-layer, 768-dimensional model on a small dataset like TinyStories, then scale up once you've validated the training pipeline. Building an LLM from scratch is not about competing with OpenAI—it's about understanding the technology well enough to customize it for your specific use case.

References

1. Wikipedia - PyTorch. Wikipedia. [Source]

2. Wikipedia - Rag. Wikipedia. [Source]

3. Wikipedia - OpenAI. Wikipedia. [Source]

4. GitHub - pytorch/pytorch. Github. [Source]

5. GitHub - Shubhamsaboo/awesome-llm-apps. Github. [Source]

6. GitHub - openai/openai-python. Github. [Source]

7. GitHub - fighting41love/funNLP. Github. [Source]

8. OpenAI Pricing. Pricing. [Source]

How to Build an LLM from Scratch with PyTorch

How to Build an LLM from Scratch with PyTorch

Table of Contents

📺 Watch: RAG [2] Explained

Why Build Instead of Fine-Tune

Prerequisites and Environment Setup

Core Architecture: The GPT-Style Decoder-Only Model

Training Loop with Gradient Accumulation

Text Generation with Temperature and Top-k Sampling

Pitfalls and Production Tips

Memory Management

Numerical Stability

Learning Rate Warmup

Data Quality Over Quantity

What's Next

References

Was this article helpful?

Related Articles

How to Build a Smart Speaker with Gemini Integration

How to Deploy a Custom Transformer for Text Classification in 2026

How to Build a Big Tech Critique Engine with Cory Doctorow's Ideas