How to Build an LLM from Scratch with PyTorch
Practical tutorial: It encourages community experimentation with AI for coding, which is valuable but not a major industry shift.
How to Build an LLM from Scratch with PyTorch
Table of Contents
- How to Build an LLM from Scratch with PyTorch
- Create a clean environment
- Core dependencies
- For profiling and debugging
📺 Watch: RAG [2] Explained
Video by IBM Technology
Building a large language model from scratch sounds like a task reserved for FAANG research teams with unlimited GPU budgets. But the reality is more accessible than most engineers assume. The repository "LLMs-from-scratch" by Sebastian Raschka has accumulated 87,799 stars and 13,374 forks on GitHub as of mid-2026, and it demonstrates exactly how to implement a ChatGPT-like LLM in PyTorch [4] from scratch, step by step. This tutorial walks through that implementation with production-grade code, real architecture decisions, and the edge cases that will break your model if you ignore them.
Why Build Instead of Fine-Tune
Most teams reach for Hugging Face transformers and call it a day. That works for 80% of use cases. But when you need to understand exactly how attention mechanisms propagate gradients, or when you're deploying to edge devices with custom hardware constraints, a black-box approach fails. Building from scratch gives you control over every parameter, every activation function, and every memory allocation.
The LLMs-from-scratch project is written entirely in Jupyter Notebook format, which makes it ideal for experimentation. But for production deployment, you'll want to extract the core components into standalone Python modules. This tutorial does exactly that.
Prerequisites and Environment Setup
You need a machine with at least 16GB of RAM for the small-scale models we'll build. GPU access is optional for training the tiny versions, but you'll want at least an RTX 3090 or A10G for anything above 1 billion parameters.
# Create a clean environment
python -m venv llm_scratch
source llm_scratch/bin/activate
# Core dependencies
pip install torch==2.3.0 torchvision==0.18.0 --index-url https://download.pytorch.org/whl/cu118
pip install numpy==1.26.4 tiktoken==0.7.0 wandb==0.17.0
pip install datasets==2.19.0 transformers==4.41.0
# For profiling and debugging
pip install torchinfo==1.8.0 memory-profiler==0.61.0
The tiktoken library is OpenAI [8]'s fast BPE tokenizer implementation. We use it instead of Hugging Face's tokenizers because it's faster for training from scratch and has zero external dependencies beyond Python.
Core Architecture: The GPT-Style Decoder-Only Model
The model architecture follows the standard decoder-only transformer pattern used by GPT-2, GPT-3, and most modern LLMs. Here's the complete implementation with every component explained.
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
class MultiHeadAttention(nn.Module):
"""
Scaled dot-product attention with multiple heads.
This is the core mechanism that allows the model to attend to different
positions in the input sequence simultaneously.
"""
def __init__(self, d_model: int, n_heads: int, dropout: float = 0.1):
super().__init__()
assert d_model % n_heads == 0, "d_model must be divisible by n_heads"
self.d_model = d_model
self.n_heads = n_heads
self.d_k = d_model // n_heads # dimension per head
# Linear projections for Q, K, V - all combined for efficiency
self.w_q = nn.Linear(d_model, d_model, bias=False)
self.w_k = nn.Linear(d_model, d_model, bias=False)
self.w_v = nn.Linear(d_model, d_model, bias=False)
self.w_o = nn.Linear(d_model, d_model, bias=False)
self.dropout = nn.Dropout(dropout)
def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:
batch_size, seq_len, _ = x.shape
# Project and reshape: (batch, seq, d_model) -> (batch, n_heads, seq, d_k)
Q = self.w_q(x).view(batch_size, seq_len, self.n_heads, self.d_k).transpose(1, 2)
K = self.w_k(x).view(batch_size, seq_len, self.n_heads, self.d_k).transpose(1, 2)
V = self.w_v(x).view(batch_size, seq_len, self.n_heads, self.d_k).transpose(1, 2)
# Scaled dot-product attention
# Q * K^T / sqrt(d_k) gives attention scores
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
# Apply causal mask to prevent attending to future tokens
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
# Softmax over the last dimension (attention over sequence positions)
attn_weights = F.softmax(scores, dim=-1)
attn_weights = self.dropout(attn_weights)
# Weighted sum of values
context = torch.matmul(attn_weights, V)
# Reshape back: (batch, n_heads, seq, d_k) -> (batch, seq, d_model)
context = context.transpose(1, 2).contiguous().view(batch_size, seq_len, self.d_model)
# Final projection
output = self.w_o(context)
return output
class FeedForward(nn.Module):
"""
Position-wise feed-forward network with GELU activation.
The expansion factor of 4 is standard for GPT-style models.
"""
def __init__(self, d_model: int, d_ff: int = None, dropout: float = 0.1):
super().__init__()
d_ff = d_ff or 4 * d_model # default expansion factor
self.linear1 = nn.Linear(d_model, d_ff)
self.linear2 = nn.Linear(d_ff, d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x: torch.Tensor) -> torch.Tensor:
# GELU activation - smoother than ReLU, helps with gradient flow
return self.linear2(self.dropout(F.gelu(self.linear1(x))))
class TransformerBlock(nn.Module):
"""
Single transformer block with pre-norm architecture.
Pre-norm (LayerNorm before attention/FFN) is more stable during training
than post-norm, especially for deeper models.
"""
def __init__(self, d_model: int, n_heads: int, dropout: float = 0.1):
super().__init__()
self.attention = MultiHeadAttention(d_model, n_heads, dropout)
self.feed_forward = FeedForward(d_model, dropout=dropout)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:
# Pre-norm: normalize before attention, then residual connection
attn_output = self.attention(self.norm1(x), mask)
x = x + self.dropout(attn_output)
# Pre-norm: normalize before FFN, then residual connection
ff_output = self.feed_forward(self.norm2(x))
x = x + self.dropout(ff_output)
return x
class GPTModel(nn.Module):
"""
Full GPT-style decoder-only transformer model.
Architecture decisions:
- Pre-norm: Better training stability for deep models
- GELU activation: Smoother gradients than ReLU
- No bias in QKV projections: Reduces parameters, no accuracy loss
- Learnable positional embeddings: Simpler than RoPE for small models
"""
def __init__(
self,
vocab_size: int = 50257, # GPT-2 vocabulary size
d_model: int = 768, # Embedding dimension
n_heads: int = 12, # Number of attention heads
n_layers: int = 12, # Number of transformer blocks
max_seq_len: int = 1024, # Maximum sequence length
dropout: float = 0.1
):
super().__init__()
self.token_embedding = nn.Embedding(vocab_size, d_model)
self.position_embedding = nn.Embedding(max_seq_len, d_model)
self.dropout = nn.Dropout(dropout)
# Stack of transformer blocks
self.blocks = nn.ModuleList([
TransformerBlock(d_model, n_heads, dropout)
for _ in range(n_layers)
])
self.final_norm = nn.LayerNorm(d_model)
self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
# Tie weights: share embedding and output projection
# This reduces parameters and improves training stability
self.token_embedding.weight = self.lm_head.weight
# Initialize weights
self._init_weights()
def _init_weights(self):
"""Initialize weights using GPT-2's initialization scheme."""
for module in self.modules():
if isinstance(module, nn.Linear):
# Small std for stability
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
if module.bias is not None:
torch.nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
elif isinstance(module, nn.LayerNorm):
torch.nn.init.ones_(module.weight)
torch.nn.init.zeros_(module.bias)
def forward(self, x: torch.Tensor) -> torch.Tensor:
batch_size, seq_len = x.shape
# Create causal mask
mask = torch.tril(torch.ones(seq_len, seq_len)).view(1, 1, seq_len, seq_len)
mask = mask.to(x.device)
# Token + position embeddings
positions = torch.arange(0, seq_len, device=x.device).unsqueeze(0)
x = self.token_embedding(x) + self.position_embedding(positions)
x = self.dropout(x)
# Pass through transformer blocks
for block in self.blocks:
x = block(x, mask)
# Final normalization and projection to vocabulary
x = self.final_norm(x)
logits = self.lm_head(x)
return logits
The weight tying between the token embedding and the output projection is a critical optimization. It reduces the parameter count by roughly 20% for a 768-dimensional model with a 50k vocabulary, and it forces the model to learn consistent representations for input and output tokens.
Training Loop with Gradient Accumulation
Training a language model from scratch requires careful management of memory and gradient flow. Here's a production-ready training loop that handles the common failure modes.
import torch
from torch.utils.data import DataLoader, Dataset
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR
import wandb
import tiktoken
from datasets import load_dataset
class TextDataset(Dataset):
"""
Memory-efficient dataset that tokenizes on-the-fly.
For production, pre-tokenize and save to disk.
"""
def __init__(self, texts: list, max_length: int = 1024, stride: int = 512):
self.enc = tiktoken.get_encoding("gpt2")
self.max_length = max_length
self.stride = stride
# Tokenize all texts and concatenate
self.tokens = []
for text in texts:
tokens = self.enc.encode(text)
self.tokens.extend(tokens)
# Calculate number of samples
self.n_samples = max(0, (len(self.tokens) - max_length) // stride + 1)
def __len__(self):
return self.n_samples
def __getitem__(self, idx):
start = idx * self.stride
end = start + self.max_length + 1 # +1 for target shift
chunk = self.tokens[start:end]
x = torch.tensor(chunk[:-1], dtype=torch.long)
y = torch.tensor(chunk[1:], dtype=torch.long)
return x, y
def train_model(
model: GPTModel,
train_dataset: TextDataset,
val_dataset: TextDataset,
batch_size: int = 4,
grad_accum_steps: int = 8, # Effective batch size = batch_size * grad_accum_steps
learning_rate: float = 3e-4,
max_epochs: int = 3,
device: str = "cuda" if torch.cuda.is_available() else "cpu"
):
model = model.to(device)
optimizer = AdamW(model.parameters(), lr=learning_rate, weight_decay=0.1)
# Cosine schedule with linear warmup
total_steps = len(train_dataset) * max_epochs // (batch_size * grad_accum_steps)
warmup_steps = int(0.01 * total_steps) # 1% warmup
def lr_lambda(step):
if step < warmup_steps:
return step / warmup_steps
return 0.5 * (1 + math.cos(math.pi * (step - warmup_steps) / (total_steps - warmup_steps)))
scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)
train_loader = DataLoader(
train_dataset,
batch_size=batch_size,
shuffle=True,
num_workers=4,
pin_memory=True
)
val_loader = DataLoader(
val_dataset,
batch_size=batch_size,
shuffle=False,
num_workers=4,
pin_memory=True
)
# Initialize wandb for tracking
wandb.init(project="llm-from-scratch", config={
"model_size": sum(p.numel() for p in model.parameters()),
"batch_size": batch_size,
"grad_accum_steps": grad_accum_steps,
"learning_rate": learning_rate,
"total_steps": total_steps
})
global_step = 0
best_val_loss = float('inf')
for epoch in range(max_epochs):
model.train()
total_loss = 0
optimizer.zero_grad()
for batch_idx, (x, y) in enumerate(train_loader):
x, y = x.to(device), y.to(device)
logits = model(x)
loss = F.cross_entropy(
logits.view(-1, logits.size(-1)),
y.view(-1),
ignore_index=50256 # Ignore padding tokens
)
# Scale loss for gradient accumulation
loss = loss / grad_accum_steps
loss.backward()
if (batch_idx + 1) % grad_accum_steps == 0:
# Gradient clipping to prevent exploding gradients
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
global_step += 1
# Logging
if global_step % 10 == 0:
wandb.log({
"train_loss": loss.item() * grad_accum_steps,
"learning_rate": scheduler.get_last_lr()[0],
"gradient_norm": torch.nn.utils.clip_grad_norm_(
model.parameters(), max_norm=float('inf')
).item()
})
# Validation at end of epoch
model.eval()
val_loss = 0
with torch.no_grad():
for x, y in val_loader:
x, y = x.to(device), y.to(device)
logits = model(x)
loss = F.cross_entropy(
logits.view(-1, logits.size(-1)),
y.view(-1),
ignore_index=50256
)
val_loss += loss.item()
val_loss /= len(val_loader)
wandb.log({"val_loss": val_loss, "epoch": epoch})
# Save best model
if val_loss < best_val_loss:
best_val_loss = val_loss
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'val_loss': val_loss,
}, 'best_model.pt')
print(f"Epoch {epoch}: Train Loss: {total_loss/len(train_loader):.4f}, Val Loss: {val_loss:.4f}")
wandb.finish()
return model
The gradient accumulation parameter is critical for production training. With a batch size of 4 and 8 accumulation steps, you get an effective batch size of 32 without needing 32x the GPU memory. This matters because most LLM training papers use batch sizes of 512 or more, and you simply cannot fit that on a single GPU without accumulation.
Text Generation with Temperature and Top-k Sampling
Once trained, the model needs a generation function that handles the stochastic nature of language generation. Here's a production-ready implementation.
@torch.no_grad()
def generate_text(
model: GPTModel,
prompt: str,
max_new_tokens: int = 100,
temperature: float = 0.7,
top_k: int = 50,
top_p: float = 0.9,
device: str = "cuda" if torch.cuda.is_available() else "cpu"
) -> str:
"""
Generate text with temperature scaling, top-k filtering, and nucleus sampling.
Temperature: Controls randomness. Lower = more deterministic.
Top-k: Only sample from the k most likely tokens.
Top-p (nucleus): Only sample from tokens with cumulative probability p.
"""
model.eval()
enc = tiktoken.get_encoding("gpt2")
# Encode prompt
input_ids = enc.encode(prompt)
input_tensor = torch.tensor([input_ids], dtype=torch.long).to(device)
for _ in range(max_new_tokens):
# Truncate to max sequence length
if input_tensor.size(1) > model.position_embedding.num_embeddings:
input_tensor = input_tensor[:, -model.position_embedding.num_embeddings:]
# Forward pass
logits = model(input_tensor)
next_token_logits = logits[:, -1, :] / temperature
# Top-k filtering
if top_k > 0:
top_k_values, _ = torch.topk(next_token_logits, top_k, dim=-1)
min_top_k = top_k_values[:, -1].unsqueeze(-1)
next_token_logits[next_token_logits < min_top_k] = float('-inf')
# Top-p (nucleus) filtering
if top_p < 1.0:
sorted_logits, sorted_indices = torch.sort(next_token_logits, descending=True)
cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
# Remove tokens with cumulative probability above threshold
sorted_indices_to_remove = cumulative_probs > top_p
sorted_indices_to_remove[:, 1:] = sorted_indices_to_remove[:, :-1].clone()
sorted_indices_to_remove[:, 0] = False
indices_to_remove = sorted_indices_to_remove.scatter(
1, sorted_indices, sorted_indices_to_remove
)
next_token_logits[indices_to_remove] = float('-inf')
# Sample from filtered distribution
probs = F.softmax(next_token_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
# Append to sequence
input_tensor = torch.cat([input_tensor, next_token], dim=-1)
# Stop if we generate the end-of-text token
if next_token.item() == 50256: # EOS token for GPT-2
break
# Decode
generated_ids = input_tensor[0].tolist()
generated_text = enc.decode(generated_ids)
return generated_text
The combination of top-k and top-p filtering prevents the model from generating gibberish while maintaining diversity. Without these filters, the model tends to fall into repetitive loops or produce incoherent text. The temperature parameter gives you control over the creativity vs. determinism tradeoff.
Pitfalls and Production Tips
Memory Management
The single biggest issue when training LLMs from scratch is GPU memory. A 12-layer model with 768-dimensional embeddings and 12 attention heads uses roughly 1.2GB for the parameters alone. But the optimizer states (AdamW stores 2 additional values per parameter) and activations can push memory usage to 8-12GB per batch.
Use gradient checkpointing to trade compute for memory. This recomputes activations during the backward pass instead of storing them, reducing memory usage by 60% at the cost of 20% slower training.
# Enable gradient checkpointing
model.gradient_checkpointing_enable()
Numerical Stability
The softmax in attention can produce NaN values when logits are very large. This happens most often during the first few training steps when weights are randomly initialized. The fix is to add a small epsilon to the softmax denominator:
# In MultiHeadAttention.forward
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
scores = scores - scores.max(dim=-1, keepdim=True).values # Subtract max for numerical stability
Learning Rate Warmup
Starting training with the full learning rate causes the model to diverge immediately. The 1% warmup in the scheduler above is the minimum safe value. For deeper models (24+ layers), use 5-10% warmup.
Data Quality Over Quantity
The LLMs-from-scratch repository uses the TinyShakespeare dataset for demonstration, but for production you need clean, deduplicated data. A single duplicated document in your training set can cause the model to memorize and repeat that text verbatim. Use the datasets library's deduplication tools:
from datasets import load_dataset
dataset = load_dataset("your_dataset", split="train")
dataset = dataset.filter(lambda x: len(x['text']) > 100) # Remove short documents
dataset = dataset.map(lambda x: {'text': x['text'].strip()})
What's Next
The model we built here is a solid foundation, but production LLMs require additional components. The "Awesome-Knowledge-Distillation-of-LLMs" repository (1,264 stars, 71 forks) collects papers on distilling large models into smaller, faster versions. This is how companies like Microsoft deploy GPT-4-level performance on consumer hardware.
For safety monitoring, the paper "Online Safety Monitoring for LLMs" by Schirmer et al. (published July 2, 2026 on arXiv) presents techniques for detecting harmful outputs during inference. This is essential for any production deployment.
If you're interested in efficient inference, the "WattGPU" paper by Argerich et al. (also July 2, 2026) predicts power consumption and latency across different GPU architectures. This matters when you're choosing between cloud providers or designing edge deployments.
The complete code for this tutorial is available on GitHub. Start with the 12-layer, 768-dimensional model on a small dataset like TinyStories, then scale up once you've validated the training pipeline. Building an LLM from scratch is not about competing with OpenAI—it's about understanding the technology well enough to customize it for your specific use case.
References
Was this article helpful?
Let us know to improve our AI generation.
Related Articles
How to Build a Smart Speaker with Gemini Integration
Practical tutorial: It highlights a product update and strategic decision by Google, indicating a smart speaker with potential but delays in
How to Deploy a Custom Transformer for Text Classification in 2026
Practical tutorial: Explaining a specific AI model type is useful for the technical community but doesn't represent a major industry shift.
How to Build a Big Tech Critique Engine with Cory Doctorow's Ideas
Practical tutorial: It provides insightful commentary on AI and its implications, which is valuable for understanding the technology's broad