How to Fine-Tune Mistral Models with Unsloth
Practical tutorial: Fine-tune Mistral models on your data with Unsloth
How to Fine-Tune Mistral Models with Unsloth
Table of Contents
- How to Fine-Tune Mistral Models with Unsloth
- Create a fresh Python environment
- Install PyTorch [6] with CUDA support
- Install Unsloth and dependencies
- Install additional utilities
- Load your dataset (example with custom JSON)
📺 Watch: Neural Networks Explained
Video by 3Blue1Brown
Fine-tuning large language models has traditionally required significant computational resources and deep expertise in distributed training. Unsloth changes this paradigm by providing memory-efficient training that can run on consumer GPUs. In this tutorial, we'll walk through the complete process of fine-tuning Mistral models using Unsloth, from environment setup to production deployment considerations.
Understanding Unsloth's Architecture and Memory Optimization
Unsloth achieves its memory efficiency through several key innovations in the training pipeline. The library implements optimized kernels for attention mechanisms and uses 4-bit quantization techniques that maintain model quality while dramatically reducing memory footprint. According to the Unsloth documentation, their approach can reduce memory usage by up to 50% compared to standard fine-tuning methods.
The core architecture leverages:
- Quantized Low-Rank Adaptation (QLoRA): Combines 4-bit NormalFloat quantization with LoRA adapters
- Custom CUDA kernels: Optimized for specific GPU architectures (NVIDIA Ampere and newer)
- Gradient checkpointing: Reduces memory by recomputing activations during backward pass
- Flash Attention v2: Efficient attention computation for longer sequences
For Mistral models specifically, Unsloth handles the architectural nuances including the sliding window attention mechanism and grouped-query attention patterns. This means you don't need to manually configure these parameters - Unsloth automatically detects and optimizes them.
Prerequisites and Environment Setup
Before beginning, ensure you have the following hardware and software requirements:
Hardware Requirements:
- GPU with at least 8GB VRAM (16GB+ recommended for Mistral 7B)
- 16GB+ system RAM
- 50GB+ free disk space for model storage and datasets
Software Requirements:
- Python 3.10 or newer
- CUDA 11.8 or newer (12.1 recommended)
- PyTorch 2.1.0 or newer
Let's set up our environment:
# Create a fresh Python environment
python -m venv unsloth_env
source unsloth_env/bin/activate # On Windows: unsloth_env\Scripts\activate
# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Install Unsloth and dependencies
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes
# Install additional utilities
pip install datasets transformers [4] wandb
Important version compatibility note: As of May 2026, ensure you're using compatible versions. The unsloth package requires specific versions of transformers and accelerate. If you encounter import errors, try:
pip install transformers==4.36.0 accelerate==0.25.0
Preparing Your Dataset for Fine-Tuning
The quality of your fine-tuned model depends heavily on your dataset preparation. Unsloth works with the standard Hugging Face datasets library, but there are important considerations for Mistral models.
Dataset Format Requirements
Mistral models expect conversation data in a specific format. For instruction fine-tuning, use the following structure:
from datasets import load_dataset, Dataset
import json
def format_conversation(example):
"""
Format conversations for Mistral's chat template.
Mistral uses a specific token structure:
<s>[INST] Instruction [/INST] Response</s>
"""
system_prompt = example.get("system", "You are a helpful assistant.")
instruction = example["instruction"]
response = example["response"]
# Format for Mistral's chat template
formatted = f"<s>[INST] {system_prompt}\n\n{instruction} [/INST] {response}</s>"
return {"text": formatted}
# Load your dataset (example with custom JSON)
def load_custom_dataset(file_path):
with open(file_path, 'r') as f:
data = json.load(f)
dataset = Dataset.from_list(data)
dataset = dataset.map(format_conversation)
return dataset
# Example usage
dataset = load_custom_dataset("training_data.json")
Data Quality Checks
Before training, validate your dataset:
def validate_dataset(dataset):
"""
Check dataset for common issues that cause training failures.
"""
issues = []
# Check for empty responses
empty_responses = [i for i, ex in enumerate(dataset) if len(ex["response"].strip()) == 0]
if empty_responses:
issues.append(f"Found {len(empty_responses)} examples with empty responses")
# Check sequence length distribution
from collections import Counter
lengths = [len(ex["text"].split()) for ex in dataset]
avg_length = sum(lengths) / len(lengths)
max_length = max(lengths)
print(f"Dataset statistics:")
print(f" Total examples: {len(dataset)}")
print(f" Average sequence length: {avg_length:.0f} tokens")
print(f" Maximum sequence length: {max_length} tokens")
# Check for duplicate examples
texts = [ex["text"] for ex in dataset]
duplicates = len(texts) - len(set(texts))
if duplicates > 0:
issues.append(f"Found {duplicates} duplicate examples")
return issues
# Run validation
issues = validate_dataset(dataset)
if issues:
print("Warnings:")
for issue in issues:
print(f" - {issue}")
Core Implementation: Fine-Tuning Mistral with Unsloth
Now let's implement the actual fine-tuning process. This is where Unsloth's optimizations shine.
Loading and Quantizing the Model
import torch
from unsloth import FastLanguageModel
from transformers import TrainingArguments
from trl import SFTTrainer
import wandb
# Configuration
MODEL_NAME = "unsloth/mistral-7b-bnb-4bit" # Pre-quantized Mistral 7B
MAX_SEQ_LENGTH = 4096 # Mistral's native context length
LORA_RANK = 16 # LoRA rank - higher = more capacity, more memory
LORA_ALPHA = 32 # LoRA alpha scaling factor
LORA_DROPOUT = 0.1 # Dropout for regularization
# Load model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=MODEL_NAME,
max_seq_length=MAX_SEQ_LENGTH,
dtype=None, # Auto-detect best dtype
load_in_4bit=True, # Use 4-bit quantization
device_map="auto", # Distribute across available GPUs
)
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=LORA_RANK,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha=LORA_ALPHA,
lora_dropout=LORA_DROPOUT,
bias="none", # Don't train bias terms
use_gradient_checkpointing="unsloth", # Use Unsloth's optimized checkpointing
random_state=42,
use_rslora=True, # Use Rank-Stabilized LoRA
loftq_config=None, # Disable LoftQ (not needed with 4-bit)
)
print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")
Configuring Training Arguments
The training configuration is critical for achieving good results. Here's a production-ready setup:
# Initialize wandb for experiment tracking (optional but recommended)
wandb.init(
project="mistral-fine-tuning",
config={
"model": MODEL_NAME,
"learning_rate": 2e-4,
"batch_size": 4,
"gradient_accumulation": 4,
}
)
training_args = TrainingArguments(
output_dir="./mistral-finetuned",
per_device_train_batch_size=4, # Adjust based on GPU memory
gradient_accumulation_steps=4, # Effective batch size = 16
warmup_steps=100,
num_train_epochs=3,
learning_rate=2e-4,
fp16=not torch.cuda.is_bf16_supported(), # Use fp16 if bf16 not available
bf16=torch.cuda.is_bf16_supported(), # Use bf16 if supported (Ampere+)
logging_steps=10,
save_steps=500,
eval_steps=500,
evaluation_strategy="steps",
save_strategy="steps",
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
greater_is_better=False,
push_to_hub=False,
report_to="wandb", # Log to wandb
gradient_checkpointing=True,
optim="adamw_8bit", # Use 8-bit AdamW optimizer
weight_decay=0.01,
max_grad_norm=0.3,
lr_scheduler_type="cosine",
seed=42,
)
Creating the Trainer and Starting Training
# Split dataset into train and validation
dataset = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = dataset["train"]
eval_dataset = dataset["test"]
# Initialize the SFTTrainer
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
dataset_text_field="text",
max_seq_length=MAX_SEQ_LENGTH,
dataset_num_proc=2, # Number of processes for dataset preprocessing
packing=False, # Don't pack sequences (better for instruction tuning)
args=training_args,
)
# Start training
print("Starting fine-tuning..")
trainer.train()
# Save the final model
trainer.save_model("./mistral-finetuned-final")
tokenizer.save_pretrained("./mistral-finetuned-final")
Handling Edge Cases and Production Considerations
Memory Management
When training on consumer GPUs, memory management is crucial. Here are strategies for handling common edge cases:
def optimize_memory_usage():
"""
Apply memory optimization techniques for consumer GPUs.
"""
import gc
# Clear CUDA cache
torch.cuda.empty_cache()
gc.collect()
# Enable memory efficient attention
from transformers.utils import is_flash_attn_2_available
if is_flash_attn_2_available():
print("Flash Attention 2 available - using for memory efficiency")
# Monitor memory usage
if torch.cuda.is_available():
allocated = torch.cuda.memory_allocated() / 1024**3
reserved = torch.cuda.memory_reserved() / 1024**3
print(f"GPU Memory - Allocated: {allocated:.2f}GB, Reserved: {reserved:.2f}GB")
# Call before training
optimize_memory_usage()
Handling Long Sequences
Mistral models have a native context length of 4096 tokens, but you might need to handle longer sequences:
def truncate_or_split_long_sequences(dataset, max_length=4096):
"""
Handle sequences that exceed the model's maximum context length.
"""
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
def process_example(example):
tokens = tokenizer.encode(example["text"])
if len(tokens) > max_length:
# Option 1: Truncate from the middle (preserves instruction and response)
half = max_length // 2
truncated_tokens = tokens[:half] + tokens[-half:]
example["text"] = tokenizer.decode(truncated_tokens)
# Option 2: Split into multiple examples (for document-level tasks)
# chunks = [tokens[i:i+max_length] for i in range(0, len(tokens), max_length)]
# return [{"text": tokenizer.decode(chunk)} for chunk in chunks]
return example
return dataset.map(process_example)
# Apply truncation
dataset = truncate_or_split_long_sequences(dataset)
Resume Training from Checkpoint
Training can be interrupted. Here's how to resume:
def resume_training(checkpoint_path="./mistral-finetuned/checkpoint-500"):
"""
Resume training from a saved checkpoint.
"""
# Load the model and tokenizer from checkpoint
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=checkpoint_path,
max_seq_length=MAX_SEQ_LENGTH,
dtype=None,
load_in_4bit=True,
)
# Re-add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=LORA_RANK,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha=LORA_ALPHA,
lora_dropout=LORA_DROPOUT,
bias="none",
use_gradient_checkpointing="unsloth",
random_state=42,
)
# Resume training
trainer.train(resume_from_checkpoint=True)
Inference and Deployment
After fine-tuning, you need to properly load and use your model:
def load_finetuned_model(model_path="./mistral-finetuned-final"):
"""
Load the fine-tuned model for inference.
"""
# Load with 4-bit quantization for inference
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=model_path,
max_seq_length=MAX_SEQ_LENGTH,
dtype=None,
load_in_4bit=True,
)
# Enable faster inference
FastLanguageModel.for_inference(model)
return model, tokenizer
def generate_response(model, tokenizer, instruction, system_prompt="You are a helpful assistant.", max_new_tokens=512):
"""
Generate a response using the fine-tuned model.
"""
# Format the prompt for Mistral
prompt = f"<s>[INST] {system_prompt}\n\n{instruction} [/INST]"
# Tokenize
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
# Generate
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=0.7,
top_p=0.95,
top_k=50,
repetition_penalty=1.1,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
)
# Decode and clean up
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Extract only the response part
response = response.split("[/INST]")[-1].strip()
return response
# Example usage
model, tokenizer = load_finetuned_model()
response = generate_response(
model,
tokenizer,
"Explain the concept of gradient descent in simple terms."
)
print(response)
Performance Optimization and Monitoring
Benchmarking Your Fine-Tuned Model
def benchmark_model(model, tokenizer, test_prompts):
"""
Benchmark inference speed and memory usage.
"""
import time
results = []
for prompt in test_prompts:
# Measure inference time
start_time = time.time()
response = generate_response(model, tokenizer, prompt, max_new_tokens=256)
inference_time = time.time() - start_time
# Measure memory
if torch.cuda.is_available():
memory_used = torch.cuda.max_memory_allocated() / 1024**3
else:
memory_used = 0
results.append({
"prompt": prompt[:50] + "..",
"response_length": len(response),
"inference_time": inference_time,
"memory_gb": memory_used,
"tokens_per_second": 256 / inference_time,
})
torch.cuda.empty_cache()
return results
# Run benchmark
test_prompts = [
"What is machine learning?",
"Write a Python function to sort a list.",
"Explain quantum computing.",
]
benchmark_results = benchmark_model(model, tokenizer, test_prompts)
for result in benchmark_results:
print(f"Prompt: {result['prompt']}")
print(f" Time: {result['inference_time']:.2f}s")
print(f" Speed: {result['tokens_per_second']:.1f} tokens/s")
print(f" Memory: {result['memory_gb']:.2f}GB")
print()
What's Next
After successfully fine-tuning your Mistral model, consider these next steps:
- Merge and Export: Use
model.save_pretrained_merged()to create a full-precision model for deployment - Quantization for Production: Convert to GGUF format for llama [8].cpp deployment
- RLHF Alignment: Apply reinforcement learning from human feedback to further improve responses
- A/B Testing: Deploy both base and fine-tuned models to compare performance
For more advanced techniques, explore our guides on model optimization and production deployment.
The combination of Unsloth's memory-efficient training and Mistral's powerful architecture enables fine-tuning on consumer hardware that was previously only possible with expensive cloud GPUs. By following this tutorial, you've created a production-ready fine-tuning pipeline that can be adapted to various domains and use cases.
Remember to monitor your training with wandb, validate your dataset quality, and always test your model on edge cases before deployment. The techniques covered here form the foundation for building specialized language models that can transform your specific domain applications.
References
Was this article helpful?
Let us know to improve our AI generation.
Related Articles
How to Analyze Security Logs with DeepSeek Locally
Practical tutorial: Analyze security logs with DeepSeek locally
How to Build a Multimodal App with Gemini 2.0 Vision API
Practical tutorial: Build a multimodal app with Gemini 2.0 Vision API
How to Build an AI Research Assistant with Perplexity API
Practical tutorial: Create an AI research assistant with Perplexity API