Back to Tutorials
tutorialstutorialaillm

How to Fine-Tune Mistral Models with Unsloth

Practical tutorial: Fine-tune Mistral models on your data with Unsloth

BlogIA AcademyMay 25, 202610 min read1 959 words

How to Fine-Tune Mistral Models with Unsloth

Table of Contents

📺 Watch: Neural Networks Explained

Video by 3Blue1Brown


Fine-tuning large language models has traditionally required significant computational resources and deep expertise in distributed training. Unsloth changes this paradigm by providing memory-efficient training that can run on consumer GPUs. In this tutorial, we'll walk through the complete process of fine-tuning Mistral models using Unsloth, from environment setup to production deployment considerations.

Understanding Unsloth's Architecture and Memory Optimization

Unsloth achieves its memory efficiency through several key innovations in the training pipeline. The library implements optimized kernels for attention mechanisms and uses 4-bit quantization techniques that maintain model quality while dramatically reducing memory footprint. According to the Unsloth documentation, their approach can reduce memory usage by up to 50% compared to standard fine-tuning methods.

The core architecture leverages:

  • Quantized Low-Rank Adaptation (QLoRA): Combines 4-bit NormalFloat quantization with LoRA adapters
  • Custom CUDA kernels: Optimized for specific GPU architectures (NVIDIA Ampere and newer)
  • Gradient checkpointing: Reduces memory by recomputing activations during backward pass
  • Flash Attention v2: Efficient attention computation for longer sequences

For Mistral models specifically, Unsloth handles the architectural nuances including the sliding window attention mechanism and grouped-query attention patterns. This means you don't need to manually configure these parameters - Unsloth automatically detects and optimizes them.

Prerequisites and Environment Setup

Before beginning, ensure you have the following hardware and software requirements:

Hardware Requirements:

  • GPU with at least 8GB VRAM (16GB+ recommended for Mistral 7B)
  • 16GB+ system RAM
  • 50GB+ free disk space for model storage and datasets

Software Requirements:

  • Python 3.10 or newer
  • CUDA 11.8 or newer (12.1 recommended)
  • PyTorch 2.1.0 or newer

Let's set up our environment:

# Create a fresh Python environment
python -m venv unsloth_env
source unsloth_env/bin/activate  # On Windows: unsloth_env\Scripts\activate

# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install Unsloth and dependencies
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes

# Install additional utilities
pip install datasets transformers [4] wandb

Important version compatibility note: As of May 2026, ensure you're using compatible versions. The unsloth package requires specific versions of transformers and accelerate. If you encounter import errors, try:

pip install transformers==4.36.0 accelerate==0.25.0

Preparing Your Dataset for Fine-Tuning

The quality of your fine-tuned model depends heavily on your dataset preparation. Unsloth works with the standard Hugging Face datasets library, but there are important considerations for Mistral models.

Dataset Format Requirements

Mistral models expect conversation data in a specific format. For instruction fine-tuning, use the following structure:

from datasets import load_dataset, Dataset
import json

def format_conversation(example):
    """
    Format conversations for Mistral's chat template.
    Mistral uses a specific token structure:
    <s>[INST] Instruction [/INST] Response</s>
    """
    system_prompt = example.get("system", "You are a helpful assistant.")
    instruction = example["instruction"]
    response = example["response"]

    # Format for Mistral's chat template
    formatted = f"<s>[INST] {system_prompt}\n\n{instruction} [/INST] {response}</s>"

    return {"text": formatted}

# Load your dataset (example with custom JSON)
def load_custom_dataset(file_path):
    with open(file_path, 'r') as f:
        data = json.load(f)

    dataset = Dataset.from_list(data)
    dataset = dataset.map(format_conversation)
    return dataset

# Example usage
dataset = load_custom_dataset("training_data.json")

Data Quality Checks

Before training, validate your dataset:

def validate_dataset(dataset):
    """
    Check dataset for common issues that cause training failures.
    """
    issues = []

    # Check for empty responses
    empty_responses = [i for i, ex in enumerate(dataset) if len(ex["response"].strip()) == 0]
    if empty_responses:
        issues.append(f"Found {len(empty_responses)} examples with empty responses")

    # Check sequence length distribution
    from collections import Counter
    lengths = [len(ex["text"].split()) for ex in dataset]
    avg_length = sum(lengths) / len(lengths)
    max_length = max(lengths)

    print(f"Dataset statistics:")
    print(f"  Total examples: {len(dataset)}")
    print(f"  Average sequence length: {avg_length:.0f} tokens")
    print(f"  Maximum sequence length: {max_length} tokens")

    # Check for duplicate examples
    texts = [ex["text"] for ex in dataset]
    duplicates = len(texts) - len(set(texts))
    if duplicates > 0:
        issues.append(f"Found {duplicates} duplicate examples")

    return issues

# Run validation
issues = validate_dataset(dataset)
if issues:
    print("Warnings:")
    for issue in issues:
        print(f"  - {issue}")

Core Implementation: Fine-Tuning Mistral with Unsloth

Now let's implement the actual fine-tuning process. This is where Unsloth's optimizations shine.

Loading and Quantizing the Model

import torch
from unsloth import FastLanguageModel
from transformers import TrainingArguments
from trl import SFTTrainer
import wandb

# Configuration
MODEL_NAME = "unsloth/mistral-7b-bnb-4bit"  # Pre-quantized Mistral 7B
MAX_SEQ_LENGTH = 4096  # Mistral's native context length
LORA_RANK = 16  # LoRA rank - higher = more capacity, more memory
LORA_ALPHA = 32  # LoRA alpha scaling factor
LORA_DROPOUT = 0.1  # Dropout for regularization

# Load model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=None,  # Auto-detect best dtype
    load_in_4bit=True,  # Use 4-bit quantization
    device_map="auto",  # Distribute across available GPUs
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=LORA_RANK,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    bias="none",  # Don't train bias terms
    use_gradient_checkpointing="unsloth",  # Use Unsloth's optimized checkpointing
    random_state=42,
    use_rslora=True,  # Use Rank-Stabilized LoRA
    loftq_config=None,  # Disable LoftQ (not needed with 4-bit)
)

print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")

Configuring Training Arguments

The training configuration is critical for achieving good results. Here's a production-ready setup:

# Initialize wandb for experiment tracking (optional but recommended)
wandb.init(
    project="mistral-fine-tuning",
    config={
        "model": MODEL_NAME,
        "learning_rate": 2e-4,
        "batch_size": 4,
        "gradient_accumulation": 4,
    }
)

training_args = TrainingArguments(
    output_dir="./mistral-finetuned",
    per_device_train_batch_size=4,  # Adjust based on GPU memory
    gradient_accumulation_steps=4,  # Effective batch size = 16
    warmup_steps=100,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=not torch.cuda.is_bf16_supported(),  # Use fp16 if bf16 not available
    bf16=torch.cuda.is_bf16_supported(),  # Use bf16 if supported (Ampere+)
    logging_steps=10,
    save_steps=500,
    eval_steps=500,
    evaluation_strategy="steps",
    save_strategy="steps",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    push_to_hub=False,
    report_to="wandb",  # Log to wandb
    gradient_checkpointing=True,
    optim="adamw_8bit",  # Use 8-bit AdamW optimizer
    weight_decay=0.01,
    max_grad_norm=0.3,
    lr_scheduler_type="cosine",
    seed=42,
)

Creating the Trainer and Starting Training

# Split dataset into train and validation
dataset = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = dataset["train"]
eval_dataset = dataset["test"]

# Initialize the SFTTrainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LENGTH,
    dataset_num_proc=2,  # Number of processes for dataset preprocessing
    packing=False,  # Don't pack sequences (better for instruction tuning)
    args=training_args,
)

# Start training
print("Starting fine-tuning..")
trainer.train()

# Save the final model
trainer.save_model("./mistral-finetuned-final")
tokenizer.save_pretrained("./mistral-finetuned-final")

Handling Edge Cases and Production Considerations

Memory Management

When training on consumer GPUs, memory management is crucial. Here are strategies for handling common edge cases:

def optimize_memory_usage():
    """
    Apply memory optimization techniques for consumer GPUs.
    """
    import gc

    # Clear CUDA cache
    torch.cuda.empty_cache()
    gc.collect()

    # Enable memory efficient attention
    from transformers.utils import is_flash_attn_2_available
    if is_flash_attn_2_available():
        print("Flash Attention 2 available - using for memory efficiency")

    # Monitor memory usage
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / 1024**3
        reserved = torch.cuda.memory_reserved() / 1024**3
        print(f"GPU Memory - Allocated: {allocated:.2f}GB, Reserved: {reserved:.2f}GB")

# Call before training
optimize_memory_usage()

Handling Long Sequences

Mistral models have a native context length of 4096 tokens, but you might need to handle longer sequences:

def truncate_or_split_long_sequences(dataset, max_length=4096):
    """
    Handle sequences that exceed the model's maximum context length.
    """
    from transformers import AutoTokenizer

    tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

    def process_example(example):
        tokens = tokenizer.encode(example["text"])

        if len(tokens) > max_length:
            # Option 1: Truncate from the middle (preserves instruction and response)
            half = max_length // 2
            truncated_tokens = tokens[:half] + tokens[-half:]
            example["text"] = tokenizer.decode(truncated_tokens)

            # Option 2: Split into multiple examples (for document-level tasks)
            # chunks = [tokens[i:i+max_length] for i in range(0, len(tokens), max_length)]
            # return [{"text": tokenizer.decode(chunk)} for chunk in chunks]

        return example

    return dataset.map(process_example)

# Apply truncation
dataset = truncate_or_split_long_sequences(dataset)

Resume Training from Checkpoint

Training can be interrupted. Here's how to resume:

def resume_training(checkpoint_path="./mistral-finetuned/checkpoint-500"):
    """
    Resume training from a saved checkpoint.
    """
    # Load the model and tokenizer from checkpoint
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=checkpoint_path,
        max_seq_length=MAX_SEQ_LENGTH,
        dtype=None,
        load_in_4bit=True,
    )

    # Re-add LoRA adapters
    model = FastLanguageModel.get_peft_model(
        model,
        r=LORA_RANK,
        target_modules=[
            "q_proj", "k_proj", "v_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj",
        ],
        lora_alpha=LORA_ALPHA,
        lora_dropout=LORA_DROPOUT,
        bias="none",
        use_gradient_checkpointing="unsloth",
        random_state=42,
    )

    # Resume training
    trainer.train(resume_from_checkpoint=True)

Inference and Deployment

After fine-tuning, you need to properly load and use your model:

def load_finetuned_model(model_path="./mistral-finetuned-final"):
    """
    Load the fine-tuned model for inference.
    """
    # Load with 4-bit quantization for inference
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=model_path,
        max_seq_length=MAX_SEQ_LENGTH,
        dtype=None,
        load_in_4bit=True,
    )

    # Enable faster inference
    FastLanguageModel.for_inference(model)

    return model, tokenizer

def generate_response(model, tokenizer, instruction, system_prompt="You are a helpful assistant.", max_new_tokens=512):
    """
    Generate a response using the fine-tuned model.
    """
    # Format the prompt for Mistral
    prompt = f"<s>[INST] {system_prompt}\n\n{instruction} [/INST]"

    # Tokenize
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    # Generate
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            top_p=0.95,
            top_k=50,
            repetition_penalty=1.1,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )

    # Decode and clean up
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract only the response part
    response = response.split("[/INST]")[-1].strip()

    return response

# Example usage
model, tokenizer = load_finetuned_model()
response = generate_response(
    model, 
    tokenizer, 
    "Explain the concept of gradient descent in simple terms."
)
print(response)

Performance Optimization and Monitoring

Benchmarking Your Fine-Tuned Model

def benchmark_model(model, tokenizer, test_prompts):
    """
    Benchmark inference speed and memory usage.
    """
    import time

    results = []

    for prompt in test_prompts:
        # Measure inference time
        start_time = time.time()

        response = generate_response(model, tokenizer, prompt, max_new_tokens=256)

        inference_time = time.time() - start_time

        # Measure memory
        if torch.cuda.is_available():
            memory_used = torch.cuda.max_memory_allocated() / 1024**3
        else:
            memory_used = 0

        results.append({
            "prompt": prompt[:50] + "..",
            "response_length": len(response),
            "inference_time": inference_time,
            "memory_gb": memory_used,
            "tokens_per_second": 256 / inference_time,
        })

        torch.cuda.empty_cache()

    return results

# Run benchmark
test_prompts = [
    "What is machine learning?",
    "Write a Python function to sort a list.",
    "Explain quantum computing.",
]

benchmark_results = benchmark_model(model, tokenizer, test_prompts)
for result in benchmark_results:
    print(f"Prompt: {result['prompt']}")
    print(f"  Time: {result['inference_time']:.2f}s")
    print(f"  Speed: {result['tokens_per_second']:.1f} tokens/s")
    print(f"  Memory: {result['memory_gb']:.2f}GB")
    print()

What's Next

After successfully fine-tuning your Mistral model, consider these next steps:

  1. Merge and Export: Use model.save_pretrained_merged() to create a full-precision model for deployment
  2. Quantization for Production: Convert to GGUF format for llama [8].cpp deployment
  3. RLHF Alignment: Apply reinforcement learning from human feedback to further improve responses
  4. A/B Testing: Deploy both base and fine-tuned models to compare performance

For more advanced techniques, explore our guides on model optimization and production deployment.

The combination of Unsloth's memory-efficient training and Mistral's powerful architecture enables fine-tuning on consumer hardware that was previously only possible with expensive cloud GPUs. By following this tutorial, you've created a production-ready fine-tuning pipeline that can be adapted to various domains and use cases.

Remember to monitor your training with wandb, validate your dataset quality, and always test your model on edge cases before deployment. The techniques covered here form the foundation for building specialized language models that can transform your specific domain applications.


References

1. Wikipedia - Transformers. Wikipedia. [Source]
2. Wikipedia - Llama. Wikipedia. [Source]
3. Wikipedia - PyTorch. Wikipedia. [Source]
4. GitHub - huggingface/transformers. Github. [Source]
5. GitHub - meta-llama/llama. Github. [Source]
6. GitHub - pytorch/pytorch. Github. [Source]
7. GitHub - Shubhamsaboo/awesome-llm-apps. Github. [Source]
8. LlamaIndex Pricing. Pricing. [Source]
tutorialaillmml
Share this article:

Was this article helpful?

Let us know to improve our AI generation.

Related Articles