Back to Tutorials
tutorialstutorialai

How to Build CI/CD for ML with GitHub Actions DVC MLflow

Practical tutorial: CI/CD for ML: GitHub Actions + DVC + MLflow

BlogIA AcademyJune 3, 20269 min read1 695 words

How to Build CI/CD for ML with GitHub Actions DVC MLflow

Table of Contents

📺 Watch: Neural Networks Explained

Video by 3Blue1Brown


Building production machine learning pipelines requires more than just training accurate models. You need automated testing, version control for data and models, experiment tracking, and reproducible deployments. This tutorial walks through implementing a complete CI/CD pipeline for ML using GitHub Actions, DVC (Data Version Control), and MLflow.

According to recent research on CI/CD security patterns, "modern CI/CD pipelines must decouple identity from access to maintain security at scale" [3]. While security is critical, this tutorial focuses on the core ML pipeline automation that makes those security patterns meaningful.

Understanding the ML CI/CD Architecture

Before writing code, let's understand what we're building. The pipeline connects three core tools:

  • GitHub Actions: Orchestrates the CI/CD workflow, running jobs on triggers like pull requests and merges
  • DVC: Tracks datasets and model files in cloud storag [1]e (S3, GCS, or Azure Blob), keeping Git repositories lean
  • MLflow: Logs experiments, metrics, and model artifacts, providing a central registry

The architecture follows a standard pattern: data scientists push code changes → GitHub Actions triggers training → DVC tracks data versions → MLflow logs experiments → validated models get promoted to production.

Research on zero-trust CI/CD emphasizes that "workload identity must be established through SPIFFE-based authentication rather than static secrets" [2]. While we use GitHub secrets for simplicity, production deployments should consider these identity patterns.

Prerequisites and Environment Setup

You'll need:

  • Python 3.9+
  • GitHub account with repository access
  • AWS S3 bucket (or equivalent cloud storage)
  • MLflow tracking server (can use local or Databricks Community Edition)

Install the required packages:

pip install dvc dvc-s3 mlflow scikit-learn pandas numpy pytest

Initialize DVC in your project:

git init
dvc init
dvc remote add -d myremote s3://your-bucket/dvc-store

Configure MLflow tracking:

export MLFLOW_TRACKING_URI="http://localhost:5000"
# Or use Databricks: export MLFLOW_TRACKING_URI="databricks"

Implementing the ML Pipeline with DVC and MLflow

Let's build a complete pipeline that trains a classifier on the Iris dataset. This demonstrates the core patterns you'll adapt for real projects.

Step 1: Define the DVC Pipeline

Create dvc.yaml to define stages:

stages:
  prepare:
    cmd: python src/prepare.py
    deps:
      - data/raw/iris.csv
    outs:
      - data/processed/train.csv
      - data/processed/test.csv
  train:
    cmd: python src/train.py
    deps:
      - data/processed/train.csv
      - src/train.py
    outs:
      - models/model.pkl
    metrics:
      - metrics/train_metrics.json:
          cache: false
  evaluate:
    cmd: python src/evaluate.py
    deps:
      - models/model.pkl
      - data/processed/test.csv
    metrics:
      - metrics/eval_metrics.json:
          cache: false

Step 2: Write the Training Script with MLflow

The training script logs experiments to MLflow and saves model artifacts:

# src/train.py
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import mlflow
import mlflow.sklearn
import json
import os

def train_model():
    # Load processed data
    train_data = pd.read_csv('data/processed/train.csv')
    X_train = train_data.drop('target', axis=1)
    y_train = train_data['target']

    # Set MLflow experiment
    mlflow.set_experiment("iris_classifier")

    with mlflow.start_run() as run:
        # Log hyperparameters
        params = {
            "n_estimators": 100,
            "max_depth": 5,
            "random_state": 42
        }
        mlflow.log_params(params)

        # Train model
        model = RandomForestClassifier(**params)
        model.fit(X_train, y_train)

        # Evaluate on training data
        train_preds = model.predict(X_train)
        train_metrics = {
            "train_accuracy": accuracy_score(y_train, train_preds),
            "train_precision": precision_score(y_train, train_preds, average='weighted'),
            "train_recall": recall_score(y_train, train_preds, average='weighted'),
            "train_f1": f1_score(y_train, train_preds, average='weighted')
        }
        mlflow.log_metrics(train_metrics)

        # Save model
        mlflow.sklearn.log_model(model, "model")

        # Save metrics for DVC
        os.makedirs('metrics', exist_ok=True)
        with open('metrics/train_metrics.json', 'w') as f:
            json.dump(train_metrics, f)

        # Save model locally for DVC tracking
        os.makedirs('models', exist_ok=True)
        mlflow.sklearn.save_model(model, 'models/model.pkl')

        print(f"Run ID: {run.info.run_id}")
        print(f"Training accuracy: {train_metrics['train_accuracy']:.4f}")

if __name__ == "__main__":
    train_model()

Step 3: Create the Evaluation Script

# src/evaluate.py
import pandas as pd
import numpy as np
import mlflow
import mlflow.sklearn
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import json
import os

def evaluate_model():
    # Load test data
    test_data = pd.read_csv('data/processed/test.csv')
    X_test = test_data.drop('target', axis=1)
    y_test = test_data['target']

    # Load model from MLflow
    # In production, load from MLflow Model Registry
    model = mlflow.sklearn.load_model('models/model.pkl')

    # Make predictions
    test_preds = model.predict(X_test)

    # Calculate metrics
    eval_metrics = {
        "test_accuracy": accuracy_score(y_test, test_preds),
        "test_precision": precision_score(y_test, test_preds, average='weighted'),
        "test_recall": recall_score(y_test, test_preds, average='weighted'),
        "test_f1": f1_score(y_test, test_preds, average='weighted')
    }

    # Log to MLflow
    with mlflow.start_run(run_id=None):  # Will create new run
        mlflow.log_metrics(eval_metrics)

    # Save for DVC
    os.makedirs('metrics', exist_ok=True)
    with open('metrics/eval_metrics.json', 'w') as f:
        json.dump(eval_metrics, f)

    print(f"Test accuracy: {eval_metrics['test_accuracy']:.4f}")

    # Check if model meets threshold
    if eval_metrics['test_accuracy'] < 0.8:
        raise ValueError(f"Model accuracy {eval_metrics['test_accuracy']:.4f} below threshold 0.8")

if __name__ == "__main__":
    evaluate_model()

Configuring GitHub Actions for ML Workflows

The GitHub Actions workflow orchestrates the entire pipeline. It runs on every push and pull request, ensuring code changes don't break the model.

Step 4: Create the CI/CD Workflow

Create .github/workflows/ml_pipeline.yml:

name: ML Pipeline CI/CD

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

env:
  MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
  AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
  AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

jobs:
  test-and-train:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          lfs: true

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'
          cache: 'pip'

      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install dvc dvc-s3 mlflow scikit-learn pandas numpy pytest

      - name: Pull DVC data
        run: |
          dvc pull --remote myremote

      - name: Run tests
        run: |
          pytest tests/ -v

      - name: Reproduce pipeline
        run: |
          dvc repro

      - name: Push updated data and models
        if: github.ref == 'refs/heads/main'
        run: |
          dvc push --remote myremote
          git config user.name github-actions
          git config user.email github-actions@github.com
          git add dvc.lock data.dvc models.dvc
          git commit -m "Update DVC artifacts [skip ci]"
          git push

      - name: Register model in MLflow
        if: github.ref == 'refs/heads/main'
        run: |
          python src/register_model.py

Step 5: Model Registration Script

Create src/register_model.py to promote validated models:

# src/register_model.py
import mlflow
from mlflow.tracking import MlflowClient

def register_best_model():
    client = MlflowClient()

    # Get the best run from the experiment
    experiment = client.get_experiment_by_name("iris_classifier")
    runs = client.search_runs(
        experiment_ids=[experiment.experiment_id],
        order_by=["metrics.test_accuracy DESC"],
        max_results=1
    )

    if not runs:
        print("No runs found")
        return

    best_run = runs[0]
    run_id = best_run.info.run_id

    # Register the model
    model_uri = f"runs:/{run_id}/model"
    model_name = "iris_classifier_prod"

    try:
        # Create registered model if it doesn't exist
        client.create_registered_model(model_name)
    except:
        pass

    # Create a new version
    result = client.create_model_version(
        name=model_name,
        source=model_uri,
        run_id=run_id
    )

    # Transition to Staging
    client.transition_model_version_stage(
        name=model_name,
        version=result.version,
        stage="Staging"
    )

    print(f"Registered model {model_name} version {result.version} from run {run_id}")

if __name__ == "__main__":
    register_best_model()

Handling Edge Cases and Production Considerations

Data Versioning Conflicts

When multiple team members push data changes simultaneously, DVC can encounter merge conflicts in .dvc files. Mitigate this by:

  1. Using DVC's lock file (dvc.lock) which is machine-generated
  2. Implementing branch-specific data stores
  3. Running dvc pull before dvc repro in CI

MLflow Tracking Server Availability

If your MLflow server goes down, the pipeline should degrade gracefully:

import mlflow
from mlflow.exceptions import MlflowException

try:
    mlflow.set_tracking_uri(os.environ.get('MLFLOW_TRACKING_URI'))
    mlflow.start_run()
except MlflowException as e:
    print(f"MLflow unavailable: {e}")
    # Fall back to local logging
    local_logger.log_metrics(metrics)

Model Performance Regression

Implement automatic rollback when new models underperform:

# In evaluate.py
def check_model_regression(new_metrics, baseline_metrics):
    """Check if new model regresses by more than 5%"""
    for metric in ['test_accuracy', 'test_f1']:
        if new_metrics[metric] < baseline_metrics[metric] * 0.95:
            raise ValueError(f"Model regressed on {metric}")

Cost Optimization for CI/CD

Running full training on every push is expensive. Optimize by:

  1. Using GitHub Actions caching for pip dependencies
  2. Running lightweight tests on PRs, full training on merge
  3. Using spot instances for training jobs
# Cache pip dependencies
- name: Cache pip
  uses: actions/cache@v4
  with:
    path: ~/.cache/pip
    key: ${{ runner.os }}-pip-${{ hashFiles('requirements.txt') }}

Testing the Complete Pipeline

Create unit tests for your pipeline components:

# tests/test_pipeline.py
import pytest
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier

def test_data_preparation():
    """Test that data splits are correct"""
    from src.prepare import prepare_data
    train, test = prepare_data()
    assert len(train) > 0
    assert len(test) > 0
    assert train.shape[1] == test.shape[1]

def test_model_training():
    """Test model trains without errors"""
    from src.train import train_model
    model = train_model()
    assert hasattr(model, 'predict')
    assert hasattr(model, 'predict_proba')

def test_model_evaluation():
    """Test evaluation metrics are reasonable"""
    from src.evaluate import evaluate_model
    metrics = evaluate_model()
    assert 0 <= metrics['test_accuracy'] <= 1
    assert 0 <= metrics['test_f1'] <= 1

Run tests locally:

pytest tests/ -v --cov=src

Monitoring and Maintenance

After deployment, monitor:

  1. Pipeline success rate: Track failed runs in GitHub Actions
  2. Model drift: Compare production metrics against baseline
  3. Data staleness: Alert when training data exceeds age threshold
  4. Infrastructure costs: Monitor S3 storage and MLflow compute

Set up alerts in GitHub Actions:

- name: Notify on failure
  if: failure()
  uses: slackapi/slack-github-action@v1.24.0
  with:
    payload: |
      {
        "text": "ML Pipeline failed in ${{ github.repository }}"
      }
  env:
    SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}

What's Next

This pipeline provides a foundation for ML CI/CD. To extend it:

  1. Add model validation gates: Require human approval before promoting to production
  2. Implement A/B testing: Deploy multiple model versions simultaneously
  3. Integrate feature stores: Use tools like Feast for feature management
  4. Add automated retraining: Schedule periodic retraining based on data freshness

For production deployments, consider the zero-trust patterns described in recent research: "decoupling identity from access through credential broker patterns enables secure, auditable CI/CD pipelines" [3]. This becomes critical when your ML pipeline accesses sensitive data or deploys models to production environments.

The pipeline we built handles the core ML workflow: data versioning, experiment tracking, automated testing, and model registration. Adapt the patterns to your specific use case, whether it's computer vision, NLP, or tabular data. The key is maintaining reproducibility and automation throughout the ML lifecycle.


References

1. Wikipedia - Rag. Wikipedia. [Source]
2. GitHub - Shubhamsaboo/awesome-llm-apps. Github. [Source]
tutorialai
Share this article:

Was this article helpful?

Let us know to improve our AI generation.

Related Articles