How to Build CI/CD for ML with GitHub Actions DVC MLflow
Practical tutorial: CI/CD for ML: GitHub Actions + DVC + MLflow
How to Build CI/CD for ML with GitHub Actions DVC MLflow
Table of Contents
- How to Build CI/CD for ML with GitHub Actions DVC MLflow
- Or use Databricks: export MLFLOW_TRACKING_URI="databricks"
- src/train.py
- src/evaluate.py
📺 Watch: Neural Networks Explained
Video by 3Blue1Brown
Building production machine learning pipelines requires more than just training accurate models. You need automated testing, version control for data and models, experiment tracking, and reproducible deployments. This tutorial walks through implementing a complete CI/CD pipeline for ML using GitHub Actions, DVC (Data Version Control), and MLflow.
According to recent research on CI/CD security patterns, "modern CI/CD pipelines must decouple identity from access to maintain security at scale" [3]. While security is critical, this tutorial focuses on the core ML pipeline automation that makes those security patterns meaningful.
Understanding the ML CI/CD Architecture
Before writing code, let's understand what we're building. The pipeline connects three core tools:
- GitHub Actions: Orchestrates the CI/CD workflow, running jobs on triggers like pull requests and merges
- DVC: Tracks datasets and model files in cloud storag [1]e (S3, GCS, or Azure Blob), keeping Git repositories lean
- MLflow: Logs experiments, metrics, and model artifacts, providing a central registry
The architecture follows a standard pattern: data scientists push code changes → GitHub Actions triggers training → DVC tracks data versions → MLflow logs experiments → validated models get promoted to production.
Research on zero-trust CI/CD emphasizes that "workload identity must be established through SPIFFE-based authentication rather than static secrets" [2]. While we use GitHub secrets for simplicity, production deployments should consider these identity patterns.
Prerequisites and Environment Setup
You'll need:
- Python 3.9+
- GitHub account with repository access
- AWS S3 bucket (or equivalent cloud storage)
- MLflow tracking server (can use local or Databricks Community Edition)
Install the required packages:
pip install dvc dvc-s3 mlflow scikit-learn pandas numpy pytest
Initialize DVC in your project:
git init
dvc init
dvc remote add -d myremote s3://your-bucket/dvc-store
Configure MLflow tracking:
export MLFLOW_TRACKING_URI="http://localhost:5000"
# Or use Databricks: export MLFLOW_TRACKING_URI="databricks"
Implementing the ML Pipeline with DVC and MLflow
Let's build a complete pipeline that trains a classifier on the Iris dataset. This demonstrates the core patterns you'll adapt for real projects.
Step 1: Define the DVC Pipeline
Create dvc.yaml to define stages:
stages:
prepare:
cmd: python src/prepare.py
deps:
- data/raw/iris.csv
outs:
- data/processed/train.csv
- data/processed/test.csv
train:
cmd: python src/train.py
deps:
- data/processed/train.csv
- src/train.py
outs:
- models/model.pkl
metrics:
- metrics/train_metrics.json:
cache: false
evaluate:
cmd: python src/evaluate.py
deps:
- models/model.pkl
- data/processed/test.csv
metrics:
- metrics/eval_metrics.json:
cache: false
Step 2: Write the Training Script with MLflow
The training script logs experiments to MLflow and saves model artifacts:
# src/train.py
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import mlflow
import mlflow.sklearn
import json
import os
def train_model():
# Load processed data
train_data = pd.read_csv('data/processed/train.csv')
X_train = train_data.drop('target', axis=1)
y_train = train_data['target']
# Set MLflow experiment
mlflow.set_experiment("iris_classifier")
with mlflow.start_run() as run:
# Log hyperparameters
params = {
"n_estimators": 100,
"max_depth": 5,
"random_state": 42
}
mlflow.log_params(params)
# Train model
model = RandomForestClassifier(**params)
model.fit(X_train, y_train)
# Evaluate on training data
train_preds = model.predict(X_train)
train_metrics = {
"train_accuracy": accuracy_score(y_train, train_preds),
"train_precision": precision_score(y_train, train_preds, average='weighted'),
"train_recall": recall_score(y_train, train_preds, average='weighted'),
"train_f1": f1_score(y_train, train_preds, average='weighted')
}
mlflow.log_metrics(train_metrics)
# Save model
mlflow.sklearn.log_model(model, "model")
# Save metrics for DVC
os.makedirs('metrics', exist_ok=True)
with open('metrics/train_metrics.json', 'w') as f:
json.dump(train_metrics, f)
# Save model locally for DVC tracking
os.makedirs('models', exist_ok=True)
mlflow.sklearn.save_model(model, 'models/model.pkl')
print(f"Run ID: {run.info.run_id}")
print(f"Training accuracy: {train_metrics['train_accuracy']:.4f}")
if __name__ == "__main__":
train_model()
Step 3: Create the Evaluation Script
# src/evaluate.py
import pandas as pd
import numpy as np
import mlflow
import mlflow.sklearn
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import json
import os
def evaluate_model():
# Load test data
test_data = pd.read_csv('data/processed/test.csv')
X_test = test_data.drop('target', axis=1)
y_test = test_data['target']
# Load model from MLflow
# In production, load from MLflow Model Registry
model = mlflow.sklearn.load_model('models/model.pkl')
# Make predictions
test_preds = model.predict(X_test)
# Calculate metrics
eval_metrics = {
"test_accuracy": accuracy_score(y_test, test_preds),
"test_precision": precision_score(y_test, test_preds, average='weighted'),
"test_recall": recall_score(y_test, test_preds, average='weighted'),
"test_f1": f1_score(y_test, test_preds, average='weighted')
}
# Log to MLflow
with mlflow.start_run(run_id=None): # Will create new run
mlflow.log_metrics(eval_metrics)
# Save for DVC
os.makedirs('metrics', exist_ok=True)
with open('metrics/eval_metrics.json', 'w') as f:
json.dump(eval_metrics, f)
print(f"Test accuracy: {eval_metrics['test_accuracy']:.4f}")
# Check if model meets threshold
if eval_metrics['test_accuracy'] < 0.8:
raise ValueError(f"Model accuracy {eval_metrics['test_accuracy']:.4f} below threshold 0.8")
if __name__ == "__main__":
evaluate_model()
Configuring GitHub Actions for ML Workflows
The GitHub Actions workflow orchestrates the entire pipeline. It runs on every push and pull request, ensuring code changes don't break the model.
Step 4: Create the CI/CD Workflow
Create .github/workflows/ml_pipeline.yml:
name: ML Pipeline CI/CD
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
env:
MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
jobs:
test-and-train:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
lfs: true
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.10'
cache: 'pip'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install dvc dvc-s3 mlflow scikit-learn pandas numpy pytest
- name: Pull DVC data
run: |
dvc pull --remote myremote
- name: Run tests
run: |
pytest tests/ -v
- name: Reproduce pipeline
run: |
dvc repro
- name: Push updated data and models
if: github.ref == 'refs/heads/main'
run: |
dvc push --remote myremote
git config user.name github-actions
git config user.email github-actions@github.com
git add dvc.lock data.dvc models.dvc
git commit -m "Update DVC artifacts [skip ci]"
git push
- name: Register model in MLflow
if: github.ref == 'refs/heads/main'
run: |
python src/register_model.py
Step 5: Model Registration Script
Create src/register_model.py to promote validated models:
# src/register_model.py
import mlflow
from mlflow.tracking import MlflowClient
def register_best_model():
client = MlflowClient()
# Get the best run from the experiment
experiment = client.get_experiment_by_name("iris_classifier")
runs = client.search_runs(
experiment_ids=[experiment.experiment_id],
order_by=["metrics.test_accuracy DESC"],
max_results=1
)
if not runs:
print("No runs found")
return
best_run = runs[0]
run_id = best_run.info.run_id
# Register the model
model_uri = f"runs:/{run_id}/model"
model_name = "iris_classifier_prod"
try:
# Create registered model if it doesn't exist
client.create_registered_model(model_name)
except:
pass
# Create a new version
result = client.create_model_version(
name=model_name,
source=model_uri,
run_id=run_id
)
# Transition to Staging
client.transition_model_version_stage(
name=model_name,
version=result.version,
stage="Staging"
)
print(f"Registered model {model_name} version {result.version} from run {run_id}")
if __name__ == "__main__":
register_best_model()
Handling Edge Cases and Production Considerations
Data Versioning Conflicts
When multiple team members push data changes simultaneously, DVC can encounter merge conflicts in .dvc files. Mitigate this by:
- Using DVC's lock file (
dvc.lock) which is machine-generated - Implementing branch-specific data stores
- Running
dvc pullbeforedvc reproin CI
MLflow Tracking Server Availability
If your MLflow server goes down, the pipeline should degrade gracefully:
import mlflow
from mlflow.exceptions import MlflowException
try:
mlflow.set_tracking_uri(os.environ.get('MLFLOW_TRACKING_URI'))
mlflow.start_run()
except MlflowException as e:
print(f"MLflow unavailable: {e}")
# Fall back to local logging
local_logger.log_metrics(metrics)
Model Performance Regression
Implement automatic rollback when new models underperform:
# In evaluate.py
def check_model_regression(new_metrics, baseline_metrics):
"""Check if new model regresses by more than 5%"""
for metric in ['test_accuracy', 'test_f1']:
if new_metrics[metric] < baseline_metrics[metric] * 0.95:
raise ValueError(f"Model regressed on {metric}")
Cost Optimization for CI/CD
Running full training on every push is expensive. Optimize by:
- Using GitHub Actions caching for pip dependencies
- Running lightweight tests on PRs, full training on merge
- Using spot instances for training jobs
# Cache pip dependencies
- name: Cache pip
uses: actions/cache@v4
with:
path: ~/.cache/pip
key: ${{ runner.os }}-pip-${{ hashFiles('requirements.txt') }}
Testing the Complete Pipeline
Create unit tests for your pipeline components:
# tests/test_pipeline.py
import pytest
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
def test_data_preparation():
"""Test that data splits are correct"""
from src.prepare import prepare_data
train, test = prepare_data()
assert len(train) > 0
assert len(test) > 0
assert train.shape[1] == test.shape[1]
def test_model_training():
"""Test model trains without errors"""
from src.train import train_model
model = train_model()
assert hasattr(model, 'predict')
assert hasattr(model, 'predict_proba')
def test_model_evaluation():
"""Test evaluation metrics are reasonable"""
from src.evaluate import evaluate_model
metrics = evaluate_model()
assert 0 <= metrics['test_accuracy'] <= 1
assert 0 <= metrics['test_f1'] <= 1
Run tests locally:
pytest tests/ -v --cov=src
Monitoring and Maintenance
After deployment, monitor:
- Pipeline success rate: Track failed runs in GitHub Actions
- Model drift: Compare production metrics against baseline
- Data staleness: Alert when training data exceeds age threshold
- Infrastructure costs: Monitor S3 storage and MLflow compute
Set up alerts in GitHub Actions:
- name: Notify on failure
if: failure()
uses: slackapi/slack-github-action@v1.24.0
with:
payload: |
{
"text": "ML Pipeline failed in ${{ github.repository }}"
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}
What's Next
This pipeline provides a foundation for ML CI/CD. To extend it:
- Add model validation gates: Require human approval before promoting to production
- Implement A/B testing: Deploy multiple model versions simultaneously
- Integrate feature stores: Use tools like Feast for feature management
- Add automated retraining: Schedule periodic retraining based on data freshness
For production deployments, consider the zero-trust patterns described in recent research: "decoupling identity from access through credential broker patterns enables secure, auditable CI/CD pipelines" [3]. This becomes critical when your ML pipeline accesses sensitive data or deploys models to production environments.
The pipeline we built handles the core ML workflow: data versioning, experiment tracking, automated testing, and model registration. Adapt the patterns to your specific use case, whether it's computer vision, NLP, or tabular data. The key is maintaining reproducibility and automation throughout the ML lifecycle.
Was this article helpful?
Let us know to improve our AI generation.
Related Articles
How to Automate Admin Tasks with AI Agents in 2026
Practical tutorial: The news highlights an advancement in AI's ability to manage administrative tasks, which is interesting but not groundbr
How to Build a Claude 3.5 Artifact Generator with Python
Practical tutorial: Build a Claude 3.5 artifact generator
How to Build a Coding Agent with Paseo: A Production Guide 2026
Practical tutorial: It introduces a new open-source interface for coding agents, which could be useful for developers and AI enthusiasts.