MLOps in Practice: Deploying and Monitoring ML Models in Production

A practical guide to production ML systems — model serving options, CI/CD for ML, experiment tracking, monitoring for data drift, and the operational practices that keep ML systems reliable.

Grids and Guides·12 min read·May 1, 2026

MLOps in Practice: Deploying and Monitoring ML Models in Production

The gap between a trained ML model and a reliable production ML system is where most ML projects stall. A model that performs well in evaluation can fail silently in production for dozens of reasons: data distribution shifts, infrastructure configuration differences, silent upstream data changes, or simply the unpredictability of real user behaviour.

MLOps is the set of engineering practices that close this gap. This guide covers what MLOps actually means in practice — not the frameworks and buzzwords, but the concrete patterns that make ML systems reliable.

What MLOps Is (and Is Not)

MLOps is not a tool. MLOps is a collection of practices adapted from DevOps and software engineering principles to the specific challenges of ML systems:

Reproducibility: Given the same data and code, produce the same model
Automation: Trigger training, evaluation, and deployment automatically based on defined conditions
Monitoring: Detect when model quality degrades before users notice
Versioning: Track models, datasets, and configurations so you can roll back

The "Ops" in MLOps is about reliability and repeatability — not about adding complexity.

The ML Production Stack

A production ML system has more components than a trained model file:

Data pipeline — ingesting, validating, and transforming input data
Feature pipeline — computing ML features at training and serving time
Training pipeline — reproducing model training from code and data
Model registry — storing and versioning trained models with metadata
Serving infrastructure — deploying models and serving predictions
Monitoring — tracking data quality, model performance, and system health

Most ML projects start with only items 1, 3, and 5 and add the rest reactively when production problems occur. The teams that add these components proactively ship faster and have fewer production incidents.

Experiment Tracking

Experiment tracking is the foundation of reproducible ML. Without it, you cannot answer questions like: "Which dataset, hyperparameters, and code produced last month's best model?"

MLflow for experiment tracking

MLflow is the most widely adopted open-source experiment tracking tool. Core concepts:

Runs: A single training execution. MLflow captures start/end time, parameters, metrics, and artifacts.
Experiments: A collection of related runs (e.g., "customer churn v2").
Artifacts: Files associated with a run — model files, confusion matrices, evaluation reports.

import mlflow

mlflow.set_experiment("churn_prediction_v2")

with mlflow.start_run():
    mlflow.log_param("learning_rate", 0.01)
    mlflow.log_param("n_estimators", 100)

    # Train model...
    model = train(X_train, y_train, lr=0.01, n=100)

    mlflow.log_metric("val_auc", evaluate(model, X_val, y_val))
    mlflow.sklearn.log_model(model, "model")

Log everything that affects model behaviour: preprocessing parameters, feature lists, train/validation split logic, random seeds. A run is only reproducible if all inputs are captured.

What to track

Parameters: Every hyperparameter and configuration value
Metrics: Training loss, validation metrics, and task-specific evaluation metrics at each epoch
Artifacts: Final model, feature importance plots, confusion matrices, evaluation reports
Data version: A hash or version identifier of the training dataset
Code version: Git commit hash

CI/CD for Machine Learning

The ML pipeline trigger model

Unlike traditional software where deployments are triggered by code changes, ML pipeline triggers include:

Code change: A change to training code, feature engineering, or model architecture
Data change: A new batch of training data arrives above a size threshold
Performance degradation: Monitoring detects that model accuracy has dropped below a threshold
Scheduled retraining: Time-based trigger (weekly, monthly) for models that need fresh data

Define the trigger conditions explicitly. Teams that retrain only when someone notices a problem are always reactive.

The evaluation gate

The most important element of ML CI/CD: a gate that compares the new model against the current production model before deployment.

A model should only be promoted to production if it:

Passes a minimum performance threshold on the evaluation dataset
Performs no worse than the current production model (± some tolerance)
Passes a regression test on critical edge cases
Completes a latency and memory check within acceptable bounds

Without an evaluation gate, you will eventually deploy a worse model. It happens to everyone.

Pipeline tools

Apache Airflow: Mature, widely adopted, Python-based DAG orchestration. Good for complex pipelines with many dependencies.
Kubeflow Pipelines: Kubernetes-native ML pipelines. Higher operational complexity but native integration with Kubernetes infrastructure.
GitHub Actions / GitLab CI: Simple training pipelines on cloud compute. Good for smaller projects or pipelines triggered by code changes.
Prefect / Dagster: More modern alternatives to Airflow with better developer experience.

Model Serving

Choosing the right serving infrastructure depends on your latency requirements, throughput, and whether you are serving traditional ML models or LLMs.

Traditional ML models (scikit-learn, XGBoost, TensorFlow, PyTorch)

FastAPI wrapper: The simplest approach — wrap the model in a FastAPI endpoint. Appropriate for low-to-moderate traffic and internal services. Easy to version with Docker.

BentoML: Framework for packaging and deploying ML models as services. Handles serialisation, versioning, and common serving patterns. Good middle ground between raw FastAPI and full platforms.

TensorFlow Serving / TorchServe: High-performance serving specifically for TensorFlow and PyTorch models. More operational complexity than FastAPI but better performance at high throughput.

MLflow Model Serving: If you are already using MLflow for tracking, its built-in model serving is a low-friction option for internal services.

LLMs and transformer models

vLLM: The standard choice for serving open-source LLMs (Llama, Mistral, Falcon) at scale. Paged attention dramatically improves GPU utilisation and throughput. Supports OpenAI-compatible API.

Triton Inference Server: NVIDIA's model serving framework. High throughput for GPU inference across TensorFlow, PyTorch, and ONNX models. More complex to configure than vLLM.

Ollama: Simple local LLM serving for development and testing. Not production-grade for high concurrency.

Model versioning and canary deployment

Never switch 100% of traffic to a new model immediately:

Deploy the new model version alongside the existing version
Route 5–10% of traffic to the new model
Compare quality and performance metrics between versions over 24–48 hours
Gradually increase traffic to the new version if metrics are acceptable
Decommission the old version once the new version has full traffic

This pattern requires that your serving infrastructure supports traffic splitting (e.g., Kubernetes ingress rules, AWS ALB weighted routing) and that you tag predictions with the model version for analysis.

Monitoring: The Hardest Part

Production monitoring is where MLOps differs most sharply from traditional software monitoring. Traditional software monitoring asks "Is the system up and responding?" ML monitoring additionally asks "Is the system giving good answers?"

Three types of ML drift

Data drift: The statistical distribution of input features has changed from what the model was trained on. A fraud detection model trained on 2023 transaction data will encounter different transaction patterns in 2025.

Detection: Compare the distribution of production input features against the training distribution using statistical tests (KS test, PSI - Population Stability Index, Jensen-Shannon divergence).

Concept drift: The relationship between features and labels has changed. The model was trained on a world that no longer exists.

Detection: Requires labelled production data — you need to know the true outcome to detect concept drift. This is the hardest drift to detect because ground truth labels arrive with a delay (or not at all).

Prediction drift: The distribution of model outputs has changed. Can be detected without labels — if the model is suddenly predicting "high risk" for 80% of cases instead of the usual 20%, something has changed.

Monitoring implementation

Input monitoring: Log a sample of production inputs and compute feature statistics on a rolling basis. Compare against a baseline computed from training data.

# Example: monitoring numeric feature distributions
from scipy import stats

def compute_psi(reference_dist, production_dist, n_bins=10):
    """Population Stability Index — PSI > 0.2 indicates significant drift"""
    ref_hist, bins = np.histogram(reference_dist, bins=n_bins, density=True)
    prod_hist, _ = np.histogram(production_dist, bins=bins, density=True)

    # Avoid division by zero
    ref_hist = np.where(ref_hist == 0, 1e-10, ref_hist)
    prod_hist = np.where(prod_hist == 0, 1e-10, prod_hist)

    psi = np.sum((prod_hist - ref_hist) * np.log(prod_hist / ref_hist))
    return psi

Output monitoring: Track prediction distribution, confidence score distribution, and any business-level metrics (e.g., escalation rate for a support routing model).

Ground truth monitoring: When ground truth labels are available (often delayed), compute actual model accuracy on a rolling basis. This is the most reliable signal but requires labelling infrastructure.

Alerting thresholds: Define alert thresholds for each metric. PSI > 0.2 for any key feature triggers a retraining evaluation. Accuracy drop > 5% from baseline triggers immediate investigation.

Model Registry and Governance

A model registry stores trained model artifacts with metadata: training metrics, data version, code version, training parameters, and deployment history.

MLflow has a built-in model registry with staging → production → archived lifecycle states. For regulated industries, the registry serves as an audit trail for model governance — you can answer "which model version was serving predictions on date X and what was it trained on?"

Key metadata to capture for every production model:

Model version and architecture
Training dataset version and size
Evaluation metrics (overall and by segment)
Training timestamp and engineer
Deployment history (promoted to production on date X, by person Y)

Getting Started: The Minimal MLOps Stack

For a first production ML system, this is the minimum viable MLOps setup:

Experiment tracking: MLflow, configured and used for every training run
Model registry: MLflow registry with staging and production stages
Evaluation gate: Script that compares new model vs current production model before promotion
Serving: FastAPI wrapper in Docker, with versioned container images
Basic monitoring: Log predictions to a database; compute weekly statistics; set up alerts for obvious failure modes
Retraining trigger: Calendar-based initially (monthly or quarterly); move to performance-based triggers once monitoring is established

This is not glamorous. It is maintainable, and it prevents the most common production failures.

We build and operate ML production systems for enterprise clients — from data pipeline design to model monitoring. If you are building your first production ML system or scaling an existing one, talk to our team.