MLOps in Practice: Deploying and Monitoring ML Models in Production
A practical guide to production ML systems — model serving options, CI/CD for ML, experiment tracking, monitoring for data drift, and the operational practices that keep ML systems reliable.
MLOps in Practice: Deploying and Monitoring ML Models in Production
The gap between a trained ML model and a reliable production ML system is where most ML projects stall. A model that performs well in evaluation can fail silently in production for dozens of reasons: data distribution shifts, infrastructure configuration differences, silent upstream data changes, or simply the unpredictability of real user behaviour.
MLOps is the set of engineering practices that close this gap. This guide covers what MLOps actually means in practice — not the frameworks and buzzwords, but the concrete patterns that make ML systems reliable.
What MLOps Is (and Is Not)
MLOps is not a tool. MLOps is a collection of practices adapted from DevOps and software engineering principles to the specific challenges of ML systems:
- Reproducibility: Given the same data and code, produce the same model
- Automation: Trigger training, evaluation, and deployment automatically based on defined conditions
- Monitoring: Detect when model quality degrades before users notice
- Versioning: Track models, datasets, and configurations so you can roll back
The "Ops" in MLOps is about reliability and repeatability — not about adding complexity.
The ML Production Stack
A production ML system has more components than a trained model file:
- Data pipeline — ingesting, validating, and transforming input data
- Feature pipeline — computing ML features at training and serving time
- Training pipeline — reproducing model training from code and data
- Model registry — storing and versioning trained models with metadata
- Serving infrastructure — deploying models and serving predictions
- Monitoring — tracking data quality, model performance, and system health
Most ML projects start with only items 1, 3, and 5 and add the rest reactively when production problems occur. The teams that add these components proactively ship faster and have fewer production incidents.
Experiment Tracking
Experiment tracking is the foundation of reproducible ML. Without it, you cannot answer questions like: "Which dataset, hyperparameters, and code produced last month's best model?"
MLflow for experiment tracking
MLflow is the most widely adopted open-source experiment tracking tool. Core concepts:
- Runs: A single training execution. MLflow captures start/end time, parameters, metrics, and artifacts.
- Experiments: A collection of related runs (e.g., "customer churn v2").
- Artifacts: Files associated with a run — model files, confusion matrices, evaluation reports.
import mlflow
mlflow.set_experiment("churn_prediction_v2")
with mlflow.start_run():
mlflow.log_param("learning_rate", 0.01)
mlflow.log_param("n_estimators", 100)
# Train model...
model = train(X_train, y_train, lr=0.01, n=100)
mlflow.log_metric("val_auc", evaluate(model, X_val, y_val))
mlflow.sklearn.log_model(model, "model")
Log everything that affects model behaviour: preprocessing parameters, feature lists, train/validation split logic, random seeds. A run is only reproducible if all inputs are captured.
What to track
- Parameters: Every hyperparameter and configuration value
- Metrics: Training loss, validation metrics, and task-specific evaluation metrics at each epoch
- Artifacts: Final model, feature importance plots, confusion matrices, evaluation reports
- Data version: A hash or version identifier of the training dataset
- Code version: Git commit hash
CI/CD for Machine Learning
The ML pipeline trigger model
Unlike traditional software where deployments are triggered by code changes, ML pipeline triggers include:
- Code change: A change to training code, feature engineering, or model architecture
- Data change: A new batch of training data arrives above a size threshold
- Performance degradation: Monitoring detects that model accuracy has dropped below a threshold
- Scheduled retraining: Time-based trigger (weekly, monthly) for models that need fresh data
Define the trigger conditions explicitly. Teams that retrain only when someone notices a problem are always reactive.
The evaluation gate
The most important element of ML CI/CD: a gate that compares the new model against the current production model before deployment.
A model should only be promoted to production if it:
- Passes a minimum performance threshold on the evaluation dataset
- Performs no worse than the current production model (± some tolerance)
- Passes a regression test on critical edge cases
- Completes a latency and memory check within acceptable bounds
Without an evaluation gate, you will eventually deploy a worse model. It happens to everyone.
Pipeline tools
- Apache Airflow: Mature, widely adopted, Python-based DAG orchestration. Good for complex pipelines with many dependencies.
- Kubeflow Pipelines: Kubernetes-native ML pipelines. Higher operational complexity but native integration with Kubernetes infrastructure.
- GitHub Actions / GitLab CI: Simple training pipelines on cloud compute. Good for smaller projects or pipelines triggered by code changes.
- Prefect / Dagster: More modern alternatives to Airflow with better developer experience.
Model Serving
Choosing the right serving infrastructure depends on your latency requirements, throughput, and whether you are serving traditional ML models or LLMs.
Traditional ML models (scikit-learn, XGBoost, TensorFlow, PyTorch)
FastAPI wrapper: The simplest approach — wrap the model in a FastAPI endpoint. Appropriate for low-to-moderate traffic and internal services. Easy to version with Docker.
BentoML: Framework for packaging and deploying ML models as services. Handles serialisation, versioning, and common serving patterns. Good middle ground between raw FastAPI and full platforms.
TensorFlow Serving / TorchServe: High-performance serving specifically for TensorFlow and PyTorch models. More operational complexity than FastAPI but better performance at high throughput.
MLflow Model Serving: If you are already using MLflow for tracking, its built-in model serving is a low-friction option for internal services.
LLMs and transformer models
vLLM: The standard choice for serving open-source LLMs (Llama, Mistral, Falcon) at scale. Paged attention dramatically improves GPU utilisation and throughput. Supports OpenAI-compatible API.
Triton Inference Server: NVIDIA's model serving framework. High throughput for GPU inference across TensorFlow, PyTorch, and ONNX models. More complex to configure than vLLM.
Ollama: Simple local LLM serving for development and testing. Not production-grade for high concurrency.
Model versioning and canary deployment
Never switch 100% of traffic to a new model immediately:
- Deploy the new model version alongside the existing version
- Route 5–10% of traffic to the new model
- Compare quality and performance metrics between versions over 24–48 hours
- Gradually increase traffic to the new version if metrics are acceptable
- Decommission the old version once the new version has full traffic
This pattern requires that your serving infrastructure supports traffic splitting (e.g., Kubernetes ingress rules, AWS ALB weighted routing) and that you tag predictions with the model version for analysis.
Monitoring: The Hardest Part
Production monitoring is where MLOps differs most sharply from traditional software monitoring. Traditional software monitoring asks "Is the system up and responding?" ML monitoring additionally asks "Is the system giving good answers?"
Three types of ML drift
Data drift: The statistical distribution of input features has changed from what the model was trained on. A fraud detection model trained on 2023 transaction data will encounter different transaction patterns in 2025.
Detection: Compare the distribution of production input features against the training distribution using statistical tests (KS test, PSI - Population Stability Index, Jensen-Shannon divergence).
Concept drift: The relationship between features and labels has changed. The model was trained on a world that no longer exists.
Detection: Requires labelled production data — you need to know the true outcome to detect concept drift. This is the hardest drift to detect because ground truth labels arrive with a delay (or not at all).
Prediction drift: The distribution of model outputs has changed. Can be detected without labels — if the model is suddenly predicting "high risk" for 80% of cases instead of the usual 20%, something has changed.
Monitoring implementation
Input monitoring: Log a sample of production inputs and compute feature statistics on a rolling basis. Compare against a baseline computed from training data.
# Example: monitoring numeric feature distributions
from scipy import stats
def compute_psi(reference_dist, production_dist, n_bins=10):
"""Population Stability Index — PSI > 0.2 indicates significant drift"""
ref_hist, bins = np.histogram(reference_dist, bins=n_bins, density=True)
prod_hist, _ = np.histogram(production_dist, bins=bins, density=True)
# Avoid division by zero
ref_hist = np.where(ref_hist == 0, 1e-10, ref_hist)
prod_hist = np.where(prod_hist == 0, 1e-10, prod_hist)
psi = np.sum((prod_hist - ref_hist) * np.log(prod_hist / ref_hist))
return psi
Output monitoring: Track prediction distribution, confidence score distribution, and any business-level metrics (e.g., escalation rate for a support routing model).
Ground truth monitoring: When ground truth labels are available (often delayed), compute actual model accuracy on a rolling basis. This is the most reliable signal but requires labelling infrastructure.
Alerting thresholds: Define alert thresholds for each metric. PSI > 0.2 for any key feature triggers a retraining evaluation. Accuracy drop > 5% from baseline triggers immediate investigation.
Model Registry and Governance
A model registry stores trained model artifacts with metadata: training metrics, data version, code version, training parameters, and deployment history.
MLflow has a built-in model registry with staging → production → archived lifecycle states. For regulated industries, the registry serves as an audit trail for model governance — you can answer "which model version was serving predictions on date X and what was it trained on?"
Key metadata to capture for every production model:
- Model version and architecture
- Training dataset version and size
- Evaluation metrics (overall and by segment)
- Training timestamp and engineer
- Deployment history (promoted to production on date X, by person Y)
Getting Started: The Minimal MLOps Stack
For a first production ML system, this is the minimum viable MLOps setup:
- Experiment tracking: MLflow, configured and used for every training run
- Model registry: MLflow registry with staging and production stages
- Evaluation gate: Script that compares new model vs current production model before promotion
- Serving: FastAPI wrapper in Docker, with versioned container images
- Basic monitoring: Log predictions to a database; compute weekly statistics; set up alerts for obvious failure modes
- Retraining trigger: Calendar-based initially (monthly or quarterly); move to performance-based triggers once monitoring is established
This is not glamorous. It is maintainable, and it prevents the most common production failures.
We build and operate ML production systems for enterprise clients — from data pipeline design to model monitoring. If you are building your first production ML system or scaling an existing one, talk to our team.
