This repository implements a portfolio Machine Learning project for predictive maintenance on turbofan engine data. It demonstrates the full ML lifecycle end-to-end: training, deployment, monitoring, drift detection, and automated retraining — all runnable locally with Docker and Python. The focus is on lifecycle robustness and reproducibility, not squeezing the last decimals of accuracy.
- Problem: predict Remaining Useful Life (RUL) / failure risk for turbofan engines.
- Lifecycle stages: training, model serving, feedback collection, monitoring, drift detection, automated retraining.
- Production-like aspects: microservices (BentoML), metrics (Prometheus), dashboards (Grafana), experiment tracking (MLflow), Docker Compose orchestration, local demo traffic generator.
- Positioning: end-to-end ML eng/MLOps project, not state-of-the-art RUL. The project is thought to be fast, and run on any computer. Not to produce competitive RUL predictions.
- Code quality assurance: Tox, Mypy and Github Actions CI/CD.
flowchart TD
%% User Layer
Demo[Demo Users<br/>continuous_predict.py]
%% API Services Layer
PredAPI[Prediction API<br/>:3000]
FeedAPI[Feedback API<br/>:3001]
DriftAPI[Drift Detection<br/>:3003]
RetrainAPI[Retraining Service<br/>:3004]
%% Storage Layer
FeedStore[(Feedback Store<br/>rul_feedback.jsonl)]
ModelStore[(Model Store<br/>models/*.joblib)]
%% External Services
Monitor[Prometheus :9090<br/>Grafana :3002]
MLflow[MLflow :5000]
%% Main Flow
Demo -->|HTTP requests| PredAPI
PredAPI -->|predictions| FeedAPI
FeedAPI --> FeedStore
FeedStore --> DriftAPI
DriftAPI -->|trigger| RetrainAPI
RetrainAPI --> ModelStore
ModelStore -.->|hot reload| PredAPI
%% Monitoring
DriftAPI -.-> Monitor
Monitor -.-> DriftAPI
%% Experiment Tracking
RetrainAPI --> MLflow
%% Styling - Much Stronger Colors
classDef service fill:#1976d2,stroke:#0d47a1,color:#ffffff
classDef storage fill:#f57c00,stroke:#e65100,color:#ffffff
classDef external fill:#7b1fa2,stroke:#4a148c,color:#ffffff
classDef user fill:#388e3c,stroke:#1b5e20,color:#ffffff
class Demo user
class PredAPI,FeedAPI,DriftAPI,RetrainAPI service
class FeedStore,ModelStore storage
class Monitor,MLflow external
- Demo serves as simulating users calling prediction endpoints concurrently.
- Serving: BentoML microservices for prediction, feedback, drift detection, and retraining.
- Feedback: JSONL storage (in a file to simplify) and compute basic RUL stats.
- Monitoring: Prometheus metrics from services + Grafana dashboards.
- Drift detection: PSI/KS style feature drift metrics and RMSE deltas with model baseline; can trigger retraining.
- Local-first: designed to run on a single machine via Docker Compose.
- Python (pydantic, pandas, numpy, scikit-learn)
- Serving: BentoML
- Monitoring: Prometheus + Grafana (pre-provisioned dashboards)
- Experiment tracking: MLflow (for runs/metrics; not using Model Registry)
- Orchestration: Docker & Docker Compose
Shows health of each service.
The number of predictions made and errors (generated from bad input in the demo script).
Each different color is a different trained model.
Vertical purple lines are retraining triggers.
Grafana dashboard (10 minutes) for drift metrics and retraining trigger.
- RMSE baseline: RMSE of the last minute prediction and the actual RUL vs baseline for the last trained model. Warning (orange line) triggers retraining.
- KS: Kolmogorov-Smirnov statistic for feature drift (KS > 0.1). Warning (orange line) triggers retraining.
- PSI: Percentage of features that have changed significantly (PSI > 0.1). Warning (orange line) triggers retraining.
Each orange rectange signifies a drift signal -> calling for a retraining.
Each purple line is a newly trained model (v1.0, v2.0, v3.0, etc).
Dataset: NASA C-MAPSS turbofan engine degradation time series (multiple units, cycles, sensor readings).
Source: https://www.kaggle.com/datasets/behrad3d/nasa-cmaps
- Features: per-engine time-based features on multiple cycles (3 settings, 21 sensors).
- Already delivered with a train split, and a test split. We have ground truth RUL for both of them.
- Model: Random Forest for speed/simplicity and quick iterations. Model bundles and feature names are tracked and hot-reloaded by the prediction service.
- Results: RUL prediction performance for demonstrating lifecycle behaviors. The emphasis is on the system performances rather than SOTA metrics. (for this portfolio project)
- Future: explore LSTM/GRU or other sequence models for improved sequence modeling.
Current performances with RandomForest, simple hyperparameters, all FE, and training and all train data and test on all test data:

- Train and Test set both have the same number of Engine Units (but different time series for each of them). The Test set simply has different 'flight conditions' (different distributions of sensor data through time).
- We train on a subset of Train data and evaluate the corresponding unit of the Test set.
- The demo strats with the first 10% of the Train set to train the first model. (and RMSE baseline made with the corresponding 10% of the Test set)
- The demo script then send data (from Test set) to the Prediction API and collect feedback (RUL predictions and ground truths)
- The feedback is used to compute metrics and trigger drift detection in another service. (see drift dashboard above)
- If a drift is detected, the retraining service retrain on the Train set corresponding to the units of the Test set used in 'production.'
- As soon as a new model is trained, the prediction service is updated with the new model bundle and hot-reloaded.
- The demo continues until the end of the Test set.. which is programmed to last 10 minutes.
Prereqs: macOS/Linux, Python 3.13+, Docker, Kaggle Legacy API Key (.json), and a populated .env file.
- Clone and enter the repo
git clone https://github.com/<you>/Turbofan-ML-lifecycle.git
cd Turbofan-ML-lifecycle- Create
.envfrom example (mandatory)
cp .env.example .env- Create a venv and install deps via uv
python -m venv .venv && source .venv/bin/activate
python -m pip install uv && uv sync- Configure Kaggle "Legacy API Key". Download & prepare data:
- How to get token (while connected to your kaggle account):
- https://www.kaggle.com/settings
- Create Legacy API Key → this will automatically download a
kaggle.jsonfile. - save it to
~/.kaggle/kaggle.json
- Then run:
This download data, prepare it, and make feature engineering (mandatory).
uv run initialize
- How to get token (while connected to your kaggle account):
If, for any reason, kaggle authentification fails, just paste:
{"username": "your_kaggle_username","key": "your_api_key_here"}in your
~/.kaggle/kaggle.jsonfile.
- Start the stack (build all images)
docker compose up --buildDashboards:
- Grafana Overview (anonymous access enabled locally)
- Grafana Drift Dashboard (anonymous access enabled locally)
Then, wait for all services to be up (see grafana 'Overview dashboard') and in another terminal, run the demo script:
source .venv/bin/activate && uv run continuousFrom here, the demo takes 10 minutes to complete.
After that, you can stop the demo script with Ctrl+C in the terminal running it, and then stop the stack with docker compose down.
Other links:
- Prometheus: http://localhost:9090
- MLflow: http://localhost:5000
- Prediction API (BentoML): http://localhost:3000 (used by the demo script)
- Time-series predictive maintenance feature engineering
- Reproducible training/evaluation with experiment tracking (MLflow)
- Model serving with a proper API (BentoML)
- Metrics instrumentation and dashboards (Prometheus + Grafana)
- Drift detection and automated retraining loop
- Containerized, local, production-like environment (Docker Compose)
- Sequence models (LSTM/GRU) for better temporal dynamics
- Hardening for production: CI/CD, more tests, container hardening, k8s (mini kube) deployment



