Skip to content

ohmatheus/Turbofan-ML-lifecycle

Repository files navigation

ML Lifecycle

Turbofan Predictive Maintenance

This repository implements a portfolio Machine Learning project for predictive maintenance on turbofan engine data. It demonstrates the full ML lifecycle end-to-end: training, deployment, monitoring, drift detection, and automated retraining — all runnable locally with Docker and Python. The focus is on lifecycle robustness and reproducibility, not squeezing the last decimals of accuracy.

Project Overview

  • Problem: predict Remaining Useful Life (RUL) / failure risk for turbofan engines.
  • Lifecycle stages: training, model serving, feedback collection, monitoring, drift detection, automated retraining.
  • Production-like aspects: microservices (BentoML), metrics (Prometheus), dashboards (Grafana), experiment tracking (MLflow), Docker Compose orchestration, local demo traffic generator.
  • Positioning: end-to-end ML eng/MLOps project, not state-of-the-art RUL. The project is thought to be fast, and run on any computer. Not to produce competitive RUL predictions.
  • Code quality assurance: Tox, Mypy and Github Actions CI/CD.

System Architecture

flowchart TD
    %% User Layer
    Demo[Demo Users<br/>continuous_predict.py]
    
    %% API Services Layer
    PredAPI[Prediction API<br/>:3000]
    FeedAPI[Feedback API<br/>:3001]
    DriftAPI[Drift Detection<br/>:3003]
    RetrainAPI[Retraining Service<br/>:3004]
    
    %% Storage Layer
    FeedStore[(Feedback Store<br/>rul_feedback.jsonl)]
    ModelStore[(Model Store<br/>models/*.joblib)]
    
    %% External Services
    Monitor[Prometheus :9090<br/>Grafana :3002]
    MLflow[MLflow :5000]
    
    %% Main Flow
    Demo -->|HTTP requests| PredAPI
    PredAPI -->|predictions| FeedAPI
    FeedAPI --> FeedStore
    FeedStore --> DriftAPI
    DriftAPI -->|trigger| RetrainAPI
    RetrainAPI --> ModelStore
    ModelStore -.->|hot reload| PredAPI
    
    %% Monitoring
    DriftAPI -.-> Monitor
    Monitor -.-> DriftAPI
    
    %% Experiment Tracking
    RetrainAPI --> MLflow
    
    %% Styling - Much Stronger Colors
    classDef service fill:#1976d2,stroke:#0d47a1,color:#ffffff
    classDef storage fill:#f57c00,stroke:#e65100,color:#ffffff
    classDef external fill:#7b1fa2,stroke:#4a148c,color:#ffffff
    classDef user fill:#388e3c,stroke:#1b5e20,color:#ffffff
    
    class Demo user
    class PredAPI,FeedAPI,DriftAPI,RetrainAPI service
    class FeedStore,ModelStore storage
    class Monitor,MLflow external
Loading
  • Demo serves as simulating users calling prediction endpoints concurrently.
  • Serving: BentoML microservices for prediction, feedback, drift detection, and retraining.
  • Feedback: JSONL storage (in a file to simplify) and compute basic RUL stats.
  • Monitoring: Prometheus metrics from services + Grafana dashboards.
  • Drift detection: PSI/KS style feature drift metrics and RMSE deltas with model baseline; can trigger retraining.
  • Local-first: designed to run on a single machine via Docker Compose.

Tech Stack

  • Python (pydantic, pandas, numpy, scikit-learn)
  • Serving: BentoML
  • Monitoring: Prometheus + Grafana (pre-provisioned dashboards)
  • Experiment tracking: MLflow (for runs/metrics; not using Model Registry)
  • Orchestration: Docker & Docker Compose

Dashboards and explanations

Overview

Overview Monitoring

Shows health of each service.
The number of predictions made and errors (generated from bad input in the demo script).
Each different color is a different trained model.
Vertical purple lines are retraining triggers.

Drift

Drift Monitoring

Grafana dashboard (10 minutes) for drift metrics and retraining trigger.

Metrics

  • RMSE baseline: RMSE of the last minute prediction and the actual RUL vs baseline for the last trained model. Warning (orange line) triggers retraining.
  • KS: Kolmogorov-Smirnov statistic for feature drift (KS > 0.1). Warning (orange line) triggers retraining.
  • PSI: Percentage of features that have changed significantly (PSI > 0.1). Warning (orange line) triggers retraining.

Each orange rectange signifies a drift signal -> calling for a retraining.
Each purple line is a newly trained model (v1.0, v2.0, v3.0, etc).

Without any retraining, it looks like this :

Drift Monitoring no retrain

Data & Modeling

Dataset: NASA C-MAPSS turbofan engine degradation time series (multiple units, cycles, sensor readings).
Source: https://www.kaggle.com/datasets/behrad3d/nasa-cmaps

turbofan

  • Features: per-engine time-based features on multiple cycles (3 settings, 21 sensors).
  • Already delivered with a train split, and a test split. We have ground truth RUL for both of them.
  • Model: Random Forest for speed/simplicity and quick iterations. Model bundles and feature names are tracked and hot-reloaded by the prediction service.
  • Results: RUL prediction performance for demonstrating lifecycle behaviors. The emphasis is on the system performances rather than SOTA metrics. (for this portfolio project)
  • Future: explore LSTM/GRU or other sequence models for improved sequence modeling.

Current performances with RandomForest, simple hyperparameters, all FE, and training and all train data and test on all test data:
rul-results

Demo explanation - drift strategy

  • Train and Test set both have the same number of Engine Units (but different time series for each of them). The Test set simply has different 'flight conditions' (different distributions of sensor data through time).
  • We train on a subset of Train data and evaluate the corresponding unit of the Test set.
  • The demo strats with the first 10% of the Train set to train the first model. (and RMSE baseline made with the corresponding 10% of the Test set)
  • The demo script then send data (from Test set) to the Prediction API and collect feedback (RUL predictions and ground truths)
  • The feedback is used to compute metrics and trigger drift detection in another service. (see drift dashboard above)
  • If a drift is detected, the retraining service retrain on the Train set corresponding to the units of the Test set used in 'production.'
  • As soon as a new model is trained, the prediction service is updated with the new model bundle and hot-reloaded.
  • The demo continues until the end of the Test set.. which is programmed to last 10 minutes.

Quickstart

Prereqs: macOS/Linux, Python 3.13+, Docker, Kaggle Legacy API Key (.json), and a populated .env file.

  1. Clone and enter the repo
git clone https://github.com/<you>/Turbofan-ML-lifecycle.git
cd Turbofan-ML-lifecycle
  1. Create .env from example (mandatory)
cp .env.example .env
  1. Create a venv and install deps via uv
python -m venv .venv && source .venv/bin/activate
python -m pip install uv && uv sync
  1. Configure Kaggle "Legacy API Key". Download & prepare data:
    • How to get token (while connected to your kaggle account):
    • Then run:
      uv run initialize
      This download data, prepare it, and make feature engineering (mandatory).

If, for any reason, kaggle authentification fails, just paste:

{"username": "your_kaggle_username","key": "your_api_key_here"}

in your ~/.kaggle/kaggle.json file.

  1. Start the stack (build all images)
docker compose up --build

Dashboards:

Then, wait for all services to be up (see grafana 'Overview dashboard') and in another terminal, run the demo script:

source .venv/bin/activate && uv run continuous

From here, the demo takes 10 minutes to complete.

After that, you can stop the demo script with Ctrl+C in the terminal running it, and then stop the stack with docker compose down.

Other links:

What this project contains

  • Time-series predictive maintenance feature engineering
  • Reproducible training/evaluation with experiment tracking (MLflow)
  • Model serving with a proper API (BentoML)
  • Metrics instrumentation and dashboards (Prometheus + Grafana)
  • Drift detection and automated retraining loop
  • Containerized, local, production-like environment (Docker Compose)

Roadmap / Future Work

  • Sequence models (LSTM/GRU) for better temporal dynamics
  • Hardening for production: CI/CD, more tests, container hardening, k8s (mini kube) deployment

About

Production-grade machine learning system demonstrating the entire ML lifecycle including training, deployment, monitoring, automated drift detection, and retraining. Using real world data from Kaggle : NASA Turbofan Engine Degradation Dataset.

Resources

Stars

Watchers

Forks

Contributors

Languages