Skip to content

Kaiha0/WQD7009-BDAA-A-Predictive-Model-for-Heart-Disease-Risk-Factors-using-a-Cloud-Based-Architecture

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Developing a Predictive Model for Heart Disease Risk Factors using a Cloud-Based Architecture

This project implements a comprehensive, cloud-native data lifecycle on Google Cloud Platform (GCP) to identify and predict heart disease risk factors through a multi-layer hybrid architecture.

Table of Content

  1. Project Overview
  2. System Architecture
  3. Dataset Description
  4. Implementation Phases
  5. Performance Evaluation
  6. Conclusion
  7. Dataset Description & Access
  8. Disclaimer

Project Overview

The project addresses the need for scalable and efficient predictive systems in healthcare. By leveraging GCP, it provides a framework for managing large datasets and ensuring model interpretability to improve preventive healthcare strategies.

Key Objectives:

  • Design a robust architecture for high-volume data processing.
  • Implement ML algorithms to identify critical heart disease risk factors.
  • Validate performance using clinically relevant metrics like ROC-AUC.

System Architecture

The architecture follows a Five-Layer Data Lifecycle designed to transform raw input into actionable medical insights.

  • Bronze Layer: Raw data ingestion.
  • Silver Layer: Cleaned and structured data.
  • Gold Layer: Enriched data with model predictions.

Dataset Description

  • Source: heart_2022_no_nans.csv from Kaggle.
  • Format: Raw CSV (Initial) → Cleaned BigQuery Tables (Silver) → Enriched Prediction Tables (Gold).

Implementation Phases

Phase 1: Data Ingestion (Bronze Layer)

  • Tool: Google Cloud Storage (GCS).
  • Action: Established the cardiopredict-bronzedata-12345 bucket to store raw CSV data and PySpark scripts as read-only "Bronze" files.

Phase 2: Data Processing (Silver Layer)

  • Tool: Dataproc with PySpark.
  • Action: Deployed a Spark cluster to resolve data type issues and convert categorical variables into numerical values.
  • Efficiency: The preprocessing job completed in approximately 1.9 minutes.

Phase 3: Structured Storage & Analytics

  • Tool: BigQuery & BigQuery ML.
  • Action: Created the heart_analytics_dataset for structured storage.
  • Analytics: Used SQL commands to train a Logistic Regression model for feature selection, retaining features with weights > 0.3.

Phase 4: Advanced Modeling (Gold Layer)

  • Tool: Vertex AI Workbench.
  • Technique: Applied SMOTE (Synthetic Minority Over-sampling Technique) to balance target labels and trained a final Logistic Regression model.
  • Storage: Prediction results were saved as the heart_analytics_gold table.

Phase 5: Data Visualization

  • Tool: Power BI Desktop.
  • Action: Linked Power BI directly to BigQuery via a cloud connector to create interactive dashboards showing risk levels across demographics.

Performance Evaluation

  • Processing Speed: Dataproc tasks achieved maximum parallelism across 4 CPU cores.
  • Memory Management: Disk usage remained at 0.0 B, signifying that all operations were handled entirely in memory.
  • SQL Performance: Most analytical queries were completed in approximately 2 seconds.

Conclusion

The practical implementation confirms that the hybrid cloud architecture successfully transforms raw healthcare data into high-quality insights in a reliable and scalable manner.


Dataset Description & Access

Due to the file size (approx. 80MB), the raw dataset is not hosted directly in this repository.

How to add the data to your GCP Environment:

  1. Download the heart_2022_no_nans.csv from the link above.
  2. Upload the file to your Google Cloud Storage (GCS) bucket named cardiopredict-bronzedata-12345.
  3. Ensure the file path in src/dataproc_heart_analysis.py matches your GCS URI.

Disclaimer

This project was developed for educational purposes. While commercial usage is welcomed, the author is not liable for any losses or GCP service charges incurred due to the use of this repository. Users are responsible for monitoring their own cloud billing.

About

End-to-end Heart Disease Predictive Model using a Five-Layer Cloud Architecture (Bronze-Silver-Gold) on GCP with Dataproc, BigQuery ML, and Vertex AI.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages