This project implements a comprehensive, cloud-native data lifecycle on Google Cloud Platform (GCP) to identify and predict heart disease risk factors through a multi-layer hybrid architecture.
- Project Overview
- System Architecture
- Dataset Description
- Implementation Phases
- Performance Evaluation
- Conclusion
- Dataset Description & Access
- Disclaimer
The project addresses the need for scalable and efficient predictive systems in healthcare. By leveraging GCP, it provides a framework for managing large datasets and ensuring model interpretability to improve preventive healthcare strategies.
Key Objectives:
- Design a robust architecture for high-volume data processing.
- Implement ML algorithms to identify critical heart disease risk factors.
- Validate performance using clinically relevant metrics like ROC-AUC.
The architecture follows a Five-Layer Data Lifecycle designed to transform raw input into actionable medical insights.
- Bronze Layer: Raw data ingestion.
- Silver Layer: Cleaned and structured data.
- Gold Layer: Enriched data with model predictions.
- Source:
heart_2022_no_nans.csvfrom Kaggle. - Format: Raw CSV (Initial) → Cleaned BigQuery Tables (Silver) → Enriched Prediction Tables (Gold).
- Tool: Google Cloud Storage (GCS).
- Action: Established the
cardiopredict-bronzedata-12345bucket to store raw CSV data and PySpark scripts as read-only "Bronze" files.
- Tool: Dataproc with PySpark.
- Action: Deployed a Spark cluster to resolve data type issues and convert categorical variables into numerical values.
- Efficiency: The preprocessing job completed in approximately 1.9 minutes.
- Tool: BigQuery & BigQuery ML.
- Action: Created the
heart_analytics_datasetfor structured storage. - Analytics: Used SQL commands to train a Logistic Regression model for feature selection, retaining features with weights > 0.3.
- Tool: Vertex AI Workbench.
- Technique: Applied SMOTE (Synthetic Minority Over-sampling Technique) to balance target labels and trained a final Logistic Regression model.
- Storage: Prediction results were saved as the
heart_analytics_goldtable.
- Tool: Power BI Desktop.
- Action: Linked Power BI directly to BigQuery via a cloud connector to create interactive dashboards showing risk levels across demographics.
- Processing Speed: Dataproc tasks achieved maximum parallelism across 4 CPU cores.
- Memory Management: Disk usage remained at 0.0 B, signifying that all operations were handled entirely in memory.
- SQL Performance: Most analytical queries were completed in approximately 2 seconds.
The practical implementation confirms that the hybrid cloud architecture successfully transforms raw healthcare data into high-quality insights in a reliable and scalable manner.
Due to the file size (approx. 80MB), the raw dataset is not hosted directly in this repository.
- Dataset Source: Kaggle - Personal Key Indicators of Heart Disease
- File Name:
heart_2022_no_nans.csv
- Download the
heart_2022_no_nans.csvfrom the link above. - Upload the file to your Google Cloud Storage (GCS) bucket named
cardiopredict-bronzedata-12345. - Ensure the file path in
src/dataproc_heart_analysis.pymatches your GCS URI.
This project was developed for educational purposes. While commercial usage is welcomed, the author is not liable for any losses or GCP service charges incurred due to the use of this repository. Users are responsible for monitoring their own cloud billing.