A predictive analytics study demonstrating the application of Ordinary Least Squares (OLS) Regression to estimate academic performance based on temporal study patterns.
Authors · Overview · Features · Structure · Results · Quick Start · Usage Guidelines · License · About · Acknowledgments
Important
Special thanks to Mega Satish for her meaningful contributions, guidance, and support that helped shape this work.
Supervised Machine Learning - Task 1 is a foundational Data Science exploration conducted under the Graduate Rotational Internship Program (GRIP) at The Sparks Foundation. The project establishes a univariate linear model to quantify the correlation between study duration (Hours) and academic outcome (Scores).
By leveraging Scikit-Learn's predictive algorithms, the system minimizes the sum of squared residuals to derive a best-fit line, enabling precise scalar predictions for arbitrary numerical inputs (e.g., predicting scores for 9.25 hours of study).
The simulation is governed by strict statistical principles ensuring reproducibility and accuracy:
-
Linear Approximation: establishing a linear relationship
$y = mx + c$ where$y$ is the predicted score and$x$ is study hours. - Residual Minimization: utilizing the OLS algorithm to optimize the coefficient (slope) and intercept.
- Predictive Inference: generating a specific scalar output for the internship query: What will be the predicted score if a student studies for 9.25 hrs/day?
Tip
Model Applicability: While this linear model provides a precise mathematical estimation for the given range, extrapolating predictions beyond the observed data range (e.g., studying > 15 hours/day) may yield unrealistic results due to the physical constraints of a 24-hour day.
| Component | Technical Description |
|---|---|
| Ingestion Pipeline | Automated data retrieval and parsing using Pandas from remote HTTP endpoints. |
| Exploratory Analysis | Visualizing distribution and correlation via Matplotlib and Seaborn scatter plots. |
| Model Architecture | Implementation of LinearRegression from Scikit-Learn for OLS optimization. |
| Evaluation Metrics | Quantitative assessment using Mean Absolute Error (MAE) to validate model precision. |
| Inference Engine | Direct scalar injection logic to predict outcomes for specific user-defined inputs. |
Note
The dataset consists of a bivariate distribution (Hours vs. Scores). The high correlation coefficient observed during EDA justifies the selection of a Linear Regression model over more complex polynomial or ensemble approaches, adhering to the principle of parsimony (Occam's Razor) in machine learning design.
- Runtime: Python 3.x
- Data Manipulation: Pandas, NumPy
- Visualization: Matplotlib, Seaborn
- Machine Learning: Scikit-Learn (sklearn)
- Environment: Jupyter Notebook / Google Colab
TSF-SUPERVISED-MACHINE-LEARNING/
│
├── docs/ # Technical Documentation
│ └── SPECIFICATION.md # Architecture & Design Specification
│
├── Mega/ # Archival Attribution Assets
│ ├── Filly.jpg # Companion (Filly)
│ ├── Mega.png # Author Profile Image (Mega Satish)
│ └── ... # Additional Attribution Files
│
├── Source Code/ # Core Implementation
│ └── TSF_INTERNSHIP_TASK_1_SUPERVISED_LEARNING.ipynb # Jupyter Notebook (Analysis Kernel)
│
├── The Sparks Foundation/ # Internship Artifacts
│ └── Task_1_Dataset.csv # Empirical Data Source
│
├── .gitattributes # Git configuration
├── .gitignore # Repository Filters
├── CITATION.cff # Scholarly Citation Metadata
├── codemeta.json # Machine-Readable Project Metadata
├── LICENSE # MIT License Terms
├── README.md # Project Documentation
└── SECURITY.md # Security PolicyInitial scatter plot revealing the strong positive correlation.
2. Feature Distribution: Scores Analysis
Statistical distribution of the target variable (Percentage Scored).
3. Regression Fit: Hours vs Scores
OLS Regression line fitted to the training data.
4. Model Training: Fitting the Line
Visualizing the linear approximation on Training Data.
5. Model Validation: Testing the Fit
Validation of the regression line against unseen Test Data.
Evaluation Metrics
Mean Absolute Error: 4.18 | R2 Score: 0.945
Final Inference
Input: 9.25 Hours → Predicted Score: 93.69%
- Python 3.7+: Required for runtime execution. Download Python
- Jupyter Environment: For interactive code execution (JupyterLab or Notebook).
Warning
Data Path Integrity
The analysis kernel relies on precise relative file paths. Ensure Task_1_Dataset.csv remains within The Sparks Foundation/ directory. Modifying the directory structure without updating the ingestion logic will result in FileNotFoundError during runtime.
Establish the local environment by cloning the repository and installing the computational stack:
# Clone the repository
git clone https://github.com/Amey-Thakur/TSF-SUPERVISED-MACHINE-LEARNING.git
cd TSF-SUPERVISED-MACHINE-LEARNING
# Install predictive modeling dependencies
pip install pandas numpy matplotlib seaborn scikit-learnLaunch the analysis kernel to reproduce the findings:
jupyter notebook "Source Code/TSF_INTERNSHIP_TASK_1_SUPERVISED_LEARNING.ipynb"Tip
Interactive Predictive Analytics | Student Score Estimation
Explore the high-fidelity Live Demo to visualize the Ordinary Least Squares (OLS) regression analysis in real-time. The interactive dashboard showcases Exploratory Data Analysis (EDA), Feature Distribution, and Model Validation results, quantifying the strong positive correlation between study hours and academic outcomes with an R² score of 0.945.
This repository is openly shared to support learning and knowledge exchange across the academic community.
For Students
Use this project as reference material for understanding supervised learning pipelines, univariate regression, and statistical predictive modeling. The source code is available for study to facilitate self-paced learning and exploration of OLS optimization and residual analysis.
For Educators
This project may serve as a practical lab example or supplementary teaching resource for Data Science and Applied Statistics courses. Attribution is appreciated when utilizing content.
For Researchers
The documentation and architectural approach may provide insights into academic project structuring, predictive inference, and industrial internship artifacts.
This academic submission, developed for the Graduate Rotational Internship Program (GRIP) at The Sparks Foundation, is made available under the MIT License. See the LICENSE file for complete terms.
Note
Summary: You are free to share and adapt this content for any purpose, even commercially, as long as you provide appropriate attribution to the original authors.
Copyright © 2021 Amey Thakur & Mega Satish
Created & Maintained by: Amey Thakur & Mega Satish
Role: Data Science & Business Analytics Interns
Program: Graduate Rotational Internship Program (GRIP)
Organization: The Sparks Foundation
This project features Supervised Machine Learning - Task 1, a predictive analytics study conducted as part of the GRIP Internship. It explores the application of linear regression to solve real-world estimation problems.
Connect: GitHub · LinkedIn · ORCID
Grateful acknowledgment to Mega Satish for her exceptional collaboration and scholarly partnership during the execution of this data science internship task. Her analytical precision, deep understanding of statistical modeling, and constant support were instrumental in refining the predictive algorithms used in this study. Working alongside her was a transformative experience; her thoughtful approach to problem-solving and steady encouragement turned complex regression challenges into meaningful learning moments. This work reflects the growth and insights gained from our side-by-side academic journey. Thank you, Mega, for everything you shared and taught along the way.
Special thanks to the mentors at The Sparks Foundation for providing this platform for rapid skill development and industrial exposure.
Authors · Overview · Features · Structure · Results · Quick Start · Usage Guidelines · License · About · Acknowledgments
📈 TSF-SUPERVISED-MACHINE-LEARNING
Computer Engineering (B.E.) - University of Mumbai
Semester-wise curriculum, laboratories, projects, and academic notes.


