Skip to content

Amey-Thakur/TSF-SUPERVISED-MACHINE-LEARNING

Repository files navigation

Supervised Machine Learning

License: MIT Status Technology Developed by Amey Thakur and Mega Satish

A predictive analytics study demonstrating the application of Ordinary Least Squares (OLS) Regression to estimate academic performance based on temporal study patterns.

Google Colab  ·  Kaggle Notebook  ·  Video Demo  ·  Live Demo

TSF Supervised Machine Learning Demo


Authors  ·  Overview  ·  Features  ·  Structure  ·  Results  ·  Quick Start  ·  Usage Guidelines  ·  License  ·  About  ·  Acknowledgments


Important

🤝🏻 Special Acknowledgement

Special thanks to Mega Satish for her meaningful contributions, guidance, and support that helped shape this work.


Overview

Supervised Machine Learning - Task 1 is a foundational Data Science exploration conducted under the Graduate Rotational Internship Program (GRIP) at The Sparks Foundation. The project establishes a univariate linear model to quantify the correlation between study duration (Hours) and academic outcome (Scores).

By leveraging Scikit-Learn's predictive algorithms, the system minimizes the sum of squared residuals to derive a best-fit line, enabling precise scalar predictions for arbitrary numerical inputs (e.g., predicting scores for 9.25 hours of study).

Computational Objectives

The simulation is governed by strict statistical principles ensuring reproducibility and accuracy:

  • Linear Approximation: establishing a linear relationship $y = mx + c$ where $y$ is the predicted score and $x$ is study hours.
  • Residual Minimization: utilizing the OLS algorithm to optimize the coefficient (slope) and intercept.
  • Predictive Inference: generating a specific scalar output for the internship query: What will be the predicted score if a student studies for 9.25 hrs/day?

Tip

Model Applicability: While this linear model provides a precise mathematical estimation for the given range, extrapolating predictions beyond the observed data range (e.g., studying > 15 hours/day) may yield unrealistic results due to the physical constraints of a 24-hour day.


Features

Component Technical Description
Ingestion Pipeline Automated data retrieval and parsing using Pandas from remote HTTP endpoints.
Exploratory Analysis Visualizing distribution and correlation via Matplotlib and Seaborn scatter plots.
Model Architecture Implementation of LinearRegression from Scikit-Learn for OLS optimization.
Evaluation Metrics Quantitative assessment using Mean Absolute Error (MAE) to validate model precision.
Inference Engine Direct scalar injection logic to predict outcomes for specific user-defined inputs.

Note

Empirical Context

The dataset consists of a bivariate distribution (Hours vs. Scores). The high correlation coefficient observed during EDA justifies the selection of a Linear Regression model over more complex polynomial or ensemble approaches, adhering to the principle of parsimony (Occam's Razor) in machine learning design.

Tech Stack

  • Runtime: Python 3.x
  • Data Manipulation: Pandas, NumPy
  • Visualization: Matplotlib, Seaborn
  • Machine Learning: Scikit-Learn (sklearn)
  • Environment: Jupyter Notebook / Google Colab

Project Structure

TSF-SUPERVISED-MACHINE-LEARNING/
│
├── docs/                                            # Technical Documentation
│   └── SPECIFICATION.md                             # Architecture & Design Specification
│
├── Mega/                                            # Archival Attribution Assets
│   ├── Filly.jpg                                    # Companion (Filly)
│   ├── Mega.png                                     # Author Profile Image (Mega Satish)
│   └── ...                                          # Additional Attribution Files
│
├── Source Code/                                     # Core Implementation
│   └── TSF_INTERNSHIP_TASK_1_SUPERVISED_LEARNING.ipynb  # Jupyter Notebook (Analysis Kernel)
│
├── The Sparks Foundation/                           # Internship Artifacts
│   └── Task_1_Dataset.csv                           # Empirical Data Source
│
├── .gitattributes                                   # Git configuration
├── .gitignore                                       # Repository Filters
├── CITATION.cff                                     # Scholarly Citation Metadata
├── codemeta.json                                    # Machine-Readable Project Metadata
├── LICENSE                                          # MIT License Terms
├── README.md                                        # Project Documentation
└── SECURITY.md                                      # Security Policy

Results

1. Exploratory Data Analysis: Hours vs Percentage
Initial scatter plot revealing the strong positive correlation.

Dataset Scatter

2. Feature Distribution: Scores Analysis
Statistical distribution of the target variable (Percentage Scored).

Score Distribution

3. Regression Fit: Hours vs Scores
OLS Regression line fitted to the training data.

Regression Plot

4. Model Training: Fitting the Line
Visualizing the linear approximation on Training Data.

Training Set Results

5. Model Validation: Testing the Fit
Validation of the regression line against unseen Test Data.

Test Set Results

Evaluation Metrics
Mean Absolute Error: 4.18 | R2 Score: 0.945

Final Inference
Input: 9.25 HoursPredicted Score: 93.69%


Quick Start

1. Prerequisites

  • Python 3.7+: Required for runtime execution. Download Python
  • Jupyter Environment: For interactive code execution (JupyterLab or Notebook).

Warning

Data Path Integrity

The analysis kernel relies on precise relative file paths. Ensure Task_1_Dataset.csv remains within The Sparks Foundation/ directory. Modifying the directory structure without updating the ingestion logic will result in FileNotFoundError during runtime.

2. Installation

Establish the local environment by cloning the repository and installing the computational stack:

# Clone the repository
git clone https://github.com/Amey-Thakur/TSF-SUPERVISED-MACHINE-LEARNING.git
cd TSF-SUPERVISED-MACHINE-LEARNING

# Install predictive modeling dependencies
pip install pandas numpy matplotlib seaborn scikit-learn

3. Execution

Launch the analysis kernel to reproduce the findings:

jupyter notebook "Source Code/TSF_INTERNSHIP_TASK_1_SUPERVISED_LEARNING.ipynb"

Tip

Interactive Predictive Analytics | Student Score Estimation

Explore the high-fidelity Live Demo to visualize the Ordinary Least Squares (OLS) regression analysis in real-time. The interactive dashboard showcases Exploratory Data Analysis (EDA), Feature Distribution, and Model Validation results, quantifying the strong positive correlation between study hours and academic outcomes with an R² score of 0.945.

Launch Live Demo


Usage Guidelines

This repository is openly shared to support learning and knowledge exchange across the academic community.

For Students
Use this project as reference material for understanding supervised learning pipelines, univariate regression, and statistical predictive modeling. The source code is available for study to facilitate self-paced learning and exploration of OLS optimization and residual analysis.

For Educators
This project may serve as a practical lab example or supplementary teaching resource for Data Science and Applied Statistics courses. Attribution is appreciated when utilizing content.

For Researchers
The documentation and architectural approach may provide insights into academic project structuring, predictive inference, and industrial internship artifacts.


License

This academic submission, developed for the Graduate Rotational Internship Program (GRIP) at The Sparks Foundation, is made available under the MIT License. See the LICENSE file for complete terms.

Note

Summary: You are free to share and adapt this content for any purpose, even commercially, as long as you provide appropriate attribution to the original authors.

Copyright © 2021 Amey Thakur & Mega Satish


About This Repository

Created & Maintained by: Amey Thakur & Mega Satish
Role: Data Science & Business Analytics Interns
Program: Graduate Rotational Internship Program (GRIP)
Organization: The Sparks Foundation

This project features Supervised Machine Learning - Task 1, a predictive analytics study conducted as part of the GRIP Internship. It explores the application of linear regression to solve real-world estimation problems.

Connect: GitHub  ·  LinkedIn  ·  ORCID

Acknowledgments

Grateful acknowledgment to Mega Satish for her exceptional collaboration and scholarly partnership during the execution of this data science internship task. Her analytical precision, deep understanding of statistical modeling, and constant support were instrumental in refining the predictive algorithms used in this study. Working alongside her was a transformative experience; her thoughtful approach to problem-solving and steady encouragement turned complex regression challenges into meaningful learning moments. This work reflects the growth and insights gained from our side-by-side academic journey. Thank you, Mega, for everything you shared and taught along the way.

Special thanks to the mentors at The Sparks Foundation for providing this platform for rapid skill development and industrial exposure.


↑ Back to Top

Authors  ·  Overview  ·  Features  ·  Structure  ·  Results  ·  Quick Start  ·  Usage Guidelines  ·  License  ·  About  ·  Acknowledgments


📈 TSF-SUPERVISED-MACHINE-LEARNING


Presented as part of the Internship @ The Sparks Foundation


Computer Engineering (B.E.) - University of Mumbai

Semester-wise curriculum, laboratories, projects, and academic notes.