📩 SMS Spam Detection

Classifying SMS messages as Spam or Ham using NLP & Machine Learning
with Random Forest, SVM, and Naive Bayes on real-world message data.

🚀 Getting Started • 📊 Dataset • 🤖 Models • 📈 Results • 👤 Author

📌 Table of Contents

🔍 Overview
📊 Dataset
📁 Project Structure
⚙️ Tech Stack
🚀 Getting Started
🔬 How It Works
🤖 Models & Results
📈 Results
🧪 Manual Testing
⚠️ Known Issues & Notes
🔮 Future Improvements
👤 Author

🔍 Overview

SMS spam is a persistent problem affecting millions of users daily. This project builds a complete NLP-powered spam detection pipeline that classifies SMS messages as either Spam or Ham (legitimate). It combines custom text preprocessing, stopword removal, stemming, word frequency analysis, and message length features — then feeds them into multiple machine learning models for comparison.

The best performing model — Random Forest Classifier — achieves 95.78% accuracy on test data.

✨ Key Highlights

🏆 Feature	📋 Detail
🧠 Models Used	Random Forest, SVM, Naive Bayes
📦 Dataset Size	5,572 SMS messages
🎯 Task	Binary Classification (Spam vs. Ham)
🔤 NLP	Stopword Removal, Stemming, Word Frequency
📐 Best Accuracy	95.78% (Random Forest)

📊 Dataset

📂 spam.csv
├── 5,572 total SMS messages
├── v1  →  Label  (ham / spam)
└── v2  →  Message Text

📥 Download from Kaggle — SMS Spam Collection Dataset

🧾 Column Description

Column	Description
`v1`	Target label — `ham` or `spam`
`v2`	Raw SMS message text
`Unnamed: 2–4`	Sparse extra columns — dropped during preprocessing

📉 Class Distribution

Ham     ████████████████████████████████████████████  86.6%
Spam    █████░░                                        13.4%

⚠️ Note: Dataset is imbalanced — ham messages dominate. This can affect recall on spam detection.

📁 Project Structure

📦 sms-spam-detection/
│
├── 📄 spam.csv                    ← Dataset (download from Kaggle)
├── 📓 spam_detection.ipynb        ← Main Jupyter Notebook
└── 📝 README.md                   ← Project documentation

⚙️ Tech Stack

Library	Version	Purpose
	3.8+	Core language
	1.24+	Numerical operations
	2.0+	Data loading & manipulation
	3.7+	Data visualization
	0.12+	Count plots & charts
	3.8+	Stopword removal & stemming
	1.0+	ML models & evaluation

📦 Installation

pip install numpy pandas matplotlib seaborn scikit-learn nltk

# Download NLTK stopwords
import nltk
nltk.download('stopwords')

🚀 Getting Started

Option 1 — Local Environment

# 1. Clone the repository
git clone https://github.com/Hasnain006-nain/sms-spam-detection.git
cd sms-spam-detection

# 2. Install dependencies
pip install numpy pandas matplotlib seaborn scikit-learn nltk

# 3. Add dataset
# Place spam.csv in the project root directory

# 4. Launch Jupyter Notebook
jupyter notebook spam_detection.ipynb

Option 2 — Google Colab ☁️

from google.colab import drive
drive.mount('/content/drive')

data = pd.read_csv("/content/drive/MyDrive/spam.csv")

🔬 How It Works

┌──────────────────────────────────────────────────────────────────┐
│                       PIPELINE OVERVIEW                          │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  1. 📥 Load Data         →  Read spam.csv with Pandas            │
│                                                                  │
│  2. 🔎 Explore Data      →  Class distribution, column info      │
│                                                                  │
│  3. 🧹 Preprocess Text   →  Lowercase, remove punctuation        │
│                                                                  │
│  4. 🔤 NLP Pipeline      →  Remove stopwords, apply stemming     │
│                                                                  │
│  5. 📏 Feature: Length   →  Bin message lengths into categories  │
│                                                                  │
│  6. 📊 Feature: Diff     →  SpamWordCount - HamWordCount         │
│                                                                  │
│  7. ✂️  Split Data        →  80% Train | 20% Test                │
│                                                                  │
│  8. 🌲 Train Models      →  RandomForest + SVM + NaiveBayes      │
│                                                                  │
│  9. 📈 Evaluate          →  Accuracy + Confusion Matrix          │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

🔤 NLP Feature Engineering

1. Text Cleaning

data["Text"] = data.v2.str.lower()
data.Text = data.Text.str.replace(r'[.,\\&;!:-?(|)#@$^%*0-9]*', '')

2. Stopword Removal

stop_words = set(stopwords.words('english'))
text = [word for word in message.split() if word not in stop_words and len(word) > 2]

3. Word Frequency Difference (Diff Feature)

# Key insight: "free" appears 219x in spam vs 59x in ham
data["Diff"] = SpamWordCount - HamWordCount

4. Message Length Binning

pd.cut(data.Length, [-1, 10, 20, 30, 50, 75, 100, 999], labels=[10,20,30,50,75,100,200])

🤖 Models & Results

Models Compared

Model	Description
🌲 Random Forest	Ensemble of decision trees with GridSearchCV tuning
📐 SVM	Support Vector Classifier with RBF kernel
📊 Naive Bayes	Gaussian Naive Bayes probabilistic classifier

GridSearchCV Parameters (Random Forest)

parameters = {
    'n_estimators': [4, 6, 9],
    'max_features': ['log2', 'sqrt', 'auto'],
    'criterion': ['entropy', 'gini'],
    'max_depth': [2, 3, 5, 10],
    'min_samples_split': [2, 3, 5],
    'min_samples_leaf': [1, 5, 8]
}

📈 Results

╔══════════════════════════════════════════════════════════╗
║              MODEL PERFORMANCE SUMMARY                   ║
╠══════════════════════════════════════════════════════════╣
║  Approach              │  Accuracy                       ║
╠══════════════════════════════════════════════════════════╣
║  Word Freq Only (Diff) │  94.34%  ██████████████████ ✅  ║
║  Random Forest         │  95.78%  ███████████████████ ✅ ║
║  Naive Bayes           │  94.34%  ██████████████████ ✅  ║
║  SVM                   │  87.80%  █████████████████  ⚠️  ║
╚══════════════════════════════════════════════════════════╝

🧮 Confusion Matrix (Best Model)

                  Predicted Ham    Predicted Spam
Actual Ham            4723              102
Actual Spam            173              574

🏆 Random Forest wins with 95.78% accuracy using only two engineered features: Length and Diff.

🧪 Manual Testing

The project includes a live prediction function for custom input:

manual_entry()
# Enter message: Congratulations! You've won a free phone. Call now on 9999999999
# Output: Spam ✅

The model correctly identifies promotional and unsolicited messages as spam in real-time.

⚠️ Known Issues & Notes

⚖️ Class Imbalance

Ham messages make up ~87% of the dataset. This may reduce the model's sensitivity to spam. Consider:

# Option 1: Class weighting
clf = RandomForestClassifier(class_weight='balanced')

# Option 2: SMOTE oversampling
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_train, y_train = sm.fit_resample(X_train, y_train)

⚠️ `auto` Parameter Deprecation

max_features='auto' is deprecated in newer versions of scikit-learn. Replace with:

'max_features': ['log2', 'sqrt']

⚠️ `DataFrame.append()` Deprecation

temp.append() is removed in Pandas 2.0+. Replace with:

temp = pd.concat([temp, pd.DataFrame({"Text": [input_text]})], ignore_index=True)

🔮 Future Improvements

🔄 Apply TF-IDF Vectorization for richer text features
🧠 Try LSTM / BERT deep learning models
📊 Add ROC-AUC and Precision-Recall curves
🌐 Deploy as a web app using Flask or Streamlit
📱 Build a real-time SMS filtering API
💾 Export trained model with joblib for production

👤 Author

╔════════════════════════════════════╗
║                                    ║
║         Hasnain Haider             ║
║                                    ║
║   Machine Learning Enthusiast      ║
║   Data Science | NLP | Python      ║
║                                    ║
╚════════════════════════════════════╝

⭐ If you found this project helpful, please give it a star! ⭐

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
SpamDetection.ipynb		SpamDetection.ipynb
spam.csv		spam.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📩 SMS Spam Detection

📌 Table of Contents

🔍 Overview

✨ Key Highlights

📊 Dataset

🧾 Column Description

📉 Class Distribution

📁 Project Structure

⚙️ Tech Stack

📦 Installation

🚀 Getting Started

Option 1 — Local Environment

Option 2 — Google Colab ☁️

🔬 How It Works

🔤 NLP Feature Engineering

🤖 Models & Results

Models Compared

GridSearchCV Parameters (Random Forest)

📈 Results

🧮 Confusion Matrix (Best Model)

🧪 Manual Testing

⚠️ Known Issues & Notes

⚖️ Class Imbalance

⚠️ `auto` Parameter Deprecation

⚠️ `DataFrame.append()` Deprecation

🔮 Future Improvements

👤 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📩 SMS Spam Detection

📌 Table of Contents

🔍 Overview

✨ Key Highlights

📊 Dataset

🧾 Column Description

📉 Class Distribution

📁 Project Structure

⚙️ Tech Stack

📦 Installation

🚀 Getting Started

Option 1 — Local Environment

Option 2 — Google Colab ☁️

🔬 How It Works

🔤 NLP Feature Engineering

🤖 Models & Results

Models Compared

GridSearchCV Parameters (Random Forest)

📈 Results

🧮 Confusion Matrix (Best Model)

🧪 Manual Testing

⚠️ Known Issues & Notes

⚖️ Class Imbalance

⚠️ auto Parameter Deprecation

⚠️ DataFrame.append() Deprecation

🔮 Future Improvements

👤 Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

⚠️ `auto` Parameter Deprecation

⚠️ `DataFrame.append()` Deprecation

Packages