Skip to content

Hasnain006-nain/SPAM-SMS-DETECTION

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

📩 SMS Spam Detection

Python scikit-learn NLTK Pandas Status


Classifying SMS messages as Spam or Ham using NLP & Machine Learning
with Random Forest, SVM, and Naive Bayes on real-world message data.


🚀 Getting Started📊 Dataset🤖 Models📈 Results👤 Author



📌 Table of Contents



🔍 Overview

SMS spam is a persistent problem affecting millions of users daily. This project builds a complete NLP-powered spam detection pipeline that classifies SMS messages as either Spam or Ham (legitimate). It combines custom text preprocessing, stopword removal, stemming, word frequency analysis, and message length features — then feeds them into multiple machine learning models for comparison.

The best performing model — Random Forest Classifier — achieves 95.78% accuracy on test data.

✨ Key Highlights

🏆 Feature 📋 Detail
🧠 Models Used Random Forest, SVM, Naive Bayes
📦 Dataset Size 5,572 SMS messages
🎯 Task Binary Classification (Spam vs. Ham)
🔤 NLP Stopword Removal, Stemming, Word Frequency
📐 Best Accuracy 95.78% (Random Forest)


📊 Dataset

📂 spam.csv
├── 5,572 total SMS messages
├── v1  →  Label  (ham / spam)
└── v2  →  Message Text

📥 Download from Kaggle — SMS Spam Collection Dataset

🧾 Column Description

Column Description
v1 Target label — ham or spam
v2 Raw SMS message text
Unnamed: 2–4 Sparse extra columns — dropped during preprocessing

📉 Class Distribution

Ham     ████████████████████████████████████████████  86.6%
Spam    █████░░                                        13.4%

⚠️ Note: Dataset is imbalanced — ham messages dominate. This can affect recall on spam detection.



📁 Project Structure

📦 sms-spam-detection/
│
├── 📄 spam.csv                    ← Dataset (download from Kaggle)
├── 📓 spam_detection.ipynb        ← Main Jupyter Notebook
└── 📝 README.md                   ← Project documentation


⚙️ Tech Stack

Library Version Purpose
Python 3.8+ Core language
NumPy 1.24+ Numerical operations
Pandas 2.0+ Data loading & manipulation
Matplotlib 3.7+ Data visualization
Seaborn 0.12+ Count plots & charts
NLTK 3.8+ Stopword removal & stemming
scikit-learn 1.0+ ML models & evaluation

📦 Installation

pip install numpy pandas matplotlib seaborn scikit-learn nltk
# Download NLTK stopwords
import nltk
nltk.download('stopwords')


🚀 Getting Started

Option 1 — Local Environment

# 1. Clone the repository
git clone https://github.com/Hasnain006-nain/sms-spam-detection.git
cd sms-spam-detection

# 2. Install dependencies
pip install numpy pandas matplotlib seaborn scikit-learn nltk

# 3. Add dataset
# Place spam.csv in the project root directory

# 4. Launch Jupyter Notebook
jupyter notebook spam_detection.ipynb

Option 2 — Google Colab ☁️

from google.colab import drive
drive.mount('/content/drive')

data = pd.read_csv("/content/drive/MyDrive/spam.csv")


🔬 How It Works

┌──────────────────────────────────────────────────────────────────┐
│                       PIPELINE OVERVIEW                          │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  1. 📥 Load Data         →  Read spam.csv with Pandas            │
│                                                                  │
│  2. 🔎 Explore Data      →  Class distribution, column info      │
│                                                                  │
│  3. 🧹 Preprocess Text   →  Lowercase, remove punctuation        │
│                                                                  │
│  4. 🔤 NLP Pipeline      →  Remove stopwords, apply stemming     │
│                                                                  │
│  5. 📏 Feature: Length   →  Bin message lengths into categories  │
│                                                                  │
│  6. 📊 Feature: Diff     →  SpamWordCount - HamWordCount         │
│                                                                  │
│  7. ✂️  Split Data        →  80% Train | 20% Test                │
│                                                                  │
│  8. 🌲 Train Models      →  RandomForest + SVM + NaiveBayes      │
│                                                                  │
│  9. 📈 Evaluate          →  Accuracy + Confusion Matrix          │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

🔤 NLP Feature Engineering

1. Text Cleaning

data["Text"] = data.v2.str.lower()
data.Text = data.Text.str.replace(r'[.,\\&;!:-?(|)#@$^%*0-9]*', '')

2. Stopword Removal

stop_words = set(stopwords.words('english'))
text = [word for word in message.split() if word not in stop_words and len(word) > 2]

3. Word Frequency Difference (Diff Feature)

# Key insight: "free" appears 219x in spam vs 59x in ham
data["Diff"] = SpamWordCount - HamWordCount

4. Message Length Binning

pd.cut(data.Length, [-1, 10, 20, 30, 50, 75, 100, 999], labels=[10,20,30,50,75,100,200])


🤖 Models & Results

Models Compared

Model Description
🌲 Random Forest Ensemble of decision trees with GridSearchCV tuning
📐 SVM Support Vector Classifier with RBF kernel
📊 Naive Bayes Gaussian Naive Bayes probabilistic classifier

GridSearchCV Parameters (Random Forest)

parameters = {
    'n_estimators': [4, 6, 9],
    'max_features': ['log2', 'sqrt', 'auto'],
    'criterion': ['entropy', 'gini'],
    'max_depth': [2, 3, 5, 10],
    'min_samples_split': [2, 3, 5],
    'min_samples_leaf': [1, 5, 8]
}


📈 Results

╔══════════════════════════════════════════════════════════╗
║              MODEL PERFORMANCE SUMMARY                   ║
╠══════════════════════════════════════════════════════════╣
║  Approach              │  Accuracy                       ║
╠══════════════════════════════════════════════════════════╣
║  Word Freq Only (Diff) │  94.34%  ██████████████████ ✅  ║
║  Random Forest         │  95.78%  ███████████████████ ✅ ║
║  Naive Bayes           │  94.34%  ██████████████████ ✅  ║
║  SVM                   │  87.80%  █████████████████  ⚠️  ║
╚══════════════════════════════════════════════════════════╝

🧮 Confusion Matrix (Best Model)

                  Predicted Ham    Predicted Spam
Actual Ham            4723              102
Actual Spam            173              574

🏆 Random Forest wins with 95.78% accuracy using only two engineered features: Length and Diff.



🧪 Manual Testing

The project includes a live prediction function for custom input:

manual_entry()
# Enter message: Congratulations! You've won a free phone. Call now on 9999999999
# Output: Spam ✅

The model correctly identifies promotional and unsolicited messages as spam in real-time.



⚠️ Known Issues & Notes

⚖️ Class Imbalance

Ham messages make up ~87% of the dataset. This may reduce the model's sensitivity to spam. Consider:

# Option 1: Class weighting
clf = RandomForestClassifier(class_weight='balanced')

# Option 2: SMOTE oversampling
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_train, y_train = sm.fit_resample(X_train, y_train)

⚠️ auto Parameter Deprecation

max_features='auto' is deprecated in newer versions of scikit-learn. Replace with:

'max_features': ['log2', 'sqrt']

⚠️ DataFrame.append() Deprecation

temp.append() is removed in Pandas 2.0+. Replace with:

temp = pd.concat([temp, pd.DataFrame({"Text": [input_text]})], ignore_index=True)


🔮 Future Improvements

  • 🔄 Apply TF-IDF Vectorization for richer text features
  • 🧠 Try LSTM / BERT deep learning models
  • 📊 Add ROC-AUC and Precision-Recall curves
  • 🌐 Deploy as a web app using Flask or Streamlit
  • 📱 Build a real-time SMS filtering API
  • 💾 Export trained model with joblib for production


👤 Author


╔════════════════════════════════════╗
║                                    ║
║         Hasnain Haider             ║
║                                    ║
║   Machine Learning Enthusiast      ║
║   Data Science | NLP | Python      ║
║                                    ║
╚════════════════════════════════════╝

LinkedIn



© 2024 Hasnain Haider — Built for educational purposes in NLP & Text Classification

If you found this project helpful, please give it a star!

About

8:22 AMSMS Spam Detection using NLP & Machine Learning — classifying messages as Spam or Ham with 95.78% accuracy.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors