Classifying SMS messages as Spam or Ham using NLP & Machine Learning
with Random Forest, SVM, and Naive Bayes on real-world message data.
🚀 Getting Started • 📊 Dataset • 🤖 Models • 📈 Results • 👤 Author
- 🔍 Overview
- 📊 Dataset
- 📁 Project Structure
- ⚙️ Tech Stack
- 🚀 Getting Started
- 🔬 How It Works
- 🤖 Models & Results
- 📈 Results
- 🧪 Manual Testing
⚠️ Known Issues & Notes- 🔮 Future Improvements
- 👤 Author
|
SMS spam is a persistent problem affecting millions of users daily. This project builds a complete NLP-powered spam detection pipeline that classifies SMS messages as either Spam or Ham (legitimate). It combines custom text preprocessing, stopword removal, stemming, word frequency analysis, and message length features — then feeds them into multiple machine learning models for comparison. The best performing model — Random Forest Classifier — achieves 95.78% accuracy on test data. |
| 🏆 Feature | 📋 Detail |
|---|---|
| 🧠 Models Used | Random Forest, SVM, Naive Bayes |
| 📦 Dataset Size | 5,572 SMS messages |
| 🎯 Task | Binary Classification (Spam vs. Ham) |
| 🔤 NLP | Stopword Removal, Stemming, Word Frequency |
| 📐 Best Accuracy | 95.78% (Random Forest) |
📂 spam.csv
├── 5,572 total SMS messages
├── v1 → Label (ham / spam)
└── v2 → Message Text
📥 Download from Kaggle — SMS Spam Collection Dataset
| Column | Description |
|---|---|
v1 |
Target label — ham or spam |
v2 |
Raw SMS message text |
Unnamed: 2–4 |
Sparse extra columns — dropped during preprocessing |
Ham ████████████████████████████████████████████ 86.6%
Spam █████░░ 13.4%
⚠️ Note: Dataset is imbalanced — ham messages dominate. This can affect recall on spam detection.
📦 sms-spam-detection/
│
├── 📄 spam.csv ← Dataset (download from Kaggle)
├── 📓 spam_detection.ipynb ← Main Jupyter Notebook
└── 📝 README.md ← Project documentation
pip install numpy pandas matplotlib seaborn scikit-learn nltk# Download NLTK stopwords
import nltk
nltk.download('stopwords')# 1. Clone the repository
git clone https://github.com/Hasnain006-nain/sms-spam-detection.git
cd sms-spam-detection
# 2. Install dependencies
pip install numpy pandas matplotlib seaborn scikit-learn nltk
# 3. Add dataset
# Place spam.csv in the project root directory
# 4. Launch Jupyter Notebook
jupyter notebook spam_detection.ipynbfrom google.colab import drive
drive.mount('/content/drive')
data = pd.read_csv("/content/drive/MyDrive/spam.csv")┌──────────────────────────────────────────────────────────────────┐
│ PIPELINE OVERVIEW │
├──────────────────────────────────────────────────────────────────┤
│ │
│ 1. 📥 Load Data → Read spam.csv with Pandas │
│ │
│ 2. 🔎 Explore Data → Class distribution, column info │
│ │
│ 3. 🧹 Preprocess Text → Lowercase, remove punctuation │
│ │
│ 4. 🔤 NLP Pipeline → Remove stopwords, apply stemming │
│ │
│ 5. 📏 Feature: Length → Bin message lengths into categories │
│ │
│ 6. 📊 Feature: Diff → SpamWordCount - HamWordCount │
│ │
│ 7. ✂️ Split Data → 80% Train | 20% Test │
│ │
│ 8. 🌲 Train Models → RandomForest + SVM + NaiveBayes │
│ │
│ 9. 📈 Evaluate → Accuracy + Confusion Matrix │
│ │
└──────────────────────────────────────────────────────────────────┘
1. Text Cleaning
data["Text"] = data.v2.str.lower()
data.Text = data.Text.str.replace(r'[.,\\&;!:-?(|)#@$^%*0-9]*', '')2. Stopword Removal
stop_words = set(stopwords.words('english'))
text = [word for word in message.split() if word not in stop_words and len(word) > 2]3. Word Frequency Difference (Diff Feature)
# Key insight: "free" appears 219x in spam vs 59x in ham
data["Diff"] = SpamWordCount - HamWordCount4. Message Length Binning
pd.cut(data.Length, [-1, 10, 20, 30, 50, 75, 100, 999], labels=[10,20,30,50,75,100,200])| Model | Description |
|---|---|
| 🌲 Random Forest | Ensemble of decision trees with GridSearchCV tuning |
| 📐 SVM | Support Vector Classifier with RBF kernel |
| 📊 Naive Bayes | Gaussian Naive Bayes probabilistic classifier |
parameters = {
'n_estimators': [4, 6, 9],
'max_features': ['log2', 'sqrt', 'auto'],
'criterion': ['entropy', 'gini'],
'max_depth': [2, 3, 5, 10],
'min_samples_split': [2, 3, 5],
'min_samples_leaf': [1, 5, 8]
}╔══════════════════════════════════════════════════════════╗
║ MODEL PERFORMANCE SUMMARY ║
╠══════════════════════════════════════════════════════════╣
║ Approach │ Accuracy ║
╠══════════════════════════════════════════════════════════╣
║ Word Freq Only (Diff) │ 94.34% ██████████████████ ✅ ║
║ Random Forest │ 95.78% ███████████████████ ✅ ║
║ Naive Bayes │ 94.34% ██████████████████ ✅ ║
║ SVM │ 87.80% █████████████████ ⚠️ ║
╚══════════════════════════════════════════════════════════╝
Predicted Ham Predicted Spam
Actual Ham 4723 102
Actual Spam 173 574
🏆 Random Forest wins with 95.78% accuracy using only two engineered features:
LengthandDiff.
The project includes a live prediction function for custom input:
manual_entry()
# Enter message: Congratulations! You've won a free phone. Call now on 9999999999
# Output: Spam ✅The model correctly identifies promotional and unsolicited messages as spam in real-time.
Ham messages make up ~87% of the dataset. This may reduce the model's sensitivity to spam. Consider:
# Option 1: Class weighting
clf = RandomForestClassifier(class_weight='balanced')
# Option 2: SMOTE oversampling
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_train, y_train = sm.fit_resample(X_train, y_train)max_features='auto' is deprecated in newer versions of scikit-learn. Replace with:
'max_features': ['log2', 'sqrt']temp.append() is removed in Pandas 2.0+. Replace with:
temp = pd.concat([temp, pd.DataFrame({"Text": [input_text]})], ignore_index=True)- 🔄 Apply TF-IDF Vectorization for richer text features
- 🧠 Try LSTM / BERT deep learning models
- 📊 Add ROC-AUC and Precision-Recall curves
- 🌐 Deploy as a web app using Flask or Streamlit
- 📱 Build a real-time SMS filtering API
- 💾 Export trained model with
joblibfor production
╔════════════════════════════════════╗
║ ║
║ Hasnain Haider ║
║ ║
║ Machine Learning Enthusiast ║
║ Data Science | NLP | Python ║
║ ║
╚════════════════════════════════════╝
© 2024 Hasnain Haider — Built for educational purposes in NLP & Text Classification
⭐ If you found this project helpful, please give it a star! ⭐