Skip to content

Rishi-Kukadiya/Feature-Selection-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧬 Feature Selection using Genetic Algorithm & Rough Set Theory

A sophisticated machine learning implementation that combines Genetic Algorithm (GA) with Rough Set Theory to optimize feature selection for classification tasks. This project demonstrates the power of soft computing techniques in reducing dimensionality while maintaining model performance.


πŸ“‹ Table of Contents


🎯 Overview

This project addresses the curse of dimensionality in machine learning through an intelligent feature selection framework. By leveraging Genetic Algorithms for population-based optimization and Rough Set Theory for measuring feature dependency, we achieve:

  • βœ… 55.56% feature reduction (54 β†’ 24 features)
  • βœ… 93.51% classification accuracy maintained
  • βœ… Improved model interpretability and reduced computational cost
  • βœ… Automated feature selection eliminating manual feature engineering

πŸ” Motivation & Problem Statement

The Challenge

Large datasets often contain redundant and irrelevant features that:

  • Increase computational complexity
  • Reduce model interpretability
  • Introduce noise and overfitting
  • Waste storage and memory resources

Our Solution

We employ a hybrid intelligent system that combines:

  1. Genetic Algorithm - Population-based evolutionary optimization
  2. Rough Set Theory - Mathematical framework for measuring feature dependency
  3. Discretization - Converting continuous values to categorical bins

πŸ”§ Technical Approach

1. Data Preprocessing Pipeline

Raw Data β†’ Memory Optimization β†’ Discretization β†’ Train-Test Split

Key Steps:

  • Data Type Optimization: Converts numeric types (int32 β†’ int8 for binary features)
  • Quantile-based Discretization: Transforms 10 continuous features into 5 discrete bins
  • Memory-Efficient Processing: Reduces memory from 48.8 MB to 31.96 MB

2. Rough Set Theory Fundamentals

Rough Sets provide a mathematical framework for handling uncertainty:

Dependency Coefficient (Ξ³):
Ξ³ = |POS(P,D)| / |U|

Where:
- POS(P,D) = Positive Region (consistent tuples)
- U = Universe (total tuples)
- P = Conditional Attributes (features)
- D = Decision Attribute (target)

Fitness Function:

Fitness = Ξ³(subset) - 0.05 Γ— (|subset| / |all features|)
         = Dependency - Size Penalty

3. Genetic Algorithm Framework

Component Details
Chromosome Binary vector (0=feature excluded, 1=feature included)
Population Size 50 individuals
Generations 50 iterations
Mutation Rate 2% bit-flip probability
Crossover Two-point crossover between parents
Selection Tournament selection (top 2 of 4 random individuals)

GA Workflow:

Initialize Random Population
    ↓
Calculate Fitness for Each Individual (Rough Set Dependency)
    ↓
Select Best Parents (Tournament Selection)
    ↓
Crossover & Mutation
    ↓
Create New Generation
    ↓
Repeat for N Generations
    ↓
Return Best Feature Subset

πŸ“ Project Structure

Feature-Selection-GA-RoughSet/
β”‚
β”œβ”€β”€ main.ipynb                      # Complete implementation notebook
β”œβ”€β”€ Data.csv                        # Forest Cover Type dataset (581,012 samples)
β”œβ”€β”€ README.md                       # This file
└── .gitignore                      # Git ignore configuration

Key Modules:
β”œβ”€β”€ DataPipeline                    # Data loading & preprocessing
β”œβ”€β”€ GApipeline                      # Genetic Algorithm implementation
└── Visualization & Evaluation      # Results analysis

πŸ“Š Dataset Description

Forest Cover Type Dataset

Source: UCI Machine Learning Repository Samples: 581,012 instances Features: 54 attributes Classes: 7 forest cover types Target: Cover_Type (1-7)

Feature Categories

Category Count Type Description
Physiological 10 Continuous β†’ Discretized Elevation, slope, aspect, distances to hydrology
Wilderness Areas 4 Binary One-hot encoded wilderness area classification
Soil Type 40 Binary One-hot encoded soil composition
Target 1 Categorical Forest cover type (7 classes)

Example Features:

  • Elevation: Height above sea level (2000-4000 meters)
  • Aspect: Compass direction (0-360Β°)
  • Slope: Steepness (0-90Β°)
  • Horizontal_Distance_To_Hydrology: Proximity to water features
  • Hillshade_*: Solar radiation indices
  • Wilderness_Area*: Geographic regions (mutually exclusive)
  • Soil_Type*: Geological composition (mutually exclusive)

✨ Key Features

1. Intelligent Data Pipeline

pipeline = DataPipeline(file_path="Data.csv")
pipeline.load_and_optimize()           # Optimize data types
pipeline.preprocess_for_rough_set()    # Discretize continuous features
X_ga, Y_ga = pipeline.prepare_subsets() # Create GA training set

2. GA Feature Selection

ga = GApipeline(df, target_col='Cover_Type',
                pop_size=50, generations=50, mutation_rate=0.02)
best_subset, history = ga.run()        # Execute GA optimization

3. Rough Set Dependency Calculation

  • Measures feature importance based on decision consistency
  • Non-parametric approach (no assumptions about data distribution)
  • Handles categorical and discretized continuous data

4. Comprehensive Visualization

  • Generation-wise fitness evolution
  • Feature selection comparison (selected vs. excluded)
  • Model performance metrics

πŸ“ˆ Results & Performance

Feature Selection Results

╔════════════════════════════════╗
β•‘   FINAL PROJECT REPORT         β•‘
╠════════════════════════════════╣
β•‘ Original Feature Count: 54     β•‘
β•‘ Reduced Feature Count:  24     β•‘
β•‘ Feature Reduction:     55.56%  β•‘
β•‘ Final Test Accuracy:    0.9351 β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

Selected Features (24 out of 54)

Physiological Features (10/10):

  • Elevation, Aspect, Slope
  • Horizontal_Distance_To_Hydrology, Vertical_Distance_To_Hydrology
  • Horizontal_Distance_To_Roadways
  • Hillshade_9am, Hillshade_Noon, Hillshade_3pm
  • Horizontal_Distance_To_Fire_Points

Wilderness Areas (3/4):

  • Wilderness_Area1, Wilderness_Area2, Wilderness_Area3

Soil Types (11/40):

  • Soil_Type2, Soil_Type4, Soil_Type10, Soil_Type15, Soil_Type20
  • Soil_Type23, Soil_Type24, Soil_Type29, Soil_Type32, Soil_Type33, Soil_Type39

Classification Performance

              precision    recall  f1-score   support
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Class 1         0.94      0.93      0.94      42368
Class 2         0.95      0.95      0.95      56661  ← Majority Class
Class 3         0.92      0.92      0.92       7151
Class 4         0.80      0.79      0.80        549   ← Minority Class
Class 5         0.84      0.83      0.83       1899
Class 6         0.87      0.85      0.86       3473
Class 7         0.94      0.94      0.94       4102
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Accuracy: 0.9351 (116,203 test samples)
Weighted F1: 0.94

GA Convergence Analysis

  • Convergence Rate: 30-40 generations (stable by generation 30)
  • Best Fitness: 0.9579 (achieved at generation 41)
  • Plateau Pattern: Indicates near-optimal solution found

πŸš€ Installation & Setup

Prerequisites

  • Python 3.8+
  • Jupyter Notebook

Required Libraries

pandas>=1.3.0          # Data manipulation
numpy>=1.20.0          # Numerical computing
scikit-learn>=0.24.0   # Machine learning tools
matplotlib>=3.3.0      # Plotting
seaborn>=0.11.0        # Statistical visualization

Installation Steps

# Clone the repository
git clone https://github.com/yourusername/Feature-Selection-GA-RoughSet.git
cd Feature-Selection-GA-RoughSet

# Install dependencies
pip install -r requirements.txt

# Launch Jupyter
jupyter notebook

πŸ“– Usage Guide

Complete Workflow

Step 1: Data Preparation

from main import DataPipeline

# Initialize pipeline
pipeline = DataPipeline("Data.csv")

# Load and optimize data types
pipeline.load_and_optimize()

# Visualize class distribution
pipeline.visualize_distribution()

# Preprocess for Rough Set analysis
pipeline.preprocess_for_rough_set(n_bins=5)

# Prepare data for GA
X_ga, Y_ga = pipeline.prepare_subsets(sample_size=0.02)
ga_training_df = pd.concat([X_ga, Y_ga], axis=1)

Step 2: GA Feature Selection

from main import GApipline

# Initialize GA
ga = GApipline(ga_training_df, target_col='Cover_Type',
               pop_size=50, generations=50, mutation_rate=0.02)

# Run feature selection
best_subset, history = ga.run()

# Extract selected features
selected_indices = [i for i, bit in enumerate(best_subset) if bit == 1]
selected_features = [ga.features[i] for i in selected_indices]

Step 3: Model Training & Evaluation

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Prepare data with selected features only
X_train_final = X_train[selected_features]
X_test_final = X_test[selected_features]

# Train classifier
final_model = DecisionTreeClassifier(random_state=42)
final_model.fit(X_train_final, y_train)

# Evaluate
y_pred = final_model.predict(X_test_final)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print(classification_report(y_test, y_pred))

πŸ”¬ Implementation Details

1. DataPipeline Class

class DataPipeline:
    """Handles data loading, preprocessing, and discretization"""

    def __init__(self, file_path):
        self.file_path = file_path
        self.df = None

    def load_and_optimize(self):
        """Load CSV and optimize data types"""
        # Reduces memory footprint

    def preprocess_for_rough_set(self, n_bins=10):
        """Discretize continuous features"""
        # Uses quantile-based binning

    def prepare_subsets(self, sample_size=0.2):
        """Create train-validation splits for GA"""
        # Returns stratified subsets

2. GApipeline Class

class GApipline:
    """Genetic Algorithm for Feature Selection using Rough Sets"""

    def __init__(self, df, target_col, pop_size=50, generations=50):
        self.population = np.random.randint(2, size=(pop_size, n_features))

    def fitness(self, chromosome):
        """Calculate Rough Set Dependency Coefficient"""
        # Measures decision consistency
        # Applies size penalty to reduce dimensionality

    def crossover(self, parent1, parent2):
        """Two-point crossover operator"""

    def mutate(self, chromosome):
        """Bit-flip mutation with fixed probability"""

    def run(self):
        """Execute GA optimization loop"""
        # Returns best chromosome and fitness history

3. Rough Set Dependency Calculation

def fitness(self, chromosome):
    # Extract selected features
    selected_cols = [cols[i] for i, bit in enumerate(chromosome) if bit == 1]

    if not selected_cols:
        return 0.0

    # Group by selected features and check target consistency
    grouped = df.groupby(selected_cols)[target]

    # Positive region: groups with single target value
    positive_region = grouped.filter(lambda x: x.nunique() == 1)

    # Dependency coefficient
    gamma = len(positive_region) / len(df)

    # Apply feature count penalty
    penalty = 0.05 * (sum(chromosome) / len(chromosome))

    return gamma - penalty

πŸ”„ Algorithm Flow Diagram

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Load & Preprocess Data     β”‚
β”‚  - Load CSV                 β”‚
β”‚  - Optimize data types      β”‚
β”‚  - Discretize continuous    β”‚
β”‚  - Create train/test split  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Initialize GA Population   β”‚
β”‚  - 50 random chromosomes    β”‚
β”‚  - Each bit = feature       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
               β–Ό
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚ Generation   β”‚
        β”‚ Loop (N=50)  β”‚
        β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
               β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Calculate Fitness   β”‚
    β”‚ Rough Set Ξ³(P,D)    β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
             β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Select Parents      β”‚
    β”‚ (Tournament)        β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
             β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Crossover & Mutate  β”‚
    β”‚ Create Offspring    β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
             β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Form New Population β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
             └────────────┐
                          β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Extract Best Features    β”‚
β”‚ & Train Final Model      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Evaluate Performance     β”‚
β”‚ - Accuracy: 93.51%       β”‚
β”‚ - Features: 24/54        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ§ͺ Experimental Notes

Hyperparameter Tuning

Parameter Value Rationale
Pop Size 50 Balances diversity & computation
Generations 50 Sufficient for convergence
Mutation Rate 0.02 2% prevents premature convergence
Discretization Bins 5 Reduces memory, preserves patterns
GA Sample Size 2% ~9,620 samples for faster iteration

Why This Approach?

βœ… GA Benefits:

  • Global optimization (avoids local optima)
  • No assumptions about data distribution
  • Flexible fitness function

βœ… Rough Set Benefits:

  • Measures feature relevance mathematically
  • Handles categorical & discrete data natively
  • No parameter tuning needed

βœ… Hybrid Approach:

  • Combines evolutionary search with information theory
  • Balances accuracy with interpretability
  • Reduces computational overhead

πŸ“Š Visualizations Included

  1. Target Class Distribution: Shows imbalanced dataset nature
  2. GA Fitness Evolution: Line plot of best fitness per generation
  3. Feature Selection Results: Bar chart of selected vs. excluded features

🀝 Contributing

Contributions are welcome! Areas for enhancement:

  • Multi-objective optimization (Pareto-optimal feature sets)
  • Alternative discretization methods (ChiMerge, EWD)
  • Different classifiers (Random Forest, SVM)
  • Advanced GA operators (adaptive mutation, elitism)
  • Performance profiling and optimization

How to Contribute:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/enhancement)
  3. Commit changes (git commit -am 'Add enhancement')
  4. Push to branch (git push origin feature/enhancement)
  5. Create Pull Request

πŸ“š References

Core Concepts

  1. Rough Set Theory

    • Pawlak, Z. (1991). Rough Sets: Theoretical Aspects of Reasoning about Data
    • Foundations and applications: https://www.springer.com/
  2. Genetic Algorithms

  3. Feature Selection in ML

    • Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection
    • Journal of Machine Learning Research
    • https://www.jmlr.org/

Dataset Reference

  • Forest Cover Type Dataset
    • Blackard, J. A., & Dean, D. J. (1999). Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types
    • UCI ML Repository: https://archive.ics.uci.edu/
    • 581,012 records, 54 attributes

Tools & Libraries


πŸ“ License

This project is licensed under the MIT License - see LICENSE file for details.


πŸ“§ Contact & Support

For questions, issues, or collaborations:


πŸŽ“ Citation

If you use this project in your research, please cite:

@software{ga_roughset_featsel,
  title={Feature Selection using Genetic Algorithm and Rough Set Theory},
  author={Your Name},
  year={2024},
  url={https://github.com/yourusername/Feature-Selection-GA-RoughSet}
}

⭐ Acknowledgments

  • UCI Machine Learning Repository for the Forest Cover Type dataset
  • Scikit-learn community for excellent ML tools
  • Rough Set Theory community for mathematical foundations

Made with ❀️ for Machine Learning & Soft Computing

Give this project a ⭐ if you found it helpful!

About

This project cover how we can implement our feature engineering by using the Stander GA Algorithm

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors