A sophisticated machine learning implementation that combines Genetic Algorithm (GA) with Rough Set Theory to optimize feature selection for classification tasks. This project demonstrates the power of soft computing techniques in reducing dimensionality while maintaining model performance.
- Overview
- Motivation & Problem Statement
- Technical Approach
- Project Structure
- Dataset Description
- Key Features
- Results & Performance
- Installation & Setup
- Usage Guide
- Implementation Details
- Contributing
- References
This project addresses the curse of dimensionality in machine learning through an intelligent feature selection framework. By leveraging Genetic Algorithms for population-based optimization and Rough Set Theory for measuring feature dependency, we achieve:
- β 55.56% feature reduction (54 β 24 features)
- β 93.51% classification accuracy maintained
- β Improved model interpretability and reduced computational cost
- β Automated feature selection eliminating manual feature engineering
Large datasets often contain redundant and irrelevant features that:
- Increase computational complexity
- Reduce model interpretability
- Introduce noise and overfitting
- Waste storage and memory resources
We employ a hybrid intelligent system that combines:
- Genetic Algorithm - Population-based evolutionary optimization
- Rough Set Theory - Mathematical framework for measuring feature dependency
- Discretization - Converting continuous values to categorical bins
Raw Data β Memory Optimization β Discretization β Train-Test Split
- Data Type Optimization: Converts numeric types (int32 β int8 for binary features)
- Quantile-based Discretization: Transforms 10 continuous features into 5 discrete bins
- Memory-Efficient Processing: Reduces memory from 48.8 MB to 31.96 MB
Rough Sets provide a mathematical framework for handling uncertainty:
Dependency Coefficient (Ξ³):
Ξ³ = |POS(P,D)| / |U|
Where:
- POS(P,D) = Positive Region (consistent tuples)
- U = Universe (total tuples)
- P = Conditional Attributes (features)
- D = Decision Attribute (target)
Fitness Function:
Fitness = Ξ³(subset) - 0.05 Γ (|subset| / |all features|)
= Dependency - Size Penalty
| Component | Details |
|---|---|
| Chromosome | Binary vector (0=feature excluded, 1=feature included) |
| Population Size | 50 individuals |
| Generations | 50 iterations |
| Mutation Rate | 2% bit-flip probability |
| Crossover | Two-point crossover between parents |
| Selection | Tournament selection (top 2 of 4 random individuals) |
Initialize Random Population
β
Calculate Fitness for Each Individual (Rough Set Dependency)
β
Select Best Parents (Tournament Selection)
β
Crossover & Mutation
β
Create New Generation
β
Repeat for N Generations
β
Return Best Feature Subset
Feature-Selection-GA-RoughSet/
β
βββ main.ipynb # Complete implementation notebook
βββ Data.csv # Forest Cover Type dataset (581,012 samples)
βββ README.md # This file
βββ .gitignore # Git ignore configuration
Key Modules:
βββ DataPipeline # Data loading & preprocessing
βββ GApipeline # Genetic Algorithm implementation
βββ Visualization & Evaluation # Results analysis
Source: UCI Machine Learning Repository Samples: 581,012 instances Features: 54 attributes Classes: 7 forest cover types Target: Cover_Type (1-7)
| Category | Count | Type | Description |
|---|---|---|---|
| Physiological | 10 | Continuous β Discretized | Elevation, slope, aspect, distances to hydrology |
| Wilderness Areas | 4 | Binary | One-hot encoded wilderness area classification |
| Soil Type | 40 | Binary | One-hot encoded soil composition |
| Target | 1 | Categorical | Forest cover type (7 classes) |
Elevation: Height above sea level (2000-4000 meters)Aspect: Compass direction (0-360Β°)Slope: Steepness (0-90Β°)Horizontal_Distance_To_Hydrology: Proximity to water featuresHillshade_*: Solar radiation indicesWilderness_Area*: Geographic regions (mutually exclusive)Soil_Type*: Geological composition (mutually exclusive)
pipeline = DataPipeline(file_path="Data.csv")
pipeline.load_and_optimize() # Optimize data types
pipeline.preprocess_for_rough_set() # Discretize continuous features
X_ga, Y_ga = pipeline.prepare_subsets() # Create GA training setga = GApipeline(df, target_col='Cover_Type',
pop_size=50, generations=50, mutation_rate=0.02)
best_subset, history = ga.run() # Execute GA optimization- Measures feature importance based on decision consistency
- Non-parametric approach (no assumptions about data distribution)
- Handles categorical and discretized continuous data
- Generation-wise fitness evolution
- Feature selection comparison (selected vs. excluded)
- Model performance metrics
ββββββββββββββββββββββββββββββββββ
β FINAL PROJECT REPORT β
β βββββββββββββββββββββββββββββββββ£
β Original Feature Count: 54 β
β Reduced Feature Count: 24 β
β Feature Reduction: 55.56% β
β Final Test Accuracy: 0.9351 β
ββββββββββββββββββββββββββββββββββ
Physiological Features (10/10):
- Elevation, Aspect, Slope
- Horizontal_Distance_To_Hydrology, Vertical_Distance_To_Hydrology
- Horizontal_Distance_To_Roadways
- Hillshade_9am, Hillshade_Noon, Hillshade_3pm
- Horizontal_Distance_To_Fire_Points
Wilderness Areas (3/4):
- Wilderness_Area1, Wilderness_Area2, Wilderness_Area3
Soil Types (11/40):
- Soil_Type2, Soil_Type4, Soil_Type10, Soil_Type15, Soil_Type20
- Soil_Type23, Soil_Type24, Soil_Type29, Soil_Type32, Soil_Type33, Soil_Type39
precision recall f1-score support
βββββββββββββββββββββββββββββββββββββββββββββββ
Class 1 0.94 0.93 0.94 42368
Class 2 0.95 0.95 0.95 56661 β Majority Class
Class 3 0.92 0.92 0.92 7151
Class 4 0.80 0.79 0.80 549 β Minority Class
Class 5 0.84 0.83 0.83 1899
Class 6 0.87 0.85 0.86 3473
Class 7 0.94 0.94 0.94 4102
βββββββββββββββββββββββββββββββββββββββββββββββ
Accuracy: 0.9351 (116,203 test samples)
Weighted F1: 0.94
- Convergence Rate: 30-40 generations (stable by generation 30)
- Best Fitness: 0.9579 (achieved at generation 41)
- Plateau Pattern: Indicates near-optimal solution found
- Python 3.8+
- Jupyter Notebook
pandas>=1.3.0 # Data manipulation
numpy>=1.20.0 # Numerical computing
scikit-learn>=0.24.0 # Machine learning tools
matplotlib>=3.3.0 # Plotting
seaborn>=0.11.0 # Statistical visualization# Clone the repository
git clone https://github.com/yourusername/Feature-Selection-GA-RoughSet.git
cd Feature-Selection-GA-RoughSet
# Install dependencies
pip install -r requirements.txt
# Launch Jupyter
jupyter notebookfrom main import DataPipeline
# Initialize pipeline
pipeline = DataPipeline("Data.csv")
# Load and optimize data types
pipeline.load_and_optimize()
# Visualize class distribution
pipeline.visualize_distribution()
# Preprocess for Rough Set analysis
pipeline.preprocess_for_rough_set(n_bins=5)
# Prepare data for GA
X_ga, Y_ga = pipeline.prepare_subsets(sample_size=0.02)
ga_training_df = pd.concat([X_ga, Y_ga], axis=1)from main import GApipline
# Initialize GA
ga = GApipline(ga_training_df, target_col='Cover_Type',
pop_size=50, generations=50, mutation_rate=0.02)
# Run feature selection
best_subset, history = ga.run()
# Extract selected features
selected_indices = [i for i, bit in enumerate(best_subset) if bit == 1]
selected_features = [ga.features[i] for i in selected_indices]from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
# Prepare data with selected features only
X_train_final = X_train[selected_features]
X_test_final = X_test[selected_features]
# Train classifier
final_model = DecisionTreeClassifier(random_state=42)
final_model.fit(X_train_final, y_train)
# Evaluate
y_pred = final_model.predict(X_test_final)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print(classification_report(y_test, y_pred))class DataPipeline:
"""Handles data loading, preprocessing, and discretization"""
def __init__(self, file_path):
self.file_path = file_path
self.df = None
def load_and_optimize(self):
"""Load CSV and optimize data types"""
# Reduces memory footprint
def preprocess_for_rough_set(self, n_bins=10):
"""Discretize continuous features"""
# Uses quantile-based binning
def prepare_subsets(self, sample_size=0.2):
"""Create train-validation splits for GA"""
# Returns stratified subsetsclass GApipline:
"""Genetic Algorithm for Feature Selection using Rough Sets"""
def __init__(self, df, target_col, pop_size=50, generations=50):
self.population = np.random.randint(2, size=(pop_size, n_features))
def fitness(self, chromosome):
"""Calculate Rough Set Dependency Coefficient"""
# Measures decision consistency
# Applies size penalty to reduce dimensionality
def crossover(self, parent1, parent2):
"""Two-point crossover operator"""
def mutate(self, chromosome):
"""Bit-flip mutation with fixed probability"""
def run(self):
"""Execute GA optimization loop"""
# Returns best chromosome and fitness historydef fitness(self, chromosome):
# Extract selected features
selected_cols = [cols[i] for i, bit in enumerate(chromosome) if bit == 1]
if not selected_cols:
return 0.0
# Group by selected features and check target consistency
grouped = df.groupby(selected_cols)[target]
# Positive region: groups with single target value
positive_region = grouped.filter(lambda x: x.nunique() == 1)
# Dependency coefficient
gamma = len(positive_region) / len(df)
# Apply feature count penalty
penalty = 0.05 * (sum(chromosome) / len(chromosome))
return gamma - penaltyβββββββββββββββββββββββββββββββ
β Load & Preprocess Data β
β - Load CSV β
β - Optimize data types β
β - Discretize continuous β
β - Create train/test split β
ββββββββββββββββ¬βββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββ
β Initialize GA Population β
β - 50 random chromosomes β
β - Each bit = feature β
ββββββββββββββββ¬βββββββββββββββ
β
βΌ
ββββββββββββββββ
β Generation β
β Loop (N=50) β
ββββββββ¬ββββββββ
β
βΌ
βββββββββββββββββββββββ
β Calculate Fitness β
β Rough Set Ξ³(P,D) β
ββββββββββ¬βββββββββββββ
β
βΌ
βββββββββββββββββββββββ
β Select Parents β
β (Tournament) β
ββββββββββ¬βββββββββββββ
β
βΌ
βββββββββββββββββββββββ
β Crossover & Mutate β
β Create Offspring β
ββββββββββ¬βββββββββββββ
β
βΌ
βββββββββββββββββββββββ
β Form New Population β
ββββββββββ¬βββββββββββββ
β
ββββββββββββββ
β
βββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββ
β Extract Best Features β
β & Train Final Model β
ββββββββββββββββ¬ββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββ
β Evaluate Performance β
β - Accuracy: 93.51% β
β - Features: 24/54 β
ββββββββββββββββββββββββββββ
| Parameter | Value | Rationale |
|---|---|---|
| Pop Size | 50 | Balances diversity & computation |
| Generations | 50 | Sufficient for convergence |
| Mutation Rate | 0.02 | 2% prevents premature convergence |
| Discretization Bins | 5 | Reduces memory, preserves patterns |
| GA Sample Size | 2% | ~9,620 samples for faster iteration |
β GA Benefits:
- Global optimization (avoids local optima)
- No assumptions about data distribution
- Flexible fitness function
β Rough Set Benefits:
- Measures feature relevance mathematically
- Handles categorical & discrete data natively
- No parameter tuning needed
β Hybrid Approach:
- Combines evolutionary search with information theory
- Balances accuracy with interpretability
- Reduces computational overhead
- Target Class Distribution: Shows imbalanced dataset nature
- GA Fitness Evolution: Line plot of best fitness per generation
- Feature Selection Results: Bar chart of selected vs. excluded features
Contributions are welcome! Areas for enhancement:
- Multi-objective optimization (Pareto-optimal feature sets)
- Alternative discretization methods (ChiMerge, EWD)
- Different classifiers (Random Forest, SVM)
- Advanced GA operators (adaptive mutation, elitism)
- Performance profiling and optimization
- Fork the repository
- Create a feature branch (
git checkout -b feature/enhancement) - Commit changes (
git commit -am 'Add enhancement') - Push to branch (
git push origin feature/enhancement) - Create Pull Request
-
Rough Set Theory
- Pawlak, Z. (1991). Rough Sets: Theoretical Aspects of Reasoning about Data
- Foundations and applications: https://www.springer.com/
-
Genetic Algorithms
- Holland, J. H. (1992). Adaptation in Natural and Artificial Systems
- Classic GA reference: https://mitpress.mit.edu/
-
Feature Selection in ML
- Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection
- Journal of Machine Learning Research
- https://www.jmlr.org/
- Forest Cover Type Dataset
- Blackard, J. A., & Dean, D. J. (1999). Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types
- UCI ML Repository: https://archive.ics.uci.edu/
- 581,012 records, 54 attributes
- scikit-learn: Machine learning in Python - https://scikit-learn.org/
- pandas: Data manipulation - https://pandas.pydata.org/
- matplotlib/seaborn: Data visualization - https://matplotlib.org/
This project is licensed under the MIT License - see LICENSE file for details.
For questions, issues, or collaborations:
- GitHub Issues: Create an issue
- Email: your.email@example.com
If you use this project in your research, please cite:
@software{ga_roughset_featsel,
title={Feature Selection using Genetic Algorithm and Rough Set Theory},
author={Your Name},
year={2024},
url={https://github.com/yourusername/Feature-Selection-GA-RoughSet}
}- UCI Machine Learning Repository for the Forest Cover Type dataset
- Scikit-learn community for excellent ML tools
- Rough Set Theory community for mathematical foundations
Made with β€οΈ for Machine Learning & Soft Computing
Give this project a β if you found it helpful!