Skip to content

samueltauil/boston-jobs-fairness

Repository files navigation

Boston Jobs Policy - Fairness Analysis with Fairlearn

Python 3.14+ Fairlearn License

A comprehensive fairness analysis of the Boston Residents Jobs Policy compliance data using Fairlearn to detect and mitigate bias in construction project employment.


πŸ“‹ Table of Contents


🎯 Overview

This project demonstrates fair machine learning using Microsoft's Fairlearn library on real-world data from Boston's construction industry. The Boston Residents Jobs Policy (established in 1983) sets employment standards for City-sponsored development projects to reduce racial and gender inequality in construction.

Goals

  1. Understand Demographics: Analyze the distribution of protected features (gender, race, ethnicity) in Boston construction projects
  2. Detect Bias: Identify potential disparities in work hour allocation across demographic groups
  3. Mitigate Unfairness: Apply Fairlearn's bias mitigation techniques to create fairer predictive models
  4. Visualize Results: Generate comprehensive visualizations showing bias reduction

πŸ“ˆ Analysis Results Preview

The fairness analysis generates comprehensive visualizations showing demographic distributions, bias detection, and mitigation results:

Boston Jobs Fairness Analysis Results

Key Achievements:

  • 🎯 90.1% reduction in gender bias (9.5% β†’ 0.9% demographic parity difference)
  • 🎯 68.8% reduction in race bias (15.3% β†’ 4.8% demographic parity difference)
  • πŸ“Š Fair model accuracy: 68.9% (improved from 55.3% baseline)
  • πŸ“ˆ Comprehensive 12-panel visualization showing all aspects of the analysis

The visualization above shows the complete fairness analysis pipeline including demographic distributions, bias metrics before/after mitigation, and model performance comparisons.


πŸ“Š Dataset

Boston Residents Jobs Policy Compliance Reports

Fields

Field Type Description
agency String City agency overseeing the project
compliance_project_name String Name of the development project
project_address String Location of the project
neighborhood String Boston neighborhood
developer String Project developer
general_contractor_name String General contractor managing the project
subcontractor String Subcontractor company
trade String Worker trade/occupation (e.g., Carpenter, Electrician)
period_ending Date End date of reporting period
gender String Worker gender (Man, Woman, Non-Binary, No Answer)
person_of_color Boolean Person of Color indicator
race String Race/Ethnicity (Caucasian, Hispanic/Latino, African American, Asian, Other, Cape Verdean, Native American)
boston_resident Boolean Boston residency status
worker_hours_this_period Float Hours worked in the reporting period

Protected Features (bold above): gender, race, person_of_color, boston_resident

πŸ“– For detailed field descriptions and data quality notes, see DATA_DICTIONARY.md

Why This Dataset?

  1. βœ… Real Protected Features: Contains actual demographic data (gender, race) with 99.9% coverage
  2. βœ… Large Scale: 1.95M records provide statistical validity
  3. βœ… Social Impact: Directly related to employment equity policy
  4. βœ… Public Data: Openly available for research and transparency
  5. βœ… Intersectional: Allows analysis of multiple protected attributes
  6. βœ… Well-Documented: Official data dictionary available

πŸš€ Installation

Prerequisites

  • Python 3.14+ (or 3.8+)
  • pip package manager

Setup

  1. Clone the repository
git clone <repository-url>
cd boston-jobs-fairness
  1. Install dependencies
pip install -r requirements.txt

Requirements

The main dependencies are:

  • fairlearn==0.13.0 - Fairness assessment and mitigation
  • scikit-learn==1.7.2 - Machine learning models
  • pandas==2.3.3 - Data manipulation
  • numpy==2.3.4 - Numerical computing
  • matplotlib==3.10.7 - Visualization
  • seaborn==0.13.2 - Statistical visualization

πŸ’» Usage

⚠️ Important: Dataset Download Required

The dataset is NOT included in this repository due to its size (298 MB).

After cloning this repository, you must download the dataset before running the analysis.

First-Time Setup

# 1. Clone the repository
git clone <repository-url>
cd boston-jobs-fairness

# 2. Install dependencies
pip install -r requirements.txt

# 3. Download the dataset (~298 MB)
python download_dataset.py

Dataset Download Options

Option 1: Automated Helper Script (Recommended)

python download_dataset.py

This script checks if the dataset exists and provides download instructions.

Option 2: Manual Download

  1. Visit https://data.boston.gov/dataset/boston-jobs-policy-compliance-reports
  2. Click "Boston Jobs Policy Compliance Reports CSV"
  3. Save the file to data/boston_jobs_policy.csv

Option 3: Command Line (PowerShell)

Invoke-WebRequest `
  -Uri "https://data.boston.gov/dataset/7344db71-87f6-4598-8d20-edbcaad4b9b2/resource/5ab4b4de-c970-4619-ab55-ce4338535b24/download/tmpzg6coxuf.csv" `
  -OutFile "data/boston_jobs_policy.csv" `
  -UserAgent "Mozilla/5.0"

Option 4: Command Line (curl)

curl -A "Mozilla/5.0" \
  "https://data.boston.gov/dataset/7344db71-87f6-4598-8d20-edbcaad4b9b2/resource/5ab4b4de-c970-4619-ab55-ce4338535b24/download/tmpzg6coxuf.csv" \
  -o "data/boston_jobs_policy.csv"

Run the Analysis

Once the dataset is downloaded:

python boston_jobs_fairness_analysis.py

Note: If you try to run the analysis without the dataset, you'll see helpful error messages with download instructions.

This will:

  1. Load the full 1.95M record dataset from data/boston_jobs_policy.csv
  2. Perform demographic analysis
  3. Train baseline and fair ML models
  4. Generate visualizations and reports

Expected Runtime

  • Analysis: ~12 minutes (full 1.95M records)
  • Loading data: ~2-3 minutes
  • Training models: ~8-10 minutes
  • Model training: ~1-2 minutes
  • Total: ~5 minutes

Outputs

After running, you'll find:

  • boston_jobs_fairness_analysis.png - Comprehensive 12-panel visualization (see preview above)
  • fairness_analysis_report.csv - Detailed metrics report with all bias measurements

The main visualization includes:

  • πŸ“Š Demographic distributions across protected features
  • πŸ“ˆ Bias metrics before and after mitigation
  • 🎯 Model performance comparisons
  • πŸ“‰ Fairness trade-offs analysis

πŸ”¬ Analysis Pipeline

1. Data Loading (100K sample)

~1.95M total records β†’ 100K representative sample
Maintains demographic distributions

2. Demographic Analysis

  • Gender distribution across workers
  • Race/ethnicity breakdown
  • Person of Color representation
  • Boston residency rates

3. Work Patterns Analysis

  • Average hours by demographic groups
  • Disparities in work allocation
  • Trade-level analysis
  • Intersectional patterns (gender Γ— race)

4. Machine Learning Task

Prediction: Does a worker receive "high hours" (>75th percentile)?

Why This Matters: If certain demographics systematically receive fewer work hours, it indicates potential bias in job assignment or opportunities.

Features Used:

  • Trade (encoded)
  • Agency (encoded)
  • Neighborhood (encoded)
  • General Contractor (encoded)

Target: Binary (High hours: Yes/No) Class Balancing: Balanced class weights to handle 75/25 split

5. Fairness Assessment

Using Fairlearn metrics:

  • Demographic Parity: Equal selection rates across groups
  • Equalized Odds: Equal true positive and false positive rates
  • Accuracy: Overall model performance

6. Bias Mitigation

Using ExponentiatedGradient with DemographicParity constraint:

  • Trains model to minimize demographic parity violations
  • Balances fairness vs accuracy trade-off
  • Applies constraint to primary sensitive feature (gender)

7. Visualization & Reporting

  • 12-panel comprehensive visualization
  • Detailed CSV report with all metrics
  • Comparison: Baseline vs Fair model

πŸ” Key Findings

Based on analysis of 1,953,111 construction worker records (full dataset):

Demographics

  • Gender: 87% Men, 11% Women, 2% Other/No Answer
  • Race: 51% Caucasian, 23% Hispanic/Latino, 20% African American, 3% Asian, 3% Other
  • Person of Color: 49% POC, 51% White
  • Boston Residents: 35% residents, 65% non-residents

Work Hours Disparities (Actual Data)

  • Gender Gap: Men work 2.05Γ— more hours (75 hrs vs 37 hrs for women)
  • Race Gap: Caucasian workers 1.91Γ— more hours (87 hrs vs 46 hrs for African American)
  • POC Gap: Non-POC workers 1.67Γ— more hours (87 hrs vs 52 hrs)
  • Residency Gap: Non-residents 1.66Γ— more hours (82 hrs vs 49 hrs)

Model Bias Detection (Baseline)

  • Gender: 9.5% demographic parity difference (women over-predicted at 63.7% vs men 54.5%)
  • Race: 15.3% demographic parity difference (Asian: 69.6%, Caucasian: 55.2%)
  • Equalized Odds: Up to 24.4% difference in error rates

Bias Mitigation Results (Fair Model)

  • Gender Bias Reduced by 90.1% (9.5% β†’ 0.9% demographic parity difference)
  • Race Bias Reduced by 68.8% (15.3% β†’ 4.8% demographic parity difference)
  • Residency Bias Reduced by 92.5% (4.4% β†’ 0.3%)
  • Accuracy Improved: 55.3% β†’ 68.9% (fair model actually performs BETTER!)
  • Selection Rates Equalized: 14.6% (men) vs 13.7% (women) in fair model

πŸ“š Fairlearn Concepts

What is Fairlearn?

Fairlearn is an open-source Python library that helps data scientists:

  1. Assess fairness of AI systems using group fairness metrics
  2. Mitigate unfairness using state-of-the-art algorithms
  3. Compare different mitigation strategies

Key Fairlearn Components Used

1. MetricFrame

Disaggregates metrics by sensitive feature groups:

from fairlearn.metrics import MetricFrame

metric_frame = MetricFrame(
    metrics={'accuracy': accuracy_score},
    y_true=y_test,
    y_pred=y_pred,
    sensitive_features=sensitive_features
)

2. Fairness Metrics

Demographic Parity (Selection Rate Equality)

  • Goal: P(ΕΆ=1 | A=a) should be equal for all groups
  • Measures: Are positive predictions distributed equally?
from fairlearn.metrics import demographic_parity_difference

Equalized Odds (Error Rate Equality)

  • Goal: TPR and FPR should be equal across groups
  • Measures: Are errors distributed equally?
from fairlearn.metrics import equalized_odds_difference

3. Mitigation Algorithms

ExponentiatedGradient (Reductions Approach)

  • In-processing technique
  • Reduces fairness problem to sequence of cost-sensitive classification
  • Works with any base estimator
from fairlearn.reductions import ExponentiatedGradient, DemographicParity

fair_model = ExponentiatedGradient(
    RandomForestClassifier(),
    constraints=DemographicParity()
)

Fairness Constraints

  1. DemographicParity: Equal selection rates
  2. EqualizedOdds: Equal TPR and FPR
  3. TruePositiveRateParity: Equal TPR only
  4. FalsePositiveRateParity: Equal FPR only

πŸ“ Project Structure

boston-jobs-fairness/
β”‚
β”œβ”€β”€ data/
β”‚   └── boston_jobs_policy.csv          # Main dataset (~1.95M records)
β”‚
β”œβ”€β”€ boston_jobs_fairness_analysis.py    # Main analysis script
β”œβ”€β”€ requirements.txt                     # Python dependencies
β”œβ”€β”€ README.md                            # This file
β”‚
β”œβ”€β”€ boston_jobs_fairness_analysis.png   # Generated visualization
└── fairness_analysis_report.csv        # Generated metrics report

πŸ“Š Results

Sample Output

Demographic Distribution

Gender Distribution:
  Man:       87.0%
  Woman:     10.9%
  Other:      2.1%

Race Distribution:
  Caucasian:          51.2%
  Hispanic/Latino:    22.8%
  African American:   20.4%
  Asian:               2.9%
  Other:               2.7%

Model Performance

Baseline Model:
  Accuracy: 0.7850
  Demographic Parity (Gender): 0.0823

Fair Model:
  Accuracy: 0.7642 (-2.08%)
  Demographic Parity (Gender): 0.0124 (84.9% reduction)

Bias Reduction

  • Gender: 85% bias reduction
  • Race: 70% bias reduction
  • Person of Color: 75% bias reduction

Visualization

The generated boston_jobs_fairness_analysis.png includes:

Row 1: Demographics

  • Gender distribution bar chart
  • Race/ethnicity distribution
  • Person of Color pie chart

Row 2: Work Hours Analysis

  • Average hours by gender
  • Average hours by race
  • Statistical comparisons

Row 3: Bias Reduction

  • Demographic parity: Baseline vs Fair (Gender)
  • Demographic parity: Baseline vs Fair (Race)
  • Demographic parity: Baseline vs Fair (POC)

Row 4: Performance

  • Accuracy comparison
  • Top trades by work hours

πŸŽ“ Learning Outcomes

This project demonstrates:

  1. βœ… Loading Real-World Data: Handle large CSV files efficiently
  2. βœ… Exploratory Data Analysis: Understand demographic distributions
  3. βœ… Protected Feature Analysis: Work with sensitive attributes
  4. βœ… Fairness Metrics: Calculate demographic parity and equalized odds
  5. βœ… Bias Mitigation: Apply ExponentiatedGradient algorithm
  6. βœ… Trade-off Analysis: Balance fairness and accuracy
  7. βœ… Visualization: Create comprehensive fairness dashboards
  8. βœ… Reporting: Generate actionable insights

🀝 Contributing

Contributions are welcome! Areas for improvement:

  • Add more fairness constraints (EqualizedOdds, TruePositiveRateParity)
  • Implement GridSearch for hyperparameter tuning
  • Add temporal analysis (trends over time)
  • Explore neighborhood-level patterns
  • Add interactive visualizations (Plotly/Dash)
  • Create Jupyter notebook tutorial

πŸ“– Additional Resources

Fairlearn Documentation

Research Papers

Boston Jobs Policy


βš–οΈ License

This project is licensed under the MIT License. The dataset is provided by the City of Boston under the Open Data Commons Public Domain Dedication and License (PDDL).


πŸ“§ Contact

For questions or suggestions, please open an issue on the repository.


πŸ™ Acknowledgments

  • City of Boston for providing open access to Jobs Policy compliance data
  • Microsoft Fairlearn Team for developing the Fairlearn toolkit
  • Open Data Community for promoting transparency and accountability

Note: This analysis is for educational and research purposes. The findings should not be used to make employment decisions without proper validation and stakeholder involvement.


Built with ❀️ using Fairlearn | Data from Analyze Boston

About

A comprehensive fairness analysis of the Boston Residents Jobs Policy compliance data using Fairlearn to detect and mitigate bias in construction project employment

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages