Boston Jobs Policy

A comprehensive fairness analysis of the Boston Residents Jobs Policy compliance data using Fairlearn to detect and mitigate bias in construction project employment.

📋 Table of Contents

Overview
Analysis Results Preview
Dataset
Installation
Usage
Analysis Pipeline
Key Findings
Fairlearn Concepts
Project Structure
Results
License

🎯 Overview

This project demonstrates fair machine learning using Microsoft's Fairlearn library on real-world data from Boston's construction industry. The Boston Residents Jobs Policy (established in 1983) sets employment standards for City-sponsored development projects to reduce racial and gender inequality in construction.

Goals

Understand Demographics: Analyze the distribution of protected features (gender, race, ethnicity) in Boston construction projects
Detect Bias: Identify potential disparities in work hour allocation across demographic groups
Mitigate Unfairness: Apply Fairlearn's bias mitigation techniques to create fairer predictive models
Visualize Results: Generate comprehensive visualizations showing bias reduction

📈 Analysis Results Preview

The fairness analysis generates comprehensive visualizations showing demographic distributions, bias detection, and mitigation results:

Key Achievements:

🎯 90.1% reduction in gender bias (9.5% → 0.9% demographic parity difference)
🎯 68.8% reduction in race bias (15.3% → 4.8% demographic parity difference)
📊 Fair model accuracy: 68.9% (improved from 55.3% baseline)
📈 Comprehensive 12-panel visualization showing all aspects of the analysis

The visualization above shows the complete fairness analysis pipeline including demographic distributions, bias metrics before/after mitigation, and model performance comparisons.

📊 Dataset

Boston Residents Jobs Policy Compliance Reports

Source: Analyze Boston - Open Data Portal
Size: ~1.95 million records
Time Period: May 2006 - Present (continuously updated)
Update Frequency: Weekly

Fields

Field	Type	Description
`agency`	String	City agency overseeing the project
`compliance_project_name`	String	Name of the development project
`project_address`	String	Location of the project
`neighborhood`	String	Boston neighborhood
`developer`	String	Project developer
`general_contractor_name`	String	General contractor managing the project
`subcontractor`	String	Subcontractor company
`trade`	String	Worker trade/occupation (e.g., Carpenter, Electrician)
`period_ending`	Date	End date of reporting period
`gender`	String	Worker gender (Man, Woman, Non-Binary, No Answer)
`person_of_color`	Boolean	Person of Color indicator
`race`	String	Race/Ethnicity (Caucasian, Hispanic/Latino, African American, Asian, Other, Cape Verdean, Native American)
`boston_resident`	Boolean	Boston residency status
`worker_hours_this_period`	Float	Hours worked in the reporting period

Protected Features (bold above): gender, race, person_of_color, boston_resident

📖 For detailed field descriptions and data quality notes, see DATA_DICTIONARY.md

Why This Dataset?

✅ Real Protected Features: Contains actual demographic data (gender, race) with 99.9% coverage
✅ Large Scale: 1.95M records provide statistical validity
✅ Social Impact: Directly related to employment equity policy
✅ Public Data: Openly available for research and transparency
✅ Intersectional: Allows analysis of multiple protected attributes
✅ Well-Documented: Official data dictionary available

🚀 Installation

Prerequisites

Python 3.14+ (or 3.8+)
pip package manager

Setup

Clone the repository

git clone <repository-url>
cd boston-jobs-fairness

Install dependencies

pip install -r requirements.txt

Requirements

The main dependencies are:

fairlearn==0.13.0 - Fairness assessment and mitigation
scikit-learn==1.7.2 - Machine learning models
pandas==2.3.3 - Data manipulation
numpy==2.3.4 - Numerical computing
matplotlib==3.10.7 - Visualization
seaborn==0.13.2 - Statistical visualization

💻 Usage

⚠️ Important: Dataset Download Required

The dataset is NOT included in this repository due to its size (298 MB).

After cloning this repository, you must download the dataset before running the analysis.

First-Time Setup

# 1. Clone the repository
git clone <repository-url>
cd boston-jobs-fairness

# 2. Install dependencies
pip install -r requirements.txt

# 3. Download the dataset (~298 MB)
python download_dataset.py

Dataset Download Options

Option 1: Automated Helper Script (Recommended)

python download_dataset.py

This script checks if the dataset exists and provides download instructions.

Option 2: Manual Download

Visit https://data.boston.gov/dataset/boston-jobs-policy-compliance-reports
Click "Boston Jobs Policy Compliance Reports CSV"
Save the file to data/boston_jobs_policy.csv

Option 3: Command Line (PowerShell)

Invoke-WebRequest `
  -Uri "https://data.boston.gov/dataset/7344db71-87f6-4598-8d20-edbcaad4b9b2/resource/5ab4b4de-c970-4619-ab55-ce4338535b24/download/tmpzg6coxuf.csv" `
  -OutFile "data/boston_jobs_policy.csv" `
  -UserAgent "Mozilla/5.0"

Option 4: Command Line (curl)

curl -A "Mozilla/5.0" \
  "https://data.boston.gov/dataset/7344db71-87f6-4598-8d20-edbcaad4b9b2/resource/5ab4b4de-c970-4619-ab55-ce4338535b24/download/tmpzg6coxuf.csv" \
  -o "data/boston_jobs_policy.csv"

Run the Analysis

Once the dataset is downloaded:

python boston_jobs_fairness_analysis.py

Note: If you try to run the analysis without the dataset, you'll see helpful error messages with download instructions.

This will:

Load the full 1.95M record dataset from data/boston_jobs_policy.csv
Perform demographic analysis
Train baseline and fair ML models
Generate visualizations and reports

Expected Runtime

Analysis: ~12 minutes (full 1.95M records)
Loading data: ~2-3 minutes
Training models: ~8-10 minutes
Model training: ~1-2 minutes
Total: ~5 minutes

Outputs

After running, you'll find:

boston_jobs_fairness_analysis.png - Comprehensive 12-panel visualization (see preview above)
fairness_analysis_report.csv - Detailed metrics report with all bias measurements

The main visualization includes:

📊 Demographic distributions across protected features
📈 Bias metrics before and after mitigation
🎯 Model performance comparisons
📉 Fairness trade-offs analysis

🔬 Analysis Pipeline

1. Data Loading (100K sample)

~1.95M total records → 100K representative sample
Maintains demographic distributions

2. Demographic Analysis

Gender distribution across workers
Race/ethnicity breakdown
Person of Color representation
Boston residency rates

3. Work Patterns Analysis

Average hours by demographic groups
Disparities in work allocation
Trade-level analysis
Intersectional patterns (gender × race)

4. Machine Learning Task

Prediction: Does a worker receive "high hours" (>75th percentile)?

Why This Matters: If certain demographics systematically receive fewer work hours, it indicates potential bias in job assignment or opportunities.

Features Used:

Trade (encoded)
Agency (encoded)
Neighborhood (encoded)
General Contractor (encoded)

Target: Binary (High hours: Yes/No) Class Balancing: Balanced class weights to handle 75/25 split

5. Fairness Assessment

Using Fairlearn metrics:

Demographic Parity: Equal selection rates across groups
Equalized Odds: Equal true positive and false positive rates
Accuracy: Overall model performance

6. Bias Mitigation

Using ExponentiatedGradient with DemographicParity constraint:

Trains model to minimize demographic parity violations
Balances fairness vs accuracy trade-off
Applies constraint to primary sensitive feature (gender)

7. Visualization & Reporting

12-panel comprehensive visualization
Detailed CSV report with all metrics
Comparison: Baseline vs Fair model

🔍 Key Findings

Based on analysis of 1,953,111 construction worker records (full dataset):

Demographics

Gender: 87% Men, 11% Women, 2% Other/No Answer
Race: 51% Caucasian, 23% Hispanic/Latino, 20% African American, 3% Asian, 3% Other
Person of Color: 49% POC, 51% White
Boston Residents: 35% residents, 65% non-residents

Work Hours Disparities (Actual Data)

Gender Gap: Men work 2.05× more hours (75 hrs vs 37 hrs for women)
Race Gap: Caucasian workers 1.91× more hours (87 hrs vs 46 hrs for African American)
POC Gap: Non-POC workers 1.67× more hours (87 hrs vs 52 hrs)
Residency Gap: Non-residents 1.66× more hours (82 hrs vs 49 hrs)

Model Bias Detection (Baseline)

Gender: 9.5% demographic parity difference (women over-predicted at 63.7% vs men 54.5%)
Race: 15.3% demographic parity difference (Asian: 69.6%, Caucasian: 55.2%)
Equalized Odds: Up to 24.4% difference in error rates

Bias Mitigation Results (Fair Model)

Gender Bias Reduced by 90.1% (9.5% → 0.9% demographic parity difference)
Race Bias Reduced by 68.8% (15.3% → 4.8% demographic parity difference)
Residency Bias Reduced by 92.5% (4.4% → 0.3%)
Accuracy Improved: 55.3% → 68.9% (fair model actually performs BETTER!)
Selection Rates Equalized: 14.6% (men) vs 13.7% (women) in fair model

📚 Fairlearn Concepts

What is Fairlearn?

Fairlearn is an open-source Python library that helps data scientists:

Assess fairness of AI systems using group fairness metrics
Mitigate unfairness using state-of-the-art algorithms
Compare different mitigation strategies

Key Fairlearn Components Used

1. MetricFrame

Disaggregates metrics by sensitive feature groups:

from fairlearn.metrics import MetricFrame

metric_frame = MetricFrame(
    metrics={'accuracy': accuracy_score},
    y_true=y_test,
    y_pred=y_pred,
    sensitive_features=sensitive_features
)

2. Fairness Metrics

Demographic Parity (Selection Rate Equality)

Goal: P(Ŷ=1 | A=a) should be equal for all groups
Measures: Are positive predictions distributed equally?

from fairlearn.metrics import demographic_parity_difference

Equalized Odds (Error Rate Equality)

Goal: TPR and FPR should be equal across groups
Measures: Are errors distributed equally?

from fairlearn.metrics import equalized_odds_difference

3. Mitigation Algorithms

ExponentiatedGradient (Reductions Approach)

In-processing technique
Reduces fairness problem to sequence of cost-sensitive classification
Works with any base estimator

from fairlearn.reductions import ExponentiatedGradient, DemographicParity

fair_model = ExponentiatedGradient(
    RandomForestClassifier(),
    constraints=DemographicParity()
)

Fairness Constraints

DemographicParity: Equal selection rates
EqualizedOdds: Equal TPR and FPR
TruePositiveRateParity: Equal TPR only
FalsePositiveRateParity: Equal FPR only

📁 Project Structure

boston-jobs-fairness/
│
├── data/
│   └── boston_jobs_policy.csv          # Main dataset (~1.95M records)
│
├── boston_jobs_fairness_analysis.py    # Main analysis script
├── requirements.txt                     # Python dependencies
├── README.md                            # This file
│
├── boston_jobs_fairness_analysis.png   # Generated visualization
└── fairness_analysis_report.csv        # Generated metrics report

📊 Results

Sample Output

Demographic Distribution

Gender Distribution:
  Man:       87.0%
  Woman:     10.9%
  Other:      2.1%

Race Distribution:
  Caucasian:          51.2%
  Hispanic/Latino:    22.8%
  African American:   20.4%
  Asian:               2.9%
  Other:               2.7%

Model Performance

Baseline Model:
  Accuracy: 0.7850
  Demographic Parity (Gender): 0.0823

Fair Model:
  Accuracy: 0.7642 (-2.08%)
  Demographic Parity (Gender): 0.0124 (84.9% reduction)

Bias Reduction

Gender: 85% bias reduction
Race: 70% bias reduction
Person of Color: 75% bias reduction

Visualization

The generated boston_jobs_fairness_analysis.png includes:

Row 1: Demographics

Gender distribution bar chart
Race/ethnicity distribution
Person of Color pie chart

Row 2: Work Hours Analysis

Average hours by gender
Average hours by race
Statistical comparisons

Row 3: Bias Reduction

Demographic parity: Baseline vs Fair (Gender)
Demographic parity: Baseline vs Fair (Race)
Demographic parity: Baseline vs Fair (POC)

Row 4: Performance

Accuracy comparison
Top trades by work hours

🎓 Learning Outcomes

This project demonstrates:

✅ Loading Real-World Data: Handle large CSV files efficiently
✅ Exploratory Data Analysis: Understand demographic distributions
✅ Protected Feature Analysis: Work with sensitive attributes
✅ Fairness Metrics: Calculate demographic parity and equalized odds
✅ Bias Mitigation: Apply ExponentiatedGradient algorithm
✅ Trade-off Analysis: Balance fairness and accuracy
✅ Visualization: Create comprehensive fairness dashboards
✅ Reporting: Generate actionable insights

🤝 Contributing

Contributions are welcome! Areas for improvement:

Add more fairness constraints (EqualizedOdds, TruePositiveRateParity)
Implement GridSearch for hyperparameter tuning
Add temporal analysis (trends over time)
Explore neighborhood-level patterns
Add interactive visualizations (Plotly/Dash)
Create Jupyter notebook tutorial

📖 Additional Resources

Fairlearn Documentation

Research Papers

⚖️ License

This project is licensed under the MIT License. The dataset is provided by the City of Boston under the Open Data Commons Public Domain Dedication and License (PDDL).

📧 Contact

For questions or suggestions, please open an issue on the repository.

🙏 Acknowledgments

City of Boston for providing open access to Jobs Policy compliance data
Microsoft Fairlearn Team for developing the Fairlearn toolkit
Open Data Community for promoting transparency and accountability

Note: This analysis is for educational and research purposes. The findings should not be used to make employment decisions without proper validation and stakeholder involvement.

Built with ❤️ using Fairlearn | Data from Analyze Boston

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
.gitignore		.gitignore
DATA_DICTIONARY.md		DATA_DICTIONARY.md
GIT_TRACKING.md		GIT_TRACKING.md
LICENSE		LICENSE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
SETUP.md		SETUP.md
boston_jobs_fairness_analysis.png		boston_jobs_fairness_analysis.png
boston_jobs_fairness_analysis.py		boston_jobs_fairness_analysis.py
download_dataset.py		download_dataset.py
fairness_analysis_report.csv		fairness_analysis_report.csv
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Boston Jobs Policy - Fairness Analysis with Fairlearn