CERT Insider Threat Detection Project

A comprehensive machine learning and deep learning pipeline for detecting insider threats using the CERT r4.1 dataset. This project combines unsupervised anomaly detection, supervised machine learning, and advanced deep learning architectures to identify anomalous user behavior in enterprise environments.

📊 Project Design

Visual representation of the project's machine learning pipeline for insider threat detection

🎯 Overview

Insider threats—security incidents perpetrated by individuals with legitimate access represent a critical challenge for enterprise security. This project develops an end-to-end detection system that:

Processes multi-modal behavioral data from enterprise systems (logon, device, file, email, HTTP, psychometric)
Engineers behavioral features capturing temporal patterns, activity intensity, and user psychology
Applies unsupervised learning (Isolation Forest, Local Outlier Factor) to generate anomaly labels
Builds supervised models including traditional ML baselines and deep learning architectures
Provides interpretability through SHAP values and feature importance analysis
Handles severe class imbalance using cost-sensitive training and focal loss

📁 Project Structure

.
├── CERT_Data_Exploration_and_Preprocessing.ipynb       # Foundation notebook
├── CERT_Data_Exploration_and_Anomaly_Detection.ipynb   # Unsupervised analysis
├── CERT_Behavioral_Pattern_Analysis.ipynb              # Supervised learning
├── CERT_Anomaly_Detection_Deep_Learning_Labeled.ipynb  # Deep learning models
├── data/                                                # Directory for CERT r4.1 dataset
│   ├── logon.csv
│   ├── device.csv
│   ├── file.csv
│   ├── email.csv
│   ├── http.csv
│   └── psychometric.csv
└── README.md                                            # This file

📊 Dataset

This project uses the CERT Insider Threat Test Dataset v4.1 (r4.1), a synthetic but realistic corpus designed to emulate enterprise user behavior and insider threat scenarios.

Data Sources

The dataset includes six primary data files:

logon.csv: User authentication events (timestamps, devices, success/failure)
device.csv: Physical device connections (USB drives, network connections)
file.csv: File operations (read, write, copy, delete)
email.csv: Email communications (sender, recipients, attachments, size)
http.csv: Web browsing activity (URLs, content types, volumes)
psychometric.csv: Simulated psychological profiles (Big Five personality traits)

Data Characteristics

Temporal coverage: Synthetic enterprise activity over extended periods
User population: Hundreds of users with varying behavioral patterns
Class imbalance: Very few anomalous users among many normal users
Multi-modal: Diverse data types requiring different processing strategies

📓 Notebooks

1. CERT_Data_Exploration_and_Preprocessing.ipynb

Purpose: Foundation of the pipeline - data loading, cleaning, and initial feature engineering

Key Functions:

Loads and validates all six CERT r4.1 data files
Handles missing values and data type conversions
Implements datetime parsing and temporal feature extraction
Creates user-level aggregations
Generates baseline behavioral features:
- Login patterns (frequency, timing, after-hours activity)
- Device usage statistics
- File access patterns
- Psychometric profile integration

Outputs:

preprocessed_cert_features.csv - Main feature dataset
logon_processed_sample.csv - Processed login data
device_processed.csv - Processed device usage
file_processed_sample.csv - Processed file operations

When to use: Start here for any new analysis or to regenerate the base feature set

2. CERT_Data_Exploration_and_Anomaly_Detection.ipynb

Purpose: Unsupervised anomaly detection and comprehensive exploratory data analysis

Key Functions:

Statistical Analysis: Distribution analysis, correlation studies, outlier detection
Visualization: Histograms, heatmaps, PCA projections, time series plots
Isolation Forest: Tree-based anomaly detection focusing on feature space isolation
Local Outlier Factor (LOF): Density-based anomaly detection identifying local outliers
Cross-validation: Stability analysis across different random seeds
Consensus Labeling: Combines multiple unsupervised detectors for robust anomaly labels
Parameter Sweeps: Systematic exploration of hyperparameters (contamination rates, neighbors)

Outputs:

Anomaly labels (Isolation Forest, LOF, and consensus)
cert_dataset_with_anomaly_labels.csv - Labeled dataset for supervised learning
Statistical summaries and performance metrics
Visualization artifacts

Key Insights:

Temporal features (after-hours activity, weekend patterns) show strong discrimination
Consensus labeling balances precision and recall
Different unsupervised methods capture complementary anomaly patterns

When to use: Run after preprocessing to generate anomaly labels for supervised learning

3. CERT_Behavioral_Pattern_Analysis.ipynb

Purpose: Advanced behavioral feature engineering and traditional machine learning baselines

Key Functions:

Advanced Feature Engineering:
- Temporal ratios (after-hours/weekend proportions)
- Activity diversity metrics (unique devices, files, email domains)
- Psychometric-behavioral interactions
- Network/access pattern features
Clustering Analysis: K-Means and hierarchical clustering for user segmentation
Traditional ML Models:
- Logistic Regression (interpretable baseline)
- Support Vector Machines (SVM with RBF kernel)
- Random Forest (ensemble baseline)
- Gradient Boosting (XGBoost)
Feature Importance: Multiple methods (permutation, tree-based, SHAP)
Model Comparison: Cross-validation and metric evaluation across algorithms
Ensemble Methods: Voting classifiers combining multiple models

Outputs:

Enhanced behavioral feature set
User clustering assignments and profiles
Traditional ML model baselines and performance metrics
Feature importance rankings
Model artifacts for comparison

Key Insights:

Random Forest and XGBoost provide strong baselines
After-hours ratios, device diversity, and file activity are top predictors
Psychometric features contribute modestly but improve interpretability
Ensemble methods generally outperform individual classifiers

When to use: Run after anomaly detection to establish baseline performance and identify important features

4. CERT_Anomaly_Detection_Deep_Learning_Labeled.ipynb

Purpose: Advanced deep learning with multi-input neural network architecture

Key Functions:

Multi-Input Architecture:
- Behavioral Branch: Activity intensity, diversity metrics
- Temporal Branch: Time-of-day, after-hours, weekend patterns
- Psychometric Branch: Big Five personality traits
- Network/Access Branch: Device, file, email, HTTP patterns
- Late Fusion: Concatenates specialized representations before final classification
Cost-Sensitive Training:
- Focal loss for class imbalance
- Class weights emphasizing rare positives
- Threshold tuning for operational metrics
Risk Scoring: Secondary regression head for continuous risk assessment
Model Interpretability:
- SHAP (SHapley Additive exPlanations) for feature attribution
- Branch-level importance analysis
- Per-user risk decomposition
Advanced Evaluation:
- ROC and Precision-Recall curves
- Confusion matrices at multiple thresholds
- Cost-sensitive metrics (weighted false negatives)
Stability Analysis:
- Multiple random seeds
- Bootstrap evaluation
- Sensitivity to hyperparameters
Integration of Large Modalities:
- Sampling strategies for email and HTTP (millions of records)
- Memory-efficient processing
- Incremental feature computation

Architecture Details:

Input Features
    ↓
┌───────────┬────────────┬──────────────┬─────────────────┐
│Behavioral │ Temporal   │ Psychometric │ Network/Access  │
│  Branch   │  Branch    │   Branch     │     Branch      │
│  Dense    │  Dense     │   Dense      │     Dense       │
│  Dropout  │  Dropout   │   Dropout    │     Dropout     │
│  Dense    │  Dense     │   Dense      │     Dense       │
└───────────┴────────────┴──────────────┴─────────────────┘
         ↓           ↓           ↓              ↓
         └───────────┴───────────┴──────────────┘
                       ↓
                  Concatenate
                       ↓
                   Dense (128)
                       ↓
                   Dropout
                       ↓
           ┌───────────┴───────────┐
           ↓                       ↓
    Classification Head      Risk Scoring Head
    (Binary Anomaly)        (Continuous Score)

Training Strategy:

Focal loss: FL(p_t) = -α(1-p_t)^γ log(p_t) with γ=2, α=0.75
Early stopping with patience
Learning rate scheduling
Batch normalization in each branch

Outputs:

Trained multi-input deep learning models
Binary anomaly predictions
Continuous risk scores (0-1 scale)
SHAP values and feature attributions
Comprehensive evaluation metrics
Model artifacts for deployment
Threshold recommendations for operational use

Key Insights:

Multi-input architecture captures feature interactions better than flat models
Temporal and behavioral branches contribute most to final predictions
Focal loss significantly improves recall on rare positives
Risk scores enable prioritization beyond binary classification
SHAP values confirm domain-relevant features (after-hours activity, device diversity)
Model is robust across multiple random seeds and sampling strategies

When to use: Run last in the pipeline, after establishing baselines and consensus labels

🔬 Features

Temporal Features

After-hours activity ratio: Proportion of actions outside business hours
Weekend activity ratio: Weekend vs. weekday activity balance
Time-of-day distributions: Hourly activity patterns
Temporal entropy: Regularity/irregularity of timing patterns

Behavioral Intensity

Total logons: Authentication frequency
Device connections: Unique and total device interactions
File operations: Read/write/copy/delete counts
Email volume: Sent/received message counts
HTTP requests: Web browsing activity volume

Activity Diversity

Unique devices: Number of distinct devices accessed
Unique files: Breadth of file system interaction
Email domain diversity: Variety of communication partners
URL diversity: Breadth of web browsing

Psychometric Features

Big Five personality traits: Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism
Psychometric-behavioral interactions: Combined psychological and activity features

Network/Access Patterns

Device-file coupling: Patterns of device and file access co-occurrence
Login success/failure rates: Authentication behavior patterns
Email attachment patterns: File sharing behavior
HTTP content type distributions: Web content consumption patterns

🔧 Methodology

1. Data Preprocessing

Load multi-modal CERT data sources
Handle missing values and outliers
Parse timestamps and extract temporal features
Aggregate activity at user level
Scale and normalize features

2. Unsupervised Anomaly Detection

Isolation Forest: Identify users isolated in feature space
Local Outlier Factor: Detect density-based outliers
Consensus Labeling: Combine detectors for robust labels
Parameter Sweeps: Optimize contamination rates and hyperparameters

3. Supervised Baselines

Logistic Regression for interpretability
Support Vector Machines for non-linear boundaries
Random Forest for ensemble strength
Gradient Boosting (XGBoost) for performance
Feature importance and model comparison

4. Deep Learning

Multi-input architecture with specialized branches
Focal loss for class imbalance
Risk scoring alongside classification
SHAP-based interpretability
Threshold tuning for operational metrics

5. Evaluation

ROC-AUC and PR-AUC curves
Confusion matrices at multiple thresholds
Cost-sensitive metrics (weighted FN penalty)
Stability analysis (seeds, bootstrapping)
Feature attribution and interpretability

💻 Requirements

Core Dependencies

# Data Processing
pandas >= 1.3.0
numpy >= 1.21.0

# Machine Learning
scikit-learn >= 1.0.0
xgboost >= 1.5.0

# Deep Learning
tensorflow >= 2.8.0  # or pytorch >= 1.10.0
keras >= 2.8.0

# Visualization
matplotlib >= 3.4.0
seaborn >= 0.11.0

# Interpretability
shap >= 0.40.0

# Utilities
jupyter >= 1.0.0
tqdm >= 4.62.0

Installation

pip install pandas numpy scikit-learn xgboost tensorflow keras matplotlib seaborn shap jupyter tqdm

🚀 Usage

Quick Start

Prepare the CERT r4.1 dataset:
- Download from CERT Insider Threat Test Dataset
- Extract to a data/ directory
Run the pipeline in order:

# Step 1: Preprocessing
jupyter notebook CERT_Data_Exploration_and_Preprocessing.ipynb

# Step 2: Unsupervised anomaly detection
jupyter notebook CERT_Data_Exploration_and_Anomaly_Detection.ipynb

# Step 3: Supervised baselines
jupyter notebook CERT_Behavioral_Pattern_Analysis.ipynb

# Step 4: Deep learning
jupyter notebook CERT_Anomaly_Detection_Deep_Learning_Labeled.ipynb

Review outputs:
- Check generated CSV files for intermediate results
- Examine plots and metrics in notebook outputs
- Review model performance summaries

Customization

Adjusting Contamination Rate:

# In anomaly detection notebook
iso_forest = IsolationForest(contamination=0.05, random_state=42)
lof = LocalOutlierFactor(contamination=0.05, novelty=True)

Modifying Deep Learning Architecture:

# In deep learning notebook
# Adjust branch depths, widths, dropout rates
behavioral_branch = Dense(64, activation='relu')(behavioral_input)
behavioral_branch = Dropout(0.3)(behavioral_branch)

Cost-Sensitive Training:

# Adjust focal loss parameters
focal_loss = FocalLoss(gamma=2.0, alpha=0.75)

# Or use class weights
class_weights = {0: 1.0, 1: 10.0}  # Emphasize anomalies

📈 Results

Key Findings

Temporal Features Matter: After-hours and weekend activity ratios are consistently top predictors
Consensus Labels Help: Combining unsupervised detectors improves downstream supervised learning
Deep Learning Adds Value: Multi-input architecture outperforms flat baselines, especially for recall
Class Imbalance is Critical: Focal loss and cost-sensitive training significantly improve performance
Interpretability is Achievable: SHAP values provide actionable explanations for security analysts

Top Predictive Features (SHAP Analysis)

After-hours logon ratio
Total device connections
Weekend activity ratio
File operation diversity
Email volume (sent)
Unique device count
HTTP request volume
Device-file coupling metric
Temporal entropy
Conscientiousness score (psychometric)

📚 References

Dataset

CERT Insider Threat Test Dataset v4.1 (r4.1)
- Carnegie Mellon University Software Engineering Institute
- https://resources.sei.cmu.edu/library/asset-view.cfm?assetid=508099

🔒 Ethics and Privacy

This project uses synthetic data designed to simulate realistic enterprise scenarios without exposing real user information. When adapting this approach to real-world data:

Obtain proper authorization and ethical approval
Anonymize all personally identifiable information
Implement access controls for sensitive behavioral data
Establish human-in-the-loop review processes
Monitor for bias and fairness across user populations
Respect privacy and limit data retention
Follow organizational policies and legal requirements

👥 Contributing

This is an academic research project. For questions or collaboration:

Review the thesis documents for detailed methodology
Examine the notebook flow analysis for pipeline understanding
Adapt the approach to your specific use case

📝 License

This project is for educational and research purposes. Please cite appropriately if using in academic work.

🙏 Acknowledgements

Carnegie Mellon University for the CERT Insider Threat Test Dataset
Open-source community for scikit-learn, TensorFlow, SHAP, and other libraries

Project Status: Complete academic research pipeline

Last Updated: October 2025

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
CERT_Anomaly_Detection_Deep_Learning_Labeled.ipynb		CERT_Anomaly_Detection_Deep_Learning_Labeled.ipynb
CERT_Behavioral_Pattern_Analysis.ipynb		CERT_Behavioral_Pattern_Analysis.ipynb
CERT_Data_Exploration_and_Anomaly_Detection.ipynb		CERT_Data_Exploration_and_Anomaly_Detection.ipynb
CERT_Data_Exploration_and_Preprocessing.ipynb		CERT_Data_Exploration_and_Preprocessing.ipynb
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

CERT Insider Threat Detection Project

📊 Project Design

📋 Table of Contents

🎯 Overview

📁 Project Structure

📊 Dataset

Data Sources

Data Characteristics

📓 Notebooks

1. CERT_Data_Exploration_and_Preprocessing.ipynb

2. CERT_Data_Exploration_and_Anomaly_Detection.ipynb

3. CERT_Behavioral_Pattern_Analysis.ipynb

4. CERT_Anomaly_Detection_Deep_Learning_Labeled.ipynb

🔬 Features

Temporal Features

Behavioral Intensity

Activity Diversity

Psychometric Features

Network/Access Patterns

🔧 Methodology

1. Data Preprocessing

2. Unsupervised Anomaly Detection

3. Supervised Baselines

4. Deep Learning

5. Evaluation

💻 Requirements

Core Dependencies

Installation

🚀 Usage

Quick Start

Customization

📈 Results

Key Findings

Top Predictive Features (SHAP Analysis)

📚 References

Dataset

🔒 Ethics and Privacy

👥 Contributing

📝 License

🙏 Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages