A comprehensive machine learning and deep learning pipeline for detecting insider threats using the CERT r4.1 dataset. This project combines unsupervised anomaly detection, supervised machine learning, and advanced deep learning architectures to identify anomalous user behavior in enterprise environments.
Visual representation of the project's machine learning pipeline for insider threat detection
- Overview
- Project Structure
- Dataset
- Notebooks
- Features
- Methodology
- Requirements
- Usage
- Results
- References
Insider threats—security incidents perpetrated by individuals with legitimate access represent a critical challenge for enterprise security. This project develops an end-to-end detection system that:
- Processes multi-modal behavioral data from enterprise systems (logon, device, file, email, HTTP, psychometric)
- Engineers behavioral features capturing temporal patterns, activity intensity, and user psychology
- Applies unsupervised learning (Isolation Forest, Local Outlier Factor) to generate anomaly labels
- Builds supervised models including traditional ML baselines and deep learning architectures
- Provides interpretability through SHAP values and feature importance analysis
- Handles severe class imbalance using cost-sensitive training and focal loss
.
├── CERT_Data_Exploration_and_Preprocessing.ipynb # Foundation notebook
├── CERT_Data_Exploration_and_Anomaly_Detection.ipynb # Unsupervised analysis
├── CERT_Behavioral_Pattern_Analysis.ipynb # Supervised learning
├── CERT_Anomaly_Detection_Deep_Learning_Labeled.ipynb # Deep learning models
├── data/ # Directory for CERT r4.1 dataset
│ ├── logon.csv
│ ├── device.csv
│ ├── file.csv
│ ├── email.csv
│ ├── http.csv
│ └── psychometric.csv
└── README.md # This file
This project uses the CERT Insider Threat Test Dataset v4.1 (r4.1), a synthetic but realistic corpus designed to emulate enterprise user behavior and insider threat scenarios.
The dataset includes six primary data files:
- logon.csv: User authentication events (timestamps, devices, success/failure)
- device.csv: Physical device connections (USB drives, network connections)
- file.csv: File operations (read, write, copy, delete)
- email.csv: Email communications (sender, recipients, attachments, size)
- http.csv: Web browsing activity (URLs, content types, volumes)
- psychometric.csv: Simulated psychological profiles (Big Five personality traits)
- Temporal coverage: Synthetic enterprise activity over extended periods
- User population: Hundreds of users with varying behavioral patterns
- Class imbalance: Very few anomalous users among many normal users
- Multi-modal: Diverse data types requiring different processing strategies
Purpose: Foundation of the pipeline - data loading, cleaning, and initial feature engineering
Key Functions:
- Loads and validates all six CERT r4.1 data files
- Handles missing values and data type conversions
- Implements datetime parsing and temporal feature extraction
- Creates user-level aggregations
- Generates baseline behavioral features:
- Login patterns (frequency, timing, after-hours activity)
- Device usage statistics
- File access patterns
- Psychometric profile integration
Outputs:
preprocessed_cert_features.csv- Main feature datasetlogon_processed_sample.csv- Processed login datadevice_processed.csv- Processed device usagefile_processed_sample.csv- Processed file operations
When to use: Start here for any new analysis or to regenerate the base feature set
Purpose: Unsupervised anomaly detection and comprehensive exploratory data analysis
Key Functions:
- Statistical Analysis: Distribution analysis, correlation studies, outlier detection
- Visualization: Histograms, heatmaps, PCA projections, time series plots
- Isolation Forest: Tree-based anomaly detection focusing on feature space isolation
- Local Outlier Factor (LOF): Density-based anomaly detection identifying local outliers
- Cross-validation: Stability analysis across different random seeds
- Consensus Labeling: Combines multiple unsupervised detectors for robust anomaly labels
- Parameter Sweeps: Systematic exploration of hyperparameters (contamination rates, neighbors)
Outputs:
- Anomaly labels (Isolation Forest, LOF, and consensus)
cert_dataset_with_anomaly_labels.csv- Labeled dataset for supervised learning- Statistical summaries and performance metrics
- Visualization artifacts
Key Insights:
- Temporal features (after-hours activity, weekend patterns) show strong discrimination
- Consensus labeling balances precision and recall
- Different unsupervised methods capture complementary anomaly patterns
When to use: Run after preprocessing to generate anomaly labels for supervised learning
Purpose: Advanced behavioral feature engineering and traditional machine learning baselines
Key Functions:
- Advanced Feature Engineering:
- Temporal ratios (after-hours/weekend proportions)
- Activity diversity metrics (unique devices, files, email domains)
- Psychometric-behavioral interactions
- Network/access pattern features
- Clustering Analysis: K-Means and hierarchical clustering for user segmentation
- Traditional ML Models:
- Logistic Regression (interpretable baseline)
- Support Vector Machines (SVM with RBF kernel)
- Random Forest (ensemble baseline)
- Gradient Boosting (XGBoost)
- Feature Importance: Multiple methods (permutation, tree-based, SHAP)
- Model Comparison: Cross-validation and metric evaluation across algorithms
- Ensemble Methods: Voting classifiers combining multiple models
Outputs:
- Enhanced behavioral feature set
- User clustering assignments and profiles
- Traditional ML model baselines and performance metrics
- Feature importance rankings
- Model artifacts for comparison
Key Insights:
- Random Forest and XGBoost provide strong baselines
- After-hours ratios, device diversity, and file activity are top predictors
- Psychometric features contribute modestly but improve interpretability
- Ensemble methods generally outperform individual classifiers
When to use: Run after anomaly detection to establish baseline performance and identify important features
Purpose: Advanced deep learning with multi-input neural network architecture
Key Functions:
- Multi-Input Architecture:
- Behavioral Branch: Activity intensity, diversity metrics
- Temporal Branch: Time-of-day, after-hours, weekend patterns
- Psychometric Branch: Big Five personality traits
- Network/Access Branch: Device, file, email, HTTP patterns
- Late Fusion: Concatenates specialized representations before final classification
- Cost-Sensitive Training:
- Focal loss for class imbalance
- Class weights emphasizing rare positives
- Threshold tuning for operational metrics
- Risk Scoring: Secondary regression head for continuous risk assessment
- Model Interpretability:
- SHAP (SHapley Additive exPlanations) for feature attribution
- Branch-level importance analysis
- Per-user risk decomposition
- Advanced Evaluation:
- ROC and Precision-Recall curves
- Confusion matrices at multiple thresholds
- Cost-sensitive metrics (weighted false negatives)
- Stability Analysis:
- Multiple random seeds
- Bootstrap evaluation
- Sensitivity to hyperparameters
- Integration of Large Modalities:
- Sampling strategies for email and HTTP (millions of records)
- Memory-efficient processing
- Incremental feature computation
Architecture Details:
Input Features
↓
┌───────────┬────────────┬──────────────┬─────────────────┐
│Behavioral │ Temporal │ Psychometric │ Network/Access │
│ Branch │ Branch │ Branch │ Branch │
│ Dense │ Dense │ Dense │ Dense │
│ Dropout │ Dropout │ Dropout │ Dropout │
│ Dense │ Dense │ Dense │ Dense │
└───────────┴────────────┴──────────────┴─────────────────┘
↓ ↓ ↓ ↓
└───────────┴───────────┴──────────────┘
↓
Concatenate
↓
Dense (128)
↓
Dropout
↓
┌───────────┴───────────┐
↓ ↓
Classification Head Risk Scoring Head
(Binary Anomaly) (Continuous Score)
Training Strategy:
- Focal loss:
FL(p_t) = -α(1-p_t)^γ log(p_t)with γ=2, α=0.75 - Early stopping with patience
- Learning rate scheduling
- Batch normalization in each branch
Outputs:
- Trained multi-input deep learning models
- Binary anomaly predictions
- Continuous risk scores (0-1 scale)
- SHAP values and feature attributions
- Comprehensive evaluation metrics
- Model artifacts for deployment
- Threshold recommendations for operational use
Key Insights:
- Multi-input architecture captures feature interactions better than flat models
- Temporal and behavioral branches contribute most to final predictions
- Focal loss significantly improves recall on rare positives
- Risk scores enable prioritization beyond binary classification
- SHAP values confirm domain-relevant features (after-hours activity, device diversity)
- Model is robust across multiple random seeds and sampling strategies
When to use: Run last in the pipeline, after establishing baselines and consensus labels
- After-hours activity ratio: Proportion of actions outside business hours
- Weekend activity ratio: Weekend vs. weekday activity balance
- Time-of-day distributions: Hourly activity patterns
- Temporal entropy: Regularity/irregularity of timing patterns
- Total logons: Authentication frequency
- Device connections: Unique and total device interactions
- File operations: Read/write/copy/delete counts
- Email volume: Sent/received message counts
- HTTP requests: Web browsing activity volume
- Unique devices: Number of distinct devices accessed
- Unique files: Breadth of file system interaction
- Email domain diversity: Variety of communication partners
- URL diversity: Breadth of web browsing
- Big Five personality traits: Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism
- Psychometric-behavioral interactions: Combined psychological and activity features
- Device-file coupling: Patterns of device and file access co-occurrence
- Login success/failure rates: Authentication behavior patterns
- Email attachment patterns: File sharing behavior
- HTTP content type distributions: Web content consumption patterns
- Load multi-modal CERT data sources
- Handle missing values and outliers
- Parse timestamps and extract temporal features
- Aggregate activity at user level
- Scale and normalize features
- Isolation Forest: Identify users isolated in feature space
- Local Outlier Factor: Detect density-based outliers
- Consensus Labeling: Combine detectors for robust labels
- Parameter Sweeps: Optimize contamination rates and hyperparameters
- Logistic Regression for interpretability
- Support Vector Machines for non-linear boundaries
- Random Forest for ensemble strength
- Gradient Boosting (XGBoost) for performance
- Feature importance and model comparison
- Multi-input architecture with specialized branches
- Focal loss for class imbalance
- Risk scoring alongside classification
- SHAP-based interpretability
- Threshold tuning for operational metrics
- ROC-AUC and PR-AUC curves
- Confusion matrices at multiple thresholds
- Cost-sensitive metrics (weighted FN penalty)
- Stability analysis (seeds, bootstrapping)
- Feature attribution and interpretability
# Data Processing
pandas >= 1.3.0
numpy >= 1.21.0
# Machine Learning
scikit-learn >= 1.0.0
xgboost >= 1.5.0
# Deep Learning
tensorflow >= 2.8.0 # or pytorch >= 1.10.0
keras >= 2.8.0
# Visualization
matplotlib >= 3.4.0
seaborn >= 0.11.0
# Interpretability
shap >= 0.40.0
# Utilities
jupyter >= 1.0.0
tqdm >= 4.62.0pip install pandas numpy scikit-learn xgboost tensorflow keras matplotlib seaborn shap jupyter tqdm-
Prepare the CERT r4.1 dataset:
- Download from CERT Insider Threat Test Dataset
- Extract to a
data/directory
-
Run the pipeline in order:
# Step 1: Preprocessing
jupyter notebook CERT_Data_Exploration_and_Preprocessing.ipynb
# Step 2: Unsupervised anomaly detection
jupyter notebook CERT_Data_Exploration_and_Anomaly_Detection.ipynb
# Step 3: Supervised baselines
jupyter notebook CERT_Behavioral_Pattern_Analysis.ipynb
# Step 4: Deep learning
jupyter notebook CERT_Anomaly_Detection_Deep_Learning_Labeled.ipynb- Review outputs:
- Check generated CSV files for intermediate results
- Examine plots and metrics in notebook outputs
- Review model performance summaries
Adjusting Contamination Rate:
# In anomaly detection notebook
iso_forest = IsolationForest(contamination=0.05, random_state=42)
lof = LocalOutlierFactor(contamination=0.05, novelty=True)Modifying Deep Learning Architecture:
# In deep learning notebook
# Adjust branch depths, widths, dropout rates
behavioral_branch = Dense(64, activation='relu')(behavioral_input)
behavioral_branch = Dropout(0.3)(behavioral_branch)Cost-Sensitive Training:
# Adjust focal loss parameters
focal_loss = FocalLoss(gamma=2.0, alpha=0.75)
# Or use class weights
class_weights = {0: 1.0, 1: 10.0} # Emphasize anomalies- Temporal Features Matter: After-hours and weekend activity ratios are consistently top predictors
- Consensus Labels Help: Combining unsupervised detectors improves downstream supervised learning
- Deep Learning Adds Value: Multi-input architecture outperforms flat baselines, especially for recall
- Class Imbalance is Critical: Focal loss and cost-sensitive training significantly improve performance
- Interpretability is Achievable: SHAP values provide actionable explanations for security analysts
- After-hours logon ratio
- Total device connections
- Weekend activity ratio
- File operation diversity
- Email volume (sent)
- Unique device count
- HTTP request volume
- Device-file coupling metric
- Temporal entropy
- Conscientiousness score (psychometric)
- CERT Insider Threat Test Dataset v4.1 (r4.1)
- Carnegie Mellon University Software Engineering Institute
- https://resources.sei.cmu.edu/library/asset-view.cfm?assetid=508099
This project uses synthetic data designed to simulate realistic enterprise scenarios without exposing real user information. When adapting this approach to real-world data:
- Obtain proper authorization and ethical approval
- Anonymize all personally identifiable information
- Implement access controls for sensitive behavioral data
- Establish human-in-the-loop review processes
- Monitor for bias and fairness across user populations
- Respect privacy and limit data retention
- Follow organizational policies and legal requirements
This is an academic research project. For questions or collaboration:
- Review the thesis documents for detailed methodology
- Examine the notebook flow analysis for pipeline understanding
- Adapt the approach to your specific use case
This project is for educational and research purposes. Please cite appropriately if using in academic work.
- Carnegie Mellon University for the CERT Insider Threat Test Dataset
- Open-source community for scikit-learn, TensorFlow, SHAP, and other libraries
Project Status: Complete academic research pipeline
Last Updated: October 2025
