Automated differential expression analysis pipeline for bulk RNA-seq data using DESeq2 and pathway enrichment
A reproducible R-based pipeline for bulk RNA-seq differential expression analysis. This project demonstrates a complete workflow from count data to biological insights, featuring DESeq2 normalization, statistical testing, multi-database pathway enrichment, and comprehensive visualization of gene expression patterns.
Key capabilities:
- Robust differential expression analysis with DESeq2
- Pathway enrichment across 9 biological databases
- Comprehensive visualization suite (MA plots, heatmaps, correlation analysis)
- Quality control and expression distribution analysis
- Fully automated gene annotation and enrichment reporting
Problem: Analyzing bulk RNA-sequencing data to identify differentially expressed genes between experimental conditions requires normalization, statistical testing, and biological interpretation through pathway analysis.
Approach: This pipeline uses DESeq2 for robust differential expression analysis with negative binomial modeling, followed by pathway enrichment using multiple databases (KEGG, Reactome, GO, OMIM, DisGeNET, HPO) via enrichR. The workflow includes quality control, multiple significance thresholds, and correlation analysis.
- R (4.x) - Statistical computing environment
- Bioconductor - Suite of bioinformatics packages
- DESeq2 - Differential expression analysis
- airway - Example RNA-seq dataset
- enrichR - Pathway and ontology enrichment
- ggplot2 - Advanced visualization
- pheatmap - Heatmap generation
- org.Hs.eg.db - Human gene annotation
bulk-rnaseq-differential-expression-r/
├── analysis/ # Analysis scripts and project files
│ ├── analysis.Rproj # RStudio project file
│ ├── main.R # Main analysis pipeline
│ ├── .RData # R workspace (gitignored)
│ └── .Rhistory # R history (gitignored)
├── bulkRNA_project.docx # Additional documentation
├── .gitignore # Git ignore rules for R
├── project_identity.md # Project metadata
└── README.md # This file
- R version 4.0 or higher
- (Optional) RStudio for interactive analysis
Run the following in R to install all dependencies:
# Install BiocManager if not present
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# Install Bioconductor packages
BiocManager::install(c(
"DESeq2",
"airway",
"org.Hs.eg.db"
))
# Install CRAN packages
install.packages(c(
"GEOquery",
"Seurat",
"dplyr",
"tidyverse",
"ggplot2",
"patchwork",
"enrichR",
"pheatmap"
))Note: The installation script is included in main.R but commented out. Uncomment lines 40-41 to auto-install on first run.
- Open
analysis/analysis.Rprojin RStudio - Open
main.Rin the editor - Run the entire script:
Ctrl+Alt+R(Windows/Linux) orCmd+Option+R(Mac)
cd bulk-rnaseq-differential-expression-r/analysis
RThen in R:
source("main.R")cd bulk-rnaseq-differential-expression-r/analysis
Rscript main.RImportant: Always run from within the analysis/ directory to ensure proper working directory context.
This pipeline uses the airway dataset from Bioconductor, which contains RNA-seq data from airway smooth muscle cells treated with dexamethasone.
Dataset details:
- 8 samples (4 treated, 4 untreated)
- ~63,000 genes
- Dexamethasone treatment vs control
For custom data, you would need:
- Count matrix (genes × samples)
- Sample metadata with experimental conditions
- Gene annotations (or use built-in annotation databases)
The pipeline generates the following outputs (displayed in R session, can be exported):
- Differential expression tables at α = 0.05 and α = 0.001
- Top 10 differentially expressed genes with annotations
- Summary statistics (log2 fold changes, adjusted p-values)
- MA plots - Log fold change vs mean expression (3 versions)
- Boxplots - Expression distribution across samples
- Barplots - Mean expression levels
- Heatmaps - Top 10 DEGs with hierarchical clustering (2 versions)
- Scatter plots - Correlation between selected genes
- Pathway enrichment results from 9 databases:
- KEGG_2021_Human
- Reactome_2022
- GO_Biological_Process_2023
- GO_Molecular_Function_2023
- GO_Cellular_Component_2023
- OMIM_Disease
- OMIM_Expanded
- DisGeNET
- Human_Phenotype_Ontology
- Gene symbol mapping for top differentially expressed genes
- Correlation statistics for gene pairs
- The analysis uses the
airwaydataset which is version-controlled through Bioconductor - DESeq2 performs internal normalization (no manual scaling required)
- For custom analyses, consider setting a random seed:
set.seed(123) - R session info can be captured with
sessionInfo() - Bioconductor version: 3.18 recommended
- Pre-filtering: Remove genes with counts < 3 across all samples
- Normalization: DESeq2 size factor normalization
- Testing: Negative binomial generalized linear model
- Multiple testing correction: Benjamini-Hochberg FDR
- Uses enrichR to query 9 biological databases
- Fisher's exact test for over-representation
- Results include adjusted p-values, odds ratios, and gene lists
- rlog transformation for variance-stabilized visualization
- Hierarchical clustering with complete linkage
- Pearson correlation for gene-gene relationships
| Issue | Solution |
|---|---|
| Package installation fails | Update BiocManager: BiocManager::install(version = "3.18") |
| "Cannot open connection" error | Ensure working directory is analysis/ |
| Memory issues | Increase memory limit: options(future.globals.maxSize = 8000 * 1024^2) |
| Plots not displaying | Check graphics device: dev.cur() |
| enrichR connection fails | Check internet connection; enrichR requires online access |
| DESeq2 convergence warnings | Normal for some genes; check specific gene results |
- Runtime: ~2-5 minutes on standard hardware (airway dataset)
- Memory: ~2-4 GB RAM required
- Internet: Required for enrichR pathway queries
- Apply to custom RNA-seq datasets
- Add volcano plots for visualization
- Implement time-series or multi-factor designs
- Export results to CSV/Excel for sharing
- Add PCA visualization for sample clustering
- Integrate with gene set enrichment analysis (GSEA)
This project is provided as-is for educational and research purposes.
Issues and suggestions are welcome. Please open an issue for any bugs or feature requests.