devtools::install_github("YuLab-SMU/enrichit")2025-12-17
Functional enrichment analysis is a staple in bioinformatics for interpreting lists of genes identified from omics experiments. enrichit provides fast, C++-based implementations of two of the most widely used methods:
The package is designed to be efficient and easy to integrate into existing workflows, with a focus on performance and standardized output formats.
You can install enrichit from GitHub:
devtools::install_github("YuLab-SMU/enrichit")ORA determines whether a set of genes of interest (e.g., differentially expressed genes) is enriched in a known gene set (e.g., a biological pathway) more than would be expected by chance.
enrichit implements ORA using the hypergeometric distribution (one-sided Fisher’s exact test). The p-value is calculated as the probability of observing at least k genes from the specific gene set in the selected list of n genes, given a background population (universe) of N genes containing M genes from that set.
$$ p = 1 - \sum_{i=0}^{k-1} \frac{\binom{M}{i} \binom{N-M}{n-i}}{\binom{N}{n}} $$
library(enrichit)
# Simulate a universe of 1000 genes
universe <- paste0("Gene", 1:1000)
# Define gene sets
gene_sets <- list(
PathwayA = paste0("Gene", 1:50), # Genes 1-50
PathwayB = paste0("Gene", 800:850) # Genes 800-850
)
# Select 'significant' genes (e.g., top 20 genes)
# PathwayA should be enriched
sig_genes <- paste0("Gene", 1:20)
# Run ORA
ora_result <- ora(
gene = sig_genes,
gene_sets = gene_sets,
universe = universe
)
# View results
as.data.frame(ora_result) ID SetSize Count DESize UniverseSize pvalue
1 PathwayA 50 20 20 1000 1.388265e-28
2 PathwayB 51 0 20 1000 1.000000e+00
geneID
1 Gene8/Gene19/Gene4/Gene3/Gene17/Gene14/Gene11/Gene10/Gene7/Gene1/Gene12/Gene2/Gene15/Gene5/Gene6/Gene9/Gene18/Gene20/Gene16/Gene13
2
GeneRatio BgRatio RichFactor FoldEnrichment
1 20/20 50/1000 0.4 20
2 0/20 51/1000 0.0 0
GSEA evaluates whether a defined set of genes shows statistically significant, concordant differences between two biological states. Unlike ORA, GSEA uses the entire ranked list of genes, avoiding the need for arbitrary thresholds to select “significant” genes.
enrichit offers a fast C++ implementation of GSEA. It calculates an Enrichment Score (ES) that reflects the degree to which a gene set is over-represented at the top or bottom of a ranked list of genes.
The package supports different methods for p-value calculation:
method = "multilevel"): This is the default and recommended method. It uses an adaptive multi-level splitting Monte Carlo approach to estimate low p-values efficiently with high accuracy, similar to the fgsea package.method = "permute"): Standard permutation of gene labels.method = "sample"): Random sampling of gene sets (faster but less rigorous for some null hypotheses).# Generate synthetic ranked gene list
set.seed(42)
geneList <- sort(rnorm(1000), decreasing = TRUE)
names(geneList) <- paste0("Gene", 1:1000)
# Define gene sets
# PathwayTop is enriched at the top (positive ES)
# PathwayBottom is enriched at the bottom (negative ES)
gene_sets <- list(
PathwayTop = names(geneList)[1:50],
PathwayBottom = names(geneList)[951:1000],
PathwayRandom = sample(names(geneList), 50)
)
# Run GSEA using the multilevel method
gsea_result <- gsea(
geneList = geneList,
gene_sets = gene_sets,
method = "multilevel",
nPerm = 1000, # Base permutations
minGSSize = 10,
maxGSSize = 500
)
# View results
head(gsea_result) ID enrichmentScore NES pvalue setSize log2err
1 PathwayTop 1.0000000 3.9449035 2.491505e-12 50 0.8986712
2 PathwayBottom -1.0000000 -3.7744199 1.797993e-12 50 0.9101197
3 PathwayRandom -0.2638292 -0.9958021 4.639175e-01 50 NaN
enrichit works seamlessly with GSON objects, which are used to store gene set information along with metadata. The GSON class is defined in the gson package. It provides a structured way to handle gene sets, including gene identifiers, gene set names, and other associated information.
# Assuming you have a GSON object 'g'
# result <- gsea_gson(geneList = geneList, gson = g)sessionInfo()R version 4.5.2 (2025-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)
Matrix products: default
LAPACK version 3.12.1
locale:
[1] LC_COLLATE=C
[2] LC_CTYPE=Chinese (Simplified)_China.utf8
[3] LC_MONETARY=Chinese (Simplified)_China.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_China.utf8
time zone: Asia/Shanghai
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] enrichit_0.0.8
loaded via a namespace (and not attached):
[1] digest_0.6.39 fastmap_1.2.0 xfun_0.54 yulab.utils_0.2.3
[5] rappdirs_0.3.3 knitr_1.50 htmltools_0.5.8.1 rmarkdown_2.30
[9] cli_3.6.5 compiler_4.5.2 tools_4.5.2 evaluate_1.0.5
[13] Rcpp_1.1.0 yaml_2.3.11 rlang_1.1.6 jsonlite_2.0.0
[17] fs_1.6.6