This R package performs association tests between the observed data and their systematic patterns of variation. Systematic variation can be modeled by latent variables, that can arise from biological processes, experimental conditions, environmental factors, and others. We often estimate these patterns using principal component analysis (PCA), factor analysis (FA), logistic factor analysis (LFA), K-means clustering, partition around medoids (PAM), and related methods. The jackstraw methods learn over-fitting characteristics inherent in unsupervised learning, where the observed data are used to estimate the systematic patterns and to be tested again (see circular analysis).
Using a variety of unsupervised learning techniques, the jackstraw provides a resampling strategy and testing scheme to estimate statistical significance of association between the observed data and their systematic patterns of variation. For example, the cell cycle in microarray data may be estimated by principal components (PCs). Then, we can use the jackstraw for PCA to identify genes that are significantly associated with these PCs. On the other hand, cell identities in single cell RNA-seq (scRNA-seq) data are often determined by K-means clustering or other unsupervised clustering algorithms. Then, the jackstraw for clustering can identify single cells that are significant members of a given cluster.
Using jackstraw_pca
, we can find statistically
significant variables with regard to the top r
principal
components (PCs). Alternatively, we could test association with respect
to a subset of r
PCs, which are called PCs of interest
r1
. The package also supports truncated PCA, using
augmented implicitly restarted Lanczos bidiagonalization algorithm
(IRLBA; jackstraw_irlba
) or randomized Singular Value
Decomposition (RSVD; jackstraw_rpca
).
Logistic
factor analysis (LFA) and ALStructure
estimate population structure from genetic data (single-nucleotide
polymorphisms; SNPs). jackstraw_lfa
and
jackstraw_alstructure
provides corresponding association
tests between SNPs and population structure, as estimated by the
aforementioned methods. Generally, one could directly specify an
estimation method for latent variables in
jackstraw_subspace
.
Instead of continuous latent variables that are estimated by PCA,
LFA, or others, one may be interested in estimating discrete clusters
from a high dimensional data. For K-means clustering,
jackstraw_kmeans
evaluates whether data points are
significant members of a given cluster, by testing association between
observed data and cluster centers. This can help select data points that
are reliable members of clusters and further improve the cluster
membership.
Related algorithms, such as Partitioning Around
Medoids (PAM) or k-medoids and Mini Batch K-means
algorithms, are supported by jackstraw_pam
and
jackstraw_MiniBatchKmeans
, respectively. Generally,
jackstraw_cluster
can be used for other clustering
algorithms.
There are few additional functions to support statistical inference
for unsupervised learning, such as finding a number of PCs or clusters.
Based on p-values, we could estimate posterior inclusion probabilities
(PIPs) using pip
.
Chung, N.C. (2020) Statistical significance of cluster membership for unsupervised evaluation of cell identities. Bioinformatics, 36(10): 3107–3114 https://doi.org/10.1093/bioinformatics/btaa087
Chung, N.C. and Storey, J.D. (2015) Statistical significance of variables driving systematic variation in high-dimensional data. Bioinformatics, 31(4): 545-554 https://doi.org/10.1093/bioinformatics/btu674
Association Test with Principal Components with a Gentle Introduction to Latent Variable Models
Statistical
Test of Cluster Memberships with a Toy Data Set
(mtcars
)
Unsupervised Evaluation of Cell Identities in Single Cell Genomics using the 10X Genomics Data
Bioconductor dependencies may fail to automatically install, e.g., lfa
, gcatest
,
qvalue
.
This would result in a warning.
To solve this problem, please install Bioconductor dependencies manually first:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install(c('qvalue', 'lfa', 'gcatest')) BiocManager
This package is in active development. Install jackstraw from GitHub:
install.packages("devtools")
library("devtools")
install_github("ncchung/jackstraw")
To use jackstraw_alstructure
, install the optional
alstructure
package from GitHub:
library(devtools)
install_github("StoreyLab/alstructure")
The stable version jackstraw v1.3.14 is on CRAN. To install a stable version from CRAN:
install.packages("jackstraw")
Here are some implementations of the jackstraw in different contexts and application domains.
jackstraw (Python) by Iain Carmichael
Jackstraw significance testing for JIVE in Python