| Title: | Linear Models for Sequence Count Data |
| Version: | 1.0.0 |
| Description: | Provides scalable generalized linear and mixed effects models tailored for sequence count data analysis (e.g., analysis of 16S or RNA-seq data). Uses Dirichlet-multinomial sampling to quantify uncertainty in relative abundance or relative expression conditioned on observed count data. Implements scale models as a generalization of normalizations which account for uncertainty in scale (e.g., total abundances) as described in Nixon et al. (2025) <doi:10.1186/s13059-025-03609-3> and McGovern et al. (2025) <doi:10.1101/2025.08.05.668734>. |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.3 |
| Suggests: | rBeta2009, testthat (≥ 3.0.0), lmtest, sandwich, knitr, rmarkdown |
| Config/testthat/edition: | 3 |
| Imports: | purrr, lme4, lmerTest, parallel, MASS, nlme, abind, matrixStats, methods, stats |
| Depends: | R (≥ 3.5) |
| LazyData: | true |
| VignetteBuilder: | knitr |
| NeedsCompilation: | no |
| Packaged: | 2026-01-27 18:31:44 UTC; jds6696 |
| Author: | Justin Silverman [aut, cre], Greg Gloor [aut], Kyle McGovern [aut, ctb] |
| Maintainer: | Justin Silverman <JustinSilverman@psu.edu> |
| Repository: | CRAN |
| Date/Publication: | 2026-01-31 19:10:16 UTC |
ALDEx3 Linear Models
Description
ALDEx3 Linear Models
Usage
aldex(
Y,
X,
data = NULL,
method = "lm",
nsample = 2000,
scale = NULL,
streamsize = 8000,
n.cores = detectCores() - 1,
return.pars = c("X", "estimate", "std.error", "p.val", "p.val.adj", "logComp",
"logScale"),
p.adjust.method = "BH",
test = "t.HC3",
onesided = FALSE,
...
)
Arguments
Y |
an (D x N) matrix of sequence count, N is number of samples, D is number of taxa or genes |
X |
either a formula (in which case DATA must be non-null) or a model matrix of dimension P x N (P is number of linear model covariates). If a formula is passed it should not include the target Y, e.g., should simply be "~condition-1" (note the lack of the left hand side). If using lme4, this should be the formula including random effects. |
data |
a data frame for use with formula, must have N rows |
method |
(default lm) The regression method; "lm": linear regression,
"lme4": linear mixed effects regression with lme4; "nlme": linear mixed
effects models with nlme, REQUIRES "random" argument representing random
effects be passed into |
nsample |
number of monte carlo replicates |
scale |
the scale model, can be a function or an N x nsample matrix (samples should be given on log2-scale; e.g., samples should be of log of system scale). The API for writing your own scale models is documented below in examples. |
streamsize |
(default 8000) memory footprint (approximate) at which to use streaming. This should be thought of as the number of Mb for each streaming chunk. If DNnsample*8/1000000 is less than streamsize then no streaming will be performed. Note, to conserve memory, samples from the Dirichlet and scale models will not be returned if streaming is used. Streaming can be turned off by setting streamsize=Inf. |
n.cores |
(default detectCores()-1) If method is 'lme4', use this many cores for running mixed effects models in parallel. |
return.pars |
what results should be returned, see return section below. |
p.adjust.method |
(default BH) The method for multiple hypothesis test
correction. See |
test |
(default t.HC3), "t", t test is performed for each covariate (fast); "t.HC0" Heteroskedasticsity-Robust Standard Errors used (HC0; White's; slower); "t.HC3" (default) Heteroskedasticsity-Robust Standard Errors used (HC3; unlike HC0, this includes a leverage adjustment and is better for small sample sizes or when there are data with high leverage; slowest). To learn more about these, loko at Long and Ervin (2000) Using Heteroscedasticity Consistent Standard Errors in the Linear Regression Model, The American Statistician. |
onesided |
(default: FALSE) if sided return p-values for two-sided test. Otherwise if "lower" or "upper" return one-sided test corresponding to test that estimate is negative or positive respectively. |
... |
parameters to be passed to the scale model (if a function is provided), may also be random or correlation arguments for nlme. |
Value
a list with elements controled by parameter resturn.pars. Options
include: - X: P x N covariate matrix - estimate: (P x D x nsample) array
of linear model estimates - std.error: (P x D x nsample) array of standard
error for the estimates - p.val: (P x D) matrix, unadjusted p-value for
two-sided t-test - p.val.adj: (P x D) matrix, p-value for two-sided t-test
adjusted according to p.adj.method - logScale: (N x S) matrix, samples
of the log scale from the scale model - logComp: (D x N x S) array,
samples of the log composition from the multinomial-Dirichlet - streaming:
boolean, detnote if streaming was used. - random.effects (Pr x N x S): if
using mixed effects models, return all Pr random effects. Note, logScale
and logComp are not returned if streaming is active.
Author(s)
Justin Silverman, Kyle McGovern
Examples
Y <- matrix(1:110, 10, 11)
condition <- c(rep(0, 5), rep(1, 6))
data <- data.frame(condition=condition)
## demonstrate formula interface and passing optional argument (gamma) to
## the scale model (clr)
res <- aldex(Y, ~condition, data, nsample=2000, scale=clr.sm, gamma=0.5)
## demonstrating how to write a custom scale model, I will write a model
## that generalizes total sum scaling (where we assume no change between
## conditions)
## Functions can include parameters X (model matrix), Y, and logComp.
## If included in the function definition, those parameters will be passed
## dynamically when aldex is running. Other optional parameters (gamma)
## can be passed as additional arguments to the aldex function
tss <- function(X, logComp, gamma=0.5) {
P <- nrow(X)
nsample <- dim(logComp)[3]
LambdaScale <- matrix(rnorm(P*nsample,0,gamma), P, nsample)
logScale <- t(X)%*% LambdaScale
return(logScale)
}
res <- aldex(Y, ~condition, data, nsample=2000, scale=tss, gamma=0.75)
Simulation function soley for testing and exploring ALDEx3, Truth is in CLR Coordinates.
Description
Not designed to create realistic data. Does not add any noise to linear regression! True W is in CLR Corrdinates
Usage
aldex.lm.sim.clr(D = 10, N = 11, P = 2, depth = 10000)
Arguments
D |
number of taxa/genes |
N |
number of samples |
P |
number of covariates |
depth |
sum of counts for each multinomial draw |
Value
a list with elements Y, X, W, and Lambda
Author(s)
Justin Silverman
Simulation for testing mixed effects models.
Description
Includes random intercpt, option to include random slope, second random intercept, and time-correlation.
Usage
aldex.mem.sim(
D,
days,
subjects,
depth = 10000,
location = 1,
random_slope = FALSE,
corr = 0,
rho_ar1 = 0,
sd_resid = 0.1
)
Arguments
D |
number of taxa/genes |
days |
num days (i.e., repeated measurements) for each subject |
subjects |
num of subjects to simulate |
depth |
sum of counts for each multinomial draw |
location |
second random intercept, if 0 ignore, else simulate num of locations |
random_slope |
If true, simulate a random slope for each subject. |
corr |
The correlation between slope/random intercept |
rho_ar1 |
The ar1 time-correlation, if 0, don't simulate |
sd_resid |
The residual error time-correlation. |
Author(s)
Kyle McGovern
Calculate p-values adjusting for changes in sign as described by Nixon et al. (2024) in Beyond Normalization: Incorporating Scale Uncertainty in Microbiome and Gene Expression Analysis (internal only)
Description
Calculate p-values adjusting for changes in sign as described by Nixon et al. (2024) in Beyond Normalization: Incorporating Scale Uncertainty in Microbiome and Gene Expression Analysis (internal only)
Usage
aldex.pvals(p.lower, p.upper, p.adjust.method, onesided = FALSE)
Arguments
p.lower |
A P x D x S matrix for P covariates, D taxa/genes, and S monte carlo samples representing the lower tail p.values |
p.upper |
A P x D x S matrix for P covariates, D taxa/genes, and S monte carlo samples representing the upper tail p.values |
p.adjust.method |
An adjutment method for p.adjust |
onesided |
(default: FALSE) if sided return p-values for two-sided test. Otherwise if "lower" or "upper" return one-sided test corresponding to test that estimate is negative or positive respectively. |
Value
A list with P x D matrices with the non-adjusted and adjusted p-values.
Author(s)
Justin Silverman, Kyle McGovern
References
Nixon G, Gloor GB, Silverman JD (2025). "Incorporating scale uncertainty in microbiome and gene expression analysis as an extension of normalization". Genome Biology. doi:10.1186/s13059-025-03609-3
Default CLR-based scale model (with optional scale uncertainty)
Description
Implements the default scale model described in Nixon et al. (Beyond
Normalizations / scale-uncertainty framework). This model generalizes the
centered log-ratio (CLR) normalization by treating the (log) scale as a
latent random variable and allowing additive uncertainty around the
CLR-implied scale differences via a Gaussian term with standard deviation
gamma.
Usage
clr.sm(X, logComp, gamma = 0.5)
Arguments
X |
A numeric design matrix used to model scale variation across
samples. This is the covariate/design matrix passed internally by
|
logComp |
A numeric array of Monte Carlo log-compositions with
dimensions |
gamma |
Non-negative scalar. Standard deviation of the Gaussian
perturbation that relaxes the CLR assumption about scale. |
Details
In the limit gamma = 0, this reduces to the CLR assumption (no scale
uncertainty beyond the CLR-implied scale). Larger gamma values
represent increasing uncertainty about the CLR-implied scale differences.
Value
A numeric matrix of dimension N x nsample giving Monte Carlo samples
of the log-scale for each sample (rows) and each Monte Carlo draw (columns).
Author(s)
Justin Silverman
References
Nixon G, Gloor GB, Silverman JD (2025). "Incorporating scale uncertainty in microbiome and gene expression analysis as an extension of normalization". Genome Biology. doi:10.1186/s13059-025-03609-3
Coefficient-based scale model with user-specified prior on fixed effects
Description
Draws Monte Carlo samples of the log2 scale by sampling fixed-effect
coefficients from a multivariate normal distribution and mapping them
through the design matrix X. This scale model is useful when you
want to encode prior information about how covariates (e.g., treatment,
batch, time) affect scale, rather than specifying scale moments directly
per sample.
Usage
coefficient.sm(X, logComp, c.mu = NULL, c.cor = NULL)
Arguments
X |
A numeric design matrix passed internally by |
logComp |
A numeric array of Monte Carlo log-compositions with
dimensions |
c.mu |
Numeric vector of length |
c.cor |
Numeric |
Details
Specifically, for each Monte Carlo draw b^{(m)} \sim N(c.mu, c.cor),
the per-sample log2 scale is computed as b^{(m)T} X, producing an
N x nsample matrix of log2-scale draws.
For example, with an intercept and a treatment indicator where treatment is
expected to increase log2 scale by ~1 on average, one might use
c.mu = c(0, 1) and c.cor = diag(c(0.25, 0.25)) (i.e., SD 0.5
for each coefficient, independent).
Value
A numeric matrix of dimension N x nsample giving Monte Carlo
draws of the log2 scale for each sample (rows) across nsample draws
(columns).
Author(s)
Kyle McGovern
Cohen's D
Description
Function to compute cohensd on the results provided by the aldex function
Usage
cohensd(m, var)
Arguments
m |
the output of a call to |
var |
if aldex was called with X being a pre-computed model matrix,
this var should be an integer corresponding to a binary covariate
indicateing the desired effect size to calculate (an effect size between
two groups indicated by the binary covariate). For example, if the third
covariate in the model is an indicator denoting health (0) and disease (1)
then set |
Details
WARNING: this function is experimental and requires users read the documetation fully.
Value
A (D x nsample)-matrix of Cohen's D statistics for the variable of interest
Author(s)
Justin Silverman
Combine output from aldex streams (internal only)
Description
Takes a list l, where every element is itself a list of 3D arrays. Each array should have matching first and second dimentions. This function combines those arrays along the third dimension.
Usage
combine.streams(l)
Arguments
l |
a list, where every element is itself a list of 3D arrays. |
Value
a list of 3D arrays
Author(s)
Justin Silverman
Freaking Fask Linear Models
Description
Tailored for ALDEx3 where covariates are shared between massive numbers of linear regressions where only Y is changing.
Usage
fflm(Y, X, test = "t.HC3")
Arguments
Y |
a numeric array (N x D X S) where D is the number of taxa/genes, N is the number of samples, and S is the number of posterior samples |
X |
a numeric matrix (N x P) where P is number of covariates |
test |
(default t.HC3), "t", t test is performed for each covariate (fast); "t.HC0" Heteroskedasticsity-Robust Standard Errors used (HC0; White's; slower); "t.HC3" (default) Heteroskedasticsity-Robust Standard Errors used (HC3; unlike HC0, this includes a leverage adjustment and is better for small sample sizes or when there are data with high leverage; slowest). To learn more about these, loko at Long and Ervin (2000) Using Heteroscedasticity Consistent Standard Errors in the Linear Regression Model, The American Statistician. |
Value
A list of (P x D x S)-arrays with the OLS point estimates, the standard errors, and the two-sided p-values for each coefficient (P), of each model fit to each taxa (D) and each posterior sample (S)
Author(s)
Justin Silverman
Gut Crohn's microbiome dataset (list)
Description
This dataset contains a matrix and a data frame: genus-level microbiome profiles and corresponding sample metadata from a Crohn's disease case–control cohort. The dataset is used in examples and vignettes throughout the package.
Usage
gut_crohns_data
Format
A named list of one matrix and a dataframe:
countsA matrix with read counts with 195 rows and 95 columns)
metadataA data frame with 95 rows and 7 columns containing subject-level covariates.
Details
The counts matrix has one row per sample and one column per genus.
The metadata data frame has one row per sample with metadata, critically Health.status either CD or Control, and Average cell count per gram frozen feces.
Source
Vandeputte D, Falony G, Vieira-Silva S, Wang J, Sailer M, Theis S, Raes J (2017). "Quantitative microbiome profiling links gut community variation to microbial load". Nature, 551, 507–511. doi:10.1038/nature24460
Closure operation
Description
Closure operation
Usage
miniclo(X)
Arguments
X |
should be a D x N matrix, closes along the D dimension |
Value
D x N matrix with columns summing to 1
Author(s)
Justin Silverman
Oral microbiome perturbation dataset (list)
Description
This dataset contains a matrix and a data frame: genus-level microbiome profiles and corresponding sample metadata from an oral microbiome perturbation study. 28 participants' oral microbiomes were measured before, 15 minutes after, and 2 hours after perturbation with either a water control, antiseptic mouthwash, alchohol-free mouthwash, or soda. The dataset is used in examples and vignettes throughout the package.
Usage
oral_mouthwash_data
Format
A named list of one matrix and a dataframe:
countsA matrix with read counts with 116 rows and 81 columns)
metadataA data frame with 81 rows and 38 columns containing sample-level covariates.
Details
The counts matrix has one row per sample and one column per genus.
The metadata data frame has one row per sample with metadata, critically the participant ID, flow cytometry average (and replicate) cells, time_c (time points), and treat (the perturbations )
Source
Marotz C, Morton JT, Navarro P, Coker J, Belda-Ferre P, Knight R (2021). "Quantifying live microbial load in human saliva samples over time reveals stable composition and dynamic load". mSystems, 6(3), e01182-21. doi:10.1128/mSystems.01182-20
Function for sampling Dirichlet random variables (base-2 normalized via log-sum-exp for stability)
Description
Function for sampling Dirichlet random variables (base-2 normalized via log-sum-exp for stability)
Usage
rLogDirichlet(n, alpha)
Arguments
n |
number of samples |
alpha |
D-vector of Dirichlet parameters |
Value
a D x n matrix of samples
Author(s)
Justin Silverman
Test object contains elements
Description
Test object contains elements
Usage
req(obj, names)
Arguments
obj |
object (list type) |
names |
names of elements required to be in the list |
Author(s)
Justin Silverman
Sample-specific scale model with user-specified mean and variance/covariance
Description
Draws Monte Carlo samples of the log2 scale for each sample using user-supplied moments. This scale model is useful when external measurements (e.g., qPCR, flow cytometry, spike-ins) provide information about absolute scale, or when you want to encode prior information about scale on a per-sample basis.
Usage
sample.sm(X, logComp, s.mu = NULL, s.var = NULL, s.cor = NULL)
Arguments
X |
A numeric design matrix passed internally by |
logComp |
A numeric array of Monte Carlo log-compositions with dimensions
|
s.mu |
Numeric vector of length |
s.var |
Numeric vector of length |
s.cor |
Numeric |
Details
Exactly one of s.var or s.cor must be provided:
-
s.var: independent per-sample log2-scale variance (diagonal covariance) -
s.cor: fullN x Nlog2-scale covariance matrix
The returned matrix has N rows (samples) and nsample columns
(Monte Carlo draws), consistent with the ALDEx3 scale-model interface.
Value
A numeric matrix of dimension N x nsample giving Monte Carlo
draws of the log2 scale for each sample (rows) across nsample draws
(columns).
Author(s)
Kyle McGovern
Implementation of SR-MEM: scale-reliant mixed effects models.
Description
Implementation of SR-MEM: scale-reliant mixed effects models.
Usage
sr.mem(logW, formula, data, n.cores, method, mem.args)
Arguments
logW |
a numeric array (N x D X S) where D is the number of taxa, N is the number of samples, and S is the number of posterior samples |
formula |
an lme4 formula with fixed and random effects for lmer |
data |
A data.frame with the random and fixed effects items in formula |
n.cores |
The number of cores for parallelization. If n.cores=1, no parallelization is used |
method |
The mem method to use: either nlme or lme4 |
mem.args |
Additional arguments including random or correlation |
Value
A list of (P x D x S)-arrays with the fixed effect point estimates, the standard errors, degrees of freedom and the lower and upper p-values for each coefficient (P), of each model fit to each taxa (D) and each posterior sample (S) and a (Pr x D x S)-array with (Pr) random effects.
Author(s)
Kyle McGovern
Summary Method for ALDEx3 Objects
Description
Summarize an ALDEx3 result object
Usage
## S3 method for class 'aldex'
summary(object, ignore.intercept = TRUE, ...)
Arguments
object |
An object of class |
ignore.intercept |
(default=TRUE), ignore intercept when creating summary table |
... |
Additional arguments (currently ignored). |
Details
Provides a summary of the adjusted p-values, estimates, and standard errors from an ALDEx3 result object.
This method extracts adjusted p-values from object$p.val.adj, along with
posterior estimates and standard errors averaged across Monte Carlo samples.
The result is returned as a long-format data.frame suitable for downstream
analysis or visualization.
Value
A data.frame with columns parameter, entity,
p.val.adjusted, estimate, and std.error.
Author(s)
Justin Silverman
TSS-centered scale model (with optional scale uncertainty)
Description
Implements a total-sum-scaling (TSS)-centered variant of the default
scale-uncertainty model described in Nixon et al. (Beyond Normalizations /
scale-uncertainty framework). Unlike clr.sm, which is centered
on the CLR-implied scale, this model is centered on the TSS assumption that
there is no systematic change in scale across.
Usage
tss.sm(X, logComp, gamma = 0.5)
Arguments
X |
A numeric design matrix passed internally by |
logComp |
A numeric array of Monte Carlo log-compositions with
dimensions |
gamma |
Non-negative scalar. Standard deviation of the Gaussian
perturbation applied to the scale-model coefficients (in log2 space).
|
Details
Scale uncertainty is introduced via an additive Gaussian perturbation on the
(log2) fixed effects. For each Monte Carlo draw, a coefficient vector is
sampled as b^{(m)} \sim N(0, \gamma^2 I), and the per-sample log2 scale
is computed as b^{(m)T} X. Larger values of gamma correspond to
weaker confidence in the TSS-centered assumption (more allowed scale
variation); gamma = 0 yields no scale variation beyond the model
center.
Note: logComp is included to match the ALDEx3 scale-model interface
and to determine nsample, but it is not otherwise used by this model.
Value
A numeric matrix of dimension N x nsample giving Monte Carlo
draws of the log2 scale for each sample (rows) across nsample draws
(columns).
Author(s)
Justin Silverman
References
Nixon G, Gloor GB, Silverman JD (2025). "Incorporating scale uncertainty in microbiome and gene expression analysis as an extension of normalization". Genome Biology. doi:10.1186/s13059-025-03609-3