# Introduction to flashier

In addition to flashier and ggplot2, we will make use of package cowplot below in order to arrange plots into grids:

library(flashier)
library(ggplot2)
library(cowplot)

## The EBMF Model

Empirical Bayes matrix factorization (Wang and Stephens, 2021) decomposes a data matrix $$Y \in \mathbb{R}^{n \times p}$$ as $Y = LF' + E,$ with “loadings” $$L \in \mathbb{R}^{n \times K}$$, “factors” $$F \in \mathbb{R}^{p \times K}$$, and residual errors $e_{ij} \sim N(0, 1 / \tau_{ij}).$ The model puts priors on each factor and each set of loadings: $f_{\cdot k} \sim g^{(f)}_k,\ \ell_{\cdot k} \sim g^{(\ell)}_k,$ with $$g^{(\ell)}_k$$ and $$g^{(f)}_k$$ assumed to belong to some families of distributions $$\mathcal{G}^{(\ell)}$$ and $$\mathcal{G}^{(f)}$$ and estimated using the data. The default choice of prior family for both factors and loadings is the family of point-normal distributions: $g \sim \pi_0 \delta_0 + (1 - \pi_0) N(0, \sigma^2),$ where both $$\pi_0$$ and $$\sigma^2$$ are free parameters. This family is especially useful when factors and loadings can be expected to be at least modestly sparse; in other settings, different prior families might be preferred.

To avoid over-parametrization, it is necessary to make some assumptions about the precision parameters $$\tau_{ij}$$. The default assumption is that all $$\tau_{ij}$$s are equal: $e_{ij} \sim N(0, 1 / \tau).$

Note that when the prior families $$\mathcal{G}^{(\ell)}$$ and $$\mathcal{G}^{(f)}$$ are closed under scaling (as is typically the case), then the model as formulated above is not identifiable, since we can scale each set of loadings $$\ell_{\cdot k}$$ by any constant $$d_k \ne 0$$ if we also scale factor $$f_{\cdot k}$$ by $$1 / d_k$$.

We can make the matrix factorization unique by writing $Y = LDF' + E,$ with the scales of loadings $$\ell_{\cdot 1}, \ldots, \ell_{\cdot K}$$ and factors $$f_{\cdot 1}, \ldots, f_{\cdot K}$$ constrained in some fashion (for example, by requiring $$\| \ell_{\cdot k} \|_2 = 1$$ and $$\| f_{\cdot k} \|_2 = 1$$ for all $$k$$). We refer to such a factorization as an LDF factorization.

## The gtex dataset and the flash() function

Throughout this vignette, we will use the main flash() function to produce a number of EBMF fits to dataset gtex, which is comprised of a subset of the data used as the primary example in Wang and Stephens (2021). The dataset is derived from data made available by the Genotype Tissue Expression (GTEx) project (Lonsdale et al. 2013), which provides $$z$$-scores for assessing the significance of effects of genetic variants (“single nucleotide polymorphisms” or SNPs) on gene expression across 44 human tissues. To reduce the data to a more manageable size, Urbut et al. (2019) chose the “top” SNP for each gene — that is, the SNP associated with the largest (absolute) $$z$$-score over all 44 tissues. This process yields a 16,069 by 44 matrix of $$z$$-scores, with rows corresponding to SNP-gene pairs or “eQTLs” and columns corresponding to tissues. The gtex dataset included in flashier is further subsampled down to 1000 rows.

One potential aim in applying matrix factorization methods to the gtex dataset is to analyze patterns of effect sharing. For example, Wang and Stephens (2021) point out an EBMF factor that has large values across brain tissues (cortex, cerebellum, basal ganglia, etc.), but values much closer to zero for all non-brain tissues. We can expect eQTLs that are loaded on this factor to affect brain and non-brain tissues much differently. Other factors have large values for a single tissue only (e.g., testis). We can expect eQTLs loaded on such factors to be more or less tissue-specific.

To fit the EBMF model, we use function flash(), which adds up to greedy_Kmax factor/loadings pairs using the “greedy” algorithm described in Wang and Stephens (2021). In brief, the algorithm adds new factor/loadings pairs one at a time, optimizing each pair as it gets added without returning to re-optimize previously added pairs. In other words, it finds the best rank-one fit; then the best rank-two fit with the first factor/loadings pair fixed; then the best rank-three fit with the first two factor/loadings pairs fixed; and so on. The algorithm stops adding factor/loadings pairs when they fail to yield an increase in the variational lower bound on the log likelihood (or ELBO), and in this sense EBMF can “automatically” select the number of factor/loadings pairs $$K$$. Thus it is often sufficient simply to set greedy_Kmax as high as is practically manageable (the default is greedy_Kmax = 50). Here, however, we will restrict the number of factor/loadings pairs in order to ensure that the vignette can be run quickly:

data(gtex, gtex_colors)
fit_default <- flash(gtex, greedy_Kmax = 5)
#> Adding factor 1 to flash object...
#> Adding factor 2 to flash object...
#> Adding factor 3 to flash object...
#> Adding factor 4 to flash object...
#> Adding factor 5 to flash object...
#> Wrapping up...
#> Done.
#> Nullchecking 5 factors...
#> Done.

By default, flash() performs a “nullcheck” after the greedy algorithm terminates: that is, it loops over each greedily added factor/loadings pair and checks that removing it does not produce an increase in the ELBO (i.e., a better fit). If it does, then the factor/loadings pair is removed from the final fit.

The output of function flash() is an object of class flash. Useful information about the fit can be accessed as list elements. For example, the ELBO can be accessed as:

var_kronecker = as.vector(ldf(fit_var_kronecker, type = "m")$F), k = factor(rep(1:5, each = ncol(gtex))) ) ggplot(comparison_df, aes(x = var_by_col, y = var_kronecker, col = k)) + geom_point(alpha = 0.5) + geom_abline(slope = 1, linetype = "dashed") + scale_color_brewer(palette = "Set1") + labs(x = "By-column variance assumption", y = "Kronecker variance assumption") + theme_minimal()  ## Measurement error In many scenarios, the data $$Y$$ is observed with some known error. In such a case, it might be preferable to fit the model $Y = L F' + S + E,$ where $$S_{ij} \sim N(0, s^2_{ij})$$ with the $$s^2_{ij}$$s fixed. In simple cases, this model can almost be reduced to the model described above. For example, since the gtex dataset is comprised of $$z$$-scores, we might choose to set the $$s_{ij}$$s identically equal to one. With by-column precision parameters $$\tau_j$$, this yields the model $Y_{ij} \sim N \left(\sum_k \ell_{ik} f_{jk}, 1 / \tau^2_{j} + 1 \right).$ This is equivalent to the previously described model $Y_{ij} \sim N \left(\sum_k \ell_{ik} f_{jk}, 1 / \tilde{\tau}^2_{j} \right)$ provided that we add constraints $$\tilde{\tau_j} \le 1$$ for all $$j$$. This model should yield a lower ELBO than the model with arbitrary column-specific variances (again, we write “should” because it is not guaranteed that flash() will converge to a global optimum), but it is arguably the more correct model. fit_with_meas_err <- flash(gtex, S = 1, greedy_Kmax = 5, var_type = 2, verbose = 0) c(bycol_elbo = fit_var_bycol$elbo, meas_err_elbo = fit_with_meas_err$elbo) #> bycol_elbo meas_err_elbo #> -80490.27 -80491.84 We note here that it is also possible to pass a matrix of standard errors $$s_{ij}$$ as argument to S. If these standard errors are to account for all residual variance (that is, if the matrix of “true” means can assumed to be exactly low-rank rather than approximately so), then one can set var_type = NULL. If not, then updates to precision parameters $$\tau_{ij}$$ must be accomplished via one call to function optimize() for each precision parameter, so variance estimation can be very slow. In such cases, we generally recommend leaving S unspecified and allowing all residual variance to be estimated. ## Prior families We next turn to parameter ebnm_fn. Note that the argument to ebnm_fn is a function; indeed, in flashier, prior families $$\mathcal{G}^{(\ell)}_k$$ and $$\mathcal{G}^{(f)}_k$$ are not specified directly, but rather implicitly via the functions used to solve the empirical Bayes normal means (EBNM) subproblems, which have form $\begin{gather} x_i \sim \mathcal{N} \left( \theta_i, s_i^2 \right) \\ \theta_i \sim g \in \mathcal{G}. \end{gather}$ (Here, the $$x_i$$s and $$s_i$$s are known observations and standard errors and the $$\theta_i$$s are unknown “means.”) Effectively, ebnm_fn does not only (implicitly) specify $$\mathcal{G}$$, but also (directly) the method for estimating $$g \in \mathcal{G}$$. A number of useful EBNM solvers are provided by package ebnm (Willwerscheid and Stephens 2021). The default argument is ebnm_fn = ebnm_point_normal, which takes the prior family $$\mathcal{G}$$ (for both factors and loadings) to be the family of point-normal distributions: $g \sim \pi_0 \delta_0 + (1 - \pi_0) N(0, \sigma^2).$ In the scenario under consideration, this choice of prior family makes sense, as it is reasonable to expect some degree of sparsity in both factors and loadings: some effects might only be shared across a few related tissues (e.g., brain tissues but not other tissues), and patterns of sharing may not be ubiquitous across SNP-gene pairs (e.g., many effects may be constant across brain and non-brain tissues). By varying ebnm_fn, we also vary our assumptions about the “true” underlying distributions of factors and loadings. For example, if we can reasonably expect each set of loadings $$\ell_{\cdot k}$$ to be heavy-tailed, then we might prefer a more flexible family of distributions, such as the family of scale mixtures of normals (ebnm_fn = ebnm_normal_scale_mixture) or the family of symmetric distributions that are unimodal at zero (ebnm_fn = ebnm_unimodal_symmetric). Note that these families are nested, so that the unimodal family should yield a higher ELBO than scale mixtures of normals, which should in turn yield a higher ELBO than the family of point-normals: fit_normalmix <- flash( gtex, greedy_Kmax = 5, ebnm_fn = ebnm_normal_scale_mixture, verbose = 0 ) fit_unimodal <- flash( gtex, greedy_Kmax = 5, ebnm_fn = ebnm_unimodal_symmetric, verbose = 0 ) c(pn_elbo = fit_default$elbo,
normalmix_elbo = fit_normalmix$elbo, unimodal_elbo = fit_unimodal$elbo)
#>        pn_elbo normalmix_elbo  unimodal_elbo
#>      -83131.69      -82981.84      -82978.90

It is also possible to introduce sign constraints via the choice of prior family; for example, to constrain factors to be nonnegative, we need only choose a prior family with nonnegative support such as the family of point-exponential distributions (ebnm_fn = ebnm_point_exponential) or the family of all nonnegative distributions that are unimodal at zero (ebnm_fn = ebnm_unimodal_nonnegative). This constraint can model the (reasonable) assumption that when effects are shared across tissues, they are shared in the same direction (that is, expression either increases or decreases across a number of related tissues, but does not increase in some and simultaneously decrease in others). We can constrain factors while allowing loadings to remain unconstrained by passing a list as argument to ebnm_fn:

fit_seminonnegative <- flash(
gtex,
greedy_Kmax = 5,
var_type = 2,
ebnm_fn = c(ebnm_point_normal, ebnm_point_exponential),
verbose = 0
)

The result is a “semi-nonnegative” factorization, which in the scenario under consideration yields sparse, interpretable factors (especially in comparison to the factors produced using all default arguments to flash(); see above). Here, factor 2 represents sharing among brain, pituitary, and testis tissues; factor 3, among whole blood, spleen tissue, lung tissue, and lymphocytes; and factor 5, among heart and muscle tissues. Factor 4 represents effects uniquely expressed in testis tissue:

plot(
fit_seminonnegative,
pm_which = "factors",
pm_colors = gtex_colors,
include_scree = FALSE
)

## Backfitting

fit_greedy <- flash(
gtex,
greedy_Kmax = 5,
var_type = 2,
ebnm_fn = c(ebnm_point_normal, ebnm_point_exponential),
backfit = FALSE,
verbose = 0
)
fit_backfit <- flash(
gtex,
greedy_Kmax = 5,
var_type = 2,
ebnm_fn = c(ebnm_point_normal, ebnm_point_exponential),
backfit = TRUE,
verbose = 0
)

c(greedy_elbo = fit_greedy$elbo, bf_elbo = fit_backfit$elbo)
#> greedy_elbo     bf_elbo
#>   -81194.23   -80893.79

Qualitatively, results can vary: for large, complex fits, the improvement can be considerable; for smaller fits, there might be few obvious differences. Interestingly, backfitting the semi-nonnegative fit from above produces a less visually appealing fit. We suspect that this occurs because 5 factor/loadings pairs is not enough to account for the structure in the data; if one increases greedy_Kmax, one obtains a more satisfying backfit.

p1 <- plot(
fit_greedy,
pm_which = "factors",
pm_colors = gtex_colors,
order_by_pve = FALSE,
include_scree = FALSE
) + ggtitle("Greedy fit")
p2 <- plot(
fit_backfit,
pm_which = "factors",
pm_colors = gtex_colors,
order_by_pve = FALSE,
include_scree = FALSE
) + ggtitle("Backfit")

plot_grid(p1, p2, nrow = 2)

We again obtain a more direct comparison of fits using function ldf(). Here it is apparent that backfitting makes factor 2 less sparse by “borrowing” from factor 1. Other factors remain very similar:

comparison_df <- data.frame(
fit_greedy = as.vector(ldf(fit_greedy, type = "m")$F), fit_backfit = as.vector(ldf(fit_backfit, type = "m")$F),
k = factor(rep(1:5, each = ncol(gtex)))
)
ggplot(comparison_df, aes(x = fit_greedy, y = fit_backfit, col = k)) +
geom_point(alpha = 0.5) +
geom_abline(slope = 1, linetype = "dashed") +
scale_color_brewer(palette = "Set1") +
labs(x = "Greedy fit", y = "Backfit") +
theme_minimal() 

## Sampling from the posterior

One of the list elements in the object returned by flash is a function that can sample from posterior distributions on loadings and factors. Take the backfit above as an example. To better understand which tissues are bound up with whole blood effects, we might like confidence intervals for the third factor. We construct 95% confidence intervals using 200 samples and then finding 2.5% and 97.5% quantiles as follows:

# Set seed for reproducibility.
set.seed(1)
# Use returned sampler to sample from posterior.
samp <- fit_backfit\$sampler(nsamp = 200)
# Only keep factor 3.
factor3_samp <- lapply(samp, function(x) x[[2]][, 3])
factor3_samp <- sapply(factor3_samp, function(x) x / max(abs(x)))
# Get 95% confidence intervals.
factor3_ci <- apply(factor3_samp, 1, quantile, c(0.025, 0.975))

Since the plot method returns a ggplot2 object, it can be customized using ggplot syntax, making the addition of error bars a simple task:

plot(fit_backfit, kset = 3, pm_colors = gtex_colors, include_scree = FALSE) +
geom_errorbar(aes(ymin = factor3_ci[1, ], ymax = factor3_ci[2, ]))

Confidence intervals for spleen tissue, lung tissue, and lymphocytes do not contain zero, so we can be reasonably confident that this pattern of sharing is “real” and not just an artefact of the data. We note, however, that empirical Bayes methods are known to underestimate uncertainty in the data (since they do not account for uncertainty in estimates $$g \in \mathcal{G}$$), so we offer this last conclusion with a large grain of salt.

## Session information

The following R version and packages were used to generate this vignette:

sessionInfo()
#> R version 4.2.1 (2022-06-23)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Monterey 12.3
#>
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base
#>
#> other attached packages:
#> [1] dplyr_1.1.3    cowplot_1.1.1  ggplot2_3.4.3  flashier_1.0.7 magrittr_2.0.3
#> [6] ebnm_1.1-2
#>
#> loaded via a namespace (and not attached):
#>  [1] softImpute_1.4-1   tidyselect_1.2.0   xfun_0.40          bslib_0.5.1
#>  [5] purrr_1.0.2        ashr_2.2-63        splines_4.2.1      lattice_0.21-9
#>  [9] colorspace_2.1-0   vctrs_0.6.3        generics_0.1.3     htmltools_0.5.6.1
#> [13] yaml_2.3.7         utf8_1.2.3         rlang_1.1.1        mixsqp_0.3-48
#> [17] jquerylib_0.1.4    pillar_1.9.0       withr_2.5.1        glue_1.6.2
#> [21] RColorBrewer_1.1-3 trust_0.1-8        lifecycle_1.0.3    stringr_1.5.0
#> [25] munsell_0.5.0      gtable_0.3.4       evaluate_0.22      labeling_0.4.3
#> [29] knitr_1.44         fastmap_1.1.1      invgamma_1.1       irlba_2.3.5.1
#> [33] parallel_4.2.1     fansi_1.0.5        Rcpp_1.0.11        scales_1.2.1
#> [37] cachem_1.0.8       horseshoe_0.2.0    jsonlite_1.8.7     truncnorm_1.0-9
#> [41] farver_2.1.1       deconvolveR_1.2-1  digest_0.6.33      stringi_1.7.12
#> [45] grid_4.2.1         cli_3.6.1          tools_4.2.1        sass_0.4.7
#> [49] tibble_3.2.1       tidyr_1.3.0        pkgconfig_2.0.3    Matrix_1.6-1.1
#> [53] SQUAREM_2021.1     rmarkdown_2.25     rstudioapi_0.15.0  R6_2.5.1
#> [57] compiler_4.2.1