Introduction

The R package RNAseqNet can be used to infer networks from RNA-seq expression data. The count data are given as a \(n \times p\) matrix in which \(n\) is the number of individuals and \(p\) the number of genes. This matrix is denoted by \(\mathbf{X}\) in the sequel.

Eventually, the RNA-seq dataset is complemented with an \(n' \times d\) matrix, \(\mathbf{Y}\) which can be used to impute missing individuals in \(\mathbf{X}\) as described in [Imbert et al., 2018]1.

Dataset description

Two datasets are available in the package: lung and thyroid with \(n = 221\) rows and respectively 100 and 50 columns. The raw data were downloaded from https://gtexportal.org/. The TMM normalisation of RNA-seq expression was performed with the R package edgeR.

Data are loaded with:

data(lung)
boxplot(log2(lung + 1), las = 3, cex.names = 0.5)

data(thyroid)
boxplot(log2(thyroid + 1), las = 3, cex.names = 0.5)

Network inference from RNA-seq data

Network inference from RNA-seq data is performed with the Poisson GLM model described in [Allen and Liu, 2012]2. The inference can be performed with the function GLMnetwork as follows:

lambdas <- 4 * 10^(seq(0, -2, length = 10))
ref_lung <- GLMnetwork(lung, lambdas = lambdas)

The entry path of ref_lung contains length(lambdas) = 10 matrices with estimated coefficients. Each matrix is a square matrix with ncol(lung) = 100 rows and columns.

The choice of the most appropriate value for \(\lambda\) can be performed with the StARS criterion of [Liu et al., 2010]3, which is implemented in the function stabilitySelection. The argument B is used to specify the number of re-sampling used to compute the stability criterion:

set.seed(11051608)
stability_lung <- stabilitySelection(lung, lambdas = lambdas, B = 50)
plot(stability_lung)

The entry best of stability_lung is the index of the chosen \(\lambda\) in lambdas. Here, the value \(\lambda=\) lambdas[stability_lung$best] = 0.3097055 is chosen.

The corresponding set of estimated coefficients, is in ref_lung$path[[stability_lung$best]] and can be transformed into a network with the function GLMnetToGraph:

lung_refnet <- GLMnetToGraph(ref_lung$path[[stability_lung$best]])
print(lung_refnet)
## IGRAPH 488850b UN-- 100 454 -- 
## + attr: name (v/c)
## + edges from 488850b (vertex names):
##  [1] MT-CO1--MT-ND4       MT-CO1--SFTPA2       MT-CO1--hsa-mir-6723
##  [4] MT-CO1--A2M          MT-CO1--MT-CO2       MT-CO1--MT-CO3      
##  [7] MT-CO1--MT-RNR2      MT-CO1--MT-ATP6      MT-CO1--MT-ND1      
## [10] MT-CO1--MTND2P28     MT-CO1--MT-ND6       MT-CO1--SAT1        
## [13] MT-CO1--HSPB1        MT-CO1--MTND4P12     MT-CO1--HLA-DRA     
## [16] MT-CO1--MTND1P23     SFTPB --SFTPA2       SFTPB --FN1         
## [19] SFTPB --B2M          SFTPB --NEAT1        SFTPB --SLC34A2     
## [22] SFTPB --PGC          SFTPB --CTSD         SFTPB --TSC22D3     
## + ... omitted several edges
set.seed(1243)
plot(lung_refnet, vertex.size = 5, vertex.color = "orange", 
     vertex.frame.color = "orange", vertex.label.cex = 0.5, 
     vertex.label.color = "black")

Network inference with an auxiliary dataset

In this section, we artificially remove some of the observations in lung to create missing individuals (as compared to those in thyroid):

set.seed(1717)
nobs <- nrow(lung)
miss_ind <- sample(1:nobs, round(0.2 * nobs), replace = FALSE)
lung[miss_ind, ] <- NA
lung <- na.omit(lung)
boxplot(log2(lung + 1), las = 3, cex.names = 0.5)

The method described in [Imbert et al., 2018] is thus used to infer a network for lung expression data, imputing missing individuals from the information provided between gene expressions by the thyroid dataset.

The first step of the method is to choose a relevant value for the donor list parameter, \(\sigma\). This is done computing \(V_{\textrm{intra}}\), the intra-variability in donor pool, for various values of \(\sigma\). An elbow rule is thus used to choose an appropriate value:

sigmalist <- 1:5
sigma_stats <- chooseSigma(lung, thyroid, sigmalist)
p <- ggplot(sigma_stats, aes(x = sigma, y = varintra)) + geom_point() +
  geom_line() + theme_bw() + 
  ggtitle(expression("Evolution of intra-pool homogeneity versus" ~ sigma)) +
  xlab(expression(sigma)) + ylab(expression(V[intra])) +
  theme(title = element_text(size = 10))
print(p)

Here, \(\sigma = 2\) is chosen. Finally, hd-MI is processed with the chosen \(\sigma\), a list of regularization parameters \(\lambda\) that are selected with with the StARS criterion (from B = 10 subsamples) in m = 100 replicates of the inference, all performed on a different imputed dataset. The function imputedGLMnetwork is the one implementing the full method. The distribution of edge frequency among the m = 100 inferred network is obtained with the function plot applied to the result of this function.

set.seed(16051244)
lung_hdmi <- imputedGLMnetwork(lung, thyroid, sigma = 2, lambdas = lambdas,
                               m = 100, B = 10)
plot(lung_hdmi)

Finally, the final graph is extracted using the function GLMnetToGraph on the result of the function ``imputedGLMnetwork``` and providing a threshold for edge frequency prediction.

lung_net <- GLMnetToGraph(lung_hdmi, threshold = 0.9)
lung_net
## IGRAPH a9aad3a UN-- 100 132 -- 
## + attr: name (v/c)
## + edges from a9aad3a (vertex names):
##  [1] MT-CO1      --MT-ND4        MT-CO1      --MT-CO2       
##  [3] MT-CO1      --MT-CO3        MT-CO1      --MT-RNR2      
##  [5] MT-CO1      --MT-ND1        SFTPB       --SLC34A2      
##  [7] SFTPB       --PGC           SFTPB       --NAPSA        
##  [9] SFTPB       --ABCA3         SFTPC       --SFTPA1       
## [11] MT-ND4      --MT-CYB        MT-ND4      --MT-ND4L      
## [13] SFTPA2      --SFTPA1        SFTPA2      --NAPSA        
## [15] hsa-mir-6723--MTATP6P1      hsa-mir-6723--RP5-857K21.11
## + ... omitted several edges
set.seed(1605)
plot(lung_net, vertex.size = 5, vertex.color = "orange", 
     vertex.frame.color = "orange", vertex.label.cex = 0.5, 
     vertex.label.color = "black")

Session information

Here is the output of sessionInfo() on the system on which this document was compiled:

## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] RNAseqNet_0.1.3 ggplot2_3.2.0  
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.5  xfun_0.8          purrr_0.3.2      
##  [4] splines_3.6.3     lattice_0.20-40   hot.deck_1.1-1   
##  [7] colorspace_1.4-1  generics_0.0.2    htmltools_0.3.6  
## [10] yaml_2.2.0        pan_1.6           survival_3.1-8   
## [13] rlang_0.4.0       jomo_2.6-8        pillar_1.4.2     
## [16] nloptr_1.2.1      glue_1.3.1        withr_2.1.2      
## [19] foreach_1.4.4     stringr_1.4.0     munsell_0.5.0    
## [22] gtable_0.3.0      codetools_0.2-16  evaluate_0.14    
## [25] labeling_0.3      knitr_1.23        parallel_3.6.3   
## [28] broom_0.5.2       Rcpp_1.0.1        scales_1.0.0     
## [31] backports_1.1.4   lme4_1.1-21       digest_0.6.20    
## [34] stringi_1.4.3     dplyr_0.8.3       grid_3.6.3       
## [37] tools_3.6.3       magrittr_1.5      lazyeval_0.2.2   
## [40] glmnet_2.0-18     tibble_2.1.3      mice_3.6.0       
## [43] crayon_1.3.4      tidyr_0.8.3       pkgconfig_2.0.2  
## [46] MASS_7.3-51.5     Matrix_1.2-18     assertthat_0.2.1 
## [49] minqa_1.2.4       rmarkdown_1.14    PoiClaClu_1.0.2.1
## [52] iterators_1.0.10  rpart_4.1-15      mitml_0.3-7      
## [55] R6_2.4.0          boot_1.3-24       igraph_1.2.4.1   
## [58] nnet_7.3-12       nlme_3.1-144      compiler_3.6.3

  1. Imbert, A., Valsesia, A., Le Gall, C., Armenise, C., Lefebvre, G., Gourraud, P.A., Viguerie, N. and Villa-Vialaneix, N. (2018) Multiple hot-deck imputation for network inference from RNA sequencing data. Bioinformatics. DOI: http://dx.doi.org/10.1093/bioinformatics/btx819

  2. Allen, G. and Liu, Z. (2012) A log-linear model for inferring genetic networks from high-throughput sequencing data. In Proceedings of IEEE International Conference on Bioinformatics and Biomedecine (BIBM)

  3. Liu, H., Roeber, K. and Wasserman, L. (2010) Stability approach to regularization selection (StARS) for high dimensional graphical models. In Proceedings of Neural Information Processing Systems (NIPS 2010), 23, 1432-1440, Vancouver, Canada.