MixedTypeData

Zhiwen Tan

Introduction

BCClong is an R package for performing Bayesian Consensus Clustering (BCC) model for clustering continuous, discrete and categorical longitudinal data, which are commonly seen in many clinical studies. This document gives a tour of BCClong package.

see help(package = "BCClong") for more information and references provided by citation("BCClong")

To download BCClong, use the following commands:

require("devtools")
devtools::install_github("ZhiwenT/BCClong", build_vignettes = TRUE)
library("BCClong")

To list all functions available in this package:

ls("package:BCClong")

Components

Currently, there are 5 function in this package which are BCC.multi, BayesT, model.selection.criteria, traceplot, trajplot.

BCC.multi function performs clustering on mixed-type (continuous, discrete and categorical) longitudinal markers using Bayesian consensus clustering method with MCMC sampling and provide a summary statistics for the computed model. This function will take in a data set and multiple parameters and output a BCC model with summary statistics.

BayesT function assess the model goodness of fit by calculate the discrepancy measure T(, ) with following steps (a) Generate T.obs based on the MCMC samples (b) Generate T.rep based on the posterior distribution of the parameters (c) Compare T.obs and T.rep, and calculate the P values.

model.selection.criteria function calculates DIC and WAIC for the fitted model traceplot function visualize the MCMC chain for model parameters trajplot function plot the longitudinal trajectory of features by local and global clustering

Pre-process (Setting up)

In this example, the PBCseq data in the mixAK package was used as it is a public data set. The variables used here include lbili, platelet, and spiders. Of these three variables, lbili and platelet are continuous variables, while spiders are categorical variables.

library(BCClong)
library(mixAK)
data(PBC910)

Fit BCC Model Using BCC.multi Function

Here, We used a binomial distribution for spiders marker, a gaussian distribution for the lbili marker and poisson distribution for platelet, respectively. The number of clusters was set to 2. All hyper parameters were set to default.

We ran the model with 12,000 iterations, discard the first 2,000 sample, and kept every 10th sample. This resulted in 1,000 samples for each model parameter. The MCMC sampling process took about 30 minutes on an AMD Ryzen\(^{TM}\) 5 5600X desktop computer.

Since this program takes a long time to run, here we will use the pre-compile result in this example.

set.seed(89)
fit.BCC2 <- BCC.multi(
    mydat = list(PBC910$lbili,PBC910$platelet,PBC910$spiders),
    dist = c("gaussian","poisson","binomial"),
    id = list(PBC910$id),
    time = list(PBC910$month),
    formula =list(y ~ time + (1|id),y ~ time + (1|id), y ~ time + (1|id)),
    num.cluster = 2,
    burn.in = 100,            
    thin = 10,                  
    per = 10,                     
    max.iter = 200) 

To run the pre-compiled result, you can use data(PBCseqfit) to attach the fitted model.

# pre-compiled result
data(PBCseqfit)
fit.BCC2 <- PBCseqfit

Printing Summary Statistics for key model parameters

To print the BCC model

print(fit.BCC2)

To print the summary statistics for all parameters

summary(fit.BCC2)

To print the proportion for each cluster (mean, sd, 2.5% and 97.5% percentile) geweke statistics (geweke.stat) between -2 and 2 suggests the parameters converge

fit.BCC2$summary.stat$PPI

The code below prints out all major parameters

summary(fit.BCC2)
#> Total number of individual:
#> [1] 260
#> 
#> Number of features:
#> [1] 3
#> 
#> Cluster proportions statistics for global clusters:
#>        V1               V2          
#>  Min.   :0.8496   Min.   :0.005276  
#>  1st Qu.:0.8762   1st Qu.:0.043312  
#>  Median :0.9371   Median :0.062932  
#>  Mean   :0.9216   Mean   :0.078430  
#>  3rd Qu.:0.9567   3rd Qu.:0.123799  
#>  Max.   :0.9947   Max.   :0.150384  
#> 
#> Globle clusters table:
#> 
#>   1 
#> 260 
#> 
#> Adherence parameters statistics by feature:
#>        V1               V2               V3        
#>  Min.   :0.5007   Min.   :0.5001   Min.   :0.7772  
#>  1st Qu.:0.5021   1st Qu.:0.5089   1st Qu.:0.7902  
#>  Median :0.5030   Median :0.5115   Median :0.7959  
#>  Mean   :0.5042   Mean   :0.5175   Mean   :0.8078  
#>  3rd Qu.:0.5041   3rd Qu.:0.5237   3rd Qu.:0.8275  
#>  Max.   :0.5118   Max.   :0.5583   Max.   :0.8576  
#> 
#> Local clusters statistics by feature:
#> Cluster statistics for feature 1 :
#> , , 1
#> 
#>                    [,1]        [,2]
#> mean         1.46526216 -0.14926529
#> sd           0.03777137  0.01364611
#> 2.5%         1.40888462 -0.16946273
#> 97.5%        1.51895826 -0.13096229
#> geweke.stat -2.10391018 -1.75995576
#> 
#> , , 2
#> 
#>                    [,1]          [,2]
#> mean        0.014619346  0.0034315243
#> sd          0.002271399  0.0016688712
#> 2.5%        0.010935934  0.0008741408
#> 97.5%       0.017225695  0.0056475790
#> geweke.stat 0.822043653 -1.7521732959
#> 
#> Cluster statistics for feature 2 :
#> , , 1
#> 
#>                    [,1]         [,2]
#> mean         5.77898717  5.264695558
#> sd           0.01067935  0.007645266
#> 2.5%         5.75976591  5.252509447
#> 97.5%        5.79323839  5.276789296
#> geweke.stat -8.63291491 -6.594456992
#> 
#> , , 2
#> 
#>                      [,1]          [,2]
#> mean        -0.0016958877 -0.0077753262
#> sd           0.0002249505  0.0001821189
#> 2.5%        -0.0020175319 -0.0080937406
#> 97.5%       -0.0013641913 -0.0075888663
#> geweke.stat  0.2202520953  0.9780291395
#> 
#> Cluster statistics for feature 3 :
#> , , 1
#> 
#>                   [,1]       [,2]
#> mean        -2.4027089  1.7287061
#> sd           0.2711898  0.5153692
#> 2.5%        -2.9158800  1.2182696
#> 97.5%       -2.0870933  2.6400038
#> geweke.stat -6.4887491 -2.3314718
#> 
#> , , 2
#> 
#>                    [,1]         [,2]
#> mean        0.030690608  0.024356269
#> sd          0.017911688  0.018026932
#> 2.5%        0.006128959 -0.003560592
#> 97.5%       0.056890173  0.050779581
#> geweke.stat 1.272508080 -0.934336018
#> 
#> 
#> Variance-covariance matrix statistics for random effects by feature:
#> Variance-covariance matrix statistics for feature 1 :
#> , , 1
#> 
#>                      [,1]
#> mean         3.825797e-05
#> sd           4.405348e-06
#> 2.5%         3.241441e-05
#> 97.5%        4.558949e-05
#> geweke.stat -1.102672e+00
#> 
#> , , 2
#> 
#>                      [,1]
#> mean         2.373843e-05
#> sd           3.102229e-06
#> 2.5%         1.956378e-05
#> 97.5%        2.803884e-05
#> geweke.stat -9.137537e-01
#> 
#> Variance-covariance matrix statistics for feature 2 :
#> , , 1
#> 
#>                      [,1]
#> mean         0.0016138846
#> sd           0.0005110166
#> 2.5%         0.0010801860
#> 97.5%        0.0024609516
#> geweke.stat -3.6092736173
#> 
#> , , 2
#> 
#>                     [,1]
#> mean         0.011132440
#> sd           0.002629614
#> 2.5%         0.007708163
#> 97.5%        0.014974559
#> geweke.stat -7.067780574
#> 
#> Variance-covariance matrix statistics for feature 3 :
#> , , 1
#> 
#>                     [,1]
#> mean        4.569681e-05
#> sd          6.677545e-06
#> 2.5%        3.448587e-05
#> 97.5%       5.383768e-05
#> geweke.stat 5.774842e-01
#> 
#> , , 2
#> 
#>                      [,1]
#> mean         7.637431e-05
#> sd           2.080340e-05
#> 2.5%         5.194423e-05
#> 97.5%        1.147128e-04
#> geweke.stat -1.853548e+00
#> 
#> 
#> Residual variance of continuous features statistics by feature:
#> Residual variance statistics for feature 1 :
#>                     [,1]         [,2]
#> mean         0.344304859  0.344304859
#> sd           0.018691680  0.018691680
#> 2.5%         0.314145269  0.314145269
#> 97.5%        0.365692696  0.365692696
#> geweke.stat -0.003239043 -0.003239043
#> Residual variance statistics for feature 2 :
#> NULL
#> Residual variance statistics for feature 3 :
#> NULL
#> 
#> Local clusters tables by feature:
#> Clusters table for feature 1 :
#> 
#>   1   2 
#>  86 174 
#> Clusters table for feature 2 :
#> 
#>   1   2 
#> 126 134 
#> Clusters table for feature 3 :
#> 
#>   1   2 
#> 198  62

Visualize Clusters

Generic plot can be used on BCC object, all relevant plots will be generate one by one using return key

plot(fit.BCC2)

We can use the traceplot function to plot the MCMC process and the trajplot function to plot the trajectory for each feature.

gp1 <- trajplot(fit=fit.BCC2,feature.ind=1,which.cluster = "local.cluster",
            title= bquote(paste("Local Clustering (",hat(alpha)[1] ==.(round(fit.BCC2$alpha[1],2)),")")),
            xlab="months",ylab="lbili",color=c("#00BA38", "#619CFF"))
gp2 <- trajplot(fit=fit.BCC2,feature.ind=2,which.cluster = "local.cluster",
            title= bquote(paste("Local Clustering (",hat(alpha)[2] ==.(round(fit.BCC2$alpha[2],2)),")")),
            xlab="months",ylab="platelet",color=c("#00BA38", "#619CFF"))
gp3 <- trajplot(fit=fit.BCC2,feature.ind=3,which.cluster = "local.cluster",
            title= bquote(paste("Local Clustering (",hat(alpha)[3] ==.(round(fit.BCC2$alpha[3],2)),")")),
            xlab="months",ylab="spiders",color=c("#00BA38", "#619CFF"))
gp4 <- trajplot(fit=fit.BCC2,feature.ind=1,which.cluster = "global.cluster",
            title="Global Clustering",
            xlab="months",ylab="lbili",color=c("#00BA38", "#619CFF"))
gp5 <- trajplot(fit=fit.BCC2,feature.ind=2,which.cluster = "global.cluster",
            title="Global Clustering",
            xlab="months",ylab="platelet",color=c("#00BA38", "#619CFF"))
gp6 <- trajplot(fit=fit.BCC2,feature.ind=3,which.cluster = "global.cluster",
            title="Global Clustering",
            xlab="months",ylab="spiders",color=c("#00BA38", "#619CFF"))

library(cowplot)
#dev.new(width=180, height=120)
plot_grid(gp1, gp2,gp3,gp4,gp5,gp6, 
          labels=c("(A)", "(B)", "(C)", "(D)", "(E)", "(F)"), ncol = 3,   align = "v" )

Package References

Tan, Z., Shen, C., Lu, Z. (2022) BCClong: an R package for performing Bayesian Consensus Clustering model for clustering continuous, discrete and categorical longitudinal data.

sessionInfo()
#> R version 4.3.2 (2023-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 11 x64 (build 22631)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=C                          
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> time zone: America/Toronto
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] mixAK_5.7        lme4_1.1-35.1    Matrix_1.6-5     colorspace_2.1-0
#>  [5] cowplot_1.1.3    ggplot2_3.5.0    joineRML_0.4.6   survival_3.5-7  
#>  [9] nlme_3.1-163     BCClong_1.0.3   
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.4         xfun_0.41            bslib_0.6.1         
#>  [4] lattice_0.21-9       vctrs_0.6.4          tools_4.3.2         
#>  [7] generics_0.1.3       stats4_4.3.2         parallel_4.3.2      
#> [10] tibble_3.2.1         fansi_1.0.6          highr_0.10          
#> [13] cluster_2.1.4        pkgconfig_2.0.3      lifecycle_1.0.4     
#> [16] farver_2.1.1         compiler_4.3.2       MatrixModels_0.5-3  
#> [19] mcmc_0.9-8           munsell_0.5.0        mnormt_2.1.1        
#> [22] combinat_0.0-8       fastGHQuad_1.0.1     codetools_0.2-19    
#> [25] SparseM_1.81         quantreg_5.97        htmltools_0.5.7     
#> [28] sass_0.4.8           evd_2.3-6.1          yaml_2.3.8          
#> [31] gmp_0.7-4            pillar_1.9.0         nloptr_2.0.3        
#> [34] jquerylib_0.1.4      MASS_7.3-60          randtoolbox_2.0.4   
#> [37] truncdist_1.0-2      cachem_1.0.8         iterators_1.0.14    
#> [40] foreach_1.5.2        boot_1.3-28.1        abind_1.4-5         
#> [43] mclust_6.0.1         tidyselect_1.2.0     digest_0.6.34       
#> [46] mvtnorm_1.2-4        LaplacesDemon_16.1.6 dplyr_1.1.4         
#> [49] labeling_0.4.3       splines_4.3.2        fastmap_1.1.1       
#> [52] grid_4.3.2           cli_3.6.1            magrittr_2.0.3      
#> [55] cobs_1.3-7           label.switching_1.8  utf8_1.2.4          
#> [58] withr_3.0.0          Rmpfr_0.9-5          scales_1.3.0        
#> [61] rmarkdown_2.25       rngWELL_0.10-9       nnet_7.3-19         
#> [64] gridExtra_2.3        coda_0.19-4.1        evaluate_0.23       
#> [67] lpSolve_5.6.20       knitr_1.45           doParallel_1.0.17   
#> [70] mgcv_1.9-0           rlang_1.1.1          MCMCpack_1.7-0      
#> [73] Rcpp_1.0.12          glue_1.6.2           rstudioapi_0.15.0   
#> [76] minqa_1.2.6          jsonlite_1.8.8       R6_2.5.1