Quick Start with the baorista R package

Enrico Crema

2024-08-19

Introduction

Baorista is an R package that provides robust alternatives to aoristic analyses based on parametric and non-parametric Bayesian inference. The package consists of a series of utility and wrapper functions as well as custom statistical distributions for the NIMBLE probabilistic programming language. This documents provides a short guide for the key steps required to setup datasets and execute the core functionalities of baorista.

Installing and loading the baorista package

Stable version of Baorista is available directly on CRAN while development version can be installed via devtools with the following command:

library(devtools)
install_github('ercrema/baorista')
library(baorista)
## Loading required package: nimble
## nimble version 1.2.0 is loaded.
## For more information on NIMBLE and a User Manual,
## please visit https://R-nimble.org.
## 
## Note for advanced users who have written their own MCMC samplers:
##   As of version 0.13.0, NIMBLE's protocol for handling posterior
##   predictive nodes has changed in a way that could affect user-defined
##   samplers in some situations. Please see Section 15.5.1 of the User Manual.
## 
## Attaching package: 'nimble'
## The following object is masked from 'package:stats':
## 
##     simulate

Please note that in order to execute all commands you need a working C++ compiler. For more further details please read the nimble package documentation

Using baorista

Data Preparation

baorista can read two types of data: a) data.frame class objects containing two fields recording the start and end date of the time-spans of existence recorded in BP, and b) matrices containing the probability of existence for each event (row) at each time-block (column). Sample datasets of the two formats are provided:

data(sampledf) #two column data.frame format
data(samplemat) #events x time-block matrix format

The function createProbMat() creates an object of class probMat which standardise the data format and includes additional information such as chronological range and resolution:

x.df <- createProbMat(x=sampledf,timeRange=c(6500,4001),resolution=50) #50 year timeblock ranging from 6500 to 4001 (i.e. 6500-6451,6450-6401,...)
x.mat <- createProbMat(pmat=samplemat,timeRange=c(5000,3001),resolution=100)

The plot() function displays probMat class object using aoristic sums:

par(mfrow=c(1,2))
plot(x.df,main='x.df')
plot(x.mat,main='x.mat')

Bayesian Inference

baorista contains functions for running two types of Bayesian analyses: parametric and non-parametric. Parametric analyses requires the selection of a suitable growth model (each one is currently a separate function) and provides estimates of key model parameters. Non-parametric models employs an intrinsic conditional auto-regressive (ICAR) Gaussian model which enables the calculation of the probability mass function (i.e. it estimates the probability of any event occurring at each timeblock, effectively recovering the shape of the time-frequency distribution).
In all cases baorista provides a wrapper function which internally initialises and runs an MCMC via the NIMBLE package. These function allow for user-defined MCMC settings (e.g. number of iterations, number chains, etc.), samplers, and priors and automatically assesses convergence statistics.

Estimating Exponential Growth Rate

The most straightforward model is a truncated discrete exponential distribution. Users can fit the data to such distribution and estimate growth rates using the function expfit().

exponential.fit <- expfit(x.mat) #fitting using default MCMC/Prior settings
## Compiling nimble model...
## ===== Monitors =====
## thin = 1: r
## ===== Samplers =====
## RW sampler (1)
##   - r
## thin = 1: r
## running chain 1...
## |-------------|-------------|-------------|-------------|
## |-------------------------------------------------------|
## running chain 2...
## |-------------|-------------|-------------|-------------|
## |-------------------------------------------------------|
## running chain 3...
## |-------------|-------------|-------------|-------------|
## |-------------------------------------------------------|
## running chain 4...
## |-------------|-------------|-------------|-------------|
## |-------------------------------------------------------|
# expfit(x.mat,rPrior='dnorm(mean=0,sd=1)') #Using a Gaussian prior with mean 0 and sd 1 for the growth rate r
# expfit(x.mat,niter=50000)  #Fitting the model with 50k iterations
# expfit(x.mat,nchains=4,parallel=T)  #Running 4 chains in parallel

Summary statistics on the posterior parameters and the MCMC diagnostics can be retrieved using the summary() function:

summary(exponential.fit)
##   Parameter Posterior Median                                95% HPDI     Rhat
## r         r      0.002074402 0.00176572362700897~0.00238598571702006 1.000405
##        ESS
## r 20142.04

Alternatively values can be extracted directly:

#Compute 90% HPDI with coda package
library(coda)
mcmc(exponential.fit$posterior.r) |> HPDinterval(prob=0.90)
##        lower       upper
## r 0.00181297 0.002335741
## attr(,"Probability")
## [1] 0.9
#Plot histogram of posterior
hist(exponential.fit$posterior.r[,1],xlab='r',main='Posterior of Growth Rate')

A dedicated plot function can also be used to visualise the fitted exponential model:

plot(exponential.fit)

Non-Parametric Modelling via Random Walk ICAR Model

baorista offers also a non-parametric model based Random Walk ICAR. The function icarfit() estimates the probability mass for each time-block accounting for the information from adjacent blocks (and hence temporal autocorrelation). Given the larger number of parameters, the MCMC requires longer chains and the execution over multiple cores is recommended. Convergence issues are also more likely to arise, in which case adjust running the model with longer chains, changing the sampler, or adjusting the priors is recommended.

# The following reaches convergence but takes a couple of hours:
# non.param.fit <- icarfit(x.df,niter=2000000,nburnin=1000000,thin=100,nchains=4,parallel=TRUE)
# Shorter number of iterations (does not reach convergence)
non.param.fit <- icarfit(x.df,niter=100000,nburnin=50000,thin=50,nchains=4,parallel=TRUE)
## Running in parallel - progress bar will no be visualised
## Warning in icarfit(x.df, niter = 1e+05, nburnin = 50000, thin = 50, nchains =
## 4, : Highest Rhat value above 1.01 (1.012). Consider rerunning the model with a
## higher number of iterations

A dedicated function can display the HPDI of the probability mass of each time-block:

plot(non.param.fit)