This vignette provides a tutorial for fitting spatial regression models to raster data using geostan. The term “raster” is used here to refer to any regularly spaced set of observations such that the data can be represented spatially by a rectangular grid or lattice. Remotely sensed imagery is a common form of raster data.
For an irregular spatial lattice and moderately big N, the best one can do (for now) is to save the sar_list
or car_list
in a file (using saveRDS(sar_list, "sar-parts.rds")
or similar) so that it only needs to be calculated once.
Start by loading the geostan and sf packages:
library(geostan)
library(sf)
set.seed(1127)
We will create a small raster data layer for the purpose of illustration.
# creating a grid
40
row <- 30
col <-c(N <- row * col)
## [1] 1200
st_sfc(st_polygon(list(rbind(c(0,0), c(col,0), c(col,row), c(0,0)))))
sfc = st_make_grid(sfc, cellsize = 1, square = TRUE)
grid <- st_as_sf(grid)
grid <-
# create connectivity matrix
shape2mat(grid, style = "W", method = "rook", quiet = TRUE)
W <-
# draw data from a spatial autoregressive model
set.seed(100)
$z <- sim_sar(rho = 0.8, w = W)
grid$y <- sim_sar(mu = -0.5 * grid$z, rho = .9, sigma = .3, w = W) grid
plot(grid[ , 'y' ])
The following R code will fit a spatial autoregressive model to these data:
stan_sar(y ~ z, data = grid, C = W) fit <-
The stan_sar
function will take the spatial weights matrix W
and pass it through a function called prep_sar_data
which will calculate the eigenvalues of the spatial weights matrix using Matrix::Schur
. This step can be prohibitive for large data sets (e.g., \(N = 100,000\)).
The following code would normally be used to fit a conditional autoregressive (CAR) model:
shape2mat(grid, style = "B", queen = FALSE)
C <- prep_car_data(C, "WCAR")
car_list <- stan_car(y ~ z, data = grid, car_parts = car_list) fit <-
Here, the prep_car_data
function calculates the eigenvalues of the spatial weights matrix using Matrix::Schur
, which is not feasible for large N.
The prep_sar_data2
and prep_car_data2
functions are designed for large raster layers. As input, they require the dimensions of the grid (number of rows and number of columns). The eigenvalues are produced very quickly using Equation 5 from Griffith (2000). The methods have some restrictions:
The following code will fit a SAR model to our grid data (without any use of shape2mat
) and is suitable for larger raster layers:
# create connectivity matrix and its eigenvalues
prep_sar_data2(row = row, col = col, quiet = TRUE)
sars <-
# if you want the matrix
sars$W
W <-
# fit model
stan_sar(y ~ z,
fit <-data = grid,
centerx = TRUE,
sar_parts = sars,
iter = 500,
chains = 2, # for demo speed
# cores = 4, # multi-core processing
slim = TRUE
)
##
## *Setting prior parameters for intercept
## Distribution: normal
## location scale
## 1 -0.019 5
##
## *Setting prior parameters for beta
## Distribution: normal
## location scale
## 1 0 5
##
## *Setting prior for SAR scale parameter (sar_scale)
## Distribution: student_t
## df location scale
## 1 10 0 3
##
## *Setting prior for SAR spatial autocorrelation parameter (sar_rho)
## Distribution: uniform
## lower upper
## 1 -1 1
##
## SAMPLING FOR MODEL 'foundation' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0.001293 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 12.93 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 500 [ 0%] (Warmup)
## Chain 1: Iteration: 251 / 500 [ 50%] (Sampling)
## Chain 1: Iteration: 500 / 500 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 1.792 seconds (Warm-up)
## Chain 1: 1.498 seconds (Sampling)
## Chain 1: 3.29 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL 'foundation' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 0.001003 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 10.03 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 500 [ 0%] (Warmup)
## Chain 2: Iteration: 251 / 500 [ 50%] (Sampling)
## Chain 2: Iteration: 500 / 500 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 2.79 seconds (Warm-up)
## Chain 2: 2.025 seconds (Sampling)
## Chain 2: 4.815 seconds (Total)
## Chain 2:
print(fit)
## Spatial Model Results
## Formula: y ~ z
## Likelihood: auto_gaussian
## Link: identity
## Spatial method: SAR (SEM)
## Observations: 1200
##
## Inference for Stan model: foundation.
## 2 chains, each with iter=500; warmup=250; thin=1;
## post-warmup draws per chain=250, total post-warmup draws=500.
##
## mean se_mean sd 2.5% 20% 50% 80% 97.5% n_eff Rhat
## intercept -0.027 0.003 0.069 -0.164 -0.078 -0.027 0.027 0.108 411 0.999
## z -0.502 0.000 0.009 -0.519 -0.510 -0.503 -0.494 -0.485 564 0.997
## sar_rho 0.877 0.001 0.016 0.845 0.863 0.877 0.890 0.908 405 1.006
## sar_scale 0.308 0.000 0.007 0.294 0.302 0.308 0.313 0.321 477 1.008
##
## Samples were drawn using NUTS(diag_e) at Wed Dec 4 15:53:50 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
The user first creates the data list using prep_sar_data2
and then passes it to stan_sar
using the sar_parts
argument. Also, slim = TRUE
is invoked to prevent the model from collecting N-length parameter vectors and quantities of interest (such as fitted values and log-likelihoods).
For large data sets and complex models, slim = TRUE
can bring about computational improvements at the cost of losing access to some convenience functions (such as sp_diag
, me_diag
, spatial
, resid
, and fitted
). Many quantities of interest, such as fitted values and spatial trend terms, can still be calculated manually using the data and parameter estimates (intercept, coefficients, and spatial autocorrelation parameters).
The favorable MCMC diagnostics for the above model based on just 200 post-warmup iterations per chain (sufficiently large effective sample sizes n_eff
, and Rhat
values very near to 1) provides some indication as to the computational efficiency of the models. The point is that the basic spatial autoregressive models can work well for larger data because you only need a modest number of MCMC samples.
Also, note that Stan usually samples more efficiently when variables have been mean-centered. Using the centerx = TRUE
argument in stan_sar
(or any other model-fitting function in geostan) can be very helpful in this respect. Also note that the SAR models in geostan are (generally) no less computationally-efficient than the CAR models, and may even be slightly more efficient.
To simulate spatially-autocorrelated data with larger numbers of observations use the quick = TRUE
argument in sim_sar
:
100
row = 100
col = prep_sar_data2(row = row, col = col)
sar_list <- sar_list$W
W <-
sim_sar(rho = .8, w = W, quick = TRUE)
z <- sim_sar(mu = -.5 * z, rho = .7, w = W, quick = TRUE)
y <- cbind(y, z) dat <-
The approximate method uses matrix powers, and for high values of SA (especially values of rho greater than 0.9) the approximation is sensitive to the number of powers taken. In sim_sar
this is controlled by the argument K
. For rho = 0.9
, you may need to raise the default value of K for the approximation to hold. Notice that \(0.9^{45} = 0.009\), which is near zero and should provide sound results, whereas \(0.9^25 = 0.07\) which is not quite there, and \(0.9^{15} = 0.2\) will be completely inadequate. In geostan 0.8.0, the default is \(K = 20\).
Donegan, Connor. 2022. “Building Spatial Conditional Autoregressive Models in the Stan Programming Language.” OSF Preprints. https://doi.org/10.31219/osf.io/3ey65.
Griffith, Daniel A. 2000. “Eigenfunction Properties and Approximations of Selected Incidence Matrices Employed in Spatial Analyses.” Linear Algebra and Its Applications 321 (1-3): 95–112.