You need to have installed R and a suitable editor such as RStudio or VSCode with R plugin. See the package’s README for instructions on installing the {simodels} package.
library(simodels)
library(dplyr)
library(ggplot2)
library(sf)
This tutorial builds on a reproducible guide to SIMs in R (Dennett 2018). We start by importing open access data representing movement between zones in Australia (thanks to Adam Dennett for making the files accessible):
# To get the data (pre-loaded in the package)
"https://github.com/Robinlovelace/simodels/releases/download/0.0.1/zones_aus.geojson"
u1 = sf::read_sf(u1)
zones_aus = "https://www.dropbox.com/s/wi3zxlq5pff1yda/AusMig2011.csv?raw=1"
u2 = read.csv(u2) od_aus =
Let’s take a quick look at and ‘minimize’ these input datasets before modelling them with SIMs:
dim(zones_aus)
names(zones_aus)
c("GCCSA_CODE", "GCCSA_NAME", "AREA_SQKM")
key_zone_names = zones_aus[key_zone_names]
zones =head(zones, 2)
dim(od_aus)
names(od_aus)
c("Orig_code", "Dest_code", "Flow")
key_od_names = od_aus[key_od_names]
od =head(od, 2)
The results printed above show that:
zones_aus
represents the 15 regions of Australia, with columns containing zone identifiers (IDs), primarily GCCSA_CODE
, and the area of each zone (AREA_SQKM
). The minimal zones
object contains only the region code, name, and area.
od_aus
contains 225 rows of data representing the movement of people between the zones in aus
. Note that 255 is 15 squared, meaning that this OD dataset contains the complete combination of migration flows, starting with zone 1GSYD
to 1GSYD
. These codes are present in the GCCSA_CODE
column in the aus
object. The first row of the OD dataset represents ‘intra-zonal’ migrations in Greater Sydney, presumably counting the number of people who move house from somewhere in the region to another home in the region. There are 13 columns in this dataset, the most important of which are Orig_code
, Dest_code
, and Flow
, which we have captured in the minimal od
dataset.
Note: a useful convention with ‘long form’ OD datasets is for the first two columns to contain zone IDs that correspond to values in the first column of the zone dataset. The R package {od}, on which {simodels} builds, assumes that inputs to its functions are in this form.
It is a good idea to verify that the origin and destination codes in the od
dataset match the zone codes in zones
:
summary(od[[1]] %in% zones[[1]])
summary(od[[2]] %in% zones[[1]])
It is clear from the above that we have ‘clean’ input datasets, let’s begin with the modelling!
Key to the workings of the {simodels} package is the conversion of geographic objects representing origins and destinations into an OD dataset. In this case, we already have an OD dataset, so this step is less relevant. However, we will take this step in any case because many SIMs start without such a comprehensive OD dataset as we have in this case.
Prepare the OD dataset as follows:
si_to_od(origins = zones, destinations = zones)
od_sim =names(od_sim)
Note that the output has duplicate columns: si_to_od()
joins data from the origin and destination objects into the resulting OD object.
A simplistic SIM - in this case an inverse power distance decay function (negative exponential is another commonly used decay function) - can be created just based on the distance between points:
function(d, beta) (d / 1000)^beta
si_power = si_calculate(
od_calculated =
od_sim,fun = si_power,
d = distance_euclidean,
beta = -0.8
)plot(od_calculated["interaction"], logz = TRUE)
This approach, ignoring all variables at the level of trip origins and destinations, results in flow estimates with no units. Before learning how to run constrained SIMs, let’s scale the result by the total flow and see how far we are from reality, just focussing on the interzonal OD pairs:
od %>%
od_interzonal = filter(Orig_code != Dest_code)
od_calculated %>%
od_calculated_interzonal = filter(O != D)
sum(od_interzonal$Flow) /
scale_factor = sum(od_calculated_interzonal$interaction)
od_calculated_interzonal %>%
od_calculated_interzonal = mutate(interaction_scaled = interaction * scale_factor)
inner_join(
od_joined =
od_calculated_interzonal,%>% rename(O = Orig_code, D = Dest_code)
od
)%>%
od_joined ggplot() +
geom_point(aes(Flow, interaction_scaled))
cor(od_joined$Flow, od_joined$interaction_scaled)^2
The results show that a simple unconstrained model, without any parameter fitting, can explain less than 20% of the variability in flows. We can do better!
%>%
od_joined mutate(decay = distance_euclidean^-0.8) %>%
mutate(decay = decay * (sum(Flow) / sum(decay))) %>%
ggplot() +
geom_point(aes(distance_euclidean, Flow)) +
geom_line(aes(distance_euclidean, decay), colour = "red")
The first logical way to improve model fit is to run a production constrained model. To do that, we’ll first calculate the total number of people leaving each zone and then use the constraint_production
argument:
od_joined %>%
od_originating = group_by(O) %>%
mutate(originating_per_zone = sum(Flow)) %>%
ungroup()
si_calculate(
od_constrained_p =
od_originating,fun = si_power,
d = distance_euclidean,
beta = -0.8,
constraint_production = originating_per_zone
)%>%
od_constrained_p ggplot() +
geom_point(aes(Flow, interaction))
cor(od_constrained_p$Flow, od_constrained_p$interaction)^2
Progress! We have more than doubled the predictive ability of our model by using a ‘production constrained’ SIM, as defined mathematically in the simodels-first-principles
vignette.
An advantage of the flow data used in this example is that we already know the interaction. (This raises the question of why a SIM is needed, answer: to test our models and demonstrate the techniques.)
We can do this using the nls()
function as follows:
library(minpack.lm)
Flow ~ a * (distance_euclidean)^b
f = nlsLM(
m =formula = f,
data = od_originating,
)
m# Nonlinear regression model
# model: Flow ~ a * (distance_euclidean)^b
# data: od_originating
# a b
# 2.182e+07 -5.801e-01
%>%
od_joined mutate(decay = distance_euclidean^-5.801e-01) %>%
mutate(decay = decay * 2.182e+07) %>%
ggplot() +
geom_point(aes(distance_euclidean, Flow)) +
geom_line(aes(distance_euclidean, decay), colour = "red")
si_predict(od_originating, model = m)
od_pred =cor(od_pred$Flow, od_pred$interaction)^2
si_predict(od_originating, model = m,
od_pred_const =constraint_production = originating_per_zone)
cor(od_pred_const$Flow, od_pred_const$interaction)^2
library(tmap)
ttm()
tm_shape(od_pred_const) +
tm_lines("interaction_scaled", palette = "viridis")
Dennett, Adam. 2018. “Modelling Population Flows Using Spatial Interaction Models.” Australian Population Studies 2 (2): 33–58. https://doi.org/10.37970/aps.v2i2.38.