golden is a flexible patient-level microsimulation modelling framework focussed on trajectories of patient risk factors and hazards of events, authored by Pete Dodd and Robert Chisholm.
golden is not currently available via CRAN, so must be installed via either devtools:
or by manually downloading a .zip from the golden GitHub
repository to be installed:
Following either of the actions, you should now be able to load golden like any other package.
golden is a framework which provides the glue, allowing R functions either defined by the user or selected from other packages to be combined into a comprehensive patient trajectory model.
The four conceptual components of the framework are
hazards, transitions,
trajectories, and histories. The
population state is conceptualized as a data.frame or
data.table, with one row per patient, and columns
representing and storing information about the patients.
Hazards are functions that determine the likelihood of an event occurring for a patient in a given time step, given their current state (column values). If the random event occurs, an associated set of transitions will be applied that change the patient’s state. An example hazard might be the mortality rate, with death as an associated transition.
Trajectories are rules that determine the evolution of patients’ states. An example might be ageing, where ages advance with time while the patient is alive.
Histories define what aggregate summaries from the population to record as time series. The final population state will automatically be returned, which can be useful for computing cumulative or final quantities.
These concepts are represented by S3 objects in R, with constructors defined below. In order to create these objects, users will typically need to define
Whilst the exact usage of each case differs slightly, they all share
a common core whereby users must specify both a function (often
shortened to fn in the constructor argument names) and the
function’s arguments (often shorted to args).
The arguments passed to a function will typically be the names of columns (within the population data.table) to be passed to the function. The name “~STEP” is a reserved name created to represent the simulation time step.
Within this document, Note will preface some remark that explains how an example illustrates some important functionality that could be used in other ways.
Currently golden assumes that all functions are designed to process vectors, rather than individual scalars. If scalar handling is required, we recommend that you wrap your scalar function inside a parent function which calls the scalar function for each element of the vector arguments.
Hazards represent an instantaneous chance of a patient transitioning between states. A hazard is passed the requested vector of population data and returns a hazard value which is converted to a probability via Equation @ref(eq:hazard). A random number is then generated for each patient, which is tested to decide whether the hazard has occurred for that patient. Patients that are selected to be impacted by the hazard, then execute linked transition functions to update one or more of their states.
\[\begin{equation} p = 1 - \exp(-h\, dt) \label{eq:hazard} \end{equation}\]
Each transition is created via new_transition() which
requires the transition function, it’s arguments and the state
(population column) that it updates.
Each hazard is created via new_hazard() which requires
the hazard function, it’s arguments, and a list of dependent
transitions. Hazards also provide optional configuration to decrease the
frequency or range of steps that a hazard is active for.
Trajectories represent regularly changing states, such as a patient’s age or disease progression. They operate very similar to transitions, however trajectories allow multiple states may be updated simultaneously. Multivariate trajectories can return a list of vectors instead of a single vector, one for each state to be updated.
Each trajectory is created via new_trajectory() which
requires the trajectory function, it’s arguments, and the property or
properties to be updated.
History is the golden approach to logging aggregate population information (e.g. timeseries) during simulation execution. Individual columns of data to be collected are specified with a corresponding reduction function, with the optional facility to filter down to a sub-population to be reduced (e.g. the average age of living male patients).
Each column is created via new_column() which requires a
name for the column, the reduction function and arguments to use and an
optional filter function and respective arguments.
One of more columns are then passed to new_history(),
alongside the frequency of data collection (which defaults to 1, every
step).
With all the components of the simulation defined, they can now be assembled into the main parameter list to pass to the simulation.
The parameters are created by passing lists of hazards and trajectories, whilst also specifying the number of steps to execute an optional items such as random seed or history.
To execute a simulation run_simulation() is called,
passing the initial population table and parameters structure. In the
next few sections we will step through a series of related, artificial
examples to illustrate particular features.
First, let’s load some relevant libraries and set the random seed:
library(golden) # this package
library(data.table) # enhanced data handling
#>
#> Attaching package: 'data.table'
#> The following object is masked from 'package:base':
#>
#> %notin%
library(ggplot2) # plotting
set.seed(1234) # PRNG seedOne of the arguments for run_simulation is the initial
population. golden does not provide utilities to create
populations. Users are expected to create there own by resampling data,
using packages such as X and Y, or writing their own methods. For these
simple examples, we will create an artificual population as a
data.table with one row per patient and columns containing
their data. Here, we are considering ages, systolic blood pressure
(SBP), and total cholesterol (TC).
## initial population
N <- 1e4L # population size
pop0 <- data.table(
id = 1:N, # patient ids
death = rep(-1, N), # time of death
age = rep(40, N), # age
sbp = rlnorm(N, log(140), .05), # systolic BP
tc = rlnorm(N, log(4.5), .01) # total cholesterol
)For our first example, we will see how to apply a constant hazard of death. We need to write a function that returns this hazard:
This is actually a special case: usually functions need to return vectors, and should be programmed to return vectors of the same length as their input arguments. This special constant function has no arguments and its scalar return value will be recycled to create a vector of identical element values, with length equal to the population size.
Hazards control how likely events are to happen at each time step. We
define what events look like by specifying (a list of) transition
functions that will be called to enact changes on the data if a hazard
triggers an event. In our initial population we have chosen to include a
variable death for time of death, which is -1 when people
are alive. We can therefore use a generic transition to enact mortality
by recording the time of death:
## generic transition: returns state as input value
transition_fn <- function(state, i) {
# If result is true, and state is -1, update state to current time
ifelse(state == -1, rep(i, length(state)), state)
}t We will need to create a hazard that understands more context about how apply the hazard function and associated transitions to the data:
## new_hazard creates a hazard
morthaz0 <- new_hazard(
deathrate0, # function returning hazard
c(), # arguments for hazard function
new_transition( # creates a transition
transition_fn, # 'kills' by death -1 -> time of death
c("death", "~STEP"), # args for transition fn
"death" # col transition acts on
)
)Note that the generic transition will record time of death: the input
arguments are death (to check not already dead) and
~STEP (the special current step variable) and, where
applied, the function will set death to the current step
value.
If we want to record aggregate time series data about our population during the simulation, we will need to construct a history object. For this we will need to specify columns, what summary they are computing, and (optionally) what population subset they are considering by providing a filter function that selects rows. As with above, we need to provide vectors of column names that specify which inputs functions are taking as arguments. You can also choose the frequency with which time series statistics are computed and recorded.
## filter to restrict to alive
filter_alive <- function(x) {
x == -1 # tests alive applied to 'death'
}
## new_history creates a history
noalive <- new_history(
columns = list(
new_column( # creates a col in history
"no. alive", # column name
length, # summary function
"age", # input args for summary fn
filter_alive, # filter to row-restrict
"death" # input variables for filter
)
),
frequency = 1 # record history every 1 steps
)Now we are in a position to create a parameters object, which will take lists of hazards, trajectories, and a history, as well as other unformation like how many steps to run the model for. Note that we haven’t said how long a step is: the hazards need to return values in units of per step. If steps are meant to represent years, mortality hazards should return in units of per year; if step are meant to represent weeks; hazards should use units of per week.
Printing parameters, hazards, etc. will generate a summary of their ingredients.
Given an initial population and a parameters object, we can run the model:
## run the simulation
result0 <- run_simulation(pop0, parms0)
#> Simulation Complete in 0.01 seconds!This outputs a list of 3 things:
## output list
print(result0) # explicit print unnecessary
#> $pop
#> id death age sbp tc
#> <int> <num> <num> <num> <num>
#> 1: 1 0 40 131.8005 4.418978
#> 2: 2 0 40 141.9555 4.528311
#> 3: 3 -1 40 147.8007 4.523375
#> 4: 4 2 40 124.5065 4.506346
#> 5: 5 -1 40 143.0363 4.566057
#> ---
#> 9996: 9996 4 40 140.1382 4.519607
#> 9997: 9997 -1 40 125.8770 4.487413
#> 9998: 9998 1 40 139.6489 4.529636
#> 9999: 9999 -1 40 138.3427 4.470808
#> 10000: 10000 -1 40 145.5417 4.573960
#>
#> $history
#> ~STEP no. alive
#> <int> <num>
#> 1: 1 9059
#> 2: 2 8162
#> 3: 3 7375
#> 4: 4 6702
#> 5: 5 6068
#> 6: 6 5495
#> 7: 7 4943
#> 8: 8 4463
#> 9: 9 4035
#> 10: 10 3635
#>
#> $timing
#> <golden_timing>
#>
#> Hazards:
#> Hazard Total Time (s) Avg Time (s) % Runtime
#> <char> <num> <num> <num>
#> 1: deathrate0 0 0 0
#>
#> Transitions:
#> Transition Total Time (s) Avg Time (s) % Runtime
#> <char> <num> <num> <num>
#> 1: transition_fn 0.002 2e-04 16.66667
#>
#> Columns:
#> Column Total Time (s) Avg Time (s) % Runtime
#> <char> <num> <num> <num>
#> 1: no. alive 0.001 1e-04 8.333334Final population data can be used to understand final status and to record cumulative data at an individual level. Conversely, history data captures specified summaries, but does so over time.
Here, we can plot the fraction of the population surviving at each time step and compare it to the deterministic expectation:
## plot
ggplot(result0$history, aes(`~STEP`, `no. alive`)) +
geom_line() +
geom_point() +
geom_function(fun = function(x) N * exp(-x / 10), col = 2)You may have noticed that the example above did not include a trajectory. That is, none of the data associated with individuals evolved over time, except as a result of events triggered by hazards. In this example we will introduce some simple trajectories: deterministic or stochastic rules that specify how continuous variables associated with patients evolve. In realistic use cases, these would represent patient-level risk factors that influence the likelihood of health events. Perhaps the simplest example one could imagine is age.
A variety of quantities may increment with time or in order to record event counts, but age should only advance in those who are alive. Let’s define functions for both these possibilities:
## Define a function to increment by 1
generic_update <- function(x) {
x + 1
}
## If dead, age is not changed
age_update <- function(age, death_time) {
ifelse(death_time == -1, age + 1, age)
}As with hazards, we will create a trajectory using our update function, but will need to provide information about what the function uses as inputs, and what column it acts upon:
## new_trajectory creates a trajectory
age_traj <- new_trajectory(
age_update, # update function
c("age", "death"), # args for update function
"age" # column(s) to be updated
)Let’s also consider a marginally more realistic mortality hazard: we’ll make our mortality risk increase with SBP and TC:
## mortality hazard depending on individual characteristics
deathrate1 <- function(death, sbp, tc) {
ifelse(death < 0, 0.1 + sbp / 1000 + tc / 100, 0)
}
morthaz1 <- new_hazard(
deathrate1, # hazard function
c("death", "sbp", "tc"),# args for hazard fn
## (list of) transitions to apply
new_transition(
transition_fn, # generic transition
c("death", "~STEP"), # input col args
"death" # col outputted
)
)This time the hazard construction is less trivial, with a vector of column names to be used as input arguments by the mortality rate.
Note also that conceptually, this shows how to include paranetric uncertainty in analyses. Often health economic analyses will perform Probabilistic Sensitivity Analysis (PSA), which means sampling across uncertain parameters to reflect their uncertainty in outputs. Here SBP and TC are patient characteristics that have been sampled to vary between patients, but they could equally represent uncertain global parameters, e.g. an uncertain treatment effect or natural history parameter. In this framework, ‘parameters’ (and any uncertainty in them) should be included in the ‘population’.
In the same way as before, we can create a parameter object and run the simulation (unless they died in the first step).
## full parameters
parms1 <- new_parameters(
hazards = morthaz1,
trajectories = age_traj,
steps = 10,
debug = FALSE,
history = noalive
)
## run the simulation
result1 <- run_simulation(pop0, parms1)
#> Simulation Complete in 0.02 seconds!This time, one can see that people are no longer 40 year old at the end of the simulation (unless they died in the first step):
result1$pop
#> id death age sbp tc
#> <int> <num> <num> <num> <num>
#> 1: 1 4 44 131.8005 4.418978
#> 2: 2 0 40 141.9555 4.528311
#> 3: 3 8 48 147.8007 4.523375
#> 4: 4 -1 50 124.5065 4.506346
#> 5: 5 0 40 143.0363 4.566057
#> ---
#> 9996: 9996 1 41 140.1382 4.519607
#> 9997: 9997 2 42 125.8770 4.487413
#> 9998: 9998 4 44 139.6489 4.529636
#> 9999: 9999 3 43 138.3427 4.470808
#> 10000: 10000 3 43 145.5417 4.573960In this section we will illustrate how to implement a stochastic multivariate trajectory. However, to keep things simple for inspecting longer runs, let’s turn off death:
## keep everyone alive here to have a look at RW dynamics
deathrate2 <- function() {
0
}
morthaz2 <- new_hazard(deathrate2, args = c(), transitions = list())In may be the case that certain patient-characteristics evolve in a non-independent way. Modern statistical approaches and longitudinal data sets allow development of models that can simultaneously represent the correlated evolution of multiple variables, i.e. are multivariate. Here, we consider a toy example of a bivariate trajectory model for SBP and TC. We represent their evolution as a correlated geometric random walk (i.e. a correlated random walk in log transformed space):
## sbp & tc as correlated geometric random walk
bptc_update <- function(sbp, tc, death_time) {
lsbp <- log(sbp)
ltc <- log(tc)
alive <- death_time < 0
if (sum(alive) > 1) {
U <- matrix(rnorm(2 * sum(alive)), nrow = sum(alive), ncol = 2)
U[, 2] <- U[, 2] + U[, 1] / 2 # correlation
U <- U / 100 # rescale
lsbp[alive] <- lsbp[alive] + U[, 1]
ltc[alive] <- ltc[alive] + U[, 2]
}
list(sbp = exp(lsbp), tc = exp(ltc))
}
bptc_update(rep(140, 5), rep(4.5, 5), rep(-1, 5))
#> $sbp
#> [1] 142.4213 141.2728 139.7552 141.2690 144.1367
#>
#> $tc
#> [1] 4.473751 4.576176 4.433001 4.521404 4.504199The two colmns are returned as a list of vectors. Note that even if a group of variables are not evolving in a correlated way, it may be more logical, convenient, or economical of code to group them together rather than writing separate update functions for each of them.
Note also that this example illustrates use of a random function. There is no restriction that the functions provided for use in trajectory updates, or transitions cannot include stochasticity. This is also true of hazard functions, although this would be less common.
Having defined our update, we can create a trajectory as before, but now with vectors to specify which columns are affected. Now we have multiple trajectories, we will need to pass a list of them to parameter the constructor.
bptc_traj <- new_trajectory(
bptc_update, # trajectory fn
c("sbp", "tc", "death"), # trajectory args
c("sbp", "tc") # outputs (list of vectors from fn)
)
## make list of trajectories for use below
trajlist2 <- list(
## age
age_traj,
## SBP & TC handled in bivariate fashion
bptc_traj
)In this case, it may convey more to construct a history that picks
out a single individual patient trajectory. There are different ways
this could be done, since sum and mean and the
identity function function(x) x all return the same when
applied to a single individual (vector of length 1).
## filter to 1
filter_1 <- function(x) {
x == 1
}
## history to also include random walk variables for id==1
noalive2 <- new_history(
columns = list(
new_column("no. alive", length, c("age"), filter_alive, c("death")),
## 3 ways of returning an individual's value:
new_column("sbp1", sum, c("sbp"), filter_1, c("id")),
new_column("tc1", mean, c("tc"), filter_1, c("id")),
new_column("tc1 v2", function(x) x, c("tc"), filter_1, c("id"))
),
frequency = 1
)For this illustration, let’s run the simulation for 1000 steps, but only for the first 50 individuals:
## full parameters
parms2 <- new_parameters(
hazards = morthaz2,
trajectories = trajlist2,
steps = 1000, #longer
debug = FALSE,
history = noalive2
)
## run the simulation
result2 <- run_simulation(pop0[1:50], parms2) #look at a limited population
#> Simulation Complete in 0.07 seconds!We can now transform and plot the SBP and TC trajectories for the first individual, which do indeed look like a correlated geometric random walk:
plot_data <- melt(result2$history[, .(t = `~STEP`, sbp1, tc1)], id = "t")
ggplot(plot_data, aes(t, value, col = variable)) +
geom_line() +
facet_wrap(~variable, scales = "free_y") +
theme(legend.position = "none")Sometimes, it may be necessary to represent ‘deterministic’ events,
which happen with certainty in response to some change. These are dealt
with in golden by allowing hazard values to be
Inf, which guarantees events will occur.
As a simple example, let’s suppose we want to put people on ‘treatment’ if their SBP exceeds 150 or their TC exceeds 4.5. First, let’s introduce some new variables to our population to describe related states:
## extra variables
pop0[, on_treatment := 0L] # time on treatment: 0 means not on treatment
pop0[, treatment_counter := 0L] # counter for treatment initiationWe will say we want treatment to start with certainty if the SBP/TC
condition above is met, and to finish after 5 steps. This means defining
hazard functions that become Inf when the relevant
conditions are met:
## definitely go on treatment if sbp >= 150 or tc >= 4.5
treatment_start_haz <- function(death, on_treatment, sbp, tc) {
ifelse(death > 0 | on_treatment > 0 | (sbp < 150 & tc < 4.5), 0, Inf)
}
## definitely end treatment after 5 steps if alive
treatment_end_haz <- function(death, on_treatment) {
ifelse(death > 0 | on_treatment < 5, 0, Inf)
}Of course we will also need a trajectory rule to track time-on-treatment, and add this to our list of trajectories:
## treatment trajectory
treatment_time_update <- function(death, on_treatment) {
ifelse(death < 0 & on_treatment > 0, on_treatment + 1, 0)
}
## trajectory list
trajlist3 <- list(
## age
age_traj,
## SBP & TC handled in bivariate fashion
bptc_traj,
## new trajectory for treatment
new_trajectory(
treatment_time_update,
c("death", "on_treatment"),
"on_treatment"
)
)And we will need to define the transitions associated with starting and finishing treatment:
## start off
treatment_start_transition <- function(on_treatment) {
rep(1, length(on_treatment))
}
## finish off
treatment_end_transition <- function(on_treatment) {
rep(0, length(on_treatment))
}These will be included in a list of transitions associated with the hazard for starting treatment:
hazlist3 <- list(
## death
morthaz0,
## starting treatment
new_hazard(
treatment_start_haz,
c("death", "on_treatment", "sbp", "tc"),
## (which triggers two transitions)
list(
new_transition(
treatment_start_transition,
"on_treatment",
"on_treatment"
),
new_transition(
generic_update,
"treatment_counter",
"treatment_counter"
)
)
),
## finishing treatment
new_hazard(
treatment_end_haz,
c("death", "on_treatment"),
## (triggering one transition)
new_transition(
treatment_end_transition,
"on_treatment",
"on_treatment"
)
)
)Note: as with trajectories, we can code transitions to act
on multiple rows by defining functions that return lists of vectors, and
specifying which columns these are via a vector of column names in
new_transition. This could allow for things like correlated
random transitions, but more often may be a more convenient way to
specify things. E.g. in the above one could have defined a single
treatment_start_transition that both turned on treatment
and incremented the treatment counter, as opposed to defining these
separately and grouping them with the treatment_start_haz
using a list.
For this example, let’s add a time series capturing the number on treatment at each time to the history:
history3 <- new_history(
columns = list(
new_column("no. alive", length, "age", filter_alive, "death"),
## alive and on treatment:
new_column(
"no. alive on tx",
function(x) sum(x > 0),
"on_treatment",
filter_alive, "death"
),
## an individual's value:
new_column("sbp1", sum, "sbp", filter_1, "id"),
new_column("tc1", mean, "tc", filter_1, "id")
),
frequency = 1
)Finally, let’s assemble the parameters and run the model:
## full parameters
parms3 <- new_parameters(
hazards = hazlist3,
trajectories = trajlist3,
steps = 20, # longer
debug = FALSE,
history = history3
)
## run the simulation
result3 <- run_simulation(pop0, parms3) # look at a limited population
#> Simulation Complete in 0.08 seconds!
result3$pop
#> id death age sbp tc on_treatment treatment_counter
#> <int> <num> <num> <num> <num> <num> <num>
#> 1: 1 -1 60 134.2391 4.478562 0 0
#> 2: 2 11 51 137.9237 4.881861 0 3
#> 3: 3 9 49 148.8540 4.585957 0 2
#> 4: 4 7 47 126.8567 4.582231 0 2
#> 5: 5 0 40 143.0363 4.566057 0 20
#> ---
#> 9996: 9996 -1 60 136.6982 4.515968 0 4
#> 9997: 9997 0 40 125.8770 4.487413 0 0
#> 9998: 9998 0 40 139.6489 4.529636 0 20
#> 9999: 9999 2 42 140.8224 4.525010 0 1
#> 10000: 10000 -1 60 145.7149 4.419947 0 4You can see the number of times patients have started treatment in the final population data. This is an example of recording a cumulative variable, and this approach could be used to capture cumulative (discounted) costs etc in health economic applications.
Let’s inspect the numbers alive and on treatment:
plot_data <- melt(result3$history[, .(
t = `~STEP`,
`no. alive`, `no. alive on tx`
)], id = "t")
ggplot(plot_data, aes(t, value, col = variable)) +
geom_line() +
geom_point() +
theme(legend.position = "top")This is a very artificial example, but one can see the cyclical pattern with people finishing treatment roughly in sync after 5 steps before restarting; faster evolution of SBP/TC would probably soften this synchronicity.
For a slightly more realistic application of these principles, we will consider globorisk cardiovascular disease (CVD) model, which commendably has an R package including available data to reconstruct baseline hazard for combined CVD event risk. We include data for the US from this package, along with body mass index (BMI) estimates based on NCD-RisC and population demographic data from the UN Word Population Prospects 2024 (WPP2024) estimates, courtesy of the R package wpp2024.
First, let’s construct a population. We provide in the package population demographic structure estimates for the US in 2023: we write a short function to enable sampling a cohort to match these patterns, and apply it to create our initial base cohort, including: age, sex, death, and a CVD event counter:
## population structure data
head(pop_snapshot)
#> popM popF
#> [1,] 2014.861 1925.018
#> [2,] 1957.782 1869.524
#> [3,] 1943.898 1857.020
#> [4,] 1923.427 1835.382
#> [5,] 1936.948 1846.899
#> [6,] 1955.784 1859.816
## function to sample a population using above data
make_cohort <- function(N) {
popcounts <- rmultinom(1, size = N, prob = c(pop_snapshot))
sexes <- rep(1, sum(popcounts[1:nrow(pop_snapshot)]))
sexes <- c(sexes, rep(0, N - length(sexes)))
initPop <- data.table(
male = as.integer(sexes),
age = as.integer(0),
death = as.integer(rep(-1, N)),
cvd_count = as.integer(rep(0, N))
)
## do ages
ageref <- rep(0:(nrow(pop_snapshot) - 1), 2)
k <- 1
for (i in 1:length(popcounts)) {
if (popcounts[i] > 0) {
initPop[k:(popcounts[i] + k - 1), age := ageref[i]]
k <- k + popcounts[i]
}
}
initPop
}
## sample the initial population
initPop <- make_cohort(N)We need to do a bit of extra work to facilitate merging with other data. We add age categories to merge in BMI fit data, and then sample a distribution of BMIs. Then we add the globorisk age categories and merge in central values of SBP and TC:
## age categories used in the BMI fits data
initPop[, acat := fcase(
age >= 25 & age < 30, "25-29",
age >= 30 & age < 35, "30-34",
age >= 35 & age < 40, "35-39",
age >= 40 & age < 45, "40-44",
age >= 45 & age < 50, "45-49",
age >= 50 & age < 55, "50-54",
age >= 55 & age < 60, "55-59",
age >= 60 & age < 65, "60-64",
age >= 65 & age < 70, "65-69",
age >= 70 & age < 75, "70-74",
age >= 75 & age < 80, "75-79",
age >= 80 & age < 85, "80-84",
age >= 85, "85plus",
default = "20-24"
)]
## add male convention to BMI fits data
bmi_fits$male <- ifelse(bmi_fits$sex == "Men", 1, 0)
## merge in BMI fits data
initPop <- merge(
initPop,
bmi_fits[, .(male, acat, k, theta)],
by = c("male", "acat"),
all.x = TRUE, all.y = FALSE
)
## sample some BMIs
initPop[, bmi := rgamma(nrow(initPop), shape = k, scale = theta)]
## add globorisk age categories
initPop[, agec := as.integer(ifelse(age < 85, trunc(age / 5) - 7, 10))]
initPop[agec < 1, agec := 1]
initPop[agec > 9, agec := 9]
## merge in central SBP and TC for US from globorisk
initPop <- merge(
initPop,
globorisk_rf[, .(agec, male = sex, sbp = 10 * mean_sbp, tc = mean_tc)],
by = c("agec", "male")
)We can check this is looking plausible:
## initial pop
ggplot(initPop, aes(x = age, fill = factor(male), group = male)) +
geom_histogram() +
theme(legend.position = "top")
#> `stat_bin()` using `bins = 30`. Pick better value `binwidth`.We create a mortality hazard function that looks up death rates in a given year, age, and sex from WPP2024 data:
## data prep
n_ages <- 101
n_years <- nrow(lifetable_data) / n_ages
qx_array <- array(0, dim = c(2, n_ages, n_years))
qx_array[1, , ] <- lifetable_data$mxM
qx_array[2, , ] <- lifetable_data$mxF
## mortality hazard function
mort_fn <- function(sex, age, year) {
## Convert to 1-indexed and clamp in bounds
d <- dim(qx_array)
n_rows <- d[2]
n_cols <- d[3]
row_index <- pmin(pmax(age + 1, 1), n_rows)
col_index <- pmin(pmax(year + 1, 1), n_cols)
qx_array[cbind(sex + 1, row_index, col_index)] # zero index female
}
## test
mort_fn(rep(1, 10), rep(50, 10), rep(70, 10))
#> [1] 0.00157482 0.00157482 0.00157482 0.00157482 0.00157482 0.00157482
#> [7] 0.00157482 0.00157482 0.00157482 0.00157482The we write a hazard for CVD events that depends on updated age, sex, BMI, SBP and TC based on the globorisk R package.
globohaz <- function(death, sex, age, bmi, sbp = 140, tc = 4.5) {
n <- length(sex)
if (length(sex) != length(age) || length(sex) != length(bmi)) {
stop("Arguments must be equal length!\n")
}
## checks/additions
age <- pmin(pmax(age, 41), 75)
dm <- rep(0, n)
smk <- rep(0, n)
## globorisk transforms
agec <- as.integer(ifelse(age < 85, trunc(age / 5) - 7, 10))
sbp <- sbp / 10
dm <- as.integer(dm)
smk <- as.integer(smk)
bmi <- bmi / 5
## center
agc_sex <- paste(agec, sex, sep = "_")
sbp_c <- sbp - globorisk_rf[agc_sex][, mean_sbp]
tc_c <- tc - globorisk_rf[agc_sex][, mean_tc]
dm_c <- dm - globorisk_rf[agc_sex][, mean_dm]
smk_c <- smk - globorisk_rf[agc_sex][, mean_smk]
bmi_c <- bmi - globorisk_rf[agc_sex][, mean_bmi]
## compute hazard ratios
HR <- sbp_c * globorisk_coefs[["main_sbpc"]] +
bmi_c * globorisk_coefs[["main_bmi5c"]] +
smk_c * globorisk_coefs[["main_smokc"]] +
sex * smk_c * globorisk_coefs[["main_sexsmokc"]] +
age * sbp_c * globorisk_coefs[["tvc_sbpc"]] +
age * smk_c * globorisk_coefs[["tvc_smokc"]] +
age * bmi_c * globorisk_coefs[["tvc_bmi5c"]]
HR <- exp(HR) # un-log
## baseline hazard
h <- globorisk_cvdr[agc_sex][, cvd_0]
## return
ans <- h * HR
ans[death > 0] <- 0 # no effect on dead
ans[!is.finite(ans)] <- 0 # safety
ans
}
## test: note returns vector
globohaz(
c(-1, -1, 1), #alive
c(0, 1, 1), #sex
c(00, 50, 90), #age
rep(25, 3), #BMI
rep(145, 3), #SBP
rep(5, 3) #TC
)
#> [1] 0.003613247 0.004592897 0.000000000We combine these to create a list of hazards. CVD events will just
increment the cvd_count variable, but in more realistic
applications would trigger costs and complications.
## hazards
hazlist <- list(
## CVD event
new_hazard(
globohaz,
c("death", "male", "age", "bmi", "sbp", "tc"), # input args for hazard
list(new_transition(
generic_update, # transition function for event
"cvd_count", # input args for transition
"cvd_count" # vars affected by transition
))
),
## deaths
new_hazard(
mort_fn,
c("male", "age", "~STEP"),
list(new_transition(transition_fn, c("death", "~STEP"), "death"))
)
)We already have trajectories for age, and (SBP and TC) from above,
but would like to consider BMI as something that changes with age. We
will make the simplistic assumption that we can use the cross-sectional
pattern seen in our baselin population as a way to evolve BMI, and based
this on a regression. Note there is no problem with using
predict methods from regressions to define hazards or
trajectories:
## linear model of BMI vs age in our data
mm <- lm(bmi ~ 1 + age + I(age^2),
data = initPop
) # regression
## pretend that cross-sectional BMIs are a reasonable way to inform trajectories
bmi_update <- function(age) {
predict(mm, newdata = data.table(age = age))
}
bmi_update(25) # test
#> 1
#> 28.91341We group these into a list of trajectories, creating one for BMI based on the update function as we go:
## trajectories
trajlist <- list(
## age
age_traj,
## BMI
new_trajectory(bmi_update, "age", "bmi"),
## SBP & TC handled in bivariate fashion
bptc_traj #trajectory fn
)Finally, we will monitor the living population size, average age (as a check), and the average number of CVD events:
Having defined our hazards, trajectories, and history, we can now create our parameter object and run the model:
parms <- new_parameters(
hazards = hazlist,
trajectories = trajlist,
steps = n_years,
debug = FALSE,
history = history
)
## run the simulation
ret <- run_simulation(initPop, parms)
#> Simulation Complete in 3.36 seconds!
#> <golden_timing>
#>
#> Hazards:
#> Hazard Total Time (s) Avg Time (s) % Runtime
#> <char> <num> <num> <num>
#> 1: globohaz 2.652 0.0262574255 78.8816154
#> 2: mort_fn 0.026 0.0002574257 0.7733492
#>
#> Transitions:
#> Transition Total Time (s) Avg Time (s) % Runtime
#> <char> <num> <num> <num>
#> 1: generic_update 0.000 0.000000e+00 0.0000000
#> 2: transition_fn 0.006 5.940594e-05 0.1784652
#>
#> Columns:
#> Column Total Time (s) Avg Time (s) % Runtime
#> <char> <num> <num> <num>
#> 1: no. alive 0.000 0.000000e+00 0.0000000
#> 2: av. age alive 0.007 6.930693e-05 0.2082094
#> 3: av. cvd events 0.000 0.000000e+00 0.0000000Note for the first time in these examples, the timing
information is potentially useful: it shows that globohaz
is where much of the time is going for this simulation, and if run time
were an issue suggests where efforts to improve efficiency should be
directed.
Let’s have a look at a few of the outputs.
Have we run the model long enough that most people have died?
What age did people die at?
## now dead
ggplot(
ret$pop[death > 0],
aes(x = age, fill = factor(male), group = male)
) +
geom_histogram() +
theme(legend.position = "top")
#> `stat_bin()` using `bins = 30`. Pick better value `binwidth`.How many CVD events have those alive typically experienced?
The third member of the output list provides data.table
timing information, which will print like this:
ret$timing
#> <golden_timing>
#>
#> Hazards:
#> Hazard Total Time (s) Avg Time (s) % Runtime
#> <char> <num> <num> <num>
#> 1: globohaz 2.652 0.0262574255 78.8816154
#> 2: mort_fn 0.026 0.0002574257 0.7733492
#>
#> Transitions:
#> Transition Total Time (s) Avg Time (s) % Runtime
#> <char> <num> <num> <num>
#> 1: generic_update 0.000 0.000000e+00 0.0000000
#> 2: transition_fn 0.006 5.940594e-05 0.1784652
#>
#> Columns:
#> Column Total Time (s) Avg Time (s) % Runtime
#> <char> <num> <num> <num>
#> 1: no. alive 0.000 0.000000e+00 0.0000000
#> 2: av. age alive 0.007 6.930693e-05 0.2082094
#> 3: av. cvd events 0.000 0.000000e+00 0.0000000The intent here is to provide some information on where the
simulation is spending most of its time to guide any optimizations. This
information can also be printed automatically at the end of a run by
setting the flag print_timing = TRUE (which is on by
default) when creating new_parameters. Note though that
this information is only printed in interactive mode, but not, e.g.,
when generating rmarkdown reports.
Debug
The debug flag passed to run_simulation(),
which defaults True, enables potentially expensive runtime
validation. For example if maths within your model returns
NaN, it will likely propagate and cause dependent values to
also equal NaN. This behaviour can be difficult to track
back to the source. When debug=True NaN and
similar erroneous return values will be detected at return and raise a
corresponding exception. Once a model is deemed stable,
debug=False can be enabled to remove any performance
overhead of expensive runtime validation. No benchmarking has been
carried out to estimate the impact of the debug
flag.