This vignette shows how to tune the hyperparameters of the machine
learning algorithms implemented in ReSurv
using the
approach in Snoek, Larochelle, and Adams
(2012) implemented in the R package
ParBayesianOptimization
(Wilson
(2022)). For illustrative purposes, we will simulate daily claim
notifications from one of the scenarios introduced in Hiabu, Hofman, and Pittarello (2023) (scenario
Alpha).
input_data_0 <- data_generator(
random_seed = 1964,
scenario = 0,
time_unit = 1 / 360,
years = 4,
period_exposure = 200
)
The counts data are then pre-processed using the
IndividualDataPP
function.
In the manuscript we fit the proportional likelihood of a cox model using extreme gradient boosting (XGB) and feed-forward neural networks (NN). The advantage of using the bayesian hyperparameters selection algorithm is in terms the computation time, and of a wide range of parameter choices (see again Snoek, Larochelle, and Adams (2012)). We show the ranges we inspected in the manuscript in the following table.
Model | Hyperparameter | Range |
---|---|---|
NN | num_layers |
\(\left[2,10\right]\) |
num_nodes |
\(\left[2,10\right]\) | |
optim |
\(\left[1,2\right]\) | |
activation |
\(\left[1,2\right]\) | |
lr |
\(\left[.005,0.5\right]\) | |
xi |
\(\left[0,0.5\right]\) | |
eps |
\(\left[0,0.5\right]\) | |
XGB | eta |
\(\left[0,1\right]\) |
max_depth |
\(\left[0,25\right]\) | |
min_child_weight |
\(\left[0,50\right]\) | |
lambda |
\(\left[0,50\right]\) | |
alpha |
\(\left[0,50\right]\) |
In the following part of this vignette, we will discuss the steps required to optimize the hyperparameters with NN using the approach of Snoek, Larochelle, and Adams (2012). We will provide the code we used for XGB as it follows a similar flow.
In ReSurv
, we have our own implementation of a standard
K-Fold cross-validation (Hastie et al.
(2009)), namely the ReSurvCV
method of an
IndividualDataPP
object. We show an illustrative example
for XGB below. Even if this routine can be used alone for choosing the
optimal parameters, we suggest to use it within the bayesian approach of
Snoek, Larochelle, and Adams (2012). An
example is provided in the following chunk for NN.
resurv_cv_xgboost <- ReSurvCV(
IndividualDataPP = individual_data,
model = "XGB",
hparameters_grid = list(
booster = "gbtree",
eta = c(.001, .01, .2, .3),
max_depth = c(3, 6, 8),
subsample = c(1),
alpha = c(0, .2, 1),
lambda = c(0, .2, 1),
min_child_weight = c(.5, 1)
),
print_every_n = 1L,
nrounds = 500,
verbose = FALSE,
verbose.cv = TRUE,
early_stopping_rounds = 100,
folds = 5,
parallel = T,
ncores = 2,
random_seed = 1
)
The ReSurv
neural network implementation uses
reticulate
to interface R Studio to Python and it is based
on a similar approach to Katzman et al.
(2018), corrected to account for left-truncation and ties in the
data. Similarly to the original implementation we relied on the Python
library pytorch
(Paszke et al.
(2019)). The syntax of our NN is then the syntax of
pytorch
. See the reference
guide for further information on the NN parametrization.
In order to use the ParBayesianOptimization
package, we
first need to specify the NN parameter ranges we are interested into a
vector. We discussed in the last section the ranges we choose for each
algorithm.
bounds <- list(
num_layers = c(2L, 10L),
num_nodes = c(2L, 10L),
optim = c(1L, 2L),
activation = c(1L, 2L),
lr = c(.005, 0.5),
xi = c(0, 0.5),
eps = c(0, 0.5)
)
Secondly, we need to specify an objective function to be optimized
with the Bayesian approach. The ParBayesianOptimization
package can be loaded as
The score metric we inspect is the negative (partial) likelihood. The likelihood is returned with negative sign as Wilson (2022) is maximizing the objective function.
obj_func <- function(num_layers,
num_nodes,
optim,
activation,
lr,
xi,
eps) {
optim = switch(optim, "Adam", "SGD")
activation = switch(activation, "LeakyReLU", "SELU")
batch_size = as.integer(5000)
number_layers = as.integer(num_layers)
num_nodes = as.integer(num_nodes)
deepsurv_cv <- ReSurvCV(
IndividualDataPP = individual_data,
model = "NN",
hparameters_grid = list(
num_layers = num_layers,
num_nodes = num_nodes,
optim = optim,
activation = activation,
lr = lr,
xi = xi,
eps = eps,
tie = "Efron",
batch_size = batch_size,
early_stopping = 'TRUE',
patience = 20
),
epochs = as.integer(300),
num_workers = 0,
verbose = FALSE,
verbose.cv = TRUE,
folds = 3,
parallel = FALSE,
random_seed = as.integer(Sys.time())
)
lst <- list(
Score = -deepsurv_cv$out.cv.best.oos$test.lkh,
train.lkh = deepsurv_cv$out.cv.best.oos$train.lkh
)
return(lst)
}
As a last step, we use the bayesOpt
function to perform
the optimization.
bayes_out <- bayesOpt(
FUN = obj_func,
bounds = bounds,
initPoints = 50,
iters.n = 1000,
iters.k = 50,
otherHalting = list(timeLimit = 18000)
)
To select the optimal hyperparameters we inspect
bayes_out$scoreSummary
output. Below we print the first
five rows of one of our runs. Observe scoreSummary
is a
data.table
that contains some parameters specific of the
original implementation (see Wilson (2022)
for more details)
Epoch | Iteration | gpUtility | acqOptimum | inBounds | errorMessage |
---|---|---|---|---|---|
0 | 1 | FALSE | TRUE | ||
0 | 2 | FALSE | TRUE | ||
0 | 3 | FALSE | TRUE | ||
0 | 4 | FALSE | TRUE | ||
0 | 5 | FALSE | TRUE |
and the parameters we specified during the optimization:
num_layers | num_nodes | optim | activation | lr | xi | eps | batch_size | Elapsed | Score | train.lkh |
---|---|---|---|---|---|---|---|---|---|---|
9 | 8 | 1 | 2 | 0.08 | 0.35 | 0.03 | 1226 | 6094.91 | -6.24 | 6.28 |
9 | 2 | 2 | 1 | 0.47 | 0.50 | 0.10 | 3915 | 7307.31 | -7.27 | 7.30 |
9 | 9 | 2 | 1 | 0.40 | 0.49 | 0.18 | 196 | 6719.70 | -5.98 | 5.97 |
8 | 8 | 1 | 2 | 0.03 | 0.23 | 0.01 | 4508 | 8893.46 | -7.39 | 7.41 |
9 | 7 | 2 | 1 | 0.13 | 0.13 | 0.12 | 900 | 3189.15 | -6.21 | 6.23 |
We select the final combination that minimizes the negative (partial)
likelihood, in the Score
column.
In a similar fashion, we can optimize the gradient boosting parameters. We first set the hyperparameters grid.
bounds <- list(
eta = c(0, 1),
max_depth = c(1L, 25L),
min_child_weight = c(0, 50),
subsample = c(0.51, 1),
lambda = c(0, 15),
alpha = c(0, 15)
)
Secondly, we define an objective function.
obj_func <- function(eta,
max_depth,
min_child_weight,
subsample,
lambda,
alpha) {
xgbcv <- ReSurvCV(
IndividualDataPP = individual_data,
model = "XGB",
hparameters_grid = list(
booster = "gbtree",
eta = eta,
max_depth = max_depth,
subsample = subsample,
alpha = lambda,
lambda = alpha,
min_child_weight = min_child_weight
),
print_every_n = 1L,
nrounds = 500,
verbose = FALSE,
verbose.cv = TRUE,
early_stopping_rounds = 30,
folds = 3,
parallel = FALSE,
random_seed = as.integer(Sys.time())
)
lst <- list(
Score = -xgbcv$out.cv.best.oos$test.lkh,
train.lkh = xgbcv$out.cv.best.oos$train.lkh
)
return(lst)
}
Lastly, we perform the optimization in a parallel setting using the
DoParallel
package, as recommended in ‘BayesOpt’.
library(DoParallel)
cl <- makeCluster(parallel::detectCores())
registerDoParallel(cl)
clusterEvalQ(cl, {
library("ReSurv")
})
bayes_out <- bayesOpt(
FUN = obj_func
,
bounds = bounds
,
initPoints = length(bounds) + 20
,
iters.n = 1000
,
iters.k = 50
,
otherHalting = list(timeLimit = 18000)
,
parallel = TRUE
)