HCPclust is an R package for Hierarchical Conformal Prediction (HCP), designed for clustered data where each subject has repeated measurements with within-subject dependence, and outcomes can be missing under a covariate-dependent Missing At Random (MAR) mechanism.
You can install the development version of HCPclust directly from GitHub using:
install.packages("remotes")
remotes::install_github("judywangstat/HCP")
library(HCPclust)The proposed method consists of four steps.
Step (1) Model fitting
fit_cond_density_quantile().fit_missingness_propensity().Both models are estimated on a subject-level training split to capture the distribution of \(Y \mid X\) and correct for covariate-dependent missingness.
Step (2) Subsampled calibration
This preserves the clustered structure while reducing the impact of within-subject dependence on conformal calibration.
Step (3) Weighted conformal scoring
Step (4) Aggregation
The final conformal prediction region contains all values \(y\) such that \(p_{\text{final}}(y) > \alpha.\)
generate_clustered_mar()
hetero_gamma,rho,target_missing.fit_cond_density_quantile()
method = "rq": linear quantile regression
(quantreg)method = "qrf": quantile random forest
(quantregForest)fit_missingness_propensity()
logistic: logistic regression (glm)grf: generalized random forests
(grf::probability_forest)boosting: gradient boosting (xgboost)[eps, 1 - eps] to
stabilize inverse-propensity weighting.hcp_conformal_region()
y_grid,
together with the interval endpoints [lo, hi].hcp_predict_targets()
hcp_conformal_region() that
automatically adapts to different test-data structures, including:
alpha_eff = alpha / M_p when a
patient has M_p > 1 target points.plot_hcp_intervals()
mode = "band": for one subject with repeated
measurements (e.g., over time), plots an interval band as a function of
a covariate.mode = "pid": for multiple subjects with one target
point each, displays one prediction interval per subject, optionally
sorted by a covariate.This section shows (i) how to call hcp_predict_targets()
under the four common test-data scenarios, and (ii) how to generate and
display two plots from plot_hcp_intervals().
hcp_predict_targets())set.seed(1)
dat_train <- generate_clustered_mar(
n = 200, m = 4, d = 1,
x_dist = "uniform", x_params = list(min = 0, max = 10),
target_missing = 0.30,
seed = 1
)
y_grid <- seq(-6, 6, length.out = 201)test_11 <- data.frame(pid = 1, X1 = 2.5)
out_11 <- hcp_predict_targets(
dat = dat_train, test = test_11,
x_cols = "X1", y_grid = y_grid,
alpha = 0.1,
seed = 1
)
out_11$predtest_1M <- data.frame(pid = 1, X1 = c(1, 3, 7, 9))
out_1M <- hcp_predict_targets(
dat = dat_train, test = test_1M,
x_cols = "X1", y_grid = y_grid,
alpha = 0.1,
seed = 1
)
out_1M$predtest_P1 <- data.frame(pid = 1:4, X1 = c(2, 4, 6, 8))
out_P1 <- hcp_predict_targets(
dat = dat_train, test = test_P1,
x_cols = "X1", y_grid = y_grid,
alpha = 0.1,
seed = 1
)
out_P1$predtest_PM <- data.frame(
pid = c(1,1, 2,2,2, 3,3),
X1 = c(1,6, 2,5,9, 3,8)
)
out_PM <- hcp_predict_targets(
dat = dat_train, test = test_PM,
x_cols = "X1", y_grid = y_grid,
alpha = 0.1,
seed = 1
)
out_PM$preddat_train <- generate_clustered_mar(
n = 200, m = 20, d = 1,
x_dist = "uniform", x_params = list(min = 0, max = 10),
hetero_gamma = 2.5,
target_missing = 0.30,
seed = 1
)
y_grid <- seq(-6, 10, length.out = 201)
# test data with latent truth
dat_test <- generate_clustered_mar(
n = 100, m = 20, d = 1,
x_dist = "uniform", x_params = list(min = 0, max = 10),
hetero_gamma = 2.5,
seed = 999
)pid <- dat_test$id[1]
idx <- which(dat_test$id == pid)
idx <- idx[order(dat_test$X1[idx])][1:10]
test_1M <- data.frame(pid = pid, X1 = dat_test$X1[idx], y_true = dat_test$Y_full[idx])
out_1M <- hcp_predict_targets(
dat = dat_train, test = test_1M,
x_cols = "X1", y_grid = y_grid,
alpha = 0.1,
S = 3, B = 3,
seed = 1
)
plot_hcp_intervals(
out_1M$pred, mode = "band", x_col = "X1",
y_true_col = "y_true", show_true = TRUE,
main = "Case A: one patient, multiple time points (band vs time)"
)
pids <- unique(dat_test$id)[1:20]
test_P1 <- subset(dat_test, id %in% pids & j == 1, select = c(id, X1, Y_full))
names(test_P1) <- c("pid", "X1", "y_true")
out_P1 <- hcp_predict_targets(
dat = dat_train, test = test_P1,
x_cols = "X1", y_grid = y_grid,
alpha = 0.1,
S = 3, B = 3,
seed = 1
)
plot_hcp_intervals(
out_P1$pred, mode = "pid", pid_col = "pid", x_sort_col = "X1",
y_true_col = "y_true", show_true = TRUE,
main = "Case B: multiple patients, one time point (by patient)"
)