HCPclust

HCPclust is an R package for Hierarchical Conformal Prediction (HCP), designed for clustered data where each subject has repeated measurements with within-subject dependence, and outcomes can be missing under a covariate-dependent Missing At Random (MAR) mechanism.

Installation

You can install the development version of HCPclust directly from GitHub using:

install.packages("remotes")
remotes::install_github("judywangstat/HCP")
library(HCPclust)

Method overview

The proposed method consists of four steps.

Step (1) Model fitting

Both models are estimated on a subject-level training split to capture the distribution of \(Y \mid X\) and correct for covariate-dependent missingness.

Step (2) Subsampled calibration

This preserves the clustered structure while reducing the impact of within-subject dependence on conformal calibration.

Step (3) Weighted conformal scoring

Step (4) Aggregation

The final conformal prediction region contains all values \(y\) such that \(p_{\text{final}}(y) > \alpha.\)

What’s included

1) Data generator

2) Conditional density estimator

3) Missingness propensity model

4) HCP conformal prediction

5) Plotting

Quick examples

This section shows (i) how to call hcp_predict_targets() under the four common test-data scenarios, and (ii) how to generate and display two plots from plot_hcp_intervals().

A) Four scenarios for testing data (via hcp_predict_targets())

Generate training data (fixed across all cases)

set.seed(1)
dat_train <- generate_clustered_mar(
  n = 200, m = 4, d = 1,
  x_dist = "uniform", x_params = list(min = 0, max = 10),
  target_missing = 0.30,
  seed = 1
)
y_grid <- seq(-6, 6, length.out = 201)

Case 1: one patient, one target

test_11 <- data.frame(pid = 1, X1 = 2.5)
out_11 <- hcp_predict_targets(
  dat = dat_train, test = test_11,
  x_cols = "X1", y_grid = y_grid,
  alpha = 0.1,
  seed = 1
)
out_11$pred

Case 2: one patient, multiple targets

test_1M <- data.frame(pid = 1, X1 = c(1, 3, 7, 9))
out_1M <- hcp_predict_targets(
  dat = dat_train, test = test_1M,
  x_cols = "X1", y_grid = y_grid,
  alpha = 0.1,
  seed = 1
)
out_1M$pred

Case 3: multiple patients, one target each

test_P1 <- data.frame(pid = 1:4, X1 = c(2, 4, 6, 8))
out_P1 <- hcp_predict_targets(
  dat = dat_train, test = test_P1,
  x_cols = "X1", y_grid = y_grid,
  alpha = 0.1,
  seed = 1
)
out_P1$pred

Case 4: multiple patients, multiple targets per patient

test_PM <- data.frame(
  pid = c(1,1, 2,2,2, 3,3),
  X1  = c(1,6,  2,5,9,  3,8)
)
out_PM <- hcp_predict_targets(
  dat = dat_train, test = test_PM,
  x_cols = "X1", y_grid = y_grid,
  alpha = 0.1,
  seed = 1
)
out_PM$pred

B) Two plots

Generate training data and test data

dat_train <- generate_clustered_mar(
  n = 200, m = 20, d = 1,
  x_dist = "uniform", x_params = list(min = 0, max = 10),
  hetero_gamma = 2.5,
  target_missing = 0.30,
  seed = 1
)
y_grid <- seq(-6, 10, length.out = 201)

# test data with latent truth
dat_test <- generate_clustered_mar(
  n = 100, m = 20, d = 1,
  x_dist = "uniform", x_params = list(min = 0, max = 10),
  hetero_gamma = 2.5,
  seed = 999
)

Plot A: one patient, multiple targets (band vs time)

pid <- dat_test$id[1]
idx <- which(dat_test$id == pid)
idx <- idx[order(dat_test$X1[idx])][1:10]
test_1M <- data.frame(pid = pid, X1 = dat_test$X1[idx], y_true = dat_test$Y_full[idx])

out_1M <- hcp_predict_targets(
  dat = dat_train, test = test_1M,
  x_cols = "X1", y_grid = y_grid,
  alpha = 0.1,
  S = 3, B = 3,
  seed = 1
)
plot_hcp_intervals(
  out_1M$pred, mode = "band", x_col = "X1",
  y_true_col = "y_true", show_true = TRUE,
  main = "Case A: one patient, multiple time points (band vs time)"
)

Plot B: multiple patients, one target each (intervals by patient)

pids <- unique(dat_test$id)[1:20]
test_P1 <- subset(dat_test, id %in% pids & j == 1, select = c(id, X1, Y_full))
names(test_P1) <- c("pid", "X1", "y_true")

out_P1 <- hcp_predict_targets(
  dat = dat_train, test = test_P1,
  x_cols = "X1", y_grid = y_grid,
  alpha = 0.1,
  S = 3, B = 3,
  seed = 1
)
plot_hcp_intervals(
  out_P1$pred, mode = "pid", pid_col = "pid", x_sort_col = "X1",
  y_true_col = "y_true", show_true = TRUE,
  main = "Case B: multiple patients, one time point (by patient)"
)