Introduction to ssp.quantreg: Subsampling for Quantile Regression

library(subsampling)

This vignette introduces the usage of ssp.quantreg. The statistical theory and algorithms behind this implementation can be found in the relevant reference papers.

Quantile regression aims to estimate conditional quantiles by minimizing the following loss function:

\[ \min_{\beta} L(\beta) = \frac{1}{N} \sum_{i=1}^{N} \rho_\tau \left( y_i - \beta^\top x_i \right) = \frac{1}{N} \sum_{i=1}^{N} \left( y_i - \beta^\top x_i \right) \left\{ \tau - I \left( y_i < \beta^\top x_i \right) \right\}, \] where \(\tau\) is the quantile of interest, \(y\) is the response variable, \(x\) is covariates vector and \(N\) is the number of observations in full dataset.

The idea of subsampling methods is as follows: instead of fitting the model on the size \(N\) full dataset, a subsampling probability is assigned to each observation and a smaller, informative subsample is drawn. The model is then fitted on the subsample to obtain an estimator with reduced computational cost.

Terminology

Example

We introduce ssp.quantreg with simulated data. \(X\) contains \(d=6\) covariates drawn from multinormal distribution and \(Y\) is the response variable. The full data size is \(N = 1 \times 10^4\). The interested quantile \(\tau=0.75\).

set.seed(1)
N <- 1e4
tau <- 0.75
beta.true <- rep(1, 7)
d <- length(beta.true) - 1
corr  <- 0.5
sigmax  <- matrix(0, d, d)
for (i in 1:d) for (j in 1:d) sigmax[i, j] <- corr^(abs(i-j))
X <- MASS::mvrnorm(N, rep(0, d), sigmax)
err <- rnorm(N, 0, 1) - qnorm(tau)
Y <- beta.true[1] + X %*% beta.true[-1] + err * rowMeans(abs(X))
data <- as.data.frame(cbind(Y, X))
colnames(data) <- c("Y", paste("V", 1:ncol(X), sep=""))
formula <- Y ~ .
head(data)
#>            Y         V1          V2          V3         V4         V5
#> 1  3.0813580 -0.1825325 -0.01613791 -0.01852406  1.0672454  0.9353870
#> 2  0.1114953 -0.3829652 -1.20674035 -0.33354934  0.3818526  0.6610612
#> 3  4.0233475 -0.1384141  0.35758454 -0.08962728  0.8591475  0.7554356
#> 4 -7.0116774 -0.7668158 -1.07028901 -2.57374497 -1.4283868 -0.4782146
#> 5 -1.2551700 -0.9557206 -0.82219260  0.47905721  0.1096016 -0.3116279
#> 6  2.6764218  0.8646208 -0.32527175  0.23441106  0.5800169  1.8153229
#>            V6
#> 1  0.44382164
#> 2  0.12626628
#> 3  1.63208199
#> 4  1.10717085
#> 5 -0.08180055
#> 6 -0.03612645

Key Arguments

The function usage is

ssp.quantreg(
  formula,
  data,
  subset = NULL,
  tau = 0.5,
  n.plt,
  n.ssp,
  B = 5,
  boot = TRUE,
  criterion = "optL",
  sampling.method = "withReplacement",
  likelihood = c("weighted"),
  control = list(...),
  contrasts = NULL,
  ...
)

The core functionality of ssp.quantreg revolves around three key questions:

criterion

criterion stands for the criterion we choose to compute the sampling probability for each observation. The choices of criterion include optL(default) and uniform. In optL, the optimal subsampling probability is by minimizing a transformation of the asymptotic variance of subsample estimator. uniform is a baseline method.

sampling.method

The options for the sampling.method argument include withReplacement (default) and poisson. withReplacement stands for drawing \(n.ssp\) subsamples from full dataset with replacement, using the specified subsampling probabilities. poisson stands for drawing subsamples one by one by comparing the subsampling probability with a realization of uniform random variable \(U(0,1)\). The expected number of drawn samples are \(n.ssp\).

likelihood

The available choice for likelihood in ssp.quantreg is weighted. It takes the inverse of sampling probabblity as the weights in likelihood function to correct the bias introduced by unequal subsampling probabilities.

boot and B

An option for drawing \(B\) subsamples (each with expected size n.ssp) and deriving subsample estimator and asymptotic covariance matrix based on them. After getting \(\hat{\beta}_{b}\) on the \(b\)-th subsample, \(b=1,\dots B\), it calculates

\[ \hat{\beta}_I = \frac{1}{B} \sum_{b=1}^{B} \hat{\beta}_{b} \] as the final subsample estimator and \[ \hat{V}(\hat{\beta}_I) = \frac{1}{r_{ef} B (B - 1)} \sum_{b=1}^{B} \left( \hat{\beta}_{b} - \hat{\beta}_I \right)^{\otimes 2}, \] where \(r_{ef}\) is a correction term for effective subsample size since the observations in each subsample can be replicated. For more details, see Wang and Ma (2021).

Outputs

After drawing subsample(s), ssp.quantreg utilizes quantreg::rq to fit the model on the subsample(s). Arguments accepted by quantreg::rq can be passed through ... in ssp.quantreg.

Below are two examples demonstrating the use of ssp.quantreg with different configurations.

B <- 5
n.plt <- 200
n.ssp <- 200
ssp.results1 <- ssp.quantreg(formula, 
                             data, 
                             tau = tau, 
                             n.plt = n.plt,
                             n.ssp = n.ssp,
                             B = B, 
                             boot = TRUE, 
                             criterion = 'optL',
                             sampling.method = 'withReplacement', 
                             likelihood = 'weighted'
                             )

ssp.results2 <- ssp.quantreg(formula, 
                             data, 
                             tau = tau, 
                             n.plt = n.plt,
                             n.ssp = n.ssp,
                             B = B, 
                             boot = FALSE, 
                             criterion = 'optL',
                             sampling.method = 'withReplacement', 
                             likelihood = 'weighted'
                             )

Returned object

The returned object contains estimation results and index of drawn subsample in the full dataset.

names(ssp.results1)
#> [1] "model.call"            "coef.plt"              "coef"                 
#> [4] "cov"                   "index.plt"             "index.ssp"            
#> [7] "N"                     "subsample.size.expect" "terms"
summary(ssp.results1)
#> Model Summary
#> 
#> 
#> Call:
#> 
#> ssp.quantreg(formula = formula, data = data, tau = tau, n.plt = n.plt, 
#>     n.ssp = n.ssp, B = B, boot = TRUE, criterion = "optL", sampling.method = "withReplacement", 
#>     likelihood = "weighted")
#> 
#> Subsample Size:
#> [1] 1000
#> 
#> Coefficients:
#> 
#>           Estimate Std. Error z value Pr(>|z|)
#> Intercept   0.9753     0.0324 30.0654  <0.0001
#> V1          0.9701     0.0220 44.1763  <0.0001
#> V2          1.0295     0.0394 26.1369  <0.0001
#> V3          0.9980     0.0209 47.8506  <0.0001
#> V4          0.9834     0.0609 16.1529  <0.0001
#> V5          1.0508     0.0301 34.8848  <0.0001
#> V6          0.9441     0.0327 28.8878  <0.0001
summary(ssp.results2)
#> Model Summary
#> 
#> 
#> Call:
#> 
#> ssp.quantreg(formula = formula, data = data, tau = tau, n.plt = n.plt, 
#>     n.ssp = n.ssp, B = B, boot = FALSE, criterion = "optL", sampling.method = "withReplacement", 
#>     likelihood = "weighted")
#> 
#> Subsample Size:
#> [1] 1000
#> 
#> Coefficients:
#> 
#>           Estimate
#> Intercept   0.9510
#> V1          1.0234
#> V2          1.0562
#> V3          1.0221
#> V4          0.9676
#> V5          0.9698
#> V6          1.0445

Some key returned variables:

See the help documentation of ssp.quantreg for details.

References

Wang, Haiying, and Yanyuan Ma. 2021. “Optimal Subsampling for Quantile Regression in Big Data.” Biometrika 108 (1): 99–112.