ssp.glm
:
Subsampling for Generalized Linear ModelsThis vignette introduces the usage of the ssp.glm
using
logistic regression as an example of generalized linear models (GLM).
The statistical theory and algorithms in this implementation can be
found in the relevant reference papers.
The log-likelihood function for a GLM is
\[ \max_{\beta} L(\beta) = \frac{1}{N} \sum_{i=1}^N \left\{y_i u(\beta^{\top} x_i) - \psi \left[ u(\beta^{\top} x_i) \right] \right\}, \] where \(u\) and \(\psi\) are known functions depend on the distribution from the exponential family. For the binomial distribution, the log-likelihood function becomes
\[ \max_{\beta} L(\beta) = \frac{1}{N} \sum_{i=1}^N \left[y_i \beta^{\top} x_i - \log\left(1 + e^{\beta^\top x_i}\right) \right]. \]
The idea of subsampling methods is as follows: instead of fitting the model on the size \(N\) full dataset, a subsampling probability is assigned to each observation and a smaller, informative subsample is drawn. The model is then fitted on the subsample to obtain an estimator with reduced computational cost.
You can install the development version of subsampling from GitHub with:
Full dataset: The whole dataset used as input.
Full data estimator: The estimator obtained by fitting the model on the full dataset.
Subsample: A subset of observations drawn from the full dataset.
Subsample estimator: The estimator obtained by fitting the model on the subsample.
Subsampling probability (\(\pi\)): The probability assigned to each observation for inclusion in the subsample.
We introduce the usage of ssp.glm
with simulated data.
\(X\) contains \(d=6\) covariates drawn from multinormal
distribution, and \(Y\) is the binary
response variable. The full dataset size is \(N = 1 \times 10^4\).
set.seed(1)
N <- 1e4
beta0 <- rep(-0.5, 7)
d <- length(beta0) - 1
corr <- 0.5
sigmax <- matrix(corr, d, d) + diag(1-corr, d)
X <- MASS::mvrnorm(N, rep(0, d), sigmax)
colnames(X) <- paste("V", 1:ncol(X), sep = "")
P <- 1 - 1 / (1 + exp(beta0[1] + X %*% beta0[-1]))
Y <- rbinom(N, 1, P)
data <- as.data.frame(cbind(Y, X))
formula <- Y ~ .
head(data)
#> Y V1 V2 V3 V4 V5 V6
#> 1 1 -1.0918680 -0.4462684 -0.02250989 -0.19626329 -0.67460551 -0.4392570
#> 2 0 -0.1591053 -0.4748068 0.46515238 0.88370061 -0.05910325 0.1857218
#> 3 1 -1.6260754 -0.3394421 -0.68490712 -0.55721107 0.01024563 -0.6319413
#> 4 0 0.1251949 1.5113247 1.38931519 1.24287417 2.48829727 0.5534888
#> 5 0 0.1931921 -0.1478401 -0.14788926 0.46973556 0.05205022 1.0907459
#> 6 0 -0.2560258 -1.6065024 0.32710042 -0.04590727 -0.94748664 -1.2310368
The function usage is
ssp.glm(
formula,
data,
subset = NULL,
n.plt,
n.ssp,
family = "quasibinomial",
criterion = "optL",
sampling.method = "poisson",
likelihood = "weighted",
control = list(...),
contrasts = NULL,
...
)
The core functionality of ssp.glm
revolves around three
key questions:
How are subsampling probabilities computed? (Controlled by the
criterion
argument)
How is the subsample drawn? (Controlled by the
sampling.method
argument)
How is the likelihood adjusted to correct for bias? (Controlled
by the likelihood
argument)
criterion
The choices of criterion
include optA
,
optL
(default), LCC
and uniform
.
The optimal subsampling criterion optA
and
optL
are derived by minimizing the asymptotic covariance of
subsample estimator, proposed by Wang, Zhu, and
Ma (2018). LCC
and uniform
are baseline
methods.
sampling.method
The options for the sampling.method
argument include
withReplacement
and poisson
(default).
withReplacement
stands for drawing n.ssp
subsamples from full dataset with replacement, using the specified
subsampling probabilities. poisson
stands for drawing
subsamples one by one by comparing the subsampling probability with a
realization of uniform random variable \(U(0,1)\). The expected number of drawn
samples are n.ssp
. More details see Wang (2019).
likelihood
The available choices for likelihood
include
weighted
(default) and logOddsCorrection
. Both
of these likelihood functions can derive an unbiased estimator.
Theoretical results indicate that logOddsCorrection
is more
efficient than weighted
in the context of logistic
regression. See Wang and Kim (2022).
After drawing subsample, ssp.glm
utilizes
survey::svyglm
to fit the model on the subsample, which
eventually uses glm
. Arguments accepted by
svyglm
can be passed through ...
in
ssp.glm
.
Below are two examples demonstrating the use of ssp.glm
with different configurations.
n.plt <- 200
n.ssp <- 600
ssp.results <- ssp.glm(formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = "quasibinomial",
criterion = "optL",
sampling.method = "withReplacement",
likelihood = "weighted"
)
summary(ssp.results)
#> Model Summary
#>
#> Call:
#>
#> ssp.glm(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp,
#> family = "quasibinomial", criterion = "optL", sampling.method = "withReplacement",
#> likelihood = "weighted")
#>
#> Subsample Size:
#>
#> 1 Total Sample Size 10000
#> 2 Expected Subsample Size 600
#> 3 Actual Subsample Size 600
#> 4 Unique Subsample Size 561
#> 5 Expected Subample Rate 6%
#> 6 Actual Subample Rate 6%
#> 7 Unique Subample Rate 5.61%
#>
#> Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept -0.5876 0.0867 -6.7749 <0.0001
#> V1 -0.4725 0.1053 -4.4865 <0.0001
#> V2 -0.5252 0.1109 -4.7357 <0.0001
#> V3 -0.4789 0.1037 -4.6193 <0.0001
#> V4 -0.6400 0.1090 -5.8705 <0.0001
#> V5 -0.4937 0.1155 -4.2737 <0.0001
#> V6 -0.6226 0.1125 -5.5368 <0.0001
ssp.results <- ssp.glm(formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = "quasibinomial",
criterion = "optA",
sampling.method = "poisson",
likelihood = "logOddsCorrection"
)
summary(ssp.results)
#> Model Summary
#>
#> Call:
#>
#> ssp.glm(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp,
#> family = "quasibinomial", criterion = "optA", sampling.method = "poisson",
#> likelihood = "logOddsCorrection")
#>
#> Subsample Size:
#>
#> 1 Total Sample Size 10000
#> 2 Expected Subsample Size 600
#> 3 Actual Subsample Size 634
#> 4 Unique Subsample Size 634
#> 5 Expected Subample Rate 6%
#> 6 Actual Subample Rate 6.34%
#> 7 Unique Subample Rate 6.34%
#>
#> Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept -0.5052 0.0775 -6.5171 <0.0001
#> V1 -0.4788 0.0992 -4.8288 <0.0001
#> V2 -0.4893 0.0985 -4.9662 <0.0001
#> V3 -0.4037 0.0997 -4.0485 <0.0001
#> V4 -0.6308 0.0988 -6.3844 <0.0001
#> V5 -0.5781 0.1029 -5.6208 <0.0001
#> V6 -0.4679 0.1001 -4.6747 <0.0001
As recommended by survey::svyglm
, when working with
binomial models, it is advisable to use use
family=quasibinomial()
to avoid a warning issued by
glm
. Refer to svyglm()
help documentation Details. The ‘quasi’ version of the family
objects provide the same point estimates.
The object returned by ssp.glm
contains estimation
results and indices of the drawn subsample in the full dataset.
names(ssp.results)
#> [1] "model.call" "coef.plt" "coef.ssp"
#> [4] "coef" "cov.ssp" "cov"
#> [7] "index.plt" "index" "N"
#> [10] "subsample.size.expect" "terms"
Some key returned variables:
index.plt
and index
are the row indices
of drawn pilot subsamples and optimal subsamples in the full
data.
coef.ssp
is the subsample estimator for \(\beta\) and coef
is the linear
combination of coef.plt
(pilot estimator) and
coef.ssp
.
cov.ssp
and cov
are estimated
covariance matrices of coef.ssp
and
coef
.
The coefficients and standard errors printed by
summary()
are coef
and the square root of
diag(cov)
. See the help documentation of
ssp.glm
for details.
We also provide examples for poisson regression and gamma regression
in the help documentation of ssp.glm
. Note that
likelihood = logOddsCorrection
is currently implemented
only for logistic regression (family = binomial
or
quasibonomial
).