quantregGrowth: Non-crossing additive quantile regression for smooth and semiparametric models with focus on growth charts

Vito M.R. Muggeo

2024-05-20

The package quantregGrowth aims to estimate the smooth, but unspecified, effect of numerical covariate(s) on one or more quantiles of the numerical response variable. The quantile regression equation can also include linear effect(s) of numerical or categorical covariates. Each flexible and unspecified relationship is expressed via a B-spline basis whose coefficients are somehow shrunken to control the wiggliness of the curve(s). The amount of smoothness is estimated as part of the model fitting as described in Muggeo et al. (2021).

The package relies on the well known quantreg package by Roger Koenker by exploiting full functionality and efficiency of its fitter functions.

Growth charts (for the eager :-))

When focus is on multiple response quantiles (1%, 5%,…, 95%, say) depending on a single variable, the resulting fitted quantile curves are also referred as growth charts. Growth charts are smooth quantile trajectories at different probability values, ‘tau’, with the noncrossing property, namely the quantile curve at any specified tau value has to be always higher then the quantile curve at any lower probability value. Growth charts are used to build reference values of an outcome of interest with respect to another variable, usually age or time. Typical examples include height (or weight) or vocabulary size development versus the children’s age as investigated in wordbank website.

quantregGrowth uses B-splines with a penalty on the coefficients and relevant smoothing parameter `selected’ within the estimation algorithm, and imposes proper constraints to prevent quantile crossing.

We build growth charts of height versus age using the data set SiChildren shipped with the package. The main function is gcrq() which requires the usual model formula including the ps() function on the right hand side. gcrq() returns quantile curves at (default) probability values c(.1, .25, .5, .75, .9), but in this example we choose different values via the argument tau, and display the result by means of the plot method function:

library(quantregGrowth)
data(SiChildren)

o <-gcrq(height ~ ps(age), tau=seq(.05,.95,l=7), data=SiChildren)

plot(o, res=TRUE, col=-1) #res=TRUE displays data too; col<0 for the default palette
plot of chunk 1

plot of chunk 1

In addition to the graphical output, we could look at the estimated quantile values via the function charts() which is a simple wrapper of predict.gcrq() to facilitate production of growth charts based on the fitted model. We display the quantiles just at some age values

charts(o, k=c(10,10.5,11,16,17)) #the quantile at the specified k values
>            0.05    0.2   0.35    0.5   0.65    0.8   0.95
> age=10   135.24 138.35 141.05 143.00 144.23 146.29 149.83
> age=10.5 133.02 136.86 140.29 142.25 144.00 146.85 151.83
> age=11   133.41 137.75 141.17 143.14 145.24 148.52 154.43
> age=16   154.36 160.47 164.16 168.39 170.16 172.18 176.94
> age=17   159.20 163.33 165.38 171.00 171.64 172.47 175.38

Results look reasonable, but there are some annoying features which need to be stressed: at around age=16 the highest quantile curve (the 95th) decreases, while some quantile curves (the 5th to the 65th) drop at early ages around 11 years and then increase again afterwards.

Broadly speaking, non monotonic growth charts could be biologically sound (for instance when modelling the body mass index), but in this example they are not. Likely, the sparseness of data on the left and right tails of age distribution, causes the drops in the fitted curves. Hence we re-fit the model by imposing positive monotonicity contraints via the argument monotone=1 in ps(),

oM<-gcrq(height ~ ps(age, monotone=1), data=SiChildren, tau=seq(.05,.95,l=7))

In the following, we report two plots: the first one focuses on the early ages by stressing differences between the fitted curves, and the second figure could be possibly closer to the usual growth charts, with legend (set at the age value overlap=15), underlying grid (with 15 vertical and 10 horizontal lines) and `customized’ axis labels.

par(mfrow=c(1,2))
#the 1st plot..
plot(o, xlim=c(10.2,12.5), col=1, lty=3, lwd=2)
plot(oM, add=TRUE, col=2, lty=1, lwd=2)
#the 2nd..
plot(oM, legend=TRUE, overlap=15, grid=list(x=15,y=10), col=2, lty=1, 
     ylab="Height (cm)", xlab="Age (years)")
plot of chunk 1b

plot of chunk 1b

Several options are available when plotting a gcrq object, see ?plot.gcrq.

By default ps() uses 11 basis functions which is appropriate for most of situations met in practice. However, if the underlying signal is strongly nonlinear, a richer basis is recommended to better capture changes in the curve. The following example presents data with a Doppler-like trend

n <- 20000
x <- 1:n/n
mu <- sqrt(x*(1-x))*sin((2*pi*(1+2^((9-4*6)/5)))/(x+2^((9-4*6)/5)))
set.seed(3008) #just for riproducibility
y <- mu+0.2*rnorm(n)

When the reader fits the model using the default values, the results appear unsatisfactory with the fitted curves missing the general trend. The practitioner can check that increasing ndx improves the smoothing on the left side, but two issues rise: i) the computational load increases as the number of spline coefficients (which depend on ndx); ii) at larger covariate values the curve is clearly undersmoothed. gcrq offers two solutions at the aforementioned issues. Firstly, to improve computational efficiency we could exploit sparse algebra methods via argument sparse as implemented by the package SparseM; and secondly, to get a fitted curve with different amount of smoothing across the covariate range, we can exploit adaptive smoothing, by means of the argument ad in ps(). Therefore the model can be fitted via

as <-gcrq(y ~ 0+ps(x, ndx=100, ad=.8), sparse=TRUE)
par(mfrow=c(2,1))
plot(as)
plot(as, col=4, n.points=500, add=TRUE)
#the same plot but zoomed in on the left side
plot(as, xlim=c(0,.1))
plot(as, col=4, n.points=500, add=TRUE)
lines(x, mu, col=1, lwd=2)
plot of chunk 1d

plot of chunk 1d

Note we need to increase the number of values (by n.points in plot.gcrq()) to portray correctly the lines on the left side.

Additive model

quantregGrowth can deal with multiple smooth terms straightforwardly. We use data simulated via mgcv::gamSim function (example 1). We are interested in modelling nonparametrically the effect of four covariates on the median, say. At this aim, we call gcrq() with several ps() terms and a single tau value,

set.seed(1515)
d<-mgcv::gamSim(n=200, eg=1, verbose=FALSE) #verbose=FALSE just suppresses the message..

o <- gcrq(y ~ ps(x0) + ps(x1) + ps(x2) + ps(x3), data=d, tau=.5)

The plots including the fitted relationships for all 4 terms are ready obtained by

# Plot the fit
plot(o, res=TRUE, col=2, conf.level=.95, shade=TRUE, cex.p=.6, split=TRUE) #cex.p<1 to reduce the points..
plot of chunk fig-margin

plot of chunk fig-margin

where the y-label on each plot displays the edf of each smooth term. Note, when there are multiple covariates, `res=TRUE` portrays the partial residuals (and not the observed values).

Additive model with linear terms

gcrq() can also include standard linear terms: for instance, the above plots suggest that a simple linear term would suffice to capture the relationships for x1 and x3. Therefore in the next model formula we include these variables outside the ps() function. We also display the model output via summary.gcrq().

o1<-gcrq(y ~ ps(x0) + x1 + ps(x2) + x3, data=d, tau=.5)

#the summary method
summary(o1)
> Call:
> gcrq(formula = y ~ ps(x0) + x1 + ps(x2) + x3, tau = 0.5, data = d)
> 
> --------  Percentile: 0.50   check function:  157.69 -----
> 
> parametric terms:
>                Est  StErr  |z|  p-value    
> (Intercept) 4.0077 0.6471 6.19 5.88e-10 ***
> x1          6.7935 0.8178 8.31  < 2e-16 ***
> x3          0.6163 0.8100 0.76    0.447    
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 
> smooth terms:
>          edf
> ps(x0) 2.000
> ps(x2) 4.999
> =========================
> 
> No. of obs = 200    No. of params = 25 (25 for each quantile) 
> Overall check = 157.69  SIC = -0.105 on edf = 10 (ncross constr: FALSE)

For the parametric terms, the summary method returns point estimates, standard errors based on the sandwich formula, and \(p\)-values coming from the Wald statistic. Degrees of freedom for the smooth terms are printed next, along with some useful information of the fit.

Of course we could estimate the same ‘semiparametric’ model at several probability values

o2<-gcrq(y ~ x1 + x3 + ps(x0) + ps(x2), data=d, tau=c(.12,.25,.5,.7,.8,.9))

with the non-crossing constraints. Summary and plots can be obtained straightforwardly.

Varying coefficient models

A particular case of ‘semiparametric’ model is given by the so-called varying coefficient (VC) model, wherein the linear effect of a (even categorical) covariate, \(x_1\) say, depends flexibly on a continuous variable \(x_2\). The relevant term in the linear predictor can be written \(\beta(x_2)x_1\), where \(\beta(\cdot)\) is the smooth expressed via B-splines. We simulate data via the gamSim() function again.

d<-mgcv::gamSim(n=200, eg=3, scale=1) #simulated data with VC effect
> Continuous `by' variable example

and the fit is obtained by means of the argument by in ps(),

o <- gcrq(y~ x1 + ps(x2, by=x1), data=d, tau=.5)

It should be noted we also specify the linear term x1, since ps() makes the basis identifiable. The fitted curve can be displayed as usual via plot.gcrq().

When the linear covariate x1 is categorical, e.g. the gender, fitting VC terms means to fit the smooth effect of x2 into each category of x1. We illustrate that with a simulated dataset

n=50
x0<-x1<-x<-1:n/n
y0<-10+sin(2*pi*x) #sinusoidal relationship + intercept
y1<-seq(7,11,l=n)  #linear relationship + intercept
sigma<- .2
y<-c(y0,y1)+rnorm(2*n)*sigma
x<-c(x,x)
z<-rep(0:1, each=n) #numeric variable
g <-factor(z)  #factor
o<-gcrq(y ~ g + ps(x, by=g), tau=.5)

Again, the same model could be obtained using a full B-spline in each group, i.e. gcrq(y ~ 0 + ps(x, by=g, dropc=FALSE, center=FALSE), tau=.5).

We compare the true and fitted quantile curves in the next plots

par(mfrow=c(1,2))
plot(x0, y[1:50]);lines(x0,y0)
plot(o, term=1, add=TRUE, col=2, lwd=2)

plot(x1, y[51:100]);lines(x1,y1)
plot(o, term=2, add=TRUE, col=3, lwd=2, shift=coef(o)["g1",1])
plot of chunk unnamed-chunk-6

plot of chunk unnamed-chunk-6

To overlap the lines in the second group, we add the estimated coefficient of the corresponding category (`g1`) to the fitted quantile curve (via the argument `shift` of `plot.gcrq()`. When the variable in `by` is categorical, `gcrq()` automatically knows that category-specific curves have to be fitted. However it could be probably instructive to fit the same model using the numerical covariate `z`, namely
o1<-gcrq(y ~ z + ps(x) + ps(x, by=z), tau=.5)

Note the plot of the VC term ps(x, by=z) represents the difference between the smooth relationships y1-y0, bar a constant depending on the intercepts.

Simple linear models

gcrq() can fit purely linear models, i.e. with no ps() term in the formula. Here a simple example.

n=100
x<-1:n/n
y<-2*x+rnorm(n)*rev(x)

o3 <-gcrq(y ~ x, tau=seq(.05,.95,l=10)) #fit 10 (noncrossing) regression quantiles

By means of plot.gcrq() with proper arguments, we display in the first plot the linear growth charts, and in the 2nd figure, we portray the tau-varying coefficient by setting axis.tau=TRUE,

par(mfrow=c(1,2))
plot(x,y)
plot(o3, add=TRUE)
plot(o3, term="x", axis.tau=TRUE, conf.level = .95, col=2)
plot of chunk unnamed-chunk-9

plot of chunk unnamed-chunk-9

The 2nd plot could be useful to assess if and how the specified coefficient estimate changes across the quantile curves.

Customizing the penalty

A possibly useful feature of quantregGrowth is supplying a user-defined (multiplicative) penalty via the argument pen.matrix in ps(). The penalty matrix \(A\), say, should be a matrix such that \(\lambda||A\beta||_1\) is the penalization in the objective to be minimized. \(\beta\) is the vector of spline coefficients and \(\lambda\) is the smoothing parameter fixed or to be estimated. The example below illustrates how a proper penalty matrix can be set to force a zero slope in the fitted curve at large values of the covariate. At this aim we consider weighted first-order differences by assigning a very large weight to the rightmost differences to fulfill the zero slope constraint. Overall, the penalty is \(\sum_j^{J-1} |\beta_j-\beta_{j-1}|w_j=||diag(w) D^{(1)}\beta||_1= ||A\beta||_1\), where \(D^{(1)}\) is the first-order differences matrix and \(w\) is the weight vector as defined below. To build the customized penalty matrix with appropriate dimension, it is probably helpful to set explicitly the number of the basis coefficients via the arguments ndx (number of covariate intervals which defaults to \(min(n/4,12)\)) and deg (spline degree which defaults to 3): Each basis has ndx+deg-1 identifiable coefficients.

data(growthData)

D1 <- diff(diag(23),dif=1)[,-1] #the 1st order diff matrix
w <- c(rep(1,15),rep(1000,7)) #the weight vector s.t. length(w)=nrow(D1)
A <- diag(w) %*% D1 #the penalty matrix

o4 <-gcrq(y~ps(x, ndx=20, pen.matrix=A), data=growthData, tau=.5)

The length of the zero-slope curve piece depends on number of large values in \(w\), in the above example rep(1000,7). Increasing (decreasing) the number of large values would lead to increase (decrease) the zero slope interval.

Figure below reports results of the aforementioned constrained fit, along with the fits obtained using unweighted first order-differences without and with additional monotonicity constraints

o5 <-gcrq(y~ps(x, d=1), data=growthData, tau=.5)
o6 <-gcrq(y~ps(x, d=1, monotone = 1), data=growthData, tau=.5)

par(mfrow=c(1,3))
plot(o4, res=TRUE, col=2, lwd=3, ylim=c(0,12))
plot(o5, res=TRUE, col=3, lwd=3, ylim=c(0,12))
plot(o6, res=TRUE, col=4, lwd=3, ylim=c(0,12))
plot of chunk unnamed-chunk-11

plot of chunk unnamed-chunk-11

Variable selection

We conclude with a simple example illustrating how quantregGrowth can be used to perform variable selection, namely when we are interested in spotting, typically few, important covariates among a large number of candidates. To illustrate, the simulated response below depends on the variables in the columns 3, 5, and 11 of the matrix X and on the linear covariate z which enters the model unpenalized

n <- 50
z <-1:n/n
p <-20
set.seed(1515)
X<-matrix(runif(n*p),n,p)
true.coef <-rep(0,p)
true.coef[c(3,5,11)] <- c(.7,1.5,-1)
y<-5 + 1.5*z+  drop(X%*%true.coef) + rnorm(n)*.25

In order to discard noisy covariates we use the lasso penalty on corresponding coefficients. More specifically, the penalized objective to be minimized is \(\sum_i\rho_\tau (y_i-\beta_0-\beta_1z_i-\sum_j\gamma_jx_{ij})+\lambda \sum_j|\gamma_j|\), where \(\rho_\tau(\cdot)\) is the usual check function. Note the lasso penalty refers to the coefficients of the \(x_j\)’s only; \(\beta_0\) and \(\beta_1\) are unpenalized. The model is fitted straightforwardly by including the matrix X in ps(), and the unpenalized covariate in the main formula.

o <-gcrq(y~ z + ps(X), tau=.5)
plot(o, term=1)
plot of chunk unnamed-chunk-13

plot of chunk unnamed-chunk-13

The `plot` method returns the estimated coefficients relevant to the columns of `X`.

By means of the argument ad in ps() is possible to set an adaptive penalty via weights that are updated iteratively. ad=1 leads to the adaptive lasso wherein the weights are given by the reciprocal of the coefficients at the previous iteration:

o1 <-gcrq(y~ z + ps(X, ad=1), tau=.5) 
plot(o, term=1, ylim=c(-1.3,1.5)) #naive lasso, red circles
plot(o1, term=1, add=TRUE, col=3, pch=2) #adaptive lasso, green triangles 
points(true.coef, pch=3, col=4) #true coefficients, blue crosses
plot of chunk unnamed-chunk-14

plot of chunk unnamed-chunk-14

It is also possible to fit multiple quantile curves and to display the results accordingly

o <-gcrq(y ~ z + ps(X))       #standard lasso
o1 <-gcrq(y ~ z + ps(X, ad=1)) #adaptive lasso
par(mfrow=c(1,2))
plot(o, term=1, legend=TRUE)  #left panel: naive lasso 
plot(o1, term=1, legend=TRUE) #right panel: adaptive lasso
plot of chunk unnamed-chunk-15

plot of chunk unnamed-chunk-15

Note each symbol refers to a specific quantile curve, and results from the adaptive lasso (right plot) are closer to the true ones with noisy coefficients (all but those corresponding to columns 4 and 15) being substantially zero.

References