Generalized Linear and Additive Models (‘GLAM’)

Fitting GLMs

Iterative Re-Weighted Least Squares (IRLS, Chen and Shao (1993)), is a common approach for fitting GLMs. IRLS is an iterative procedure that searches for coefficients for a linear model that maximize the Fisher scoring information; this is akin to finding the maximum likelihood estimator (MLE) for the coefficients. What’s powerful about the IRLS algorithm is its relative simplicity and modularity. As described in chapter 6 of Hastie and Tibshirani (1990), employing IRLS requires only a few unique quantities. One is (obviously) the choice of link function, where the function evaluation and its derivative with respect to our expectation \(\mu\) are used in the algorithms search. We also must provide an analytical form of the variance of our response \(Y\), which again will depend on the choice of link function. The rest of the quantities in IRLS are constant; so long as we have these quantities, we can implement this fitting approach for any GLM we please, and will (eventually) converge at the MLE.

Fitting GAMs

Although the structure of additive models is different from a linear approach, its fitting procedure is remarkably close to that of IRLS. In addition to the given smoothing functions we wish to employ for our fit (Hastie and Tibshirani (1990)), we need only specify the same quantities we needed for our GLM fit, including the choice of link function, its derivative, and the variance of our response \(Y\). The local scoring procedure detailed in Chapter 6 of Hastie and Tibshirani (1990) uses these quantities to iterativelt fit additive models using a weighted backfitting approach. The backfitting algorithm is a simple approach to fitting additive models that is detailed also in Hastie and Tibshirani (1990), chapter 4.

GLAM fitting examples

The ‘GLAM’ library takes advantage of GLM and GAM local scoring procedures to fit commonly-used regression models on non-Gaussian response data. We wrap our methods into one, easy-to-use function called glam:

glam(X, Y, model, family, intercept, tol, max_iter)

There are four required arguments; a numeric matrix of covariate data, \(X\), a vector of responses, \(Y\), a choice of whether to fit a GLM or GAM, and the distributional family for which to model the response. The optional arguments are whether to include an intercept term or not (default is “TRUE”), the tolerance for which convergence is determined (default is \(1e-8\)), and the maximum number of iterations allowed until convergence is deemed to have failed (default is \(100\)).

We provide some simple, examples of how to use our function for four different kinds of response data.

library(glam)

We first generate some fake, 1-dimensional covariate data for which the following examples will use in generating different responses.

# generate random inputs
set.seed(10)
n <- 50
X <- sort(runif(n))

Beta regression

We first start with beta regression, which is the primary contribution of this package given the lack of beta regression fitting tools that are currently available for R. Beta regression is appropriate for response data that is bounded between 0 and 1; this is common when dealing with proportion or percentage data, such as modeling the proportion of people who answered in a certain way on a poll question.

To fit a Beta GLM/GAM, we specify in our glam call that family = "beta". We then compare the performance of a GLM (model = "linear") with a GAM (model = "additive") on generated response data that ranges from \(0\) to \(1\) in the given domain of \(X\).

Y <- (X - 0.5)^2 + 0.5 + rnorm(n, sd = 1e-1)
beta_glm <- glam(cbind(X, X^2), Y, model = "linear", family = "beta")
beta_gam <- glam(cbind(X, X^2), Y, model = "additive", family = "beta")
plot(X, Y, xlab = "X", ylab = "Y", main = "Beta Regression Example")
lines(X, beta_glm$mu, col = "red", lwd = 2)
lines(X, beta_gam$mu, col = "blue", lwd = 2)
legend("bottomright", legend = c("GLM", "GAM"), col = c("red", "blue"), lwd = 2)

For both models we plot the fitted mean \(\mu\) that is return by glam. We can also see the estimated coefficients from the GLM model are also returned:

beta_glm$coef
#> [1]  0.6228513 -2.3475940  2.3634290

Logisitic regression

Logistic regression is possibly the most common instance of generalized regression. It is useful when response data is binary in nature; for instance, many surveys have questions where the answer recorded is only a “yes” or a “no” rather than a numerical value. In the current iteration of the GLAM package, only a GLM is available to fit, but hopefully a GAM version will be made available soon.

To fit a GLM on some example binary data, we specify family = "binomial". Note the response \(Y\) must be coded numerically with two unique values. In this example we code the response as either \(0\) for “failure” or \(1\) for “success”.

Y <- rep(0, n)
Y[which(X <= 1/3)] <- rbinom(sum(X <= 1/3), 1, 0.9)
Y[which(X >= 2/3)] <- rbinom(sum(X >= 2/3), 1, 0.9)
log_glm <- glam(cbind(X, X^2), Y, model = "linear", family = "binomial")
plot(X, Y, xlab = "X", ylab = "Y", main = "Binomial Regression Example")
lines(X, log_glm$mu, col = "red", lwd = 2)
legend("bottomright", legend = c("GLM"), col = c("red"), lwd = 2)

Poisson regression

Poisson regression is another common example of generalized modeling. This is often used when the response of interest is a count of some kind. For instance, we may wish to model the expected number of times of a given event occurring under varying conditions.

Poisson GLMs and GAMs can be fit by setting family = "poisson":

Y <- rpois(n, 3*rnorm(n, mean = 2*X, sd = 1e-4))
pois_glm <- glam(cbind(X, X^2), Y, model = "linear", family = "poisson")
pois_gam <- glam(cbind(X, X^2), Y, model = "additive", family = "poisson")
#> Backfitting algorithm failed to converge in 100 iterations.
#> Backfitting algorithm failed to converge in 100 iterations.
#> Backfitting algorithm failed to converge in 100 iterations.
#> Backfitting algorithm failed to converge in 100 iterations.
#> Backfitting algorithm failed to converge in 100 iterations.
#> Backfitting algorithm failed to converge in 100 iterations.
#> Backfitting algorithm failed to converge in 100 iterations.
#> Backfitting algorithm failed to converge in 100 iterations.

plot(X, Y, xlab = "X", ylab = "Y", main = "Poisson Regression Example")
lines(X, pois_glm$mu, col = "red", lwd = 2)
lines(X, pois_gam$mu, col = "blue", lwd = 2)
legend("bottomright", legend = c("GLM", "GAM"), col = c("red", "blue"), lwd = 2)

Gamma regression

Finally, gamma regression is a common approach is a common approach for modeling continuous positive responses. Many situations produce response data that is inherently positive in nature, such as the heights or weights of individuals. While a normal approximation is often used in practice and typically works well in most situations, we may wish to use a model that enforces an appropriate constraint on the domain of \(Y\).

To fit a gamma GLM or GAM, we simply specify family = "gamma".

Y <- rgamma(n, shape = 10, scale = 5*X)
gamma_glm <- glam(cbind(X, X^2), Y, model = "linear", family = "gamma")
gamma_gam <- glam(cbind(X, X^2), Y, model = "additive", family = "gamma")
plot(X, Y, xlab = "X", ylab = "Y", main = "Gamma Regression Example")
lines(X, gamma_glm$mu, col = "red", lwd = 2)
lines(X, gamma_gam$mu, col = "blue", lwd = 2)
legend("bottomright", legend = c("GLM", "GAM"), col = c("red", "blue"), lwd = 2)

References

Chen, Jiahua, and Jun Shao. 1993. “Iterative Weighted Least Squares Estimators.” The Annals of Statistics, 1071–92.

Hastie, T. J., and R. J. Tibshirani. 1990. Generalized Additive Models. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis. https://books.google.com/books?id=qa29r1Ze1coC.