An influential observation dramatically changes the fitted model based on whether it is included or excluded from the analysis.
library(api2lm)
Influential observations are often extreme with respect to their response values or their associated predictors.
home_sales
dataTo facilitate our discussion we will use the home_sales
data set in the api2lm package, which contains
information about homes sold in King County, WA between May 2014 and May
2015. The data set is a data frame with 216 rows and 8 columns:
price
: sale price (log10 units).bedrooms
: number of bedrooms.bathrooms
: number of bathrooms.sqft_living
: size of living space.sqft_lot
: size of property.floors
: number of floors.waterfront
: a factor variable with levels
"yes"
and "no"
that indicate whether a home as
a waterfront view.condition
: a factor variable indicating the condition
of a house with levels ranging from "poor"
to
"very good"
.We load the data below.
data(home_sales, package = "api2lm")
The validity of the assumptions we want to check depend on the model fit to the data.
We will regress the price
variable on all of the
remaining variables in home_sales
.
<- lm(price ~ ., data = home_sales)
lmod coef(lmod)
## (Intercept) bedrooms bathrooms sqft_living
## 5.377543e+00 2.604201e-03 3.149792e-02 9.987617e-05
## sqft_lot floors waterfrontyes conditionfair
## -8.773790e-07 8.630242e-02 3.152404e-01 -3.128967e-01
## conditionaverage conditiongood conditionvery_good
## -1.278938e-01 -9.122793e-02 -2.185523e-02
We describe two common approaches for identifying outliers.
The simplest approach for identifying outliers is to use an index plot of the studentized residuals to determine the observations with extreme residuals.
The other approach is to compare the set of studentized residuals to the appropriate quantile of a \(t_{n-p-1}\) distribution.
To evaluate whether a single observation is an outlier, we compare its studentized residual to \(t^{\alpha/2}_{n-p-1}\), i.e., the \(1-\alpha/2\) quantile of a \(t\) distribution with \(n-p-1\) degrees of freedom.
To identify the outliers from the \(n\) observations of our fitted model, we should adjust the previous idea using the Bonferroni correction to address the multiple comparisons problem.
Outlier example
The outlier_plot
function is a convenient way to create
an index plot of the studentized residuals from a fitted model.
outlier_plot
has four main arguments that we need to
know about:
model
: the fitted model.id_n
: the number of points to identify with
labels.add_reference
: a logical value indicating whether the
Bonferroni-corrected \(t\) quantiles
(\(t^{1-\alpha/2n}_{n-p-1}\) and \(t^{\alpha/2n}_{n-p-1}\)) are added as
reference lines to the plot. The default is TRUE
.alpha
: the \(\alpha\)
value used for the \(t\) quantiles. The
default is 0.05
.We use the outlier_plot
function to identify the 2 most
extreme outliers for our fitted model. Observations 214 and 33 have the
most extreme studentized residuals and observation 214 is more extreme
than the Bonferroni-adjusted \(t\)
quantile.
outlier_plot(lmod, id_n = 2)
The outlier_test
function can be used to identify the
observations with studentized residuals more extreme than the
appropriate Bonferroni-corrected \(t\)
quantiles. The main arguments to the outlier_test
function
are:
model
: the fitted model.n
: the number of outliers to return (by default, all of
them).alpha
: the desired familywise error rate for the family
of \(n\) tests. The default is
alpha = 0.05
.Running outlier_test
on a fitted model will return the
identified outliers and their Bonferroni-adjusted p-values
(adj_pvalue
). If the adjusted p-value is less than
alpha
, then the observation is declared an outlier.
We use outlier_test
to see that observation 214 is an
outlier according to this test.
outlier_test(lmod)
## stat pvalue adj_pvalue
## 214 -3.948387 5.414913e-05 0.01169621
The leverage values are the diagonal elements of the hat matrix \(\mathbf{H}=\mathbf{X}(\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T\). The \(i\)th leverage value is given by \(h_i=\mathbf{H}_{i,i}\), the \(i\)th diagonal position of the hat matrix.
We can extract the leverage values of a fitted model using the
hatvalues
function.
An observation is declared to be a leverage point if its leverage value is unusually large.
Kutner et al. (2005) suggest two different thresholds for identifying a leverage point.
Leverage example
We can create an index plot of the leverage values of a fitted model
using the leverage_plot
function.
The leverage_plot
function takes a few main
arguments:
model
: the fitted model.id_n
: the number of points to identify with
labels.ttype
: the threshold type. The default is
"half"
, which uses a reference threshold of
0.5
. Alternatively, we can choose "2mean"
,
which uses a reference threshold of \(2p/n\).We use the leverage_plot
function to identify leverage
points for the model fit to the home_sales
data.
leverage_plot(lmod)
Observations 33 and 179 are leverage points with leverage values greater than 0.5. Observation 214 has a leverage value nearly at 0.5, so we should consider it a leverage point.
An influential observation dramatically changes the fitted model based on whether it is included or excluded from the analysis.
We will now discuss several other measures of influence.
The DFBETA measure of influence is an \(n\times p\) matrix of statistics that measures the change in the estimated regression coefficients when we fit the model using all \(n\) observations versus when we fit the model leaving one observation out at a time.
The \(i\)th row of DFBETA is
\[ \text{DFBETA}_i = \hat{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}_{(i)}. \]
Identifying influential observations using the DFBETA matrix can be difficult because the sampling variance associated with each coefficient can be very different.
The DFBETAS matrix is a transformation of the DFBETA matrix that has been scaled so that the individual statistics have similar sampling variability.
DFBETAS is also an \(n\times p\) matrix of statistics with the \(j\)th element of the \(i\)th row being defined as
\[\text{DFBETAS}_{i,j} = \frac{\hat{\beta}_{j-1} - \hat{\beta}_{j-1(i)}}{\hat{\mathrm{se}}(\hat{\beta}_{j-1})},\quad i=1,2,\ldots,n, \quad j=0,1,\ldots,p-1.\]
A DFBETAS statistic less than -1 or greater than 1 is often used as an indicator that the associated observation is influential in determining the fitted model.
One way of identifying influential observations is to create index plots the DFBETA or DFBETAs values for each regression coefficient and determine the estimates that substantially change when observation \(i\) is excluded from analysis.
The DFBETA and DFBETAS matrices can be obtained by using the
dfbeta
and dfbetas
functions,
respectively.
DFBETAS example
We can use the dfbetas_plot
function to get index plots
of the DFBETAS statistics for each regression coefficient.
Some of the main arguments to the dfbetas_plot
function
are:
model
: the fitted model.id_n
: the number of observations for which to print an
associated label.regressors
: a formula indicating the regressors for
which we want to plot the DFBETAS statistics. By default, index plots
are created for all regressors.Reference lines are automatically added at \(\pm 1\) to indicate observations whose DFBETAS values are particularly extreme.
We plot the bedrooms
, bathrooms
,
sqft_living
, and sqft_lot
variables for the
fitted model of our home_sales
data and label the two most
extreme observations.
dfbetas_plot(lmod, id_n = 2,
regressors = ~ bedrooms + bathrooms + sqft_living + sqft_lot)
We can see from these plots that:
bedrooms
changes by more
than 3 standard errors based on whether observation 179 is used in
fitting the model.sqft_living
and
sqft_lot
change by more than 1 standard error based on
whether observations 33 or 214 are used in fitting the model.Welsch and Kuh (1977) proposed measuring an observation’s influence through the DFFITS statistic, which is the difference between its fitted value and its LOO predicted response value.
The DFFITS statistic for observations \(i\) defined as
\[\text{DFFITS}_i = \hat{Y}_i - \hat{Y}_{i(i)}.\]
Belsley et al. (2005) suggest that observation \(i\) is influential if \(|\text{DFFITS}_i|>2\sqrt{p/n}\).
An index plot of the DFFITS statistics for all observations will indicate the observations most influential in changing their associated predicted value.
The DFFITS statistics for each observation can be obtained using the
dffits
function.
DFFITS example
We can create an index plot of our DFFITS statistics using the
dffits_plot
function.
Some of the main arguments to the dffits_plot
function
are:
model
: the fitted model.id_n
: the number of observations for which to print an
associated label.Horizontal reference lines are automatically added at \(\pm 2\sqrt{p/n}\) to identify influential observations.
We create an index plot of the DFFITS statistics below. We identify the observations with the 3 most extreme DFFITS statistics.
dffits_plot(lmod, id_n = 3)
We see that observations 33, 179, and 214 all have particularly
extreme DFFITS statistics. Several other observations have statistics
more extreme than \(\pm 1\) (run
dffits_test(lmod)
for a complete list).
Cook (1977) proposed the Cook’s distance to summarize the potential influence of an observation with a single statistic.
The Cook’s distance for the \(i\)th observation is
\[D_i=\frac{\sum_{k=1}^n (Y_k - \hat{Y}_{k(i)})^2}{p \widehat{\sigma}^2} = \frac{1}{p} r_i^2 \frac{h_i}{1-h_i}.\]
\[ r_i = \frac{\hat{\epsilon}_i}{\widehat{\sigma}\sqrt{1-h_i}}. \]
An index plot of the Cook’s distances can be used to identify observations with unusually large Cook’s distances.
cooks.distance
function.Cook’s distance example
An index plot of the Cook’s distances can be created using the
cooks_plot
function.
Some of the main arguments to the cooks_plot
function
are:
model
: the fitted model.id_n
: the number of observations for which to print an
associated label.A horizontal reference line is automatically added at \(F^{0.5}_{p,n-p}\) to identify influential observations.
We create an index plot of the Cook’s distances below. We identify the observations with the 3 most extreme Cook’s distances.
cooks_plot(lmod, id_n = 3)
We see that observations 33, 179, and 214 all have particularly extreme Cook’s distances, indicating those observations are influential in some sense. This is consistent with our previous results.
An influence plot is a scatter plot of the studentized or standardized residuals versus the leverage values. The size of each point in the plot is proportional to either its Cook’s distance or DFFITS statistic.
In general, observations that are extreme with respect to their residuals, leverage values, or Cook’s distance/DFFITS statistics are more likely be be potentially influential.
Influence plot example
An influence plot can be created using the
influence_plot
function from the api2lm
package.
Some of the main arguments to the influence_plot
function are:
model
: the fitted model.rtype
: the type of residual to plot on the y-axis. The
default is the studentized residuals.criterion
: the criterion that controls the size of the
points. The default is the Cook’s distance.id_n
: the number of observations for which to print an
associated label. The most extreme observations with respect to the
criterion are labeled.We create in influence plot for the model fit to the
home_sales
data below.
influence_plot(lmod)
Belsley, D. A., Kuh, E., & Welsch, R. E. (2005). Regression diagnostics: Identifying influential data and sources of collinearity. John Wiley & Sons.
Cook, R. D. (1977). Detection of Influential Observation in Linear Regression. Technometrics, 19(1), 15–18. https://doi.org/10.2307/1268249
Kutner, Michael H, Christopher J Nachtsheim, John Neter, and William Li. 2005. Applied Linear Statistical Models, 5th Edition. McGraw-Hill/Irwin, New York.
Welsch, R. E., & Kuh, E. (1977). Linear regression diagnostics (No. w0173). National Bureau of Economic Research.