mfp2
package
Multivariable regression models are widely used across various fields of science where empirical data is analyzed. In model building, many researchers often assume a linear function for continuous variables, sometimes after applying “standard” transformations such as logarithms, or dividing the variable into several categories. Assuming linearity without considering non-linear relationships may hinder the detection of effects or cause the effects to be mismodeled. Categorizing continuous variables, which results in modelling implausible step functions, is a common practice but widely criticized (Royston et al. 2006; Sauerbrei et al. 2020).
When building a descriptive model with the aim of capturing the data structure parsimoniously, two components should be considered: first, variable selection to identify the subset of “important” covariates that have a significant impact on the outcome and second, whether non-linear relationships of continuous covariates fit the data (substantially) better.
The MFP approach has been proposed as a pragmatic method that combines variable and function selection simultaneously in multivariable regression modelling. This approach identifies non-linear functions for continuous variables if sufficiently supported by the data, and eliminates covariates with no or weak effects by backward elimination (BE). Despite its relative simplicity, the selected models often capture the essential information from the data. The MFP models are relatively straightforward to interpret and report, which is important for transportability and practical usability.
In detail, the MFP procedure combines: - variable selection through backward elimination with - selection of fractional polynomial (FP) functions for continuous variables.
The analyst must decide on nominal significance levels (α1,α2) for both components. This choice has a strong influence on the complexity of the final model. Often, the same significance levels (α1=α2) are used for both components, but they can also differ. The decision regarding these significance levels strongly depends on the specific aim of the analysis.
The rest of the paper is organized as follows. Section 1.2 provides an overview of fractional polynomial functions for a single continuous covariate in the model, including the function selection procedure (FSP). Section 1.3 describes the MFP approach, focusing on models involving two or more covariates.
Section 2 is an introduction to the mfp2
package. It
covers the installation process and provides instructions for utilizing
the package in various linear regression models. The package
functionality is predominantly demonstrated using Gaussian linear models
(Subsection 2.3), while other models, such as logistic (Subsection
2.4),and Cox (Subsection 2.5), are briefly explained.
Section 3 introduces an extension of MFP using the approximate cumulative distribution (ACD) transformation of a continuous covariate. This extension allows for modelling a sigmoid relationship between covariates and an outcome variable (subsection 3.1).
For more comprehensive information about MFP and its extensions, please refer to Royston and Sauerbrei 2008 or visit MFP website (click here) and explore the references given there.
Suppose that we have an outcome variable, a single continuous covariate x, and a regression model relating them. A starting point is the straight-line model β1x (for simplicity, we suppress the constant term, β0). Often, a straight line is an adequate description of the relationship, but other models should be investigated for possible improvements in fit. A simple extension of the straight line is a power transformation model, β1xp. The latter model has often been used by practitioners in an ad hoc way, utilizing different choices of p. Royston and Altman (1994) formalized the model by calling it a first-degree fractional polynomial or FP1 function. The power p is chosen from a pragmatically restricted set of eight elements: S={−2,−1,−0.5,0,0.5,1,2,3}, where x0 denotes the natural logarithm log(x).
As with polynomial regression, extension from one-term FP1 functions to more complex and flexible two-term FP2 functions is straightforward. The quadratic function β1x1+β2x2 is written as β1xp1+β2xp2 in FP terminology. The powers p1=1 and p2=2 are members of set S. Royston and Altman extended the class of FP2 functions with different powers to cases with equal powers (p1=p2=p) by defining them as β1xp+β2xplog(x). These are known as repeated-powers functions. Detailed definitions of FP functions are given in Section 4.3.1 of Royston and Sauerbrei (2008). For formal definitions, we use their notation.
FP1 functions are always monotonic and those with power p<0 have an asymptote as x→∞. FP2 functions may be monotonic or unimodal (i.e., have one maximum or one minimum for positive values of x), and they have an asymptote as x→∞ when both p1 and p2 are negative. For more details, see Royston and Sauerbrei (2008), Section 4.4.
Extension to FPm is straightforward but m>2 is hardly needed when modelling health research data with MFP. For m=2, there are 44 models available within the set of FP powers (S), consisting of 8 FP1 models and 36 FP2 models. Although the allowed class of FP functions may seem limited, it encompasses a wide range of diverse shapes. This is illustrated in the Figure below, with the left panel displaying eight FP1 powers and the right panel depicting a subset of FP2 powers.
Based on extensive experience with real data and several simulation studies, FP1 and FP2 are generally considered adequate in the context of multivariable model building, particularly when the selection of variable and functional forms are required (Royston and Sauerbrei 2008; Sauerbrei et al. 2007a).
This short description of the MFP methodology content is based on previously published encyclopedia articles (Sauerbrei and Royston, 2011; Sauerbrei and Royston, 2016).
Choosing the best fitting FP1, FP2, or linear function for a given dataset is conducted by grid search, computing the deviance (minus twice the maximized log-likelihood) and making a decision for the best function based on a suitable algorithm (e.g. via pre-defined significance levels).
The default function in most implementations of MFP is a linear function. Therefore, unless the data support a more complex FP function, a straight line model is chosen. Concerning the default, there are occasional exceptions. For example, to model time-varying regression coefficients in the Cox model, Sauerbrei et al. (2007b) chose a default time (t) transformation of log(t) rather than t.
It can be assumed that deviance difference between an FPm and an FP(m-1) model is distributed approximately as central χ2 on 2 degrees of freedom (d.f.) (Royston and Sauerbrei 2008, Chapter 4.9; Ambler and Royston (2001)). To select a specific function, a closed test procedure was proposed (Royston and Sauerbrei 2008, Section 4.10). The complexity of the finally chosen function requires pre-definition of the nominal significance level (α) and the degree (m) of the most complex FP model allowed. Typical choices are α=0.05 and FP2 (m=2). We illustrate the strategy for m=2, which runs as follows:
Step 1 is a test of the overall association of the outcome with x. Step 2 examines the evidence for nonlinearity. Step 3 chooses between a simpler or a more complex non-linear model.
When developing a multivariable model with a relatively large number of candidate covariates (say between 10 and 30, we are not envisaging the case of high-dimensional data), an important distinction is between descriptive, predictive and explanatory modelling (Shmueli, 2010). MFP was mainly developed for descriptive modelling, aiming to capture the data structure parsimoniously. Nevertheless, a suitable descriptive model often has a fit similar to a model whose aim is good prediction (Sauerbrei et al. 2007a). In some fields, the term explanatory modelling is used exclusively for testing causal theory.
In many areas of science, the main interest often lies in the
identification of influential variables and determination of appropriate
functional forms for continuous variables. Often, linearity is presumed
without checking this important assumption. The MFP procedure was
proposed as a pragmatic strategy to investigate whether nonlinear
functions can improve the model fit (Royston and Sauerbrei, 2008;
Sauerbrei and Royston, 1999). MFP combines backward elimination for the
selection of variables with a systematic search for possible
nonlinearity by the function selection procedure (FSP). The extension is
feasible with any type of regression model to which BE is applicable.
The philosophy behind MFP modeling is to create interpretable and
relatively simple models (Sauerbrei et al. 2007a). Consequently, an
analyst using MFP should be less concerned about failing to include
variables with a weak effect or failing to identify minor curvature in a
functional form of a continuous covariate. Modifications that may improve
MFP models are combination with post-estimation shrinkage (Dunkler et
al., 2016, R-package shrink) and a more systematic check for overlooked
local features (Binder and Sauerbrei, 2010, currently not implemented in
the mfp2
package). Successful use of MFP requires only
general knowledge about building regression models.
Two nominal significance level values are the main tuning parameters:
α1 for selecting variables
with BE (in the first step of the FSP) and α2 for comparing the fit of
functions within the FSP. Often, α1=α2 is a good choice. If
available, subject-matter knowledge should replace or at least guide
data-dependent model choice. Only minor modifications are required to
incorporate various types of subject-matter knowledge into MFP modeling.
For a detailed example, see Sauerbrei and Royston (1999).
Recommendations for MFP modeling in practice are given in Sauerbrei et
al. (2007a) and in Royston and Sauerbrei (2008, Section 12.2). In the
mfp2
package you may replace α1 and α2 by AIC-P or BIC-P.
Mainly focusing on the FP component, we briefly mention key issues of MFP modeling and refer to the literature for further reading. A brief overview is also given on the MFP website (https://mfp.imbi.uni-freiburg.de/fp). Regarding variable selection, we have summarized relevant issues and provided arguments for backward elimination as our preferred strategy (Royston and Sauerbrei 2008, Chapter 2). Even when a search for model improvement using a nonlinear function is not considered, that is, all functions are assumed linear, it is infeasible to derive a suitable and stable model for description in small datasets. Below we provide some information about sample size needed, but implicitly we assume that the sample size is “sufficient”.
The class of FP1 and FP2 functions includes a log and other transformations which require that the continuous variable must be positive. A preliminary origin-shift transformation can be applied as shown in section 2.3.2.
Sometimes, there are variables that exhibit a spike at zero. This
means that a certain proportion of individuals have zero values, while
among those with positive values, the variable has a continuous
distribution. In epidemiology, typical examples are cigarette
consumption or alcohol intake. There are different approaches to
modeling such variables. Royston and Sauerbrei (2008, chapter 4.15) and
Royston et al. (2010) proposed a method that involves adding a small
constant (to have positive values) and a binary indicator (for zero or
positive values) to the regression model. Additionally, they utilized
fractional polynomials to determine the functional form of the variable.
In a two-stage procedure, they assessed whether the binary variable
and/or the continuous function is required for a good fit to the data.
Becher et al. (2012) proposed a slight modification of this approach.
These two approaches are currently not implemented in the
mfp2
package.
All statistical models are potentially adversely affected by influential observations or “outliers”. However, compared with models that comprise only linear functions, the situation may be more critical for FP functions because logarithmic or negative power transformations may produce extreme functional estimates at small values of x. Conversely, the same may happen with large positive powers at large values of x. Such transformations may create influential observations that may affect parts of the FSP. To mitigate the impact of influential observations, it is important to assess the robustness of FP functions. Some suggestions for investigating influential points (IPs) and handling such issues in MFP modeling can be found in Royston and Sauerbrei (2008, Chapters 5 and 10) and their paper on improving the robustness of FP models (Royston and Sauerbrei, 2007). Using synthetic data, a more detailed investigation of IPs is given in Sauerbrei et al. (2023). The authors conclude that for smaller sample sizes, IPs and low power are important reasons that the MFP approach may not be able to identify underlying functional relationships for continuous variables and selected models might differ substantially from the true model. However, for larger sample sizes (about 50 or more observations per variable) a carefully conducted MFP analysis (which includes checks for IPs and other aspects of model criticism) is often a suitable way to select a multivariable regression model which includes continuous variables.
Based on experience we argue that FP1 or FP2 functions are sufficient to derive a good fitting model. Occasionally, higher order FP functions (with m=3 or m=4) are useful in data analysis. Negative exponential or other types of pre-transformations may help to improve the fit of a complex function or may allow to incorporate subject specific knowledge. An example is discussed in section 3.1.3 in Sauerbrei and Royston (1999).
Unlike splines which have a local interpretation of the fitted
function, FPs provide a curve with a global interpretation. To
investigate possible “overlooked” local features of an FP function,
Binder and Sauerbrei (2010) conducted a systematic analysis of residuals
from a model obtained by MFP. If local features are detected by their
MFP + L procedure, statistically significant local polynomials are then
parsimoniously added to the FP function. This enables the identification
and incorporation of local features that may have been missed by the
global FP function. This approach is not currently implemented in the
mfp2
package, but it is a potential area for future
extensions.
mfp2
packagemfp2
is a package that implements the MFP approach to
model selection. In addition, it has the ability to model a sigmoid
relationship between x
and an outcome variable
y
using the ACD transformation proposed by Royston (2014)
and implemented in the Stata procedure MFPA (Royston and Sauerbrei
2016). The package offers three options for variable and function
selection: nominal significance levels (p-value), adapted Akaike
information criterion (AIC-P), and adapted Bayesian information
criterion (BIC-P). It also provides functions for prediction and
plotting. Currently, the package implements linear, logistic, Poisson,
and Cox regression models. However, the package is designed in such a
way that it can easily incorporate other regression models.
The main function, mfp2()
, implements both MFP and MFPA.
The latter adds sigmoid functions to the FP family and is only required
if sigmoid functions are relevant. It offers two interfaces for input
data. The first interface (matrix, default) allows direct input of the
covariate matrix x
and the outcome y
. The
second interface (formula) uses a formula object in conjunction with a
data frame, similar to the glm()
function with slight
modifications. Both interfaces are equivalent in terms of
functionality.
The package also includes several example datasets. We will
illustrate working with mfp2
using the prostate dataset.
Details about the dataset are found in the book by Royston and Sauerbrei
(2008) and on the MFP website.
The current authors of mfp2
are Edwin Kipruto, Michael
Kammer, Patrick Royston, and Willi Sauerbrei. We acknowledge the
comments and suggestions provided by Georg Heinze and Gregory Steiner,
which helped us improve our package. The R package is maintained by
Edwin Kipruto, while the Stata MFP and MFPA programs are maintained by
Patrick Royston. Parts of the text and some recommendations
(e.g. section 2.3.3) have been taken from the Stata programs MFP and
MFPA.
The estimation algorithm employed in mfp2
sequentially
processes the covariates using a back-fitting approach. It calculates
the p-values of each covariate using the likelihood ratio test, assuming
linearity, and then arranges the covariates based on these p-values. By
default, the covariates are arranged in order of decreasing statistical
significance. This ordering aims to prioritize modeling relatively
important variables before less important ones, which can help mitigate
potential challenges in model fitting arising from collinearity or, more
generally, the presence of “concurvity” among the covariates (StataCorp.
2023). Although alternative options for covariate ordering are available
in the mfp2
package, such as increasing statistical
significance (xorder = "descending"
) and respecting the
user’s original variable order (xorder = "original"
), we
strongly recommend the default option
(xorder = "ascending"
). Occasionally, there may be subject
matter knowledge that provides arguments to change this order.
If a covariate contains non-positive values, the program by default shifts the location of the covariate to ensure positivity. In addition, it scales the shifted covariate before the first cycle of the algorithm. For more information on shifting and scaling, please see section 2.3.2.
During the initial cycle, the best-fitting FP function for the first variable (ordered by the p-values from the full model including all variables) is determined, with adjustment for all the other variables, assuming that continuous variables have a linear effect. The selected power terms are kept, and the process is repeated for the other variables. The first iteration concludes when all the variables have been processed in this way. Some variables may be eliminated, and for the remaining selected continuous variables linear, FP1 or FP2 functions are chosen. This is the starting point for the next cycle. It works similar to the initial cycle, with the selected power terms from the initial cycle being used for all variables except the one currently being processed.
Updating of FP functions and candidate variables continues through several cycles until the functions and variables included in the overall model do not change from one cycle to the next (convergence). Typically, MFP requires two to four cycles for convergence. Non-convergence may occur when there is oscillation between two or more models, but this situation is extremely rare. If this situation arises, try changing the nominal significance levels.
To install and load the mfp2
package, enter the
following command in the R console:
# install the package
install.packages("mfp2")
# load the package
library(mfp2)
The default family
in mfp2
package is
family = "gaussian"
, which fits a Gaussian linear model. We
will use the prostate cancer data (Stamey et al., 1989) included in our
package to demonstrate how to fit this model. The dataset contains seven
covariates (six continuous (age
, pgg45
,
cavol
, weight
, bph
, and
cp
) and one binary (svi
)) and a continuous
outcome variable log prostate-specific antigen (lpsa) of 97 male
patients with prostate cancer. Our aim is to determine variables with
influence on the outcome and whether non-linear functional relationships
exist between the covariates and the outcome variable.
Load the prostate
dataset from the mfp2
package and display the first six rows
# Load prostate data
data("prostate")
head(prostate, 6)
#> # A tibble: 6 × 9
#> obsno age svi pgg45 cavol weight bph cp lpsa
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 50 0 0 0.560 16.0 0.25 0.25 -0.431
#> 2 2 58 0 0 0.370 27.7 0.25 0.25 -0.163
#> 3 3 74 0 20 0.600 14.7 0.25 0.25 -0.163
#> 4 4 58 0 0 0.300 26.6 0.25 0.25 -0.163
#> 5 5 62 0 0 2.12 31.0 0.25 0.25 0.372
#> 6 6 50 0 0 0.350 25.2 0.25 0.25 0.765
# create covariate matrix x and numeric vector y
x <- as.matrix(prostate[, 2:8])
y <- as.numeric(prostate$lpsa)
The command data("prostate")
loads a data frame from the
mfp2
package. We create a matrix x
and a
numeric vector y
from the data frame to be used by the
matrix interface of the mfp2()
function.
The default interface of mfp2()
requires a matrix of
covariates x
and a numeric vector of response
y
for continuous outcomes. If you have one covariate, make
sure you convert it into a matrix with a single column.
We demonstrate how to fit the Gaussian linear model using the default
parameters. The formula interface uses the fp()
function to
state that continuous variables should be modelled using fractional
polynomials (with default parameters). The two approaches lead to the
same results as demonstrated by calling summary()
with the
fitted models as input.
# default interface
fit_default <- mfp2(x, y, verbose = FALSE)
summary(fit_default)
#>
#> Call:
#> glm(formula = y ~ ., family = family, data = data, weights = weights,
#> offset = offset, x = TRUE, y = TRUE)
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 2.33129 0.08509 27.399 < 2e-16 ***
#> cavol.1 0.54021 0.07449 7.252 1.2e-10 ***
#> svi.1 0.67944 0.20807 3.265 0.001531 **
#> weight.1 1.41590 0.38967 3.634 0.000458 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 0.5054218)
#>
#> Null deviance: 127.918 on 96 degrees of freedom
#> Residual deviance: 47.004 on 93 degrees of freedom
#> AIC: 215
#>
#> Number of Fisher Scoring iterations: 2
# formula interface
fit_formula <- mfp2(lpsa ~ fp(age) + svi + fp(pgg45) + fp(cavol) + fp(weight) +
fp(bph) + fp(cp), data = prostate, verbose = FALSE)
summary(fit_formula)
#>
#> Call:
#> glm(formula = y ~ ., family = family, data = data, weights = weights,
#> offset = offset, x = TRUE, y = TRUE)
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 2.33129 0.08509 27.399 < 2e-16 ***
#> cavol.1 0.54021 0.07449 7.252 1.2e-10 ***
#> svi.1 0.67944 0.20807 3.265 0.001531 **
#> weight.1 1.41590 0.38967 3.634 0.000458 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 0.5054218)
#>
#> Null deviance: 127.918 on 96 degrees of freedom
#> Residual deviance: 47.004 on 93 degrees of freedom
#> AIC: 215
#>
#> Number of Fisher Scoring iterations: 2
The main distinction between the formulas used in mfp2()
and glm()
functions in R is the inclusion
of the fp()
function within the formula. The presence of
the fp()
function in the formula indicates that the
variables included within it should undergo FP transformation, provided
that the degree of freedom (df) is not equal to 1. A df of 1 indicates a
linear relationship, which does not require transformation. Note that
df
is an argument of the fp()
function.
The variable svi
is a binary variable and is therefore
not passed to the fp()
function. This is because binary or
factor variables should not undergo FP transformation. If a binary
variable is passed to the fp()
function, the program will
automatically set the df
to 1, treating the variable as
linear. However, passing a factor variable to the fp()
function will result in an error. For more details on how
mfp2
handles categorical variables, refer to section
2.3.10.
Fractional polynomials are defined only for positive variables due to
the use of logarithms and other power transformations such as the square
root. Thus, mfp2()
estimates shifting factors for each
variables to ensure positivity. The function
find_shift_factor()
, used internally by mfp2
,
automatically estimates shifting factors for each continuous variables.
The formula used to estimate the shifting factor for a continuous
variable, say x1, is given by:
shiftx1=γ−min
where \min(x1) is the smallest
observed value of x1, while \gamma is the minimum increment between
successive ordered sample values of x1, excluding 0 (Royston and Sauerbrei,
2008). The new variable x1' = x1 +
shift_{x1} will then be used by mfp2()
in
estimating the FP powers.
For example, to estimate shifting factors for covariate matrix
x
from prostate data in R, you can run the following
code:
# minimum values for each covariate
apply(x, 2, min)
#> age svi pgg45 cavol weight bph cp
#> 41.00 0.00 0.00 0.26 10.75 0.25 0.25
# shifting values for each covariate
apply(x, 2, find_shift_factor)
#> age svi pgg45 cavol weight bph cp
#> 0 0 1 0 0 0 0
We see that among the continuous variables, only the variable
pgg45
is shifted by a factor of 1, which is attributed to
its minimum value being 0. Even though the variable svi
also has a minimum value of 0, it is not shifted because it is a binary
variable. The user can manually set the shifting factors for each
variable in the mfp2()
function if desired to e.g. improve
interpretability.
If the values of the variables are too large or too small, it is important to scale the variables to reduce the chances of numerical underflow or overflow which can lead to inaccuracies and difficulties in estimating the model (Royston and Sauerbrei, 2008). Scaling can be done automatically or by directly specifying the scaling values for each variables so that the magnitude of the covariate values are not too extreme. By default scaling factors are estimated by the program as follows.
After adjusting the location of x (if necessary) so that its minimum value is positive, creating x', automatic scaling will divide each value of x' by 10^p where the exponent p is given by
p = sign(k) \times floor(|k|), \quad \text{where} \quad k = \log_{10} (\max(x')- \min(x')).
The mfp2()
function uses this formula to scale the
columns of the covariate matrix, implemented through the
find_scale_factor()
function. From the output below, we see
that the variables age
, cavol
, and
cp
have scaling factors of 10 each, while the variables
pgg45
and weight
have scaling factors of 100
each. Each variable will be divided by its corresponding scaling factor.
A scaling factor of 1 (e.g. bph
and svi
)
implies no scaling.
# shift x if nonpositive values exist
shift <- apply(x, 2, find_shift_factor)
xnew <- sweep(x, 2, shift, "+")
# scaling factors
apply(xnew, 2, find_scale_factor)
#> age svi pgg45 cavol weight bph cp
#> 10 1 100 10 100 1 10
To manually enter shifting and scaling factors, the
mfp2()
function provides the shift
and
scale
arguments. In the default usage of
mfp2()
, a vector of shifting or scaling factors, with a
length equal to the number of covariates, can be provided. In the
formula interface, shifting or scaling factors can be directly specified
within the fp()
function. Below is an example to illustrate
this (not run):
# Default interface
mfp2(x, y, shift = c(0, 0, 1, 0, 0, 0, 0), scale = c(10, 1, 100, 10, 100, 1,
10))
# Formula interface
mfp2(lpsa ~ fp(age, shift = 0, scale = 10) + svi + fp(pgg45, shift = 1, scale = 100) +
fp(cavol, shift = 0, scale = 10) + fp(weight, shift = 0, scale = 100) +
fp(bph, shift = 0, scale = 1) + fp(cp, shift = 0, scale = 10), data = prostate)
In the default interface, each variable in the x
matrix
is assigned a shifting and scaling factor based on their respective
positions. For instance, the first variable in the column of
x
, which is age
, is assigned a shifting factor
of 0 and a scaling factor of 10. The second variable, svi
,
is assigned a shifting factor of 0 and a scaling factor of 1, and so
on.
The degrees of freedom (df) for each covariate (excluding the
intercept) are twice the degrees of freedom of the FP. For instance, if
the maximum allowed complexity of variable x1 is a second-degree FP (FP2), then the
degrees of freedom assigned to this variable should be 4. To find the
best fitting function, FSP selects power terms p1 and p2 and the corresponding regression
coefficients \beta_1 and \beta_2. The default df
is 4
(m = 2) for each covariate.
After assigning the default degrees of freedom to each variable, the program overrides the default value if a variable has only a very small number of unique values as such variables may not be handled as other continuous variable. As proposed in the Stata program the rules for overriding the default degrees of freedom are as follows:
df
= 1 (linear).df
= min(2, default).df
= default.These rules ensure that the appropriate df
are assigned
to variables. For instance, it is not sensible to fit an FP2 function to
a variable with only 3 distinct values (StataCorp. 2023).
The following code illustrates how to set different df
for each variable. In the default interface, the df
of the
binary variable svi
is explicitly set to 1 (linear), while
in the formula interface, there is no need to specify the
df
, as the program automatically assigns
df = 1
for binary variables.
If the user attempts to enter df = 4
for
svi
in the default interface, the program will reset the
df
to 1 and issue a warning.
# Default Interface
mfp2(x, y, df = c(4, 1, 4, 4, 4, 4, 4))
# Formula Interface
mfp2(lpsa ~ fp(age, df = 4) + svi + fp(pgg45, df = 4) + fp(cavol, df = 4) +
fp(weight, df = 4) + fp(bph, df = 4) + fp(cp, df = 4), data = prostate)
In addition, if the user does not explicitly assign df
to variables, the program will automatically assign df = 4
to each variable, corresponding to m =
2 as the default for MFP modelling. It is important to note
that even if a continuous variable is not passed to the
fp()
function in the formula interface, the df
of that variable will still be set to the default value and will later
be adjusted based on the unique values. The following examples are all
equivalent:
# Default Interface
mfp2(x, y)
# Formula Interface, all continuous variables passed to the fp() function
mfp2(lpsa ~ fp(age) + svi + fp(pgg45) + fp(cavol) + fp(weight) + fp(bph) + fp(cp),
data = prostate)
# Formula Interface, `cp` not passed to the fp() function
mfp2(lpsa ~ fp(age) + svi + fp(pgg45) + fp(cavol) + fp(weight) + fp(bph) + cp,
data = prostate)
# Formula Interface, all covariates not passed to the fp() function
mfp2(lpsa ~ age + svi + pgg45 + cavol + weight + bph + cp, data = prostate)
The two key components of MFP are variable selection with backward
elimination (BE) and function selection for continuous variables through
the function selection procedure (FSP). The mfp2()
function
has an argument called criterion
, which allows the user to
specify the criterion applied for variable and function selection.
The default criterion is pvalue
, which enables the user
to set two nominal significance levels: \alpha_1 for variable selection with BE
and \alpha_2 for comparing the fit
of the functions (linear, best FP1, best FP2 and so on) within the FSP.
The choice of these significance levels strongly influences the
complexity and stability of the final model. Please refer to section section 2.3.4.1 for instructions
on setting the nominal significance levels.
The mfp2
package offers alternative approaches to
variable and function selection by utilizing information criteria.
Within the MFP framework, each covariate is evaluated univariately while
adjusting for the other covariates through an overarching back-fitting
algorithm. This algorithm iteratively assesses each covariate and
selects the model (null, linear, FP1, FP2) with the minimum AIC-P
(criterion set to aic) or BIC-P (criterion set to bic) (see section
2.3.4.2). Here, AIC-P and BIC-P refer to the adapted versions of AIC and
BIC, respectively, to distinguish them from the original AIC and BIC.
However, in the program, we use the abbreviations AIC and BIC to denote
AIC-P and BIC-P.
The “null” model refers to a model without the covariate of interest.
A “linear” model assumes a linear relationship with the outcome
variable, while an FP1 model assumes a non-linear relationship using the
FP1 function. The criterion
argument allows users to
specify whether they want to use AIC-P or BIC-P criteria for both
variable and function selection. If AIC-P or BIC-P is selected as the
criterion, the nominal significance levels set through the
select
and alpha
arguments will be ignored.
For details on using AIC-P and BIC-P in variable and function selection,
please refer to section 2.3.4.2.
The mfp2()
function has the arguments
select
and alpha
for setting the nominal
significance level for variable selection by BE and for testing between
FP models of different degrees, respectively. When using these
arguments, you should ensure that the criterion
is set to
"pvalue"
to correctly use the specified significance
levels.
For variable selection, a significance level can be set using the
select
argument. A value of 1 (select = 1
) for
all variables forces them all into the model. Setting the nominal
significance level to be 1 for a given variable forces it into the
model, leaving others to be selected or not. A variable is dropped if
its removal leads to a nonsignificant increase in deviance. Default is
select = 0.05
for each variable.
On the other hand, the alpha
argument is used to
determine the complexity of the selected FP function. A value of 1
(alpha = 1
) will choose the most complex FP function
permitted for a given variable. For example, if FP2 is the most complex
function allowed for a continuous variable, setting
alpha = 1
will select the best FP2 function. Default is
alpha = 0.05
for each continuous variable.
The rules for setting select
and alpha
are
the same as those for setting df
(see
section 2.3.3). The following R codes shows how to set equal nominal
significance levels for variable and function selection (\alpha_1 = \alpha_2 = 0.05) for each
variable and produces identical results. Setting different nominal
significance levels is straightforward. Simply replace the value 0.05
with the desired significance level of your choice.
# Default Interface
mfp2(x, y, select = rep(0.05, ncol(x)), alpha = rep(0.05, ncol(x)))
# Formula Interface
mfp2(lpsa ~ fp(age, select = 0.05) + svi + fp(pgg45, select = 0.05) + fp(cavol,
select = 0.05) + fp(weight, select = 0.05) + fp(bph, select = 0.05) + fp(cp,
select = 0.05), select = 0.05, alpha = 0.05, data = prostate)
In the formula interface, binary variables such as svi
that are not passed in the fp()
function utilize the global
select
argument. In our example, the global parameter is
set to select = 0.05
. If several binary variables exist in
the model, the global parameters will be used for all of them. However,
if specific parameters need to be set for individual binary variables,
the user can use the fp()
function. In summary, if a
variable is not passed through the fp
function, it will
utilize the global parameters.
Suppose we want to force the variables "age"
and
"svi"
in the model. To achieve this, we have two
options:
select = 1
for age
and
svi
in the mfp2()
function.keep
argument, which
resets the nominal significance levels for age
and
svi
to 1 if they are different from 1. This ensures that
these variables are retained in the model.# default Interface Set select to 1 for variables 'age' and 'svi'
mfp2(x, y, select = c(1, 1, 0.05, 0.05, 0.05, 0.05, 0.05), alpha = rep(0.05,
ncol(x)))
# use keep argument
mfp2(x, y, select = rep(0.05, ncol(x)), alpha = rep(0.05, ncol(x)), keep = c("age",
"svi"))
# formula Interface use fp() function and set select to 1 for 'age' and
# 'svi'
mfp2(lpsa ~ fp(age, select = 1) + fp(svi, df = 1, select = 1) + fp(pgg45, select = 0.05) +
fp(cavol, select = 0.05) + fp(weight, select = 0.05) + fp(bph, select = 0.05) +
fp(cp, select = 0.05), select = 0.05, alpha = 0.05, data = prostate)
# use keep argument
mfp2(lpsa ~ fp(age, select = 0.05) + svi + fp(pgg45, select = 0.05) + fp(cavol,
select = 0.05) + fp(weight, select = 0.05) + fp(bph, select = 0.05) + fp(cp,
select = 0.05), select = 0.05, alpha = 0.05, keep = c("age", "svi"), data = prostate)
Instead of using nominal significance levels for variable and function selection, an alternative approach is to directly use adapted versions of AIC or BIC, defined below.
\begin{align*} AIC-P & = -2 \log(L) + 2k \\ BIC-P & = -2 \log(L) + \log(n) \times k, \end{align*}
where \log(L) is the maximum log-likelihood of the fitted model, which measures how well the model fits the data. The parameter k corresponds to the number of estimated parameters in the model (regression estimates and FP powers) and n is the sample size or the number of observations in the dataset. AIC-P and BIC-P consider both model fit and complexity, with lower values indicating better-fitting models.
For instance, let’s assume we want to select the best model for a variable of interest (z) with fixed adjustment variables x1 (with power p3 as the best FP1 function) and x2 (linear function was pre-specified). Then, we can compare the AIC and BIC of different models, such as FP2, FP1, linear, and null models for z. The adjustment models all have the same number of parameters (2 for x1, 1 for x2, and perhaps an intercept if it exists). For z, we have 4, 2, and 1 parameter(s) for FP2, FP1, and linear models, respectively, for a model without an intercept. In a model without an intercept, the total number of parameters, including adjustment variables, for variable z are 7 (4 + 3) for FP2, 5 (2 + 3) for FP1, and 4 (1 + 3) for a linear model. These total number of parameters for each model type is used in calculating the AIC-P and BIC-P. The model with the smallest AIC-P or BIC-P is then selected for z.
In mfp2
package, this can be achieved by setting the
criterion
argument to either "aic"
or
"bic"
in the mfp2()
function. Additionally, if
there is a need to force certain variables, such as age
and
svi
, into the model, the keep
argument can be
used.
The following R code demonstrates how to implement this approach using both the default and formula interfaces, as well as how to force specific variables into the model:
# Default Interface
mfp2(x, y, criterion = "aic", keep = c("age", "svi"))
# Formula Interface
mfp2(lpsa ~ fp(age) + fp(svi, df = 1) + fp(pgg45) + fp(cavol) + fp(weight) +
fp(bph) + fp(cp), criterion = "aic", keep = c("age", "svi"), data = prostate)
The FSP in mfp2()
function compares various models for
the variable of interest. For instance, if the most complex allowed FP
function is FP2, the FSP will compare the best FP2 model with the null
model, the best FP2 model with the linear model, and the best FP2 model
with the best FP1 model, when the criterion = "pvalue"
. The
deviance for each model (null, linear, FP1, and FP2) and their
corresponding differences and p-values will be calculated.
When comparing Gaussian models, the mfp2()
function
provides two options for calculating p-values: the F-test and the
Chi-square test. The Chi-square test is the default option. For other
non-Gaussian models such Cox and logistic regression models, the
Chi-square test is used. For more detailed information, please refer to
page 23 of the MFP Stata
manual paper (here)
To use the F-test, we can set the ftest = TRUE
, as
demonstrated in the example below. Conversely, to use the Chi-square
test, set ftest = FALSE
.
# Default Interface
mfp2(x, y, criterion = "pvalue", ftest = TRUE)
# Formula Interface
mfp2(lpsa ~ fp(age) + svi + fp(pgg45) + fp(cavol) + fp(weight) + fp(bph) + fp(cp),
criterion = "pvalue", ftest = TRUE, data = prostate)
Please note that the p-values reported by the mfp
program in Stata
for Gaussian models are based on the
F-test. If you intend to compare the results between the two software
packages, it is crucial to ensure that ftest = TRUE
is set
in R since the default is
ftest = FALSE
.
It is important to be aware that the older version of the
mfp
package in R uses the Chi-square test
for all model types, which may yield different results compared to the
Stata program, for Gaussian models. In summary, the mfp2
package reproduces the results of Stata (mfp and mfpa programs), as well
as the old mfp
program in R if you set
ftest = FALSE
.
By default, the mfp2()
function uses the predefined set
S = \{-2, -1, -0.5, 0, 0.5, 1, 2,
3\} for each continuous covariate, as proposed in the original
paper by Royston and Altman (1994). However, there may be situations
where users want to provide their own power terms based on subject
matter knowledge (not all powers allowed). In such cases, the
powers
argument which takes a list of powers can be
employed to specify the desired power terms. More extreme powers may be
considered for a possible improvement of the model fit. This may be
sensible if the ‘extreme’ power terms (-2, -2) or (3,3) are selected,
Considering (-3,-3) or (4,4) may be a suitable option. By default, the
covariates are arranged in order of decreasing statistical significance
(see section 2.1 for other options).
The following example illustrates how to assign different power terms
to continuous covariates. Two power terms (1 and 0.5) are evaluated for
the age
covariate. This set includes three models of degree
2 (with power combinations of (1, 0.5), (1, 1), and (0.5, 0.5)),and a
model of degree 1 (with power of 0.5). A linear model (power=1) is also
allowed.
The cavol
variable is assigned three powers (1, 0, 1).
The remaining continuous variables, namely pgg45
,
weight
, bph
, and cp
, are not
assigned any power terms. Instead, the default powers defined in set
S are used and a search through all
possible fractional polynomials up to the degree set by df
is performed.
When using the fp()
function in the formula interface,
the power terms provided must be a vector
not a list since
they are specific to a particular variable.
# create a list of power terms for covariates 'age' and 'cavol'
powx <- list(age = c(1, 0.5), cavol = c(1, 0, 1))
# Default Interface
mfp2(x, y, criterion = "pvalue", powers = powx)
# Formula Interface
mfp2(lpsa ~ fp(age, powers = c(1, 0.5)) + svi + fp(pgg45) + fp(cavol, powers = c(1,
0, 1)) + fp(weight) + fp(bph) + fp(cp), data = prostate)
Using the prostate example, we briefly explain how the algorithm works. Similar to backward elimination, it starts with the full model (model with all variables, linear function for all continuous variables) and investigates whether variables can be eliminated and whether non-linear functions improve the fit. For each of the continuous variables, the FSP is used to check whether a non-linear function fits the data significantly better than a linear function. After the first cycle, some variables may be eliminated from the model, and for some continuous variables, a more suitable non-linear function may be identified as a better fit for the data.
The algorithm then proceeds to the second cycle, but the new starting model now has fewer variables due to the elimination process in the previous cycle. Additionally, non-linear functions may have been identified for some of the continuous variables. In the second cycle, all variables will be reconsidered, even if they were not significant at the end of the first cycle, and FSP is used again to determine the ‘best’ fitting FP function (the selected functions may be different due to potential variations in the adjustment variables). The results obtained from the second cycle then serve as the starting model for the third cycle. In most cases, the variables and functions selected remain unchanged in cycles 3 or 4 and the algorithm stops with the final MFP model.
The order of ‘searching’ for model improvement by better fitting non-linear functions is important. Mismodelling the functional form of a variable with a strong effect is more critical than mismodelling the functional form of a variable with a weak effect. Therefore, the order is determined by the p-values from the full model. Variables with small p-values are considered first.
To explain the output of the MFP algorithm, we will use the default
interface of the mfp2()
function to build an MFP model with
the default parameters. Specifically, all continuous variables are
assigned a degree of freedom (df) of 4, implying that FP2 is the most
complex permitted function. Moreover, we employ the ‘pvalue’ as the
criterion for variable and function selection, and we set the nominal
significance levels for both components to 0.05 for all variables.
Finally, we use the F-test instead of Chi-square to calculate the
p-values.
The R output displays the df
used for
each variable, as denoted by the “initial degrees of freedom”. The
program correctly identifies svi
as a binary variable and
assigns it a df of 1. Additionally, the variables are ordered based on
their p-values in descending order of significance, as indicated by the
“visiting order”. The variable cavol
has the smallest
p-value and will be evaluated first, while age
has the
largest p-value and will be evaluated last.
fit <- mfp2(x, y, criterion = "pvalue", select = 0.05, alpha = 0.05, ftest = TRUE)
#>
#> i Initial degrees of freedom:
#> age svi pgg45 cavol weight bph cp
#> df 4 1 4 4 4 4 4
#>
#> i Visiting order: cavol, svi, pgg45, weight, bph, cp, age
#>
#> ---------------------
#> i Running MFP Cycle 1
#> ---------------------
#>
#> Variable: cavol (keep = FALSE)
#> Powers DF Deviance Versus Deviance diff. P-value
#> FP2 -0.5, 1 12 196.3 . . .
#> null NA 8 240.1 FP2 43.8 0.0000
#> linear 1 9 214.3 FP2 18.0 0.0011
#> FP1 0 10 199.7 FP2 3.4 0.2226
#> Selected: FP1
#>
#> Variable: svi (keep = FALSE)
#> Powers DF Deviance Versus Deviance diff. P-value
#> null NA 8 208.6 . . .
#> linear 1 9 199.7 null 9.0 0.0042
#> Selected: linear
#>
#> Variable: pgg45 (keep = FALSE)
#> Powers DF Deviance Versus Deviance diff. P-value
#> FP2 -2, -2 12 196.7 . . .
#> null NA 8 202.0 FP2 5.3 0.3091
#> Selected: null
#>
#> Variable: weight (keep = FALSE)
#> Powers DF Deviance Versus Deviance diff. P-value
#> FP2 -2, -2 11 199.3 . . .
#> null NA 7 209.7 FP2 10.4 0.0521
#> Selected: null
#>
#> Variable: bph (keep = FALSE)
#> Powers DF Deviance Versus Deviance diff. P-value
#> FP2 -1, 3 10 207.7 . . .
#> null NA 6 217.7 FP2 10.0 0.0567
#> Selected: null
#>
#> Variable: cp (keep = FALSE)
#> Powers DF Deviance Versus Deviance diff. P-value
#> FP2 2, 3 9 213.2 . . .
#> null NA 5 217.9 FP2 4.6 0.3655
#> Selected: null
#>
#> Variable: age (keep = FALSE)
#> Powers DF Deviance Versus Deviance diff. P-value
#> FP2 -1, -1 8 216.6 . . .
#> null NA 4 217.9 FP2 1.2 0.8839
#> Selected: null
#>
#> ---------------------
#> i Running MFP Cycle 2
#> ---------------------
#>
#> Variable: cavol (keep = FALSE)
#> Powers DF Deviance Versus Deviance diff. P-value
#> FP2 -0.5, 1 7 215.0 . . .
#> null NA 3 264.6 FP2 49.6 0.0000
#> linear 1 4 238.1 FP2 23.0 0.0001
#> FP1 0 5 217.9 FP2 2.8 0.2646
#> Selected: FP1
#>
#> Variable: svi (keep = FALSE)
#> Powers DF Deviance Versus Deviance diff. P-value
#> null NA 3 226.9 . . .
#> linear 1 4 217.9 null 9.0 0.0032
#> Selected: linear
#>
#> Variable: pgg45 (keep = FALSE)
#> Powers DF Deviance Versus Deviance diff. P-value
#> FP2 0.5, 3 8 214.3 . . .
#> null NA 4 217.9 FP2 3.6 0.4973
#> Selected: null
#>
#> Variable: weight (keep = FALSE)
#> Powers DF Deviance Versus Deviance diff. P-value
#> FP2 -2, -2 8 202.1 . . .
#> null NA 4 217.9 FP2 15.7 0.0052
#> linear 1 5 205.0 FP2 2.9 0.4438
#> Selected: linear
#>
#> Variable: bph (keep = FALSE)
#> Powers DF Deviance Versus Deviance diff. P-value
#> FP2 0.5, 0.5 9 199.1 . . .
#> null NA 5 205.0 FP2 5.9 0.2460
#> Selected: null
#>
#> Variable: cp (keep = FALSE)
#> Powers DF Deviance Versus Deviance diff. P-value
#> FP2 2, 3 9 200.3 . . .
#> null NA 5 205.0 FP2 4.7 0.3617
#> Selected: null
#>
#> Variable: age (keep = FALSE)
#> Powers DF Deviance Versus Deviance diff. P-value
#> FP2 -0.5, 0 9 203.2 . . .
#> null NA 5 205.0 FP2 1.8 0.7926
#> Selected: null
....
After the variables are ordered, the MFP algorithm starts by
searching for a suitable function for cavol
. It compares
the best-fitting FP2 (-0.5, 1) function for cavol
against a
null model that excludes cavol
. This comparison involves
adjusting for all other six variables (svi
to
age
) using linear functions. The test is highly significant
(p = 0.0000), indicating that the best FP2 function fits significantly
better than a null model. Next, the model with the best FP2 function is
compared to a model assuming linearity, and the test remains significant
(p = 0.0011). This suggests that cavol
can be better
described by a non-linear function at this stage. Finally, the best FP2
function is compared to the best FP1 (0) function, and the test is not
significant (p = 0.2226). Thus, at this stage in the model-selection
procedure, the final function for cavol
is FP1 with power
0, denoting a log function.
The next variable to be evaluated is svi
, which is a
binary variable. In this case, only the test of null versus linear is
appropriate. Both the null and linear models of svi
are
adjusted for log(cavol)
(the log function just selected)
and the other five variables (pgg45
, cp
,
weight
, bph
, and age
) that are
still present in the model. The test of null versus linear is
significant (p = 0.0042), indicating that svi
remains in
the model.
Next, we evaluate the continuous variable pgg45
in the
same model as for svi
. The p-value for the first test (FP2
versus null) is not significant (p = 0.3091), and the variable is
eliminated from the model and will not be included in any of the models
in the rest of the first cycle.
Next, the algorithm evaluates the variable weight
in a
model adjusting for log(cavol)
, svi
, and the
three variables (cp
, bph
, and
age
) that have not yet been evaluated. The first test is
non-significant (p = 0.0521), and the variable is eliminated for the
rest of the first cycle. In the subsequent steps, bph
,
cp
, and age
are also eliminated.
At the end of the first cycle, only two variables were
selected: log(cavol)
and svi
.
The variables and functions selected in the first cycle becomes the
new starting model for the second cycle. The second cycle starts with
the evaluation of the effect of cavol
, but this time in a
simpler model adjusting for svi
only. Deviances are larger
compared to cycle 1 because the five variables (pgg45
,
age
, weight
, bph
and
cp
) no longer belong to the ‘adjustment’ model. However,
the FSP still selects log(cavol)
, now with a different
estimate for the regression parameter.
Then svi
is evaluated in a model that includes only
log(cavol)
, and the test of linear vs null confirms that
the variable should be included in the model. Pgg45
is
evaluated in a model with svi
and log(cavol)
as adjustment variables, but the test of inclusion (FP2 vs null) is not
significant and the variable is again eliminated.
The variable weight
is considered next in a model with
svi
and log(cavol)
. In contrast to the first
cycle, the test for inclusion (FP2 vs null) is significant (p = 0.0052),
indicating that weight
needs to be re-included into the
model. The second test of the FSP (FP2 vs linearity) is non-significant
and a linear function is chosen for weight
. The inclusion
of weight
in the second cycle is a result of eliminating
bph
, cp
and age
in the first
cycle, and of the correlation between these variables.
In the subsequent steps, bph
, cp
, and
age
are investigated in models adjusting for
log(cavol)
, svi
, and weight
(linear). Compared to the first cycle, the p-values change, but all of
them are much larger than 0.05. Therefore, none of these variables are
included in the model.
The second cycle ends with the model consisting of
log(cavol)
, svi
(binary), and
weight
(linear).
This model serves as the starting point for the third cycle (not shown), where all seven variables are investigated again. However, there are no further change in the selected variables or functions, so MFP terminates with the three variables selected in the second cycle.
Once you fit an MFP model using the mfp2()
function, you
can apply various methods to the resulting model object to extract
information or perform specific tasks. Suppose we fit an MFP model for
the Gaussian family with default parameters as follows:
fit <- mfp2(x, y, verbose = FALSE)
class(fit)
#> [1] "mfp2" "glm" "lm"
The fit
is an object of class mfp2
, which
inherits properties from glm
and lm
. If the
family = "cox"
, the fit will inherit from
coxph
in the survival package. This means that
mfp2
can utilize methods and functions defined for
glm
, lm
or coxph
.
To obtain a summary of the final mfp2
object, you can
simply enter the object name (e.g., fit
in our example) or
use the print function. The print()
function provides a
comprehensive summary of the final MFP model, including the parameters
used in the MFP model such as shifting and scaling factors, degrees of
freedom (denoted by df_initial
), nominal significance
levels (select
and alpha
), and other relevant
information.
Furthermore, it shows the final degrees of freedom (denoted by
df_final
), where a df
of 0 indicates that a
variable was eliminated, which is confirmed by the selected
column. The power
column shows the final FP powers for
variables, with eliminated variables assigned NA
. The
acd
column, which indicates whether ACD transformation was
applied to a covariate, will only be displayed when the user chooses to
apply ACD transformation to at least one covariate.
print(fit)
#> Shifting, Scaling and Centering of covariates
#> shift scale center
#> cavol 0 10 TRUE
#> svi 0 1 TRUE
#> pgg45 1 100 TRUE
#> weight 0 100 TRUE
#> bph 0 1 TRUE
#> cp 0 10 TRUE
#> age 0 10 TRUE
#>
#> Final Multivariable Fractional Polynomial for y
#> df_initial select alpha selected df_final power1
#> cavol 4 0.05 0.05 TRUE 2 0
#> svi 1 0.05 0.05 TRUE 1 1
#> pgg45 4 0.05 0.05 FALSE 0 NA
#> weight 4 0.05 0.05 TRUE 1 1
#> bph 4 0.05 0.05 FALSE 0 NA
#> cp 4 0.05 0.05 FALSE 0 NA
#> age 4 0.05 0.05 FALSE 0 NA
#>
#> MFP algorithm convergence: TRUE
#>
#> Call: glm(formula = y ~ ., family = family, data = data, weights = weights,
#> offset = offset, x = TRUE, y = TRUE)
#>
#> Coefficients:
#> (Intercept) cavol.1 svi.1 weight.1
#> 2.3313 0.5402 0.6794 1.4159
#>
#> Degrees of Freedom: 96 Total (i.e. Null); 93 Residual
#> Null Deviance: 127.9
#> Residual Deviance: 47 AIC: 215
To display the regression coefficients from the final MFP model, you
can use the coef()
or summary()
function. It’s
important to note that the variables in the model have names with dot
extensions. This means that a variable with two FP powers (FP2) will be
denoted as “var.1” and “var.2”. Please note that the displayed
regression coefficients are not currently scaled back to the original
scale.
# extract only regression coefficients
coef(fit)
#> (Intercept) cavol.1 svi.1 weight.1
#> 2.3312906 0.5402090 0.6794446 1.4158958
# display regression coefficients with other statistics
summary(fit)
#>
#> Call:
#> glm(formula = y ~ ., family = family, data = data, weights = weights,
#> offset = offset, x = TRUE, y = TRUE)
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 2.33129 0.08509 27.399 < 2e-16 ***
#> cavol.1 0.54021 0.07449 7.252 1.2e-10 ***
#> svi.1 0.67944 0.20807 3.265 0.001531 **
#> weight.1 1.41590 0.38967 3.634 0.000458 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 0.5054218)
#>
#> Null deviance: 127.918 on 96 degrees of freedom
#> Residual deviance: 47.004 on 93 degrees of freedom
#> AIC: 215
#>
#> Number of Fisher Scoring iterations: 2
Users can generate predictions for new or existing data based on the
fitted MFP model using the predict()
function. To
illustrate this, let’s consider an example using the prostate dataset.
Suppose we want to make predictions for the first five observations in
the matrix x (considering them as
new observations). We can achieve this with the following code:
# extract the first five observations from 'x'
new_observations <- x[1:5, ]
dim(new_observations)
#> [1] 5 7
# make predictions for these new observations.
predict(fit, newdata = new_observations)
#> [,1]
#> [1,] 0.9297156
#> [2,] 0.8714946
#> [3,] 0.9499953
#> [4,] 0.7440425
#> [5,] 1.8612448
To prepare the newdata
for prediction, the
predict()
function applies any necessary shifting and
scaling based on the factors obtained from the development data. It is
important to note that if the shifting factors are not sufficiently
large as estimated from the development data, variables in
newdata
may end up with negative values, which can cause
prediction errors if non-linear functional forms are used. A warning is
given in this case by the function.
The regression estimates (\hat{\beta}) for FP terms provide
incomplete information as they only give limited insight into the fitted
function for the variable of interest. A more informative approach is to
visualize functions for variables with non-linear effects (Royston and
Sauerbrei, 2008). The mfp2
package offers the
fracplot()
function specifically designed for this
purpose.
By providing the mfp2
object and other relevant
arguments, fracplot()
generates plots of the functions for
variables along with 95% confidence intervals. As for all data derived
models, model based confidence limits may be too narrow. The bootstrap
may be used to derive more realistic estimates of standard errors and
pointwise confidence intervals (Sauerbrei and Royston 2007). You can
specify the variable to be plotted using the terms
argument. If terms
is set to NULL
(the
default), the function returns a list of plots for all variables
included in the model.
The fracplot()
function, by default produces a
component-plus-residual plot. This is because the
partial_only
argument is set to FALSE
by
default.
For Gaussian models with constant weights and a single covariate, this amounts to a plot of the outcome variable and each covariate with the fitted function.
For other normal-error models, weighted residuals are calculated and added to the fit. For models with two or more covariates, the function is the partial linear predictor for the variable of interest and includes the intercept.
For generalized linear and Cox models, the fracplot()
function plots the fitted values on the scale of the linear predictor.
Deviance residuals are added to the (partial) linear predictor for
generalized linear models, while martingale residuals are added for Cox
models to give component-plus-residual. These values are small circles
on the plot. The component-plus-residual plots may show the amount of
residual variation at each covariate and may indicate lack of fit or
outliers in the outcome (Royston and Sauerbrei, 2008; StataCorp.
2023).
If you set partial_only = TRUE
, the
fracplot()
function will plot only the partial linear
predictor without including the residuals. For more detailed information
on component-plus-residuals, you can refer to the Stata manual
(here).
For instance, to plot the function for cavol
in the
model, you can use the following code:
plots <- fracplot(fit)
# plots is a list
class(plots)
#> [1] "list"
# The plot of cavol is the first element of the list
plots[[1]] + ggplot2::ylab("Partial predictor + residuals")
The fracplot()
function also supports the incorporation
of a reference point or target value in the plot. Including a reference
point allows for visualizing whether other data points exceed or fall
below the reference point. To achieve this, you can use the
type
argument and set it to “contrasts”
(type = "contrasts"
). Additionally, you can supply the
reference point using the ref
argument. By default, the
ref
argument is set to NULL
, and average
values are used as the reference points for continuous variables, while
the minimum values are used as reference points for binary variables.
For instance, if we want the reference value for the continuous variable
cavol
to be 30, we can use the following code:
plots <- fracplot(fit, type = "contrasts", ref = list(cavol = 30))
plots[[1]] + ggplot2::ylab("Partial predictor")
In certain cases, when the data points for a specific variable contain extreme values, the resulting function plotted using the original data can appear very sharp. This sharpness is primarily due to the limited number of data points that are close to the extreme values of the variable.
The fracplot()
function provides two options for
generating the plot. The first option is to use the original data that
was used to fit the model (terms_seq = "data"
). The second
option is to generate a sequence of equidistant new data points using
the range of the original data for the variable of interest
(terms_seq = "equidistant"
).
To illustrate these options, we will use the art
dataset
included in the mfp2
package. This dataset is described in
detail by Royston and Sauerbrei (2008, Chapter 10), and the outcome
variable is continuous. In our example, we will fit an MFP model with a
single covariate (x5) (see Figure
below).
# load art data
data("art")
# fit mfp model using art data
xx <- as.matrix(art[, "x5", drop = FALSE])
yy <- as.numeric(art$y)
fit2 <- mfp2(xx, yy, verbose = FALSE)
# generate plot using original data
plot1 <- fracplot(fit2)
plot1[[1]] <- plot1[[1]] + ggplot2::ylab("Partial predictor + residuals")
# generate plot using sequence of data
plot2 <- fracplot(fit2, terms_seq = "equidistant")
plot2[[1]] <- plot2[[1]] + ggplot2::ylab("Partial predictor + residuals")
# combine plots
patchwork::wrap_plots(plot2[[1]], plot1[[1]], ncol = 2, widths = 8, heights = 4)
The mfp2
package requires the creation of dummy
(indicator) variables for categorical independent variables before using
the mfp2()
function. The number of dummy variables depends
on the number of classes in the categorical variable. As a general rule
of thumb, a categorical variable with k
classes is
represented by k-1
dummy variables, each taking on the
values of 0 and 1. It may be easiest to define k-1 new dummy variable
before starting mfp2
, including all k-1 dummy variables as
independent variables. Based on subject matter knowledge you may also
decide to combine some of the categories, resulting in a smaller number
of dummy variables and decide to include these dummy variables in the
model, not allowing their elimination. In principle, this would be an
important decision from the initial data analysis process (Huebner et al
2018).
Categorical variables can be measured using an ordinal scale or a nominal scale. An ordinal variable indicates that the levels of the variable are ordered in some way, such as the levels of income (e.g., low income, medium income, and high income). On the other hand, a nominal variable has no intrinsic ordering among its categories. When variable selection is involved in the model building process, it is important to consider different approaches for coding ordinal and nominal variables in order to facilitate interpretation. This is because, during variable selection, certain dummy variables may be eliminated, which is analogous to combining some levels of variables. For more details on handling categorical variables, refer to Royston and Sauerbrei (2008, Chapter 3).
The model.matrix()
function in R
automatically generates dummy variables using a dummy coding for
categorical variables. This will ignore any ordinal structure of the
variables. The mfp2
package offers a function called
create_dummy_variables()
that provides an alternative to
the model.matrix()
specifically designed for creating dummy
variables for both ordinal and nominal variables.
Before using the create_dummy_variables()
function, we
recommend converting the variable of interest to a factor using the
factor()
function in base R. If necessary, you can specify
the levels of the variable. Otherwise, the first level will be used as
the reference. We will illustrate how to handle categorical variables
using the art
data. The outcome variable is continuous, and
there are 10 covariates.
The R code below demonstrates how to convert the ordinal variable
x4
, which has 3 levels (1 as the reference, 2, and 3), to
dummy variables using ordinal coding. Similarly, it demonstrates how to
convert the nominal variable x9
, which also has 3 levels (1
as the reference, 2, and 3), to dummy variables using categorical
coding. The final data will contain the dummy variables, which can be
supplied to the mfp2()
function.
# load art data
data("art")
# order the levels of the variable such that Levels: 1 < 2 < 3
art$x4 <- factor(art$x4, ordered = TRUE, levels = c(1, 2, 3))
# convert x9 to factor and assign level 1 as reference group
art$x9 <- factor(art$x9, ordered = FALSE, levels = c(1, 2, 3))
# create dummy variables for x4 and x9 based on ordinal an categorical
# coding and drop the original variable
art <- create_dummy_variables(art, var_ordinal = c("x4"), var_nominal = c("x9"),
drop_variables = TRUE)
# display the first 20 observations
head(art, 10)
#> y x1 x2 x3 x5 x6 x7 x8 x10 x4_1 x4_2 x92 x93
#> 1 12.81324 60 1 24 1.829647 247 302 0 33.13 0 0 0 0
#> 2 11.54249 35 1 20 10.335463 64 97 0 12.75 1 0 0 0
#> 3 10.15125 70 0 21 7.155877 0 76 1 16.92 1 0 0 0
#> 4 11.44530 64 0 17 7.745342 21 2 1 15.12 1 0 0 0
#> 5 13.01152 71 0 20 1.851301 226 92 0 31.75 1 0 1 0
#> 6 12.50528 50 1 17 2.471984 35 108 1 9.08 1 0 0 0
#> 7 11.48337 49 1 13 7.277893 15 66 1 23.75 1 1 0 0
#> 8 11.90820 44 1 20 6.306514 89 115 0 34.99 1 0 0 0
#> 9 12.55809 55 1 26 6.544373 171 133 1 22.64 1 0 1 0
#> 10 12.53969 56 1 14 2.057520 16 201 0 30.63 1 0 0 0
After creating the dummy variables (x4_1
and
x4_2
for x4
, and x92
and
x93
for x9
), we fit the MFP model using the
mfp2()
function with default parameters. If a categorical
variable is not an important predictor of the outcome, all the dummy
variables will be eliminated, as in the case of variable
x9.
On the other hand, variable x4
is an
important predictor, and only one dummy (x4_2
) out of the
two was eliminated. Elimination of x4_2
means that the two
categories 2 and 3 are collapsed as a result of the selection
process.
# create matrix x and outcome y
x <- as.matrix(art[, -1])
y <- as.numeric(art$y)
# fit mfp using default interface with default parameters
fit <- mfp2(x, y, verbose = FALSE)
fit
#> Shifting, Scaling and Centering of covariates
#> shift scale center
#> x5 0 100 TRUE
#> x1 0 10 TRUE
#> x7 1 1000 TRUE
#> x10 0 10 TRUE
#> x4_1 0 1 TRUE
#> x93 0 1 TRUE
#> x8 0 1 TRUE
#> x92 0 1 TRUE
#> x4_2 0 1 TRUE
#> x6 1 1000 TRUE
#> x2 0 1 TRUE
#> x3 0 10 TRUE
#>
#> Final Multivariable Fractional Polynomial for y
#> df_initial select alpha selected df_final power1 power2
#> x5 4 0.05 0.05 TRUE 4 0 3
#> x1 4 0.05 0.05 TRUE 1 1 NA
#> x7 4 0.05 0.05 FALSE 0 NA NA
#> x10 4 0.05 0.05 TRUE 1 1 NA
#> x4_1 1 0.05 0.05 TRUE 1 1 NA
#> x93 1 0.05 0.05 FALSE 0 NA NA
#> x8 1 0.05 0.05 TRUE 1 1 NA
#> x92 1 0.05 0.05 FALSE 0 NA NA
#> x4_2 1 0.05 0.05 FALSE 0 NA NA
#> x6 4 0.05 0.05 TRUE 2 0 NA
#> x2 1 0.05 0.05 FALSE 0 NA NA
#> x3 4 0.05 0.05 TRUE 1 1 NA
#>
#> MFP algorithm convergence: TRUE
#>
#> Call: glm(formula = y ~ ., family = family, data = data, weights = weights,
#> offset = offset, x = TRUE, y = TRUE)
#>
#> Coefficients:
#> (Intercept) x5.1 x5.2 x1.1 x10.1 x4_1.1 x8.1
#> 12.3794 -0.6793 0.1973 -0.2169 0.1787 -0.3616 0.3324
#> x6.1 x3.1
#> 0.2318 -0.2350
#>
#> Degrees of Freedom: 249 Total (i.e. Null); 241 Residual
#> Null Deviance: 245
#> Residual Deviance: 118.6 AIC: 543
Logistic regression is a statistical method used to model binary
outcomes. The mfp2()
function implements the logistic
regression model using the family = "binomial"
argument.
For illustration, we will use the Pima Indians dataset included in the
mfp2 package, which was downloaded from the MFP
website(here). The dataset originates from a study conducted on a
cohort of 768 female Pima Indians, a group known to be particularly
susceptible to diabetes. Out of the 768 individuals, 268 developed
diabetes. The dataset consists of a binary outcome variable, denoting
the presence (1) or absence (0) of diabetes, and eight covariates. For
more information on the dataset, please refer to the book by Royston and
Sauerbrei (2008) Chapter 9.7.
In the mfp2()
function, the response variable
y
should be provided as a binary vector rather than a
matrix. The other arguments whether using the “default” or “formula”
interface for logistic regression, are nearly identical to those
previously explained for the Gaussian family.
# load pima data
data("pima")
head(pima)
#> # A tibble: 6 × 10
#> id pregnant glucose diastolic triceps insulin bmi diabetes age y
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 6 148 72 35 126 33.6 0.627 50 1
#> 2 2 1 85 66 29 76 26.6 0.351 31 0
#> 3 3 8 183 64 20 272 23.3 0.672 32 1
#> 4 4 1 89 66 23 94 28.1 0.167 21 0
#> 5 5 0 137 40 35 168 43.1 2.29 33 1
#> 6 6 5 116 74 19 72 25.6 0.201 30 0
# matrix x
x <- as.matrix(pima[, 2:9])
# outcome y
y <- as.vector(pima$y)
# fit mfp
fit <- mfp2(x, y, family = "binomial", verbose = FALSE)
fit
#> Shifting, Scaling and Centering of covariates
#> shift scale center
#> glucose 0 100 TRUE
#> bmi 0 10 TRUE
#> pregnant 1 10 TRUE
#> diabetes 0 1 TRUE
#> age 0 10 TRUE
#> diastolic 0 10 TRUE
#> triceps 0 10 TRUE
#> insulin 0 100 TRUE
#>
#> Final Multivariable Fractional Polynomial for y
#> df_initial select alpha selected df_final power1 power2
#> glucose 4 0.05 0.05 TRUE 1 1 NA
#> bmi 4 0.05 0.05 TRUE 2 -2 NA
#> pregnant 4 0.05 0.05 FALSE 0 NA NA
#> diabetes 4 0.05 0.05 TRUE 1 1 NA
#> age 4 0.05 0.05 TRUE 4 0 3
#> diastolic 4 0.05 0.05 FALSE 0 NA NA
#> triceps 4 0.05 0.05 FALSE 0 NA NA
#> insulin 4 0.05 0.05 FALSE 0 NA NA
#>
#> MFP algorithm convergence: TRUE
#>
#> Call: glm(formula = y ~ ., family = family, data = data, weights = weights,
#> offset = offset, x = TRUE, y = TRUE)
#>
#> Coefficients:
#> (Intercept) glucose.1 bmi.1 diabetes.1 age.1 age.2
#> -0.98769 3.61970 -15.16265 0.80648 4.38286 -0.01672
#>
#> Degrees of Freedom: 767 Total (i.e. Null); 762 Residual
#> Null Deviance: 993.5
#> Residual Deviance: 683.5 AIC: 695.5
From the output, it is evident that four continuous variables
(glucose
(power 1), bmi(-2), diabetes
(1) and
age
(0, 3)) are selected. We can use
fracplot()
function to generate plots for these selected
variables, as illustrated below, which produced the Figure below.
We illustrate two of the analyses performed by Sauerbrei and Royston
(1999). We use the gbsg
data, which contains prognostic
factors data from the German Breast Cancer Study Group of patients with
node-positive breast cancer. The response variable is recurrence-free
survival time (rectime
), and the censoring variable is
censrec.
There are 686 patients with 299 events. We use Cox
regression to predict the log hazard of recurrence from prognostic
factors, of which five are continuous (age
,
size
, nodes
, pgr
,
er
), and one is binary (meno
). The variable
grade
is ordinal, and we will use ordinal coding to create
two dummy variables (grade_1
and grade_2
). The
treatment variable hormonal therapy (hormon
) is forced into
the model. As an alternative we stratify the analysis using the
hormon
variable.
We use the mfp2()
function with default parameters to
build a model from the initial set of eight covariates. We set the
nominal significance level for variable and FP selection to 0.05 for all
variables except hormon
, which is forced into the model
using the keep
argument by setting the significance level
of hormon
to 1.
The mfp2()
function for survival outcome requires the
outcome variable to be a survival object created using the
survival::Surv()
function. Other arguments are similar to
those explained previously for the Gaussian family.
# load gbsg data
data("gbsg")
# data preparation create dummy variable for grade using ordinal coding
gbsg <- create_dummy_variables(gbsg, var_ordinal = "grade", drop_variables = TRUE)
# covariate matrix x
x <- as.matrix(gbsg[, -c(1, 6, 10, 11)])
head(x, 10)
#> age meno size nodes pgr er hormon grade_1 grade_2
#> [1,] 38 0 18 5 141 105 0 1 1
#> [2,] 52 0 20 1 78 14 0 0 0
#> [3,] 47 0 30 1 422 89 0 1 0
#> [4,] 40 0 24 3 25 11 0 0 0
#> [5,] 64 1 19 1 19 9 1 1 0
#> [6,] 49 1 56 3 356 64 1 0 0
#> [7,] 53 1 52 9 6 29 0 1 0
#> [8,] 61 1 22 2 6 173 1 1 0
#> [9,] 43 0 30 1 22 0 0 1 0
#> [10,] 74 1 20 1 462 240 1 1 0
# use Surv() function to create outcome y
y <- survival::Surv(gbsg$rectime, gbsg$censrec)
# fit mfp and keep hormon in the model
fit1 <- mfp2(x, y, family = "cox", keep = "hormon", control = coxph.control(iter.max = 50),
verbose = FALSE)
fit1
#> Shifting, Scaling and Centering of covariates
#> shift scale center
#> nodes 0 10 TRUE
#> pgr 1 1000 TRUE
#> grade_1 0 1 TRUE
#> hormon 0 1 TRUE
#> size 0 100 TRUE
#> meno 0 1 TRUE
#> grade_2 0 1 TRUE
#> age 0 10 TRUE
#> er 1 1000 TRUE
#>
#> Final Multivariable Fractional Polynomial for y
#> df_initial select alpha selected df_final power1 power2
#> nodes 4 0.05 0.05 TRUE 4 -2.0 -1.0
#> pgr 4 0.05 0.05 TRUE 2 0.5 NA
#> grade_1 1 0.05 0.05 TRUE 1 1.0 NA
#> hormon 1 1.00 0.05 TRUE 1 1.0 NA
#> size 4 0.05 0.05 FALSE 0 NA NA
#> meno 1 0.05 0.05 FALSE 0 NA NA
#> grade_2 1 0.05 0.05 FALSE 0 NA NA
#> age 4 0.05 0.05 TRUE 4 -2.0 -0.5
#> er 4 0.05 0.05 FALSE 0 NA NA
#>
#> MFP algorithm convergence: TRUE
#> Call:
#> survival::coxph(formula = ff, data = d, weights = weights, control = control,
#> x = TRUE, y = TRUE, method = method, nocenter = nocenter)
#>
#> coef exp(coef) se(coef) z p
#> nodes.1 3.879e-02 1.040e+00 7.697e-03 5.040 4.67e-07
#> nodes.2 -5.491e-01 5.775e-01 8.643e-02 -6.353 2.11e-10
#> pgr.1 -1.807e+00 1.642e-01 3.506e-01 -5.153 2.56e-07
#> grade_1.1 5.007e-01 1.650e+00 2.496e-01 2.006 0.04488
#> hormon.1 -4.024e-01 6.687e-01 1.281e-01 -3.142 0.00168
#> age.1 4.473e+01 2.677e+19 8.257e+00 5.418 6.03e-08
#> age.2 -1.792e+01 1.645e-08 3.910e+00 -4.584 4.55e-06
#>
#> Likelihood ratio test=155.6 on 7 df, p=< 2.2e-16
#> n= 686, number of events= 299
According to Sauerbrei and Royston (1999), medical knowledge dictates
that the estimated risk function for nodes
(number of
positive nodes), which was selected with FP powers (-2, -1), should be
monotonic - but the selected function is not. Therefore, they proposed
to estimate a preliminary exponential transformation,
enodes = exp(-0.12 * nodes)
, for nodes and fitting a degree
1 FP for enodes
, thus obtaining a monotonic risk function.
The value of -0.12 was estimated univariately using nonlinear Cox
regression. To ensure a negative exponent, Sauerbrei and Royston (1999)
restricted the powers for enodes
to be positive. Their
Model III may be fit by using the following R command, which yields
identical results to the mfp command in
Stata(here):
# remove nodes and include enodes
x <- as.matrix(gbsg[, -c(1, 5, 10, 11)])
# fit mfp and keep hormon in the model
fit2 <- mfp2(x, y, family = "cox", keep = "hormon", powers = list(enodes = c(0.5,
1, 2, 3)), control = coxph.control(iter.max = 50), verbose = FALSE)
fit2
#> Shifting, Scaling and Centering of covariates
#> shift scale center
#> enodes 0 1 TRUE
#> pgr 1 1000 TRUE
#> hormon 0 1 TRUE
#> grade_1 0 1 TRUE
#> size 0 100 TRUE
#> meno 0 1 TRUE
#> grade_2 0 1 TRUE
#> age 0 10 TRUE
#> er 1 1000 TRUE
#>
#> Final Multivariable Fractional Polynomial for y
#> df_initial select alpha selected df_final power1 power2
#> enodes 4 0.05 0.05 TRUE 1 1.0 NA
#> pgr 4 0.05 0.05 TRUE 2 0.5 NA
#> hormon 1 1.00 0.05 TRUE 1 1.0 NA
#> grade_1 1 0.05 0.05 TRUE 1 1.0 NA
#> size 4 0.05 0.05 FALSE 0 NA NA
#> meno 1 0.05 0.05 FALSE 0 NA NA
#> grade_2 1 0.05 0.05 FALSE 0 NA NA
#> age 4 0.05 0.05 TRUE 4 -2.0 -0.5
#> er 4 0.05 0.05 FALSE 0 NA NA
#>
#> MFP algorithm convergence: TRUE
#> Call:
#> survival::coxph(formula = ff, data = d, weights = weights, control = control,
#> x = TRUE, y = TRUE, method = method, nocenter = nocenter)
#>
#> coef exp(coef) se(coef) z p
#> enodes.1 -1.981e+00 1.379e-01 2.269e-01 -8.732 < 2e-16
#> pgr.1 -1.840e+00 1.588e-01 3.508e-01 -5.245 1.57e-07
#> hormon.1 -3.945e-01 6.740e-01 1.281e-01 -3.080 0.00207
#> grade_1.1 5.174e-01 1.678e+00 2.494e-01 2.075 0.03799
#> age.1 4.355e+01 8.226e+18 8.253e+00 5.277 1.31e-07
#> age.2 -1.748e+01 2.558e-08 3.912e+00 -4.469 7.87e-06
#>
#> Likelihood ratio test=153.1 on 6 df, p=< 2.2e-16
#> n= 686, number of events= 299
See Figure 1 in Sauerbrei and Royston (1999) for the functions for nodes with powers (-2, - 1) and for enodes.
The stratified Cox model is a variation of the Cox proportional hazards (PH) model that allows for controlling the effect of a covariate that does not satisfy the PH assumption (Therneau et al. 2000). The model includes covariates that are assumed to satisfy the PH assumption, while the covariate being stratified on is not included in the model. Instead of including the stratified variable as a covariate in the model, it is used to define distinct subgroups.
The mfp2()
function, similar to the coxph()
function in the survival package, includes a strata
argument that enables stratification. The fitted MFP model assumes that
the FP functions for a particular variable do not vary over the strata.
However, it is essential for the user to evaluate this assumption.
To provide an illustrative example, we will stratify the analysis
using the variable hormon
. The following R
code demonstrates how to fit a stratified Cox model using both the
default and formula interfaces, along with their default parameters.
# using default interface
fit2 <- mfp2(x[, -7], y, family = "cox", strata = x[, 7], verbose = FALSE)
fit2
#> Shifting, Scaling and Centering of covariates
#> shift scale center
#> enodes 0 1 TRUE
#> pgr 1 1000 TRUE
#> grade_1 0 1 TRUE
#> size 0 100 TRUE
#> meno 0 1 TRUE
#> grade_2 0 1 TRUE
#> age 0 10 TRUE
#> er 1 1000 TRUE
#>
#> Final Multivariable Fractional Polynomial for y
#> df_initial select alpha selected df_final power1 power2
#> enodes 4 0.05 0.05 TRUE 1 1.0 NA
#> pgr 4 0.05 0.05 TRUE 2 0.5 NA
#> grade_1 1 0.05 0.05 TRUE 1 1.0 NA
#> size 4 0.05 0.05 FALSE 0 NA NA
#> meno 1 0.05 0.05 FALSE 0 NA NA
#> grade_2 1 0.05 0.05 FALSE 0 NA NA
#> age 4 0.05 0.05 TRUE 4 -2.0 -1
#> er 4 0.05 0.05 FALSE 0 NA NA
#>
#> MFP algorithm convergence: TRUE
#> Call:
#> survival::coxph(formula = ff, data = d, weights = weights, control = control,
#> x = TRUE, y = TRUE, method = method, nocenter = nocenter)
#>
#> coef exp(coef) se(coef) z p
#> enodes.1 -1.978e+00 1.384e-01 2.272e-01 -8.704 < 2e-16
#> pgr.1 -1.808e+00 1.639e-01 3.507e-01 -5.156 2.52e-07
#> grade_1.1 5.134e-01 1.671e+00 2.495e-01 2.058 0.0396
#> age.1 6.057e+01 2.029e+26 1.188e+01 5.098 3.43e-07
#> age.2 -2.653e+01 3.007e-12 5.904e+00 -4.494 6.99e-06
#>
#> Likelihood ratio test=142 on 5 df, p=< 2.2e-16
#> n= 686, number of events= 299
# using formula interface
fit3 <- mfp2(Surv(rectime, censrec) ~ age + meno + size + nodes + pgr + er +
grade_1 + grade_2 + strata(hormon), family = "cox", data = gbsg, verbose = FALSE)
Royston (2014) proposed the approximate cumulative distribution (ACD)
transformation of a continuous covariate x
as a route
toward modeling a sigmoid relationship between x
and an
outcome variable y
. The ACD transformation is a smooth
function that maps a continuous covariate, x
, to an
approximation, ACD (x)
, of its distribution function. By
construction, the distribution of ACD (x)
in the sample is
roughly uniform on (0, 1). FP modeling is then performed with the
transformed values ACD (x)
instead of x
as a
covariate. He further showed that such an approach could successfully
represent a sigmoid function of x
, something a standard FP
function cannot do (Royston and Sauerbrei 2008). He went on to
demonstrate that useful flexibility in functional form could be achieved
by considering both x
and a = ACD (x)
simultaneously as independent covariates and applying the MFP algorithm
to x
and a
. To limit instability and
overfitting, he suggested restricting the models considered for
x
and a
to FP1 functions. Furthermore, Royston
(2014) highlighted that models based on ACD (x)
may have
other advantages in terms of interpretability of regression
coefficients.
Royston and Sauerbrei (2016) extended the MFP modeling approach to incorporate the ACD transformation, resulting in the MFPA (MFP with ACD) algorithm. To implement this extension, the original FSP is replaced by a modified version known as FSPA (FSP with ACD transformation). The MFPA algorithm has all the features of the MFP algorithm plus the ability to model a sigmoid function. For additional information on MFPA, including FSPA, please refer to Royston and Sauerbrei (2016)
We will demonstrate how to fit the MFPA using the mfp2()
function. In this example, we simulated the data to create an example
that showcases this relationship. The simulation process is outlined
below.
# Generate artificial data with sigmoid relationship
set.seed(54)
n <- 500
x <- matrix(rnorm(n), ncol = 1, dimnames = list(NULL, "x"))
# Apply sigmoid transformation to x
sigmoid <- function(x) {
1/(1 + exp(-1.7 * x))
}
# Generate y with sigmoid relationship to x
y <- as.numeric(20 * sigmoid(x) + rnorm(n, mean = 0, sd = 0.5))
The mfp2()
function has an acdx
argument
that allows the user to specify the names of continuous variables to
undergo the ACD transformation. The use of acdx
invokes the
FSPA to determine the best-fitting model (see documentation for more
details). The acdx
parameter can be set in the formula
interface using the fp()
function, as demonstrated in the R
code below. To avoid confusion with the original variable
x
, the variable representing the ACD transformation of
x
is named A(x)
in the output.
# default interface
fit1 <- mfp2(x, y, acdx = "x", verbose = FALSE)
# formula interface
datax <- data.frame(y, x)
fit2 <- mfp2(y ~ fp(x, acdx = TRUE), data = datax, verbose = FALSE)
# display selected power terms
fit2
#> Shifting, Scaling and Centering of covariates
#> shift scale center
#> x 3.005751 1 TRUE
#>
#> Final Multivariable Fractional Polynomial for y
#> df_initial select alpha acd selected df_final power1 power2
#> x 4 0.05 0.05 TRUE TRUE 4 1 1
#>
#> MFP algorithm convergence: TRUE
#>
#> Call: glm(formula = y ~ ., family = family, data = data, weights = weights,
#> offset = offset, x = TRUE, y = TRUE)
#>
#> Coefficients:
#> (Intercept) x.1 A_x.1
#> 9.8889 -0.5686 22.4108
#>
#> Degrees of Freedom: 499 Total (i.e. Null); 497 Residual
#> Null Deviance: 17060
#> Residual Deviance: 129.7 AIC: 752.4
# fit usual mfp: bad idea but useful for illustration
fit3 <- mfp2(y ~ fp(x, acdx = FALSE), data = datax, verbose = FALSE)
fit3
#> Shifting, Scaling and Centering of covariates
#> shift scale center
#> x 3.005751 1 TRUE
#>
#> Final Multivariable Fractional Polynomial for y
#> df_initial select alpha selected df_final power1 power2
#> x 4 0.05 0.05 TRUE 4 3 3
#>
#> MFP algorithm convergence: TRUE
#>
#> Call: glm(formula = y ~ ., family = family, data = data, weights = weights,
#> offset = offset, x = TRUE, y = TRUE)
#>
#> Coefficients:
#> (Intercept) x.1 x.2
#> 9.8889 0.9659 -0.4925
#>
#> Degrees of Freedom: 499 Total (i.e. Null); 497 Residual
#> Null Deviance: 17060
#> Residual Deviance: 231.8 AIC: 1043
The selected function using the ACD transformation is FP1(1,1), represented by \beta_0 + \beta_1x^1 +\beta_2a^1. On the other hand, the usual MFP approach selected FP2(3,3), equivalent to \beta_0 + \beta_1x^3 +\beta_2x^3 \log(x). The Figure below shows these functions. It is evident from the figure that the ACD approach provides a better fit to the data compared to the usual MFP model.
Ambler, G. and Royston, P. (2001). Fractional polynomial model selection procedures: investigation of type I error rate. J. Stat. Comput. Simul., 69, 89–108.
Becher, H., Lorenz, E., Royston, P., and Sauerbrei, W. (2012). Analysing covariates with spike at zero: a modified FP procedure and conceptual issues. Biom. J., 54 (5), 686–700.
Binder, H. and Sauerbrei, W. (2010). Adding local components to global functions for continuous covariates in multivariable regression modeling. Stat. Med., 29, 808–817.
Binder, H., Sauerbrei, W., and Royston, P. (2013). Comparison between splines and fractional polynomials for multivariable model building with continuous covariates: a simulation study with continuous response. Stat. Med., 32, 2262–2277.
Buchholz, A. and Sauerbrei, W. (2011). Comparison of procedures to assess non-linear and time-varying effects in multivariable models for survival data. Biom. J., 53, 308–331. DOI: 10.1002/bimj.201000159.
Draper, D. (1995). Assessment and propagation of model selection uncertainty (with discussion). J. R. Stat. Soc. Ser. B, 57, 45–97.
Dunkler, D., Sauerbrei, W., and Heinze, G. (2016). Global, parameterwise and joint post-estimation shrinkage. J. Stat. Softw.
Faes, C., Aerts, M., Geys, H., and Molenberghs, G. (2007). Model averaging using fractional polynomials to estimate a safe level of exposure. Risk Anal., 27, 111–123.
Huebner, M., le Cessie, S., Schmidt, C. O., & Vach, W. (2018). A contemporary conceptual framework for initial data analysis. Observational Studies, 4(1), 171-192.
Royston, P. (2014). A smooth covariate rank transformation for use in regression models with a sigmoid dose-response function. Stata J., 14, 329–341.
Royston, P., Altman, D. G. (1994). Regression using fractional polynomials of continuous covariates parsimonious parametric modelling (with discussion). Applied Statistics 43 (3): 429–467.
Royston, P. and Sauerbrei, W. (2004). A new approach to modelling interactions between treatment and continuous covariates in clinical trials by using fractional polynomial. Stat. Med., 23, 2509–2525.
Royston, P. and Sauerbrei, W. (2007). Improving the robustness of fractional polynomial models by preliminary covariate transformation. Comput. Stat. Data Anal., 51, 4240–4253.
Royston, P. and Sauerbrei, W. (2008). Multivariable Model-Building—A Pragmatic Approach to Regression Analysis Based on Fractional Polynomials for Modelling Continuous Variables, John Wiley & Sons, Chichester, UK.
Royston, P. and Sauerbrei, W. (2013). Interaction of treatment with a continuous variable: simulation study of significance level for several methods of analysis. Stat. Med., 32 (22), 3788–3803.
Royston, P. and Sauerbrei, W. (2014). Interaction of treatment with a continuous variable: simulation study of power for several methods of analysis. Stat. Med., 33, 4695–4708.
Royston, P., Altman, D.G., Sauerbrei, W. (2006): Dichotomizing continuous predictors in multiple regression: a bad idea. Statistics in Medicine, 25: 127-141
Royston, P. and Sauerbrei, W. (2008). Multivariable model-building: a pragmatic approach to re- gression analysis based on fractional polynomials for modelling continuous variables. John Wiley & Sons.
Royston, P. and Sauerbrei, W. (2016). mfpa: Extension of mfp using the ACD covariate transformation for enhanced parametric multivariable modeling. The Stata Journal, 16(1), 72-87.
Royston, P., Sauerbrei, W., and Becher, H. (2010). Modelling continuous exposures with a ‘spike’ at zero: a new procedure based on fractional polynomials. Stat. Med., 29, 1219–1227.
Sauerbrei, W. (1999). The use of resampling methods to simplify regression models in medical statistics. Appl. Stat., 48, 313–329.
Sauerbrei, W., Abrahamowicz, M., Altman, D.G., et al. (2014). STrengthening Analytical Thinking for Observational Strategies: the STRATOS initiative. Stat. Med., 33 (30), 5413–32.
Sauerbrei, W., Kipruto, E., Balmford, J. (2022). Effects of Influential Points and Sample Size on the Selection and Replicability of Multivariable Fractional Polynomial Models. Diagnostic and Prognostic Research, 7(1):7
Sauerbrei, W., Meier-Hirmer, C., Benner, A., Royston, P. (2006). Multivariable regression model building by using fractional polynomials: description of SAS, STATA and R programs. Computational Statistics and Data Analysis, 50: 3464-3485.
Sauerbrei W, Perperoglou A, Schmid M, Abrahamowicz M, Becher H, Binder H, Dunkler D, Harrell Jr. FE, Royston P, Heinze G for TG2 of the STRATOS initiative (2020). State of the art in selection of variables and functional forms in multivariable analysis - outstanding issues. Diagnostic and Prognostic Research, 4:3, 1-18. DOI: 10.1186/s41512-020-00074-3
Sauerbrei, W. and Royston, P. (1999). Building multivariable prognostic and diagnostic models: trans- formation of the predictors using fractional polynomials. J. R. Stat. Soc. Ser. A, 162, 71–94.
Sauerbrei, W. and Royston, P. (2011). Multivariable fractional polynomial models, in: Lovric, M. (Ed.): International Encyclopedia of Statistical Science, Springer Berlin, p. 899-902.
Sauerbrei, W. and Royston, P. (2016). Multivariable Fractional Polynomial Models. Wiley StatsRef: Statistics Reference Online. 1–8. DOI: 10.1002/9781118445112.stat07861.
Sauerbrei, W., Royston, P., Binder, H. (2007b). Selection of important variables and determination of functional form for continuous predictors in multivariable model-building. Stat. Med., 26, 5512–5528.
Sauerbrei, W., Royston, P., Look, M. (2007a). A new proposal for multivariable modelling of time- varying effects in survival data based on fractional polynomial time-transformation. Biom. J., 49, 453–473.
Sauerbrei, W., Royston, P. (2007): Modelling to extract more information from clinical trials data : on some roles for the bootstrap. Statistics in Medicine, 26: 4989-5001.
Shmueli, G. (2010). To explain or to predict? Stat. Sci., 3, 289–310.
Stamey TA, Kabalin JN, McNeal JE, Johnstone I.M, Freiha F, Redwine EA and Yang N (1989). Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. II. Radical prostatectomy treated patients. The Journal of Urology;141(5),1076-1083.
StataCorp. (2023). Stata: Release 18. Statistical Software. College Station, TX: StataCorp LLC.
Therneau, T.M., Grambsch, P.M., Therneau, T.M. and Grambsch, P.M., (2000). The cox model (pp. 39-77). Springer New York.