| Type: | Package |
| Title: | Symmetrized MCD |
| Version: | 0.6 |
| Date: | 2026-03-31 |
| Imports: | Rcpp (≥ 0.11.0) |
| LinkingTo: | Rcpp, RcppArmadillo |
| Description: | Provides implementations of origin-based and symmetrized minimum covariance determinant (MCD) estimators, together with supporting utility functions. |
| License: | GPL-2 | GPL-3 [expanded from: GPL (≥ 2)] |
| NeedsCompilation: | yes |
| Packaged: | 2026-03-31 10:52:44 UTC; nordklau |
| Author: | Klaus Nordhausen |
| Maintainer: | Klaus Nordhausen <klaus.nordhausenr@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2026-04-03 15:30:07 UTC |
Symmetrized MCD
Description
Provides implementations of origin-based and symmetrized minimum covariance determinant (MCD) estimators, together with supporting utility functions.
Details
| Package: | symMCD |
| Type: | Package |
| Title: | Symmetrized MCD |
| Version: | 0.6 |
| Date: | 2026-03-31 |
| Authors@R: | c(person(given = "Klaus", family = "Nordhausen", role = c("aut", "cre"), email = "klaus.nordhausenr@gmail.com", comment = c(ORCID = "0000-0002-3758-8501")), person(given = "Christophe", family = "Croux", role = "aut", comment = c(ORCID = "0000-0003-2912-3437"))) |
| Imports: | Rcpp (>= 0.11.0) |
| LinkingTo: | Rcpp, RcppArmadillo |
| Description: | Provides implementations of origin-based and symmetrized minimum covariance determinant (MCD) estimators, together with supporting utility functions. |
| License: | GPL (>= 2) |
| Author: | Klaus Nordhausen [aut, cre] (ORCID: <https://orcid.org/0000-0002-3758-8501>), Christophe Croux [aut] (ORCID: <https://orcid.org/0000-0003-2912-3437>) |
| Maintainer: | Klaus Nordhausen <klaus.nordhausenr@gmail.com> |
| Archs: | x64 |
The minimum covariance determinant (MCD) estimator is one of the most popular robust scatter matrix estimators; however, it does not possess the independence property. The symmetrized MCD is a variant of the MCD that does satisfy this property. It is, however, computationally more demanding, as it is based on all pairwise differences of the observations. On the other hand, the symmetrized MCD does not require a location estimate, since it is inherently centered at the origin.
To reduce the computational burden, an approximate variant of the symmetrized
MCD is also implemented. This variant does not use all pairwise differences, but
instead relies on a subset of size n m, where n denotes the sample
size and m is a user-specified integer.
In addition, several helper functions are provided to support the computation and use of the symmetrized MCD.
Index of help topics:
MCDzero MCD with respect to the Origin
Maha Fast squared Mahalanobis distances
mdiff Computes Successive Differences for a Data Set
symMCD Symmetrized Minimum Covariance Determinant
Estimator
symMCD-package Symmetrized MCD
Author(s)
Klaus Nordhausen [aut, cre] (ORCID: <https://orcid.org/0000-0002-3758-8501>), Christophe Croux [aut] (ORCID: <https://orcid.org/0000-0003-2912-3437>)
Maintainer: Klaus Nordhausen <klaus.nordhausenr@gmail.com>
References
Croux, C. and Nordhausen, K. (2026). Symmetrized MCD. Unpublished manuscript.
MCD with respect to the Origin
Description
Computes the minimum covariance determinant (MCD) estimator with the location fixed at the origin, using either a C++ or a plain R implementation.
Usage
MCDzero(X, alpha = 0.5, ns = 500, nc = 10, delta = 0.01)
MCDzeroR(y, alpha = 0.5, ns = 500, nc = 10, delta = 0.01)
Arguments
y |
A numeric data matrix with observations in rows and variables in columns. |
X |
A numeric data matrix with observations in rows and variables in columns. |
alpha |
A numeric value in |
ns |
An integer specifying the number of random initial subsets used by the algorithm. |
nc |
An integer specifying the number of concentration steps performed for each initial subset. |
delta |
A numeric tuning constant used in the reweighting step. |
Details
The minimum covariance determinant (MCD) estimator is computed using an iterative algorithm based on random initial subsets, followed by a fixed number of concentration steps. In contrast to the classical MCD, the location is fixed at the origin and is not estimated from the data.
The parameter alpha controls the size of the subset retained at each
step and thus determines the robustness of the estimator. For each of the
ns random initial subsets, the algorithm applies nc
concentration steps to improve the determinant of the scatter estimate. The
best solution over all starts is retained.
An optional reweighting step is applied using the tuning constant
delta, which aims to improve efficiency by downweighting observations
with large squared Mahalanobis distances.
Value
For the C++ implementation, a list with the following components:
SigmaRaw |
The raw MCD scatter matrix with the location fixed at the origin. |
SigmaRe |
The reweighted MCD scatter matrix with the location fixed at the origin. |
IndexOptimal |
An integer vector giving the indices of the observations forming the optimal subset. |
For the R implementation, a list with the following components:
sigma |
The raw MCD scatter matrix with the location fixed at the origin. |
sigmaR |
The reweighted MCD scatter matrix with the location fixed at the origin. |
Examples
X <- matrix(rnorm(300), ncol=3)
colnames(X) <- LETTERS[1:3]
# settings seeds for comparison
set.seed(1)
mcd0cpp <- MCDzero(X, 0.5)
set.seed(1)
mcd0r <- MCDzeroR(X, 0.5)
mcd0cpp$SigmaRaw
mcd0cpp$SigmaRaw - mcd0r$sigma
mcd0cpp$SigmaRe
mcd0cpp$SigmaRe - mcd0r$sigmaR
Fast squared Mahalanobis distances
Description
Computes squared Mahalanobis distances using different choices of location and scatter matrices.
Usage
Maha(X, center, scatter)
MahaClassical(X)
MahaOrigin(X)
Arguments
X |
numeric data matrix. |
center |
numeric vector giving the location vector. |
scatter |
numeric matrix giving the scatter matrix. |
Details
The function Maha allows the user to supply a custom location and scatter
matrix in order to compute (pseudo) Mahalanobis distances. This makes it
possible to use alternative location and scatter estimates instead of the
sample mean vector and covariance matrix. In contrast, the function
MahaClassical always uses the sample mean and the classical covariance
matrix. The function MahaOrigin fixes the location at the origin and
computes the corresponding covariance matrix with respect to the origin.
The three functions provided here prioritize computational efficiency, but offer fewer
options than mahalanobis. For example, mahalanobis allows the
user to provide a pre-inverted scatter matrix.
Value
A numeric vector with the squared distances
See Also
Examples
X <- matrix(rnorm(300), ncol = 3)
COV <- cov(X)
MEAN <- colMeans(X)
Maha(X, MEAN, COV)
mahalanobis(X, MEAN, COV)
MahaClassical(X)
MEAN0 <- rep(0, 3)
COV0 <- crossprod(X)/99
MahaOrigin(X)
mahalanobis(X, MEAN0, COV0)
Computes Successive Differences for a Data Set
Description
Computes a subset of pairwise differences between observations by forming successive differences up to a fixed lag.
Usage
mdiff(X, m = 20)
Arguments
X |
A numeric data matrix with observations in rows and variables in columns. |
m |
A positive integer specifying the number of successive differences to be computed for each observation. |
Details
Let X be a data matrix with n rows, denoted by
x_1, \dots, x_n. The function computes a matrix of successive
differences consisting of the vectors
x_i - x_{i+j}, for j = 1, \dots, m.
To ensure that each observation contributes exactly m differences,
the first m observations are appended to the end of the data matrix
before computing the differences. As a result, the output contains
n m difference vectors.
The resulting set of differences forms a subset of all possible pairwise
differences and provides a computationally cheaper alternative to using
all n(n-1)/2 differences. Such constructions are useful, for example,
in approximations of symmetrized scatter estimators.
Value
A numeric matrix with n m rows and the same number of columns as
X, containing the successive difference vectors.
References
Dumbgen, L. and Nordhausen, K. (2024). Approximating symmetrized estimators of scatter via balanced incomplete U-statistics. Annals of the Institute of Statistical Mathematics, 76, 185–207. <doi:10.1007/s10463-023-00879-1>.
Examples
X <- matrix(1:12, ncol=2)
mdiff(X, m=3)
Symmetrized Minimum Covariance Determinant Estimator
Description
Computes the symmetrized minimum covariance determinant (MCD) estimator based on pairwise differences of the observations.
Usage
symMCD(X, gamma = 0.25, m = NULL, ns = 500, nc = 10, delta = 0.01)
Arguments
X |
A numeric data matrix with observations in rows and variables in columns. |
gamma |
A numeric value in |
m |
If |
ns |
An integer specifying the number of random initial subsets used by the algorithm. |
nc |
An integer specifying the number of concentration steps performed for each initial subset. |
delta |
A numeric tuning constant used in the reweighting step. |
Details
The symmetrized MCD computes a minimum covariance determinant (MCD)
estimator based on pairwise differences x_i - x_j with i > j.
Since these differences are centered by construction, the location is
fixed at the origin and no location parameter is estimated or returned.
In practice, this corresponds to applying an MCD estimator with respect
to the origin to the matrix of pairwise differences; see
MCDzero.
In contrast to the classical MCD, the symmetrized MCD possesses the independence property: at the population level, the resulting scatter matrix is diagonal for a random vector with independent components.
The argument gamma controls the fraction of pairwise differences
used for the scatter computation. Reasonable values are in the interval
(0.25, 1). For a given value of gamma, the exact number of
differences to be retained is determined analogously to the implementation
in the robustbase package, using the function h.alpha.n.
Computing all pairwise differences is computationally expensive. As an
approximation, the function therefore allows the use of only m
successive differences per observation. In this case, the order of the
observations should be random; hence, applying a random permutation to
the observations prior to calling symMCD is advisable. While this
approximation substantially reduces computational cost, it may be less
appropriate when robustness is the primary concern rather than the
independence property.
When only a subset of pairwise differences is used, the number of
differences employed in the computation is given by
\lfloor n m \gamma \rfloor + 1, where n denotes the
sample size and m the number of successive differences per
observation.
Value
A list with the following components:
SigmaRaw |
The raw symmetrized MCD scatter matrix. |
SigmaRe |
The reweighted symmetrized MCD scatter matrix. |
References
Croux, C. and Nordhausen, K. (2026). Symmetrized MCD. Unpublished manuscript.
Examples
# data is small to keep the example fast
X <- matrix(rnorm(150), ncol=3)
colnames(X) <- LETTERS[1:3]
# transformed data to check affine equivariance:
A <- matrix(runif(9),3,3)
Y <- X %*% t(A)
# seed needs to be set for comparison
set.seed(1)
mcdall <- symMCD(X, 0.25)
set.seed(1)
mcdallY <- symMCD(Y, 0.25)
mcdall$SigmaRaw
mcdall$SigmaRe
A %*% mcdall$SigmaRaw %*% t(A) - mcdallY$SigmaRaw
A %*% mcdall$SigmaRe %*% t(A) - mcdallY$SigmaRe
# could do same for incomplete variant
set.seed(123)
mcdm20 <- symMCD(X, 0.25, m = 20)
set.seed(123)
mcdm20Y <- symMCD(Y, 0.25, m = 20)
mcdm20$SigmaRaw
mcdm20$SigmaRe
A %*% mcdm20$SigmaRaw %*% t(A) - mcdm20Y$SigmaRaw
A %*% mcdm20$SigmaRe %*% t(A) - mcdm20Y$SigmaRe