Outliers compromise the reliability of classical estimators even in moderate samples. A single recording error can destroy the standard deviation, yet the standard deviation remains the default scale estimator in most statistical software.
To see why, consider a small sample with one gross error:
library(robscale)
x_clean <- c(2.1, 2.3, 2.0, 2.4, 2.2, 2.1, 2.3, 1.9)
x_contaminated <- c(x_clean, 200) # one recording error
sd(x_clean)
#> [1] 0.1685018
sd(x_contaminated) # collapses
#> [1] 65.94602
gmd(x_clean)
#> [1] 0.1804105
gmd(x_contaminated) # unaffected
#> [1] 39.1023
qn(x_clean)
#> [1] 0.1486827
qn(x_contaminated) # unaffected
#> [1] 0.1938622The breakdown point is the fraction of contaminated
observations an estimator tolerates before producing an arbitrarily
large result. The standard deviation has a 0% breakdown point; a single
outlier drives it to infinity. The GMD achieves 29.3%; qn
and sn achieve the maximum 50%.
The asymptotic relative efficiency (ARE) measures statistical efficiency relative to the sample standard deviation under the Gaussian model (Bickel and Lehmann 1976). High efficiency means fewer observations are needed to match the standard deviation’s precision. The tradeoff is inherent: higher breakdown point generally costs efficiency.
robscale exposes all major points on this
robustness–efficiency frontier.
library(robscale)
set.seed(42)
x <- c(rnorm(20), 50) # 20 clean observations + one outlier
scale_robust(x) # auto-selects best strategy
#> [1] 5.391021
scale_robust(x, ci = TRUE) # with confidence interval
#> gmd estimate: 5.3910
#> 95% CI (analytical): [3.7441, 7.0380]
data.frame(
estimator = c("sd", "sd_c4", "gmd", "qn", "mad_scaled"),
value = round(c(sd(x), sd_c4(x), gmd(x), qn(x), mad_scaled(x)), 4)
)
#> estimator value
#> 1 sd 10.9441
#> 2 sd_c4 11.0817
#> 3 gmd 5.3910
#> 4 qn 1.4471
#> 5 mad_scaled 1.3756robscale provides 11 exported functions spanning the
full robustness–efficiency spectrum:
| Function | ARE | Breakdown | Description |
|---|---|---|---|
sd_c4(x) |
100% | 0% | Bias-corrected standard deviation |
gmd(x) |
98% | 29.3% | Gini mean difference |
adm(x) |
88.3% | \(1/n\) | Average deviation from median |
qn(x) |
82.3% | 50% | Rousseeuw–Croux \(Q_n\) |
sn(x) |
58.2% | 50% | Rousseeuw–Croux \(S_n\) |
iqr_scaled(x) |
37% | 25% | Scaled interquartile range |
mad_scaled(x) |
36.7% | 50% | Scaled median absolute deviation |
robScale(x) |
~55% | 50% | M-estimator for scale |
robLoc(x) |
98.4% | 50% | M-estimator for location |
scale_robust(x) |
ensemble | — | Variance-weighted dispatcher |
get_consistency_constant(n) |
— | — | Finite-sample bias correction factors |
sd_c4gmdqnmad_scaled or iqr_scaledscale_robust() (adaptive)robLocrobScalescale_robust() Dispatcherscale_robust() adapts its strategy to sample size
automatically.
For small samples, all 7 scale estimators run on bootstrap replicates. Inverse-variance weights—where the variance is the bootstrap variance of each estimator across replicates—combine them into a single estimate:
\[ \hat\sigma = \frac{\sum_{j=1}^{7} \hat\sigma_j(x) \,/\, \operatorname{Var}_{\mathrm{boot}}(\hat\sigma_j)}{\sum_{j=1}^{7} 1 \,/\, \operatorname{Var}_{\mathrm{boot}}(\hat\sigma_j)} \]
set.seed(1)
x_small <- rnorm(12)
scale_robust(x_small) # ensemble (n = 12 < 20)
#> [1] 0.8261654
scale_robust(x_small, ci = TRUE) # BCa bootstrap CI
#> Ensemble estimate: 0.8262
#> 95% bootstrap CI (bca): [0.5546, 1.2405]
#>
#> Component estimators:
#> estimator estimate weight analytical_lower analytical_upper boot_lower
#> sd_c4 0.8293 0.31376 0.5875 1.408 0.5833
#> gmd 0.8414 0.28733 0.5014 1.181 0.6054
#> mad_scaled 0.7737 0.05480 0.2634 1.284 0.2252
#> iqr_scaled 0.7428 0.07801 0.2543 1.231 0.3158
#> sn 0.9515 0.06101 0.4517 1.451 0.2956
#> qn 0.8658 0.10749 0.4833 1.248 0.3516
#> robScale 0.7451 0.09760 0.3432 1.147 0.3063
#> boot_upper
#> 1.129
#> 1.113
#> 1.453
#> 1.443
#> 1.447
#> 1.300
#> 1.289For larger samples, scale_robust() uses
gmd() directly: 98% ARE, 29.3% breakdown, \(O(n \log n)\).
sd_c4(x): Bias-corrected standard deviationApplies the exact \(c_4(n)\) finite-sample correction to remove the downward bias of the sample standard deviation under normality:
gmd(x): Gini mean differenceAverage absolute pairwise difference (Gini 1912), computed via the order-statistics form in \(O(n \log n)\):
adm(x): Average deviation from the medianThe mean deviation from the median (Nair 1936, 1947), scaled for consistency at the normal distribution.
qn(x): Rousseeuw–Croux \(Q_n\) (Rousseeuw and
Croux 1993)sn(x): Rousseeuw–Croux \(S_n\)iqr_scaled(x) and mad_scaled(x)robLoc(x): M-estimator for locationM-estimator for location via Newton–Raphson iteration on the logistic psi function (Rousseeuw and Verboven 2002, Eq. 21). Starting value: \(T^{(0)} = \text{median}(x)\); auxiliary scale: \(S = \text{MAD}(x)\).
Fallback: When scale is unknown and \(n < 4\), or when scale is known and
\(n < 3\), returns
median(x) without iteration.
robLoc(c(1, 2, 3, 5, 7, 8))
#> [1] 4.317035
robLoc(c(1, 2, 3), scale = 1.5) # known scale enables n = 3
#> [1] 2Newton–Raphson algorithm for robLoc():
flowchart TD
A[Set minobs: 3 if scale known, 4 if unknown] --> B[med = median x]
B --> C{n < minobs?}
C -- Yes --> D([Return med])
C -- No --> E{Scale known?}
E -- Yes --> F[s = scale]
E -- No --> G[s = MAD x]
F --> H{s = 0?}
G --> H
H -- Yes --> I([Return med])
H -- No --> J[t = med]
J --> K[psi_i = tanh( (x_i - t) / 2s )]
K --> L[sum_psi = Sum psi_i, sum_dpsi = Sum (1 - psi_i^2)]
L --> M[t += 2s * sum_psi / sum_dpsi]
M --> N{"|v| <= tol * max(|t|, 1)?"}
N -- No --> K
N -- Yes --> O([Return t])
robScale(x): M-estimator for scaleM-estimator for scale via Newton–Raphson iteration on the M-scale estimating equation (Rousseeuw and Verboven 2002). Starting value: \(S^{(0)} = \text{MAD}(x)\).
robScale(c(1, 2, 3, 5, 7, 8))
#> [1] 3.305786
robScale(c(5, 5, 5, 5, 6), fallback = "na") # revss compatibility
#> [1] NA
robScale(c(1, 2, 3, 5, 7, 8), ci = TRUE)
#> robScale estimate: 3.3058
#> 95% CI (analytical): [0.7838, 5.8278]Newton–Raphson iteration for robScale():
flowchart TD
A{Location known?}
A -- Yes --> B1[w = x - loc]
B1 --> B2[s = 1.4826 * median( abs(w) )]
B2 --> B3[t = 0, minobs = 3]
A -- No --> C1[med = median x]
C1 --> C2[s = MAD x, t = med]
C2 --> C3[minobs = 4]
B3 --> D{n < minobs?}
C3 --> D
D -- Yes --> E{s <= implbound?}
E -- Yes --> F{fallback?}
F -- adm --> F1([Return ADM x])
F -- na --> F2([Return NA])
E -- No --> G([Return s])
D -- No --> H{s <= implbound?}
H -- Yes --> F
H -- No --> J["u_i = (x_i - t) / (2cs), compute tanh(u_i)"]
J --> K["numer = Σtanh²/n − 0.5<br/>denom = (2/n)·Σ(u·tanh·sech²)"]
L --> LG{"|denom| <= 1e-14 * s?"}
LG -- Yes --> LF["s *= sqrt(2 * mean_tanh²)"]
LF --> J
LG -- No --> MM["Δs = s · numer / denom"]
MM --> MG{"s + Δs <= 0?"}
MG -- Yes --> MH["s /= 2"]
MH --> J
MG -- No --> MS["s ← s + Δs"]
MS --> N{"|Δs|/s ≤ tol?"}
N -- No --> J
N -- Yes --> O([Return s])
All scale estimators support ci = TRUE:
# Analytical CI (chi-squared for sd_c4, ARE-based for others)
sd_c4(c(1, 2, 3, 5, 7, 8), ci = TRUE)
#> sd_c4 estimate: 2.9476
#> 95% CI (analytical): [1.8399, 7.2294]
gmd(c(1, 2, 3, 5, 7, 8), ci = TRUE)
#> gmd estimate: 3.0723
#> 95% CI (analytical): [1.3163, 4.8282]
# BCa bootstrap CI for scale_robust() ensemble
set.seed(42)
x_small <- rnorm(10)
scale_robust(x_small, ci = TRUE, n_boot = 500)
#> Ensemble estimate: 0.8321
#> 95% bootstrap CI (bca): [0.5010, 1.2110]
#>
#> Component estimators:
#> estimator estimate weight analytical_lower analytical_upper boot_lower
#> sd_c4 0.8589 0.32524 0.5908 1.5681 0.62712
#> gmd 0.8671 0.27949 0.4832 1.2509 0.63590
#> mad_scaled 0.7177 0.06932 0.1992 1.2362 0.04068
#> iqr_scaled 0.9438 0.07122 0.2638 1.6237 0.33936
#> sn 0.6129 0.06227 0.2602 0.9656 0.05213
#> qn 0.8021 0.11373 0.4139 1.1904 0.68047
#> robScale 0.8137 0.07872 0.3328 1.2945 0.28543
#> boot_upper
#> 1.217
#> 1.153
#> 1.507
#> 1.566
#> 1.760
#> 1.404
#> 1.468The CI return type is a named list with estimate,
lower, upper, and level. For
scale_robust() in ensemble mode, the object class is
robscale_ensemble_ci and includes per-estimator
weights.
The logistic psi function satisfies \(\psi_{\log}(x) = \tanh(x/2)\).
robscale evaluates tanh in bulk using the
fastest available platform backend. The configure script
selects one of two vectorized tanh libraries at compile
time; at runtime, CPUID selects the widest available instruction
set:
vvtanh): array-wide
SIMD on macOS._ZGVeN8v_tanh): 8
doubles per iteration on AVX-512F hardware._ZGVdN4v_tanh or
Sleef_tanhd4_u10avx2): 4 doubles per iteration.#pragma omp simd):
compiler-guided portable fallback.std::tanh: universal
fallback.For the bulk tanh dispatcher, vectorization requires \(n \geq 8\). However, robLoc’s
fused AVX2 NR kernel activates at \(n \geq
4\) (one full 4-wide vector).
The Gini mean difference uses a separate AVX2 FMA kernel
(_mm256_fmadd_pd) for its weighted sum \(\sum (2i - (n-1))\, x_{(i)}\), shared
across the standalone gmd(), the ensemble bootstrap, and
the BCa jackknife. This kernel activates for \(n \geq 8\) on AVX2 hardware.
robscale uses a tiered strategy rather than
sort():
-O2):| \(n\) | Comparators |
|---|---|
| 3 | 3 |
| 4 | 5 |
| 8 | 19 |
| 12 | 39 |
| 16 | 60 |
std::nth_element (introselect, \(O(n)\) worst-case).The crossover between Floyd–Rivest and pdqselect is derived at runtime from the per-core L2 cache size; each estimator uses its own threshold based on working-set pressure:
| Estimator | Strategy | Threshold |
|---|---|---|
iqr_scaled |
Always pdqselect | — |
mad_scaled |
Floyd–Rivest vs. pdqselect | \(\max(2048,\, L_2 / 40)\) |
robScale |
Floyd–Rivest vs. pdqselect | \(\max(2048,\, L_2 / 16)\) |
| \(S_n\) | Floyd–Rivest vs. pdqselect | \(\max(2048,\, L_2 / 16)\) |
| \(Q_n\) (final window) | Inline check | \(\max(2048,\, L_2 / 32)\) |
A 128-double micro-buffer (1 KB) covers the smallest samples (\(n \leq 128\) for MAD, \(n \leq 64\) for robScale and
robLoc). For \(n \leq
2{,}048\), stack-allocated arrays avoid heap allocation:
robScale uses one 2,048-double array (16 KB);
robLoc uses one 4,096-double array (32 KB, split into
equal-sized buffer and deviation halves).
Fused single-buffer algorithm. After median selection the working buffer holds a permutation of the input. Because MAD is order-invariant, absolute deviations can be computed in-place, eliminating the second scratch array and halving memory from \(2n\) to \(n\) doubles per estimator.
robLoc: Newton–Raphson. Location
estimation uses Newton–Raphson (quadratic convergence, ~3 iterations)
rather than the scoring fixed-point method (~6–8 iterations):
\[ T^{(k+1)} = T^{(k)} + \frac{2\,S\sum \psi\!\left(\frac{x_i - T^{(k)}}{2S}\right)} {\sum \left[1 - \psi^2\!\left(\frac{x_i - T^{(k)}}{2S}\right)\right]} \]
Since \(\tanh\) values are already computed for the numerator, the denominator costs only squaring and subtraction. Typical iteration counts:
| \(n\) | Scoring | Newton–Raphson |
|---|---|---|
| 4 | 7 | 2–4 |
| 8 | 7 | 2–4 |
| 20 | 6 | 2–4 |
| 100 | 5 | 2–4 |
Fused AVX2 kernel for robLoc().
rob_loc_nr_step_avx2 collapses three passes into one. Two
AVX2 accumulators advance in lockstep:
\[ \texttt{acc\_psi} \mathrel{+}= p_i, \qquad \texttt{acc\_dpsi} \mathrel{+}= (1 - p_i^2) \]
where \(p_i = \tanh(u_i)\) is evaluated once. The derivative accumulator uses a fused negative multiply-add (fnmadd) to accumulate \(1 - p_i^2\) directly, avoiding the catastrophic cancellation that would arise from subtracting two large numbers (\(n - \sum p_i^2\)) when all \(|u_i| \gg 1\).
robScale: Newton–Raphson. Scale
estimation also uses Newton–Raphson, applied to the M-scale estimating
equation \(n^{-1}\sum \rho(u_i) =
1/2\). The fused single-pass kernel
(nr_scale_step_avx2 / nr_scale_step_scalar)
computes \(\sum \tanh^2(u_i)\) and
\(\sum u_i
\tanh(u_i)\,\text{sech}^2(u_i)\) in one read over the data,
yielding the NR update \(\Delta s = s \cdot
(\bar\rho - \tfrac{1}{2}) / \bar{D}\) where \(\bar{D}\) is the scaled derivative sum. The
same quadratic convergence as robLoc applies; typical
iteration counts are 3–4. A degenerate-denominator guard applies two
protections: when the denominator is degenerate (\(|\bar{D}| \leq 10^{-14} \times s\)), a
multiplicative fallback \(s \leftarrow s
\times \sqrt{2 \times \overline{\tanh^2}}\) replaces the Newton
step, avoiding the catastrophic cancellation of subtracting two large
numbers. A separate guard halves \(s\)
when the proposed update would make \(s\) non-positive (\(s + \Delta s \leq 0\)).
\(Q_n\) and \(S_n\) partition inner loops across TBB threads for \(n\) above a runtime L2-derived threshold. The ensemble bootstrap parallelizes across TBB threads when \(n \times n_{\text{boot}} \geq 10{,}000\); with default settings (\(n < 20\), \(n_{\text{boot}} = 200\)) the bootstrap runs serially.
The ensemble bootstrap stores results in estimator-major layout (\(7 \times n_{\text{boot}}\)): each estimator’s replicates occupy a contiguous memory column, so the mean/variance reduction pass reads at stride-1 rather than stride-7. Within each replicate, the bootstrap resample is sorted once and reused by all seven estimators; pre-allocated workspace buffers eliminate per-replicate heap allocation for \(S_n\) and \(Q_n\). Bootstrap resampling uses a deterministic XorShift32 PRNG (Marsaglia 2003) seeded per replicate for reproducibility.
sd_c4 uses Welford’s one-pass algorithm (Welford 1962) for numerically stable
variance.constexpr, replacing an inner-loop division with
multiplication.inv_s,
half_inv_s, inv_n) are hoisted before the
iteration loop.(Rousseeuw and Verboven 2002, Eq. 23):
\[ \psi_{\log}(x) = \frac{e^x - 1}{e^x + 1} = \tanh(x/2) \]
Bounded in \((-1, 1)\), \(C^\infty\), strictly monotone. Boundedness provides robustness; smoothness avoids the corner artifacts of Huber’s psi at small \(n\).
Location and scale are estimated separately with a fixed auxiliary
estimate, breaking the positive-feedback loop of Huber’s Proposal 2.
robLoc fixes scale at \(\text{MAD}(x)\); robScale
fixes location at \(\text{median}(x)\).
(Rousseeuw and Verboven 2002, Eq. 26):
\[ \rho_{\log}(x) = \psi_{\log}^2(x / c) \]
where \(c = 0.37394112142347236\) is the constant that yields a 50% breakdown point.
\[Q_n = c_n \cdot 2.2191 \cdot \{|x_i - x_j|;\; i < j\}_{(k)}\]
where \(k = \binom{h}{2}\), \(h = \lfloor n/2 \rfloor + 1\), and \(c_n\) is the finite-sample correction factor.
\[S_n = c_n' \cdot 1.1926 \cdot \operatorname{lomed}_i \{\operatorname{himed}_j |x_i - x_j|\}\]
where \(\operatorname{lomed}\) and \(\operatorname{himed}\) denote the low and high medians respectively.
\[ c_4(n) = \sqrt{\frac{2}{n{-}1}} \cdot \frac{\Gamma(n/2)}{\Gamma((n{-}1)/2)} \]
\[ \text{GMD}(x) = C_{\text{GMD}} \cdot \frac{2}{n(n{-}1)}\sum_{i=1}^{n} (2i - n - 1)\, x_{(i)} \]
Algebraically equivalent to \(\frac{1}{\binom{n}{2}}\sum_{i<j}|x_i - x_j|\), but computed in \(O(n \log n)\) via sorted-array indexing.
Weights are the normalized inverse bootstrap variances \(w_j = (1/V_j) / \sum_k (1/V_k)\) where \(V_j = \operatorname{Var}_{\text{boot}}(\hat\sigma_j)\):
\[ \hat\sigma = \frac{\sum_{j=1}^{J} \hat\sigma_j(x) \,/\, \operatorname{Var}_{\mathrm{boot}}(\hat\sigma_j)}{\sum_{j=1}^{J} 1 \,/\, \operatorname{Var}_{\mathrm{boot}}(\hat\sigma_j)} \]
| Symbol | Value | Definition |
|---|---|---|
| \(c\) | 0.37394112142347236 |
Solution to \(\int \rho_{\log}(u)\,d\Phi(u) = 0.5\); scale rho constant |
| \(C_{\text{ADM}}\) | 1.2533141373155001 |
\(\sqrt{\pi/2}\); ADM consistency constant |
| \(C_{\text{MAD}}\) | 1.482602218505602 |
\(1/\Phi^{-1}(3/4)\); MAD consistency constant |
| \(C_{\text{GMD}}\) | 0.886226925452758 |
\(\sqrt{\pi}/2\); GMD consistency constant |
| \(C_{\text{IQR}}\) | 0.741301109252801 |
\(1/(\Phi^{-1}(0.75) - \Phi^{-1}(0.25))\); IQR consistency constant |
adm(), robLoc(), and
robScale() accept the same arguments and return the same
values as revss. Code using revss can switch
to robscale by changing only the library()
call:
x <- c(1.2, 2.4, 3.1, 5.5, 7.0)
revss::robScale(x)
#> [1] 3.094334
robscale::robScale(x) # same value
#> [1] 2.34597Note: revss v3.1.0 updated its bias correction factors;
robscale follows the Rousseeuw and
Verboven (2002) estimating equations and constants but solves
them via Newton–Raphson rather than the original multiplicative
iteration, so results may differ at convergence tolerance.
qn() and sn() match
robustbase::Qn() and robustbase::Sn() (with
lowercase names for consistency).
At tolerance \(\sqrt{\epsilon_{\text{mach}}} \approx 1.49 \times
10^{-8}\), robscale matches revss
across 5,400 randomly generated inputs (\(n =
3, \ldots, 20\); 100 replicates per \(n\); 100% pass rate). See
tests/testthat/test-cross-check.R for full test
coverage.