Introduction to robscale

1. The Robustness Problem

Outliers compromise the reliability of classical estimators even in moderate samples. A single recording error can destroy the standard deviation, yet the standard deviation remains the default scale estimator in most statistical software.

To see why, consider a small sample with one gross error:

library(robscale)

x_clean <- c(2.1, 2.3, 2.0, 2.4, 2.2, 2.1, 2.3, 1.9)
x_contaminated <- c(x_clean, 200)   # one recording error

sd(x_clean)
#> [1] 0.1685018
sd(x_contaminated)   # collapses
#> [1] 65.94602

gmd(x_clean)
#> [1] 0.1804105
gmd(x_contaminated)  # unaffected
#> [1] 39.1023

qn(x_clean)
#> [1] 0.1486827
qn(x_contaminated)   # unaffected
#> [1] 0.1938622

The breakdown point is the fraction of contaminated observations an estimator tolerates before producing an arbitrarily large result. The standard deviation has a 0% breakdown point; a single outlier drives it to infinity. The GMD achieves 29.3%; qn and sn achieve the maximum 50%.

The asymptotic relative efficiency (ARE) measures statistical efficiency relative to the sample standard deviation under the Gaussian model (Bickel and Lehmann 1976). High efficiency means fewer observations are needed to match the standard deviation’s precision. The tradeoff is inherent: higher breakdown point generally costs efficiency.

robscale exposes all major points on this robustness–efficiency frontier.

2. Getting Started

Installation

install.packages("robscale")

# Development version:
# install.packages("remotes")
# remotes::install_github("davdittrich/robscale")

Quick demo

library(robscale)

set.seed(42)
x <- c(rnorm(20), 50)   # 20 clean observations + one outlier

scale_robust(x)              # auto-selects best strategy
#> [1] 5.391021
scale_robust(x, ci = TRUE)   # with confidence interval
#> gmd estimate: 5.3910
#> 95% CI (analytical): [3.7441, 7.0380]

data.frame(
  estimator = c("sd", "sd_c4", "gmd", "qn", "mad_scaled"),
  value = round(c(sd(x), sd_c4(x), gmd(x), qn(x), mad_scaled(x)), 4)
)
#>    estimator   value
#> 1         sd 10.9441
#> 2      sd_c4 11.0817
#> 3        gmd  5.3910
#> 4         qn  1.4471
#> 5 mad_scaled  1.3756

3. The Estimator Spectrum

robscale provides 11 exported functions spanning the full robustness–efficiency spectrum:

Function	ARE	Breakdown	Description
`sd_c4(x)`	100%	0%	Bias-corrected standard deviation
`gmd(x)`	98%	29.3%	Gini mean difference
`adm(x)`	88.3%	\(1/n\)	Average deviation from median
`qn(x)`	82.3%	50%	Rousseeuw–Croux \(Q_n\)
`sn(x)`	58.2%	50%	Rousseeuw–Croux \(S_n\)
`iqr_scaled(x)`	37%	25%	Scaled interquartile range
`mad_scaled(x)`	36.7%	50%	Scaled median absolute deviation
`robScale(x)`	~55%	50%	M-estimator for scale
`robLoc(x)`	98.4%	50%	M-estimator for location
`scale_robust(x)`	ensemble	—	Variance-weighted dispatcher
`get_consistency_constant(n)`	—	—	Finite-sample bias correction factors

Decision guide

Maximum efficiency, no outliers expected: sd_c4
High efficiency with moderate outlier protection: gmd
Maximum breakdown with good efficiency: qn
Fast computation on large data: mad_scaled or iqr_scaled
Uncertain which estimator to use: scale_robust() (adaptive)
M-estimate of location: robLoc
M-estimate of scale: robScale

4. The `scale_robust()` Dispatcher

scale_robust() adapts its strategy to sample size automatically.

Ensemble mode (\(n < 20\))

For small samples, all 7 scale estimators run on bootstrap replicates. Inverse-variance weights—where the variance is the bootstrap variance of each estimator across replicates—combine them into a single estimate:

\[ \hat\sigma = \frac{\sum_{j=1}^{7} \hat\sigma_j(x) \,/\, \operatorname{Var}_{\mathrm{boot}}(\hat\sigma_j)}{\sum_{j=1}^{7} 1 \,/\, \operatorname{Var}_{\mathrm{boot}}(\hat\sigma_j)} \]

set.seed(1)
x_small <- rnorm(12)
scale_robust(x_small)              # ensemble (n = 12 < 20)
#> [1] 0.8261654
scale_robust(x_small, ci = TRUE)   # BCa bootstrap CI
#> Ensemble estimate: 0.8262
#> 95% bootstrap CI (bca): [0.5546, 1.2405]
#> 
#> Component estimators:
#>   estimator estimate  weight analytical_lower analytical_upper boot_lower
#>       sd_c4   0.8293 0.31376           0.5875            1.408     0.5833
#>         gmd   0.8414 0.28733           0.5014            1.181     0.6054
#>  mad_scaled   0.7737 0.05480           0.2634            1.284     0.2252
#>  iqr_scaled   0.7428 0.07801           0.2543            1.231     0.3158
#>          sn   0.9515 0.06101           0.4517            1.451     0.2956
#>          qn   0.8658 0.10749           0.4833            1.248     0.3516
#>    robScale   0.7451 0.09760           0.3432            1.147     0.3063
#>  boot_upper
#>       1.129
#>       1.113
#>       1.453
#>       1.443
#>       1.447
#>       1.300
#>       1.289

Automatic switch to GMD (\(n \geq 20\))

For larger samples, scale_robust() uses gmd() directly: 98% ARE, 29.3% breakdown, \(O(n \log n)\).

x_large <- rnorm(500)
scale_robust(x_large)   # equivalent to gmd(x_large)
#> [1] 1.017226

Explicit method selection

scale_robust(x_small, method = "qn")       # force Qn
#> [1] 0.8658276
scale_robust(x_small, method = "robScale") # force M-scale
#> [1] 0.7451493
scale_robust(x_small, auto_switch = FALSE) # keep ensemble at any n
#> [1] 0.8261654

5. Individual Estimators

`sd_c4(x)`: Bias-corrected standard deviation

Applies the exact \(c_4(n)\) finite-sample correction to remove the downward bias of the sample standard deviation under normality:

x <- c(1.2, 0.8, 1.5, 0.9, 1.1)
sd_c4(x)              # unbiased under normality
#> [1] 0.2913462
sd(x)                 # biased downward for small n
#> [1] 0.2738613
sd_c4(x, ci = TRUE)
#> sd_c4 estimate: 0.2913
#> 95% CI (analytical): [0.1746, 0.8372]

`gmd(x)`: Gini mean difference

Average absolute pairwise difference (Gini 1912), computed via the order-statistics form in \(O(n \log n)\):

gmd(c(1, 2, 3, 5, 7, 8))
#> [1] 3.072253
gmd(c(1, 2, 3, 5, 7, 8), ci = TRUE)
#> gmd estimate: 3.0723
#> 95% CI (analytical): [1.3163, 4.8282]

`adm(x)`: Average deviation from the median

The mean deviation from the median (Nair 1936, 1947), scaled for consistency at the normal distribution.

adm(c(1, 2, 3, 5, 7, 8))
#> [1] 2.9244

`qn(x)`: Rousseeuw–Croux \(Q_n\) (Rousseeuw and Croux 1993)

qn(c(1, 2, 3, 5, 7, 8))
#> [1] 2.717298
qn(c(1, 2, 3, 5, 7, 8), ci = TRUE)
#> qn estimate: 2.7173
#> 95% CI (analytical): [1.0195, 4.4151]

`sn(x)`: Rousseeuw–Croux \(S_n\)

sn(c(1, 2, 3, 5, 7, 8))
#> [1] 3.556794
sn(c(1, 2, 3, 5, 7, 8), ci = TRUE)
#> sn estimate: 3.5568
#> 95% CI (analytical): [0.9144, 6.1992]

`iqr_scaled(x)` and `mad_scaled(x)`

iqr_scaled(c(1, 2, 3, 5, 7, 8))
#> [1] 3.15053
mad_scaled(c(1, 2, 3, 5, 7, 8))
#> [1] 3.706506

`robLoc(x)`: M-estimator for location

M-estimator for location via Newton–Raphson iteration on the logistic psi function (Rousseeuw and Verboven 2002, Eq. 21). Starting value: \(T^{(0)} = \text{median}(x)\); auxiliary scale: \(S = \text{MAD}(x)\).

Fallback: When scale is unknown and \(n < 4\), or when scale is known and \(n < 3\), returns median(x) without iteration.

robLoc(c(1, 2, 3, 5, 7, 8))
#> [1] 4.317035
robLoc(c(1, 2, 3), scale = 1.5)   # known scale enables n = 3
#> [1] 2

Newton–Raphson algorithm for robLoc():

flowchart TD
    A[Set minobs: 3 if scale known, 4 if unknown] --> B[med = median x]
    B --> C{n < minobs?}
    C -- Yes --> D([Return med])
    C -- No --> E{Scale known?}
    E -- Yes --> F[s = scale]
    E -- No --> G[s = MAD x]
    F --> H{s = 0?}
    G --> H
    H -- Yes --> I([Return med])
    H -- No --> J[t = med]
    J --> K[psi_i = tanh( (x_i - t) / 2s )]
    K --> L[sum_psi = Sum psi_i, sum_dpsi = Sum (1 - psi_i^2)]
    L --> M[t += 2s * sum_psi / sum_dpsi]
    M --> N{"|v| <= tol * max(|t|, 1)?"}
    N -- No --> K
    N -- Yes --> O([Return t])

`robScale(x)`: M-estimator for scale

M-estimator for scale via Newton–Raphson iteration on the M-scale estimating equation (Rousseeuw and Verboven 2002). Starting value: \(S^{(0)} = \text{MAD}(x)\).

robScale(c(1, 2, 3, 5, 7, 8))
#> [1] 3.305786
robScale(c(5, 5, 5, 5, 6), fallback = "na")   # revss compatibility
#> [1] NA
robScale(c(1, 2, 3, 5, 7, 8), ci = TRUE)
#> robScale estimate: 3.3058
#> 95% CI (analytical): [0.7838, 5.8278]

Newton–Raphson iteration for robScale():

flowchart TD
    A{Location known?}
    A -- Yes --> B1[w = x - loc]
    B1 --> B2[s = 1.4826 * median( abs(w) )]
    B2 --> B3[t = 0, minobs = 3]
    A -- No --> C1[med = median x]
    C1 --> C2[s = MAD x, t = med]
    C2 --> C3[minobs = 4]
    B3 --> D{n < minobs?}
    C3 --> D
    D -- Yes --> E{s <= implbound?}
    E -- Yes --> F{fallback?}
    F -- adm --> F1([Return ADM x])
    F -- na --> F2([Return NA])
    E -- No --> G([Return s])
    D -- No --> H{s <= implbound?}
    H -- Yes --> F
    H -- No --> J["u_i = (x_i - t) / (2cs), compute tanh(u_i)"]
    J --> K["numer = Σtanh²/n − 0.5<br/>denom = (2/n)·Σ(u·tanh·sech²)"]
    L --> LG{"|denom| <= 1e-14 * s?"}
    LG -- Yes --> LF["s *= sqrt(2 * mean_tanh²)"]
    LF --> J
    LG -- No --> MM["Δs = s · numer / denom"]
    MM --> MG{"s + Δs <= 0?"}
    MG -- Yes --> MH["s /= 2"]
    MH --> J
    MG -- No --> MS["s ← s + Δs"]
    MS --> N{"|Δs|/s ≤ tol?"}
    N -- No --> J
    N -- Yes --> O([Return s])

6. Confidence Intervals

All scale estimators support ci = TRUE:

# Analytical CI (chi-squared for sd_c4, ARE-based for others)
sd_c4(c(1, 2, 3, 5, 7, 8), ci = TRUE)
#> sd_c4 estimate: 2.9476
#> 95% CI (analytical): [1.8399, 7.2294]
gmd(c(1, 2, 3, 5, 7, 8), ci = TRUE)
#> gmd estimate: 3.0723
#> 95% CI (analytical): [1.3163, 4.8282]

# BCa bootstrap CI for scale_robust() ensemble
set.seed(42)
x_small <- rnorm(10)
scale_robust(x_small, ci = TRUE, n_boot = 500)
#> Ensemble estimate: 0.8321
#> 95% bootstrap CI (bca): [0.5010, 1.2110]
#> 
#> Component estimators:
#>   estimator estimate  weight analytical_lower analytical_upper boot_lower
#>       sd_c4   0.8589 0.32524           0.5908           1.5681    0.62712
#>         gmd   0.8671 0.27949           0.4832           1.2509    0.63590
#>  mad_scaled   0.7177 0.06932           0.1992           1.2362    0.04068
#>  iqr_scaled   0.9438 0.07122           0.2638           1.6237    0.33936
#>          sn   0.6129 0.06227           0.2602           0.9656    0.05213
#>          qn   0.8021 0.11373           0.4139           1.1904    0.68047
#>    robScale   0.8137 0.07872           0.3328           1.2945    0.28543
#>  boot_upper
#>       1.217
#>       1.153
#>       1.507
#>       1.566
#>       1.760
#>       1.404
#>       1.468

The CI return type is a named list with estimate, lower, upper, and level. For scale_robust() in ensemble mode, the object class is robscale_ensemble_ci and includes per-estimator weights.

7. Performance Architecture

SIMD vectorization

The logistic psi function satisfies \(\psi_{\log}(x) = \tanh(x/2)\). robscale evaluates tanh in bulk using the fastest available platform backend. The configure script selects one of two vectorized tanh libraries at compile time; at runtime, CPUID selects the appropriate instruction set:

Apple Accelerate (vvtanh): array-wide SIMD on macOS.
glibc libmvec (Linux x86_64, glibc \(\geq\) 2.35) or SLEEF (fallback when libmvec is absent). Only one is linked per binary. The AVX2 4-wide kernel (_ZGVdN4v_tanh or Sleef_tanhd4_u10avx2) processes 4 doubles per iteration.
OpenMP SIMD (#pragma omp simd): compiler-guided portable fallback.
Scalar std::tanh: universal fallback.

For the bulk tanh dispatcher, vectorization requires \(n \geq 8\). However, robLoc’s fused AVX2 NR kernel activates at \(n \geq 4\) (one full 4-wide vector).

The Gini mean difference uses a separate AVX2 FMA kernel (_mm256_fmadd_pd) for its weighted sum \(\sum (2i - (n-1))\, x_{(i)}\), shared across the standalone gmd(), the ensemble bootstrap, and the BCa jackknife. This kernel activates for \(n \geq 8\) on AVX2 hardware.

\(O(n)\) median selection

robscale uses a tiered strategy rather than sort():

\(n \in \{8, 16, 32\}\) on AVX2 hardware: SIMD selection networks (bitonic merge, 23–58% faster than scalar selection networks).
\(n \leq 36\): scalar selection networks for median computation.
\(n \leq 56\): branchless sorting networks for full-sort paths (GMD, \(Q_n\) exact, \(S_n\)), compiled to conditional-move instructions at -O2:

\(n\)	Comparators
3	3
4	5
8	19
16	60
32	185
56	438

\(37 \leq n\) (selection) / \(57 \leq n\) (sort): Floyd–Rivest algorithm (Floyd and Rivest 1975) or pdqselect, chosen per-estimator based on a runtime L2-cache-derived crossover threshold.

The crossover between Floyd–Rivest and pdqselect is derived at runtime from the per-core L2 cache size; each estimator uses its own threshold based on working-set pressure:

Estimator	Strategy	Threshold
`iqr_scaled`	Always pdqselect	—
`mad_scaled`	Floyd–Rivest vs. pdqselect	\(\max(2048,\, L_2 / 40)\)
`robScale`	Floyd–Rivest vs. pdqselect	\(\max(2048,\, L_2 / 16)\)
\(S_n\)	Floyd–Rivest vs. pdqselect	\(\max(2048,\, L_2 / 16)\)
\(Q_n\) (final window)	Inline check	\(\max(2048,\, L_2 / 32)\)

Stack-allocated memory arenas

A 128-double micro-buffer (1 KB) covers the smallest samples (\(n \leq 128\) for MAD, \(n \leq 64\) for robScale and robLoc). For \(n \leq 2{,}048\), stack-allocated arrays avoid heap allocation: robScale uses one 2,048-double array (16 KB); robLoc uses one 4,096-double array (32 KB, split into equal-sized buffer and deviation halves).

Fused single-buffer algorithm. After median selection the working buffer holds a permutation of the input. Because MAD is order-invariant, absolute deviations can be computed in-place, eliminating the second scratch array and halving memory from \(2n\) to \(n\) doubles per estimator.

Iteration convergence

robLoc: Newton–Raphson. Location estimation uses Newton–Raphson (quadratic convergence, ~3 iterations) rather than the scoring fixed-point method (~6–8 iterations):

\[ T^{(k+1)} = T^{(k)} + \frac{2\,S\sum \psi\!\left(\frac{x_i - T^{(k)}}{2S}\right)} {\sum \left[1 - \psi^2\!\left(\frac{x_i - T^{(k)}}{2S}\right)\right]} \]

Since \(\tanh\) values are already computed for the numerator, the denominator costs only squaring and subtraction. Typical iteration counts:

\(n\)	Scoring	Newton–Raphson
4	7	2–4
8	7	2–4
20	6	2–4
100	5	2–4

Fused AVX2 kernel for robLoc(). rob_loc_nr_step_avx2 collapses three passes into one. Two AVX2 accumulators advance in lockstep:

\[ \texttt{acc\_psi} \mathrel{+}= p_i, \qquad \texttt{acc\_dpsi} \mathrel{+}= (1 - p_i^2) \]

where \(p_i = \tanh(u_i)\) is evaluated once. The derivative accumulator uses a fused negative multiply-add (fnmadd) to accumulate \(1 - p_i^2\) directly, avoiding the catastrophic cancellation that would arise from subtracting two large numbers (\(n - \sum p_i^2\)) when all \(|u_i| \gg 1\).

robScale: Newton–Raphson. Scale estimation also uses Newton–Raphson, applied to the M-scale estimating equation \(n^{-1}\sum \rho(u_i) = 1/2\). The fused single-pass kernel (nr_scale_step_avx2 / nr_scale_step_scalar) computes \(\sum \tanh^2(u_i)\) and \(\sum u_i \tanh(u_i)\,\text{sech}^2(u_i)\) in one read over the data, yielding the NR update \(\Delta s = s \cdot (\bar\rho - \tfrac{1}{2}) / \bar{D}\) where \(\bar{D}\) is the scaled derivative sum. The same quadratic convergence as robLoc applies; typical iteration counts are 3–4. A degenerate-denominator guard applies two protections: when the denominator is degenerate (\(|\bar{D}| \leq 10^{-14} \times s\)), a multiplicative fallback \(s \leftarrow s \times \sqrt{2 \times \overline{\tanh^2}}\) replaces the Newton step, avoiding the catastrophic cancellation of subtracting two large numbers. A separate guard halves \(s\) when the proposed update would make \(s\) non-positive (\(s + \Delta s \leq 0\)).

TBB parallelism and ensemble design

\(Q_n\) and \(S_n\) partition inner loops across TBB threads for \(n\) above a runtime L2-derived threshold. The ensemble bootstrap parallelizes across TBB threads when \(n \times n_{\text{boot}} \geq 10{,}000\); with default settings (\(n < 20\), \(n_{\text{boot}} = 200\)) the bootstrap runs serially.

The ensemble bootstrap stores results in estimator-major layout (\(7 \times n_{\text{boot}}\)): each estimator’s replicates occupy a contiguous memory column, so the mean/variance reduction pass reads at stride-1 rather than stride-7. Within each replicate, the bootstrap resample is sorted once and reused by all seven estimators; pre-allocated workspace buffers eliminate per-replicate heap allocation for \(S_n\) and \(Q_n\). Bootstrap resampling uses a deterministic XorShift32 PRNG (Marsaglia 2003) seeded per replicate for reproducibility.

Numerical stability and compile-time optimizations

sd_c4 uses Welford’s one-pass algorithm (Welford 1962) for numerically stable variance.
The reciprocal constant \(1/c\) is constexpr, replacing an inner-loop division with multiplication.
Loop-invariant quantities (inv_s, half_inv_s, inv_n) are hoisted before the iteration loop.
Sorting-network entry points are instantiated in a single translation unit, reducing cold build time from ~300 s to ~52 s.

8. Mathematical Foundations

Logistic psi function

(Rousseeuw and Verboven 2002, Eq. 23):

\[ \psi_{\log}(x) = \frac{e^x - 1}{e^x + 1} = \tanh(x/2) \]

Bounded in \((-1, 1)\), \(C^\infty\), strictly monotone. Boundedness provides robustness; smoothness avoids the corner artifacts of Huber’s psi at small \(n\).

Decoupled estimation

Location and scale are estimated separately with a fixed auxiliary estimate, breaking the positive-feedback loop of Huber’s Proposal 2. robLoc fixes scale at \(\text{MAD}(x)\); robScale fixes location at \(\text{median}(x)\).

Rho function for scale

(Rousseeuw and Verboven 2002, Eq. 26):

\[ \rho_{\log}(x) = \psi_{\log}^2(x / c) \]

where \(c = 0.37394112142347236\) is the constant that yields a 50% breakdown point.

\(Q_n\) and \(S_n\) statistics

\[Q_n = c_n \cdot 2.2191 \cdot \{|x_i - x_j|;\; i < j\}_{(k)}\]

where \(k = \binom{h}{2}\), \(h = \lfloor n/2 \rfloor + 1\), and \(c_n\) is the finite-sample correction factor.

\[S_n = c_n' \cdot 1.1926 \cdot \operatorname{lomed}_i \{\operatorname{himed}_j |x_i - x_j|\}\]

where \(\operatorname{lomed}\) and \(\operatorname{himed}\) denote the low and high medians respectively.

\(c_4(n)\) correction factor

\[ c_4(n) = \sqrt{\frac{2}{n{-}1}} \cdot \frac{\Gamma(n/2)}{\Gamma((n{-}1)/2)} \]

Gini mean difference

\[ \text{GMD}(x) = C_{\text{GMD}} \cdot \frac{2}{n(n{-}1)}\sum_{i=1}^{n} (2i - n - 1)\, x_{(i)} \]

Algebraically equivalent to \(\frac{1}{\binom{n}{2}}\sum_{i<j}|x_i - x_j|\), but computed in \(O(n \log n)\) via sorted-array indexing.

Ensemble weighting formula

Weights are the normalized inverse bootstrap variances \(w_j = (1/V_j) / \sum_k (1/V_k)\) where \(V_j = \operatorname{Var}_{\text{boot}}(\hat\sigma_j)\):

\[ \hat\sigma = \frac{\sum_{j=1}^{J} \hat\sigma_j(x) \,/\, \operatorname{Var}_{\mathrm{boot}}(\hat\sigma_j)}{\sum_{j=1}^{J} 1 \,/\, \operatorname{Var}_{\mathrm{boot}}(\hat\sigma_j)} \]

Key constants (full double precision)

Symbol	Value	Definition
\(c\)	`0.37394112142347236`	Solution to \(\int \rho_{\log}(u)\,d\Phi(u) = 0.5\); scale rho constant
\(C_{\text{ADM}}\)	`1.2533141373155001`	\(\sqrt{\pi/2}\); ADM consistency constant
\(C_{\text{MAD}}\)	`1.482602218505602`	\(1/\Phi^{-1}(3/4)\); MAD consistency constant
\(C_{\text{GMD}}\)	`0.886226925452758`	\(\sqrt{\pi}/2\); GMD consistency constant
\(C_{\text{IQR}}\)	`0.741301109252801`	\(1/(\Phi^{-1}(0.75) - \Phi^{-1}(0.25))\); IQR consistency constant

9. Compatibility

revss API

adm(), robLoc(), and robScale() accept the same arguments and return the same values as revss. Code using revss can switch to robscale by changing only the library() call:

x <- c(1.2, 2.4, 3.1, 5.5, 7.0)
revss::robScale(x)
#> [1] 3.094334
robscale::robScale(x)   # same value
#> [1] 2.34597

Note: revss v3.1.0 updated its bias correction factors; robscale follows the Rousseeuw and Verboven (2002) estimating equations and constants but solves them via Newton–Raphson rather than the original multiplicative iteration, so results may differ at convergence tolerance.

robustbase API

qn() and sn() match robustbase::Qn() and robustbase::Sn() (with lowercase names for consistency).

Numerical equivalence

At tolerance \(\sqrt{\epsilon_{\text{mach}}} \approx 1.49 \times 10^{-8}\), robscale matches revss across 5,400 randomly generated inputs (\(n = 3, \ldots, 20\); 100 replicates per \(n\); 100% pass rate). See tests/testthat/test-cross-check.R for full test coverage.