| Type: | Package |
| Title: | Robust Outlier Detection for Diverse Distributions |
| Version: | 0.1.3 |
| Maintainer: | Amanda Mejia <mandy.mejia@gmail.com> |
| Description: | Provides robust outlier detection techniques for identifying anomalies in multivariate data, with a focus on methods that remain effective under non-Gaussian distributions. For more details see Saluja, Parlak, and Mejia (2026+) <doi:10.48550/arXiv.2505.11806>. |
| License: | GPL-3 |
| URL: | https://github.com/mandymejia/rrobot |
| BugReports: | https://github.com/mandymejia/rrobot/issues |
| Depends: | R (≥ 3.6.0) |
| Imports: | MASS, stats, cellWise, expm, robustbase, gamlss, imputeTS, isotree, ggplot2, tidyr, reshape2, rlang |
| Suggests: | knitr, rmarkdown, testthat (≥ 3.0.0) |
| VignetteBuilder: | knitr |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.3 |
| Language: | en-US |
| NeedsCompilation: | no |
| Packaged: | 2026-03-04 21:08:12 UTC; ddpham |
| Author: | Amanda Mejia [aut, cre],
Damon Pham |
| Repository: | CRAN |
| Date/Publication: | 2026-03-09 16:30:02 UTC |
Dots parameter documentation
Description
Dots parameter documentation
Arguments
... |
Additional arguments to to method-specific functions. |
B parameter documentation
Description
B parameter documentation
Arguments
B |
Integer; number of bootstrap samples per imputed dataset (default = 1000). |
M parameter documentation
Description
M parameter documentation
Arguments
M |
Integer; number of multiply imputed datasets (default = 5). |
Multiple Imputation with Per-Cycle Updates (OLS + MICE-style)
Description
Multiple Imputation with Per-Cycle Updates (OLS + MICE-style)
Usage
MImpute(x, w, outlier_matrix, M = 50, k = 5, ridge_eps = 1e-08, tol = NA_real_)
Arguments
x |
(T × p) high-kurtosis ICA matrix to impute. |
w |
(T × q) predictors (e.g., low-kurtosis components). |
outlier_matrix |
logical (T × p) mask of entries to impute. |
M |
number of multiply-imputed datasets (default 50). |
k |
number of chained-equation cycles per dataset (default 5–10 is common). |
ridge_eps |
tiny ridge added to X'X for stability (default 1e-8). |
tol |
optional early-stop tolerance on per-cycle max change (NA to disable). |
Value
list(imp_datasets, outlier_matrix)
Multiple Imputation for High-Kurtosis ICA Components
Description
Performs multiple imputation using perturbed robust regression models.
Usage
MImpute_old(x, w, outlier_matrix, M = 50, k = 100)
Arguments
x |
A numeric matrix (n_time × p) of high-kurtosis ICA components to be imputed. |
w |
A numeric matrix (n_time × L) of low-kurtosis ICA components used as predictors (required for MI). |
outlier_matrix |
A logical matrix (same dim as x) indicating univariate outliers to be imputed. |
M |
Integer; number of multiply imputed datasets (default = 5). |
k |
Integer; number of perturbation cycles per imputation (default = 10). |
Value
A list with:
imp_datasetsList of M imputed versions of x
outlier_matrixLogical matrix of imputed outlier positions
Comprehensive Outlier Detection Using Robust Distance Thresholding
Description
Performs univariate outlier detection + imputation, robust distance, and multiple thresholding methods.
Usage
RD(
x,
w = NULL,
method = c("SI_boot", "MI", "MI_boot", "SI", "F", "SHASH"),
mode = "auto",
cov_mcd = NULL,
ind_incld = NULL,
dist = TRUE,
impute_method = "mean",
cutoff = 4,
trans = "SHASH",
M = 50,
k = 100,
alpha = 0.01,
quantile = 0.01,
verbose = FALSE,
boot_quant = 0.95,
B = 1000
)
Arguments
x |
A numeric matrix or data frame of dimensions T × p (observations × variables). |
w |
A numeric matrix (n_time × L) of low-kurtosis ICA components used as predictors (required for MI). |
method |
Character string; one of "all","SI","SI_boot","MI","MI_boot","F", "SHASH". |
mode |
Character string; either "auto" (default) to compute MCD internally or "manual" to use user-supplied values. |
cov_mcd |
Optional covariance matrix (p × p); required in "manual" mode. |
ind_incld |
Optional vector of row indices used to compute the robust mean; required in "manual" mode. |
dist |
Logical; if TRUE, compute squared robust Mahalanobis distances for all observations. |
impute_method |
Character string; imputation method for univariate outliers. |
cutoff |
A numeric value indicating how many MADs away from the median to flag as outliers. The default value is set to be 4. |
trans |
Character string; transformation method, one of "SHASH" or "robZ". |
M |
Integer; number of multiply imputed datasets (default = 5). |
k |
Integer; number of perturbation cycles per imputation (default = 10). |
alpha |
Significance level used to compute RD threshold (default = 0.01 for 99th percentile). |
quantile |
Numeric in (0,1) specifying the upper quantile for thresholding; the expected False Positive Rate for the chosen threshold. |
verbose |
Logical; if TRUE, print progress messages. |
boot_quant |
Numeric; confidence level for bootstrap confidence intervals (default = 0.95, for 95% CI). |
B |
Integer; number of bootstrap samples per imputed dataset (default = 1000). |
Value
Depends on method:
- Single method
Returns the result from the specific threshold method.
- RD_obj
The robust distance object from compute_RD().
- outliers
Logical vector indicating which observations have RD greater than the threshold.
- call
The matched function call.
RD_obj parameter documentation
Description
RD_obj parameter documentation
Arguments
RD_obj |
Pre-computed RD_result object from |
RD_org_obj parameter documentation
Description
RD_org_obj parameter documentation
Arguments
RD_org_obj |
Output list from |
SHASH-based Outlier Detection (Extended)
Description
Detects univariate outliers using an iterative SHASH fitting process with
optional pre-flagging strategies. A SHASH (Sinh-Arcsinh) distribution is
fitted to the data iteratively, each time excluding candidate outliers from
the fit, until the set of flagged observations converges or maxit
is reached.
Usage
SHASH_out(
x,
thr0 = 2.58,
thr1 = 2.58,
thr = 4,
tail = c("both", "upper", "lower"),
use_iso = TRUE,
thr_iso = 0.6,
maxit = 100,
weight_init = NULL
)
Arguments
x |
Numeric vector. May contain |
thr0 |
Positive numeric scalar. Threshold for initial outlier
pre-flagging when |
thr1 |
Positive numeric scalar. Threshold used to classify observations as inliers during iterative convergence (default: 2.58). |
thr |
Positive numeric scalar. Final threshold applied to the converged SHASH-normalised scores to declare outliers in the returned output (default: 4). |
tail |
Character string specifying which tail(s) to check for outliers.
Must be one of
|
use_iso |
Logical. If |
thr_iso |
Numeric scalar in [0, 1]. Isolation forest anomaly score
threshold above which observations are treated as candidate outliers
during pre-screening (default: 0.6). Only used when |
maxit |
Positive integer. Maximum number of fitting iterations before the algorithm stops regardless of convergence (default: 100). |
weight_init |
Optional logical vector of length |
Value
A list of class "SHASH_out" with the following elements:
out_idxInteger vector. Indices of observations in
xthat were flagged as outliers at the final thresholdthr.x_normNumeric vector. SHASH-normalised scores for every observation (same length as
x;NAwherexwasNA).SHASH_coefNamed list with elements
mu,sigma,nu, andtau: the fitted SHASH parameter estimates from the final iteration (sigma and tau are on the log scale, as returned bygamlssML).isotree_scoresNumeric vector of isolation forest anomaly scores (same length as
x).NAwhenuse_iso = FALSEorweight_initwas supplied.initial_weightsLogical vector. Inlier weights used for the very first fitting iteration (same length as
x).indx_itersInteger matrix of dimensions
length(x)×last_iter. Each column records which observations were flagged as outliers (value 1) during that iteration.norm_itersNumeric matrix of dimensions
length(x)×last_iter. Each column records the SHASH-normalised scores from that iteration.last_iterInteger. The number of iterations completed before convergence or hitting
maxit.convergedLogical.
TRUEif the inlier weight vector stabilised before reachingmaxit.paramsList. A record of all input parameters, stored for reproducibility.
Examples
# --- Example 1: Synthetic data with known injected outliers ---------------
# Using rnorm lets us inject outliers at known positions so we can verify
# the function finds exactly what we planted.
set.seed(42)
x <- rnorm(200, mean = 10, sd = 2)
# Shift a handful of observations far into the upper tail
outlier_positions <- c(17, 77, seq(190, 200))
x[outlier_positions] <- x[outlier_positions] + 10
result_sim <- SHASH_out(
x,
thr0 = 2.58,
thr1 = 2.58,
thr = 4,
tail = "both",
use_iso = FALSE # skip isolation forest to keep the example fast
)
result_sim$out_idx # should recover positions near outlier_positions
result_sim$converged # did the iterative fit stabilise?
# --- Example 2: Real benchmark data (Hawkins-Bradu-Kass) ------------------
# hbk is a classic outlier detection benchmark shipped with robustbase,
# which this package already imports, so it is always available.
data("hbk", package = "robustbase")
result_hbk <- SHASH_out(
hbk$X1,
thr0 = 2.58,
thr1 = 2.58,
thr = 4,
tail = "both",
use_iso = FALSE
)
result_hbk$out_idx # flagged observations in the X1 column
result_hbk$SHASH_coef # fitted SHASH parameters; sigma and tau are log-scale
# Which positions were flagged as outliers?
result_hbk$out_idx
# Did the algorithm converge before hitting maxit?
result_hbk$converged
# How many iterations did it take?
result_hbk$last_iter
SHASH Data Transformation
Description
These two functions form a matched pair for transforming data between the
SHASH (Sinh-Arcsinh) distribution and the standard normal distribution.
SHASH_to_normal() maps SHASH-distributed observations onto an
approximately normal scale; normal_to_SHASH() is the inverse.
Usage
SHASH_to_normal(x, mu, sigma, nu, tau)
normal_to_SHASH(x, mu, sigma, nu, tau)
Arguments
x |
Numeric vector of values to transform. |
mu |
Numeric scalar. Location parameter controlling the mean of the SHASH distribution. |
sigma |
Numeric scalar. Spread parameter on the log scale. The function
applies |
nu |
Numeric scalar. Skewness parameter. A value of 0 gives a symmetric distribution. |
tau |
Numeric scalar. Tail-weight parameter on the log scale. Pass
|
Value
A numeric vector of transformed values, the same length as x.
Functions
-
SHASH_to_normal(): Transforms SHASH-distributed data to approximately normal data. -
normal_to_SHASH(): Transforms standard normal data back to the SHASH-distributed scale.
Examples
set.seed(42)
x <- rnorm(200)
x[c(17, 77)] <- x[c(17, 77)] + 5
mu <- 0; sigma <- 0; nu <- 0; tau <- 0
z <- SHASH_to_normal(x, mu = mu, sigma = sigma, nu = nu, tau = tau)
x_recovered <- normal_to_SHASH(z, mu = mu, sigma = sigma, nu = nu, tau = tau)
all.equal(x, x_recovered)
Alpha parameter documentation
Description
Alpha parameter documentation
Arguments
alpha |
Significance level used to compute RD threshold (default = 0.01 for 99th percentile). |
Binwidth parameter documentation
Description
Binwidth parameter documentation
Arguments
binwidth |
Histogram bin width (default = 0.1). |
Boot_quant parameter documentation
Description
Boot_quant parameter documentation
Arguments
boot_quant |
Numeric; confidence level for bootstrap confidence intervals (default = 0.95, for 95% CI). |
Compute Squared robust distance and covariance from a Subset
Description
Calculates the robust mean, covariance matrix, and optionally robust distances using either:
"auto" mode: automatically selects the best robust subset using covMcd
"manual" mode: uses provided robust covariance matrix and subset indices
Usage
compute_RD(
x,
mode = c("auto", "manual"),
cov_mcd = NULL,
ind_incld = NULL,
dist = TRUE
)
Arguments
x |
A numeric matrix or data frame of dimensions T × p (observations × variables). |
mode |
Character string; either "auto" (default) to compute MCD internally or "manual" to use user-supplied values. |
cov_mcd |
Optional covariance matrix (p × p); required in "manual" mode. |
ind_incld |
Optional vector of row indices used to compute the robust mean; required in "manual" mode. |
dist |
Logical; if TRUE, compute squared robust Mahalanobis distances for all observations. |
Value
A list with elements:
- ind_incld
Vector of row indices used to compute the robust mean and covariance.
- ind_excld
Vector of excluded row indices.
- h
Number of included observations.
- xbar_star
Robust mean vector (length p).
- S_star
Robust covariance matrix (p × p).
- invcov_sqrt
Matrix square root of the inverse covariance matrix (p × p).
- RD
Squared robust distances for all observations (length T), or NULL if dist = FALSE.
- call
The matched function call.
Cov_mcd parameter documentation
Description
Cov_mcd parameter documentation
Arguments
cov_mcd |
Optional covariance matrix (p × p); required in "manual" mode. |
Cutoff parameter documentation
Description
Cutoff parameter documentation
Arguments
cutoff |
A numeric value indicating how many MADs away from the median to flag as outliers. The default value is set to be 4. |
Dist parameter documentation
Description
Dist parameter documentation
Arguments
dist |
Logical; if TRUE, compute squared robust Mahalanobis distances for all observations. |
Robust Empirical Rule Outlier Detection
Description
Detects outliers using the median ± thr × MAD rule, where MAD is
normalised by 1.4826 to be consistent with the standard deviation under
normality.
Usage
emprule_rob(x, thr = 4, tail = c("both", "upper", "lower"))
Arguments
x |
Numeric vector. |
thr |
Positive numeric scalar. Threshold multiplier for the MAD rule (default: 4). |
tail |
Character string: one of |
Value
A logical vector the same length as x. TRUE indicates
an outlier, FALSE indicates an inlier.
Imp_data parameter documentation
Description
Imp_data parameter documentation
Arguments
imp_data |
A numeric matrix (T × p) of single-imputed data. |
Imp_datasets parameter documentation
Description
Imp_datasets parameter documentation
Arguments
imp_datasets |
A list of M numeric matrices (T × p); multiply imputed datasets. |
Impute_method parameter documentation
Description
Impute_method parameter documentation
Arguments
impute_method |
Character string; imputation method for univariate outliers. |
Temporally impute univariate outliers from external detection
Description
Takes a high-kurtosis data matrix and a precomputed outlier mask,
replaces the outliers with NA, and applies temporal interpolation using imputeTS::na_interpolation.
Usage
impute_univOut(x, outlier_mask, method = c("mean", "interp"))
Arguments
x |
A numeric matrix or data frame of dimensions T × p (observations × variables). |
outlier_mask |
A logical matrix (same dimensions as x) with TRUE at outlier positions. |
method |
One of |
Value
A list with elements:
- x_df
Original matrix with outliers replaced as NA (as tibble).
- NA_data
Matrix version of x with NAs at outlier positions.
- imp_data
Imputed matrix after temporal interpolation.
- NA_locs
Row-column indices of outliers (now NA).
- call
The matched function call.
Ind_incld parameter documentation
Description
Ind_incld parameter documentation
Arguments
ind_incld |
Optional vector of row indices used to compute the robust mean; required in "manual" mode. |
K parameter documentation
Description
K parameter documentation
Arguments
k |
Integer; number of perturbation cycles per imputation (default = 10). |
Threshold_method parameter documentation
Description
Threshold_method parameter documentation
Arguments
method |
Character string; one of "all","SI","SI_boot","MI","MI_boot","F", "SHASH". |
Method_univOut parameter documentation
Description
Method_univOut parameter documentation
Arguments
method |
Character string. One of |
Mode parameter documentation
Description
Mode parameter documentation
Arguments
mode |
Character string; either "auto" (default) to compute MCD internally or "manual" to use user-supplied values. |
Plot Method for RD Analysis Results
Description
Creates diagnostic plots for robust distance analysis results.
Usage
## S3 method for class 'RD'
plot(x, type = c("histogram", "imputations", "univOut"), method = NULL, ...)
Arguments
x |
An object of class "RD" from RD() or threshold_RD(). |
type |
Character string specifying plot type: "histogram" (default), "imputations", or "univOut". |
method |
Character string specifying threshold method. Auto-detected if NULL. |
... |
Additional arguments passed to plotting functions. |
Value
A ggplot object.
Plot F-Distribution Method Results
Description
Creates histogram of robust distances with F-distribution overlay and threshold.
Usage
plot_F_histogram(
F_result,
RD_obj,
alpha = 0.01,
binwidth = 0.1,
show_f_density = TRUE,
...
)
Arguments
F_result |
F_result object from thresh_F(). |
RD_obj |
Pre-computed RD_result object from |
alpha |
Significance level used to compute RD threshold (default = 0.01 for 99th percentile). |
binwidth |
Histogram bin width (default = 0.1). |
show_f_density |
Logical. Show F-distribution curve overlay (default = TRUE). |
... |
Additional arguments to to method-specific functions. |
Value
A ggplot object.
Plot Robust Distance Histogram with Threshold
Description
Creates a histogram of robust distances with threshold line for outlier detection.
Usage
plot_RD_histogram(thresh_result, RD_obj, alpha = 0.01, binwidth = 0.1, ...)
Arguments
thresh_result |
A threshold result object from any threshold method containing threshold information. |
RD_obj |
Pre-computed RD_result object from |
alpha |
Significance level used to compute RD threshold (default = 0.01 for 99th percentile). |
binwidth |
Histogram bin width (default = 0.1). |
... |
Additional arguments to to method-specific functions. |
Value
A ggplot object with histogram colored by inclusion status and threshold line.
Plot Multiple Threshold Methods on Robust Distance Histogram
Description
Creates a histogram of robust distances with multiple colored threshold lines showing different outlier detection methods simultaneously.
Usage
plot_RD_histogram_multi(
RD_result,
RD_obj,
methods = c("SI", "SI_boot", "MI", "MI_boot", "F"),
alpha = 0.01,
binwidth = 0.1,
...
)
Arguments
RD_result |
An RD result object from threshold_RD() with method="all" containing multiple threshold results in a list. |
RD_obj |
Pre-computed RD_result object from |
methods |
Character vector of threshold methods to display (default: c("SI", "SI_boot", "MI", "MI_boot")). |
alpha |
Significance level used to compute RD threshold (default = 0.01 for 99th percentile). |
binwidth |
Histogram bin width (default = 0.1). |
... |
Additional arguments to to method-specific functions. |
Value
A ggplot object with histogram colored by inclusion status and multiple colored threshold lines for comparison of different methods.
Plot Multiple Imputation Results from RD Analysis
Description
Creates time series plots showing original data, temporal imputation, and multiple imputation results with outlier locations highlighted.
Usage
plot_imputations(x)
Arguments
x |
An object of class "RD" from RD() or threshold_RD(). |
Value
Prints ggplot objects for each variable showing imputation results.
Plot Univariate Outliers from RD Analysis
Description
Creates a heatmap visualization of univariate outliers detected in high-kurtosis components.
Usage
plot_univOut(x, cutoff = NULL, method = NULL)
Arguments
x |
An object of class "RD" from RD() or threshold_RD(). |
cutoff |
A numeric value indicating how many MADs away from the median to flag as outliers. The default value is set to be 4. |
method |
Character string. One of |
Value
A ggplot object showing a heatmap of outlier locations.
Quantile parameter documentation
Description
Quantile parameter documentation
Arguments
quantile |
Numeric in (0,1) specifying the upper quantile for thresholding; the expected False Positive Rate for the chosen threshold. |
Summary method for Hardin & Rocke F results
Description
Summary method for Hardin & Rocke F results
Usage
## S3 method for class 'F_result'
summary(object, ...)
Arguments
object |
An object of class "F_result" or "HR_result" |
... |
Additional arguments to to method-specific functions. |
Value
NULL, invisibly
Summary method for MI_boot results
Description
Summary method for MI_boot results
Usage
## S3 method for class 'MI_boot_result'
summary(object, ...)
Arguments
object |
An object of class "MI_boot_result" |
... |
Additional arguments to to method-specific functions. |
Value
NULL, invisibly
Summary method for MI results
Description
Summary method for MI results
Usage
## S3 method for class 'MI_result'
summary(object, ...)
Arguments
object |
An object of class "MI_result" |
... |
Additional arguments to to method-specific functions. |
Value
NULL, invisibly
Summary method for SI_boot results
Description
Summary method for SI_boot results
Usage
## S3 method for class 'SI_boot_result'
summary(object, ...)
Arguments
object |
An object of class "SI_boot_result" |
... |
Additional arguments to to method-specific functions. |
Value
NULL, invisibly
Summary method for SI results
Description
Summary method for SI results
Usage
## S3 method for class 'SI_result'
summary(object, ...)
Arguments
object |
An object of class "SI_result" |
... |
Additional arguments to to method-specific functions. |
Value
NULL, invisibly
Thr parameter documentation
Description
Thr parameter documentation
Arguments
thr |
Threshold multiplier for outlier detection (default = 4). |
Fit F-distribution Parameters for MCD-based Robust Distances
Description
Computes the scaling constant and degrees of freedom for the F-distribution approximation of squared robust Mahalanobis distances based on the Minimum Covariance Determinant (MCD) estimator, following the method of Hardin & Rocke (2005).
Usage
thresh_F(p, n, h, quantile, RD_obj, SHASH = FALSE, verbose = FALSE)
Arguments
p |
Integer. The number of variables (dimension of the data). |
n |
Integer. The total sample size. |
h |
Integer. The number of observations retained in the MCD subset. |
quantile |
Numeric in (0,1) specifying the upper quantile for thresholding; the expected False Positive Rate for the chosen threshold. |
RD_obj |
Pre-computed RD_result object from |
SHASH |
Boolean. If running SHASH variant. |
verbose |
Logical; if TRUE, print progress messages. |
Details
This function is useful for deriving robust outlier detection thresholds in high-dimensional multivariate data contaminated by outliers.
Value
A list with the following elements:
- c
Consistency correction factor for robust distances.
- m
Estimated degrees of freedom parameter used in the F-distribution.
- df
A numeric vector of degrees of freedom:
c(df1, df2).- scale
Scale factor for the threshold.
- threshold
Threshold for squared robust distances.
- flagged_outliers
Integer vector of row indices from original data matrix that exceed the threshold.
- call
The matched function call.
Outlier Detection via Multiple Imputation Voting (MI)
Description
Applies robust distance (RD) computation to multiply imputed datasets, derives thresholds, and flags outliers via majority voting. Also computes the lower bound of the 95% confidence interval of the (1 - alpha) quantiles across imputations.
Usage
thresh_MI(
RD_org_obj,
imp_datasets,
alpha = 0.01,
boot_quant = 0.95,
verbose = FALSE
)
Arguments
RD_org_obj |
Output list from |
imp_datasets |
A list of M numeric matrices (T × p); multiply imputed datasets. |
alpha |
Significance level used to compute RD threshold (default = 0.01 for 99th percentile). |
boot_quant |
Numeric; confidence level for bootstrap confidence intervals (default = 0.95, for 95% CI). |
verbose |
Logical; if TRUE, print progress messages. |
Value
A list with:
- thresholds
Numeric vector of length M; (1 - alpha) quantiles of RD per imputed dataset.
- threshold
Lower bound of the confidence interval of thresholds.
- call
The matched function call.
- flagged_outliers
Integer vector of row indices from original data matrix that exceed the threshold.
Bootstrap-Based Outlier Detection via Multiple Imputation (MI_boot)
Description
Extends single imputation bootstrapping by using multiple imputation. For each of the M imputed datasets:
Generate B bootstrap samples (with replacement) from the included (non-outlier) indices
Compute robust distances (RD) for each sample
Extract the (1 - alpha) quantile of RD for each sample
Usage
thresh_MI_boot(
RD_org_obj,
imp_datasets,
B = 1000,
alpha = 0.01,
boot_quant = 0.95,
verbose = FALSE
)
Arguments
RD_org_obj |
Output list from |
imp_datasets |
A list of M numeric matrices (T × p); multiply imputed datasets. |
B |
Integer; number of bootstrap samples per imputed dataset (default = 1000). |
alpha |
Significance level used to compute RD threshold (default = 0.01 for 99th percentile). |
boot_quant |
Numeric; confidence level for bootstrap confidence intervals (default = 0.95, for 95% CI). |
verbose |
Logical; if TRUE, print progress messages. |
Details
This yields M × B threshold candidates. The lower bound of their (1 - boot_quant) confidence interval is used as the final threshold. This is applied to the RD of the original data.
Value
A list with:
- thresholds
Vector of M×B thresholds from each bootstrap sample.
- threshold
Lower bound of CI across thresholds.
- flagged_outliers
Integer vector of row indices from original data matrix that exceed the threshold.
- call
The matched function call.
Compute SI Threshold for Outlier Detection
Description
Computes a robust distance (RD) threshold based on single imputation (SI), using the robust covariance from the original data (via RD_org_obj) and recomputed mean from the imputed data.
Usage
thresh_SI(RD_org_obj, imp_data, alpha = 0.01, verbose = FALSE)
Arguments
RD_org_obj |
Output list from |
imp_data |
A numeric matrix (T × p) of single-imputed data. |
alpha |
Significance level used to compute RD threshold (default = 0.01 for 99th percentile). |
verbose |
Logical; if TRUE, print progress messages. |
Value
A list with:
- SI_obj
A list from
compute_RDcontaining robust distances.- threshold
Numeric threshold based on the (1 - alpha) quantile of RD.
- flagged_outliers
Integer vector of row indices from original data matrix that exceed the threshold.
- call
The matched function call.
Compute SI Boot Thresholds for Outlier Detection
Description
Computes a robust distance (RD)–based threshold using single-imputed data followed by bootstrap resampling over clean (included) indices. Returns the confidence interval bounds of the bootstrapped 99th percentiles.
Usage
thresh_SI_boot(
RD_org_obj,
imp_data,
B = 1000,
alpha = 0.01,
boot_quant = 0.95,
verbose = FALSE
)
Arguments
RD_org_obj |
Output list from |
imp_data |
A numeric matrix (T × p) of single-imputed data. |
B |
Integer; number of bootstrap samples per imputed dataset (default = 1000). |
alpha |
Significance level used to compute RD threshold (default = 0.01 for 99th percentile). |
boot_quant |
Numeric; confidence level for bootstrap confidence intervals (default = 0.95, for 95% CI). |
verbose |
Logical; if TRUE, print progress messages. |
Value
A list with:
- thresholds
Vector of 99th quantiles of RD for each bootstrap sample.
- threshold
Threshold based on lower bound of the confidence interval.
- flagged_outliers
Integer vector of row indices from original data matrix that exceed the threshold.
- UB_CI
Upper bound of the confidence interval for the 99th quantiles.
- call
The matched function call.
Thresh_result parameter documentation
Description
Thresh_result parameter documentation
Arguments
thresh_result |
A threshold result object from any threshold method containing threshold information. |
Comprehensive Outlier Detection Using Robust Distance Thresholding
Description
Performs univariate outlier detection + imputation, robust distance, and multiple thresholding methods.
Usage
threshold_RD(
x,
w = NULL,
method = c("SI_boot", "MI", "MI_boot", "SI", "F", "SHASH", "all"),
RD_obj = NULL,
impute_method = "mean",
cutoff = 4,
trans = "SHASH",
M = 50,
k = 100,
alpha = 0.01,
quantile = 0.01,
verbose = FALSE,
boot_quant = 0.95,
B = 1000
)
Arguments
x |
A numeric matrix or data frame of dimensions T × p (observations × variables). |
w |
A numeric matrix (n_time × L) of low-kurtosis ICA components used as predictors (required for MI). |
method |
Character string; one of "all","SI","SI_boot","MI","MI_boot","F", "SHASH". |
RD_obj |
Pre-computed RD_result object from |
impute_method |
Character string; imputation method for univariate outliers. |
cutoff |
A numeric value indicating how many MADs away from the median to flag as outliers. The default value is set to be 4. |
trans |
Character string; transformation method, one of "SHASH" or "robZ". |
M |
Integer; number of multiply imputed datasets (default = 5). |
k |
Integer; number of perturbation cycles per imputation (default = 10). |
alpha |
Significance level used to compute RD threshold (default = 0.01 for 99th percentile). |
quantile |
Numeric in (0,1) specifying the upper quantile for thresholding; the expected False Positive Rate for the chosen threshold. |
verbose |
Logical; if TRUE, print progress messages. |
boot_quant |
Numeric; confidence level for bootstrap confidence intervals (default = 0.95, for 95% CI). |
B |
Integer; number of bootstrap samples per imputed dataset (default = 1000). |
Value
A list with:
- thresholds
Result from the specific threshold method, or list of all methods if "all".
- RD_obj
The robust distance object from compute_RD().
- call
The matched function call.
Trans parameter documentation
Description
Trans parameter documentation
Arguments
trans |
Character string; transformation method, one of "SHASH" or "robZ". |
Temporal univariate outlier detection using SHASH, robust Yeo-Johnson, or robust MAD.
Description
Detects univariate outliers across time for each variable (column) using one of three methods:
-
"SHASH": Applies SHASH transformation and detects outliers using isolation forest. -
"robZ": Applies robust z-score outlier detection using median and MAD.
Usage
univOut(x, cutoff = 4, method = c("SHASH", "robZ"))
Arguments
x |
A numeric matrix or data frame of dimensions T × p (observations × variables). |
cutoff |
A numeric value indicating how many MADs away from the median to flag as outliers. The default value is set to be 4. |
method |
Character string. One of |
Value
A list with elements:
- outliers
Logical matrix of the same dimension as
x, indicating detected outlier locations (TRUE = outlier).- method
A character string indicating the transformation method used.
- call
The matched function call.
Verbose parameter documentation
Description
Verbose parameter documentation
Arguments
verbose |
Logical; if TRUE, print progress messages. |
W parameter documentation
Description
W parameter documentation
Arguments
w |
A numeric matrix (n_time × L) of low-kurtosis ICA components used as predictors (required for MI). |
X parameter documentation
Description
X parameter documentation
Arguments
x |
A numeric matrix or data frame of dimensions T × p (observations × variables). |