Sample Provenance Quality Resolver in Proteomics —
native R port of the Python
spqrp package.
Recent advancements in MS technology and lab methods opened the door
for large-scale proteomics but also led to a growing concern regarding
sample mix-ups. spqrp helps you evaluate whether sample
data is safe for further analysis by clustering samples and flagging
probable mix-ups, uncertain assignments, and outliers.
# install.packages("remotes")
remotes::install_github("fhradilak/spqrp_r")No Python install needed — this is a native R port.
A long-format data frame with these columns:
| Column | Description |
|---|---|
Sample_ID |
Unique sample identifier |
Patient_ID |
Patient identifier |
Protein |
Protein name/identifier |
Intensity |
Numeric intensity value |
Optionally a protein ranking with Protein and
Importance columns. If you don’t supply one, the package
uses a precomputed ranking from a plasma cohort
(spqrp_example_data("ranking_cohort_a")).
library(spqrp)
df <- spqrp_example_data("input_cohort_df")
ranking <- spqrp_example_data("protein_ranking")
# Clustering: build kNN graph, split big components, visualise
res <- run_clustering(
df = df, ranking = ranking,
n_neighbors = 1L,
max_component_size = 2L,
metric = "manhattan",
method = "UMAP" # or "PCA" / "MDS"
)
res$cluster_assignments # sample -> cluster ID
res$uncertain_samples # likely missing connections
res$error_candidate_samples # likely sample mix-ups
res$plot # ggplot objectAll spqrp functions are silent by default — no progress
messages, no per-call summaries. If you want progress and diagnostic
prints (which sample IDs got flagged, what cutoff was picked, how many
proteins overlapped between ranking and data, etc.) pass
quiet = FALSE to any function that emits status output:
remove_outlier_samples(df, quiet = FALSE) # prints flagged Sample_IDs
run_clustering(df, ranking, n_neighbors = 1,
max_component_size = 3, quiet = FALSE) # prints save-path hint,
# cluster listing,
# transitive metrics
perform_distance_evaluation_on_ranked_proteins(
df, top_importance_df = ranking, quiet = FALSE
) # prints real-protein countWarnings about genuine data issues — e.g. samples dropped because
they lack measurements for any of the top-ranked proteins — fire
regardless of quiet, because they signal a
real problem you need to see.
If you don’t have a precomputed ranking, train one. The Python
package uses imblearn.BalancedRandomForestClassifier; this
R port exposes three substitute backends so you can pick the tradeoff
that fits:
results <- train_with_normalise(
df,
classifier_backend = "randomForest" # default — closest to imblearn's BalancedRF
# classifier_backend = "ranger" # faster, class.weights on impurity
# classifier_backend = "themis_smote" # SMOTE rebalance + ranger
)
new_ranking <- retrieve_ranking(results)See articles/numerical-divergence.md
for when to pick each.
result <- perform_distance_evaluation_on_ranked_proteins(
df = df,
top_importance_df = ranking,
metric = "manhattan",
p = 0.989,
n = 20L
)
result$cutoff
result$eval_metrics[c("TP", "FP", "FN", "TN", "Precision", "Sensitivity", "F1")]optimize_parameters() sweeps n and the
percentile cutoff to find optimal values for your dataset.
Optional helpers that mirror the Python pipeline:
df_pp <- df |>
log_transform() |>
filter_by_occurrence(cutoff = 0.7)
norm <- normalize_medianintensity(df_pp, plot = FALSE)
df_pp <- norm$data
# If your data has a `plate` column:
df_pp <- plate_correct_residuals_by_protein(df_pp)| Function | Purpose |
|---|---|
run_clustering() |
End-to-end clustering pipeline |
cluster_samples_iteratively() |
Build kNN graph + 2D embedding |
plot_distances_neighbours_with_coloring_hue() |
Heavy clustering visualization |
perform_distance_evaluation_on_ranked_proteins() |
Threshold-based pairwise classification |
optimize_parameters() |
Grid-search optimal n and percentile |
calculate_pairwise_distances() |
Distance matrix on top-n proteins |
train_with_normalise() |
Full ranking pipeline (filter → normalize → RF) |
retrieve_ranking() |
Extract ranked proteins from a trained model |
train_pairwise_balanced_rand_forest() |
Pairwise RF (3 backends) |
get_threshold() |
ROC/F1/Youden/MinFP threshold selection |
get_distances(),
get_nearest_neighbours() |
Distance + kNN helpers |
get_sample_relations_by_cutoff(),
get_evaluation_metrics() |
Cutoff → metrics |
percentile_cutoff() |
numpy.percentile-equivalent |
filter_by_occurrence(), log_transform(),
revert_log_transform(),
normalize_medianintensity(),
plate_correct_residuals_by_protein() |
Preprocessing |
by_isolation_forest(),
by_isolation_forest_plot(),
remove_outlier_samples() |
Outlier detection (Isolation Forest).
contamination = 0.1 for sklearn-like behaviour. |
spqrp_example_data() |
Access bundled example CSVs |
check_input_data_format() |
Validate required columns |
The R API mirrors the Python one — function names are identical
snake_case. Outputs are R named lists (which work just like Python
dicts: res$cluster_assignments).
Because the underlying numerical libraries differ (uwot
vs umap-learn, ranger vs
imblearn, solitude (wrapping
ranger) vs sklearn’s IsolationForest), exact numbers can
drift across runs even with matched seeds. See articles/numerical-divergence.md
for which outputs are bit-exact, which match up to rotation/reflection,
and which are only equivalent in expectation.
GPL-3