Benchmarking bigKNN Against Other Exact Implementations

This article summarizes the benchmark recorded in the bench/ directory for bigKNN 0.3.0. The goal is not to claim a universal winner. The goal is to show:

The benchmark driver is bench/exact_knn_benchmark.R, and the full recorded outputs are:

Comparator Set

The dense exact comparator set in the recorded run was:

bigKNN itself was measured in three dense modes:

For larger scale runs, the benchmark used file-backed big.matrix references and measured:

Recorded Environment

The recorded benchmark was generated on:

Problem Sizes

The benchmark covers three dense comparator cases and two larger file-backed scale cases.

Case n_ref n_query p k Reference size Full dense pairwise matrix
dense_small 5,000 250 16 10 0.61 MB 0.01 GB
dense_medium 20,000 500 16 10 2.44 MB 0.07 GB
dense_large 50,000 1,000 16 10 6.10 MB 0.37 GB
filebacked_xlarge 100,000 1,000 32 10 24.41 MB 0.75 GB
filebacked_2xlarge 200,000 1,000 32 10 48.83 MB 1.49 GB

The full_pairwise_gb column is intentionally included because it highlights what bigKNN does not materialize: a full dense query-by-reference distance matrix.

Correctness

All dense exact comparators matched bigKNN on neighbour indices for the recorded cases. All distance-capable dense comparators matched on distances as well.

That matters because the benchmark is intended to compare exact search implementations, not approximate recall. The validation table in bench/exact_knn_benchmark_validation.csv is therefore as important as the timing table.

Dense Benchmark Snapshot

The table below compares the recorded bigKNN_prepared median with the fastest external exact comparator in each dense case.

Case bigKNN_prepared median Fastest external exact comparator
dense_small 0.110 s 0.042 s (FNN_brute and dbscan_kdtree tie)
dense_medium 0.822 s 0.328 s (RANN_kd, Rfast_index_only, and nabor_kd tie)
dense_large 4.642 s 1.289 s (nabor_kd)

The broader dense median ranking in the recorded run was:

This is a sensible result. The dense comparator packages are optimized for ordinary in-memory matrix search, while bigKNN is designed around bigmemory::big.matrix, reusable prepared references, streaming, and larger workflows that do not assume everything should round-trip through ordinary R matrices.

File-Backed Scale Snapshot

The scale portion of the benchmark focuses on the feature set that is more specific to bigKNN: exact search on file-backed big.matrix references and streamed output into destination big.matrix objects.

Case bigKNN_prepare bigKNN_prepared_search bigKNN_stream
filebacked_xlarge (100,000 x 1,000 x 32) 0.039 s 9.714 s 9.290 s
filebacked_2xlarge (200,000 x 1,000 x 32) 0.078 s 18.594 s 18.533 s

Two points stand out:

That second point is important in practice. It means the file-backed exact path behaves predictably as the reference grows, which is exactly the kind of workflow bigKNN is meant to support.

How To Read These Results

These benchmark numbers should be interpreted in context:

In other words, the benchmark is useful both for performance comparison and for positioning the package.

Reproducing The Benchmark

From the package root, rerun the benchmark with:

env LC_ALL=C LANG=C Rscript bench/exact_knn_benchmark.R

The script prefers benchmarking the source tree through pkgload::load_all() when available, and otherwise falls back to the installed bigKNN package.

Caveats

Benchmark results are machine-specific. The recorded values in this article are best treated as a reproducible snapshot rather than a universal ranking.

The benchmark is also deliberately split into two stories:

That split keeps the benchmark honest about what is truly comparable and what is specific to the package’s design goals.