This article summarizes the benchmark recorded in the
bench/ directory for bigKNN 0.3.0. The goal is
not to claim a universal winner. The goal is to show:
bigKNN compares with several exact dense
k-nearest-neighbour implementations on the same Euclidean workloadsbigKNN behaves when the reference data is stored as
a file-backed bigmemory::big.matrixThe benchmark driver is bench/exact_knn_benchmark.R, and
the full recorded outputs are:
bench/exact_knn_benchmark_report.mdbench/exact_knn_benchmark_results.csvbench/exact_knn_benchmark_validation.csvThe dense exact comparator set in the recorded run was:
Rfast::dista(..., index = TRUE)FNN::get.knnx(..., algorithm = "brute")dbscan::kNN(..., approx = 0)RANN::nn2(..., eps = 0)nabor::knn(..., eps = 0)BiocNeighbors::queryKNN(..., BNPARAM = KmknnParam(distance = "Euclidean"))bigKNN itself was measured in three dense modes:
knn_bigmatrix()knn_search_prepared()knn_search_stream_prepared()For larger scale runs, the benchmark used file-backed
big.matrix references and measured:
knn_prepare_bigmatrix()knn_search_prepared()knn_search_stream_prepared()The recorded benchmark was generated on:
bigKNN 0.3.0bigmemory 4.6.4Rfast 2.1.5.2FNN 1.1.4.1dbscan 1.2.4RANN 2.6.2nabor 0.5.0BiocNeighbors 2.2.0The benchmark covers three dense comparator cases and two larger file-backed scale cases.
| Case | n_ref |
n_query |
p |
k |
Reference size | Full dense pairwise matrix |
|---|---|---|---|---|---|---|
dense_small |
5,000 |
250 |
16 |
10 |
0.61 MB |
0.01 GB |
dense_medium |
20,000 |
500 |
16 |
10 |
2.44 MB |
0.07 GB |
dense_large |
50,000 |
1,000 |
16 |
10 |
6.10 MB |
0.37 GB |
filebacked_xlarge |
100,000 |
1,000 |
32 |
10 |
24.41 MB |
0.75 GB |
filebacked_2xlarge |
200,000 |
1,000 |
32 |
10 |
48.83 MB |
1.49 GB |
The full_pairwise_gb column is intentionally included
because it highlights what bigKNN does not materialize: a
full dense query-by-reference distance matrix.
All dense exact comparators matched bigKNN on neighbour
indices for the recorded cases. All distance-capable dense comparators
matched on distances as well.
That matters because the benchmark is intended to compare exact
search implementations, not approximate recall. The validation table in
bench/exact_knn_benchmark_validation.csv is therefore as
important as the timing table.
The table below compares the recorded bigKNN_prepared
median with the fastest external exact comparator in each dense
case.
| Case | bigKNN_prepared median |
Fastest external exact comparator |
|---|---|---|
dense_small |
0.110 s |
0.042 s (FNN_brute and
dbscan_kdtree tie) |
dense_medium |
0.822 s |
0.328 s (RANN_kd,
Rfast_index_only, and nabor_kd tie) |
dense_large |
4.642 s |
1.289 s (nabor_kd) |
The broader dense median ranking in the recorded run was:
bigKNN1.29 s, while
bigKNN_prepared remained around 4.64 sThis is a sensible result. The dense comparator packages are
optimized for ordinary in-memory matrix search, while
bigKNN is designed around
bigmemory::big.matrix, reusable prepared references,
streaming, and larger workflows that do not assume everything should
round-trip through ordinary R matrices.
The scale portion of the benchmark focuses on the feature set that is
more specific to bigKNN: exact search on file-backed
big.matrix references and streamed output into destination
big.matrix objects.
| Case | bigKNN_prepare |
bigKNN_prepared_search |
bigKNN_stream |
|---|---|---|---|
filebacked_xlarge
(100,000 x 1,000 x 32) |
0.039 s |
9.714 s |
9.290 s |
filebacked_2xlarge
(200,000 x 1,000 x 32) |
0.078 s |
18.594 s |
18.533 s |
Two points stand out:
That second point is important in practice. It means the file-backed
exact path behaves predictably as the reference grows, which is exactly
the kind of workflow bigKNN is meant to support.
These benchmark numbers should be interpreted in context:
k neighbours back in ordinary R matrices, the
specialist dense comparator packages can be fasterbigmemory::big.matrix,
reusable prepared references, streamed output, resumable jobs, or larger
file-backed data, bigKNN provides functionality that goes
beyond the dense in-memory comparator setIn other words, the benchmark is useful both for performance comparison and for positioning the package.
From the package root, rerun the benchmark with:
The script prefers benchmarking the source tree through
pkgload::load_all() when available, and otherwise falls
back to the installed bigKNN package.
Benchmark results are machine-specific. The recorded values in this article are best treated as a reproducible snapshot rather than a universal ranking.
The benchmark is also deliberately split into two stories:
big.matrix-oriented workflow of bigKNNThat split keeps the benchmark honest about what is truly comparable and what is specific to the package’s design goals.