Exact Graph Construction from big.matrix Data

bigKNN can build exact neighbour graphs directly from bigmemory::big.matrix references. These graph builders are useful when the exact neighbourhood structure is the real output you care about, not only the raw search result.

Typical uses include:

This vignette introduces the graph families available in bigKNN, the symmetrization rules they use, and the output formats that make them easy to inspect or hand off downstream.

Build a Small Reference Set

We will work with six points arranged as two well-separated local groups. That geometry keeps the graph outputs easy to read while still showing meaningful differences between directed, mutual, shared-nearest-neighbour, and radius graphs.

reference_points <- data.frame(
  id = paste0("p", 1:6),
  x1 = c(0.0, 0.3, 1.2, 4.0, 4.3, 5.2),
  x2 = c(0.0, 0.0, 0.0, 4.0, 4.1, 4.0)
)

reference <- as.big.matrix(as.matrix(reference_points[c("x1", "x2")]))

reference_points
#>   id  x1  x2
#> 1 p1 0.0 0.0
#> 2 p2 0.3 0.0
#> 3 p3 1.2 0.0
#> 4 p4 4.0 4.0
#> 5 p5 4.3 4.1
#> 6 p6 5.2 4.0

The returned graph objects always refer to rows of the reference matrix. In other words, vertices 1:6 correspond to p1:p6.

Directed k-NN graphs with knn_graph_bigmatrix()

A directed k-nearest-neighbour graph stores one outgoing edge per retained neighbour for every row. With include_distance = TRUE, each edge value is the exact distance between the source row and the neighbour row.

directed_knn <- knn_graph_bigmatrix(
  reference,
  k = 1,
  format = "edge_list",
  symmetrize = "none"
)

clean_graph(directed_knn)
#>   from to  distance
#> 1    1  2 0.3000000
#> 2    2  1 0.3000000
#> 3    3  2 0.9000000
#> 4    4  5 0.3162278
#> 5    5  4 0.3162278
#> 6    6  5 0.9055385

Here:

This is the most literal graph form: every query row keeps its own directed outgoing edges. If one row points to another but not vice versa, both facts are visible.

Mutual k-NN graphs with mutual_knn_graph_bigmatrix()

Mutual k-NN graphs keep only reciprocal neighbour relationships. They are often sparser and more conservative than directed k-NN graphs, because one-sided edges are dropped.

mutual_knn <- mutual_knn_graph_bigmatrix(
  reference,
  k = 1,
  format = "edge_list"
)

clean_graph(mutual_knn)
#>   from to  distance
#> 1    1  2 0.3000000
#> 2    4  5 0.3162278

In this example, only the pairs (1, 2) and (4, 5) survive. Vertices 3 and 6 each point toward their nearest local anchor, but that preference is not returned in the other direction, so those edges disappear in the mutual graph.

mutual_knn_graph_bigmatrix() is equivalent to calling knn_graph_bigmatrix(..., symmetrize = "mutual"):

identical(
  clean_graph(mutual_knn),
  clean_graph(knn_graph_bigmatrix(reference, k = 1, format = "edge_list", symmetrize = "mutual"))
)
#> [1] TRUE

Shared-nearest-neighbour graphs with snn_graph_bigmatrix()

Shared-nearest-neighbour graphs connect pairs of rows that have overlapping exact neighbour sets. Instead of storing distances, they store a similarity-like weight derived from neighbourhood overlap.

bigKNN currently supports two weight definitions:

snn_count <- snn_graph_bigmatrix(
  reference,
  k = 2,
  weight = "count",
  format = "edge_list"
)

snn_jaccard <- snn_graph_bigmatrix(
  reference,
  k = 2,
  weight = "jaccard",
  format = "edge_list"
)

merge(
  clean_graph(snn_count),
  clean_graph(snn_jaccard),
  by = c("from", "to"),
  suffixes = c("_count", "_jaccard")
)
#>   from to weight_count weight_jaccard
#> 1    1  2            1      0.3333333
#> 2    1  3            1      0.3333333
#> 3    2  3            1      0.3333333
#> 4    4  5            1      0.3333333
#> 5    4  6            1      0.3333333
#> 6    5  6            1      0.3333333

The edge set is the same for both weight choices in this example, but the weights differ:

That normalization can be helpful when you want weights that are easier to compare across regions of the graph with different neighbour-set structure.

Radius graphs with radius_graph_bigmatrix()

Radius graphs use a fixed distance threshold instead of a fixed k. That means the number of neighbours can vary from row to row.

radius_directed <- radius_graph_bigmatrix(
  reference,
  radius = 1.1,
  format = "edge_list",
  symmetrize = "none"
)

radius_union <- radius_graph_bigmatrix(
  reference,
  radius = 1.1,
  format = "edge_list",
  symmetrize = "union"
)

clean_graph(radius_directed)
#>   from to  distance
#> 1    1  2 0.3000000
#> 2    2  1 0.3000000
#> 3    2  3 0.9000000
#> 4    3  2 0.9000000
#> 5    4  5 0.3162278
#> 6    5  4 0.3162278
#> 7    5  6 0.9055385
#> 8    6  5 0.9055385
clean_graph(radius_union)
#>   from to  distance
#> 1    1  2 0.3000000
#> 2    2  3 0.9000000
#> 3    4  5 0.3162278
#> 4    5  6 0.9055385

Compared with fixed-k graphs:

In this self-search setting with a symmetric metric, symmetrize = "union" and symmetrize = "mutual" coincide because any radius edge that appears in one direction also appears in the other:

identical(
  clean_graph(radius_graph_bigmatrix(reference, radius = 1.1, format = "edge_list", symmetrize = "union")),
  clean_graph(radius_graph_bigmatrix(reference, radius = 1.1, format = "edge_list", symmetrize = "mutual"))
)
#> [1] TRUE

Symmetrization options and edge semantics

For k-NN and radius graph builders, symmetrize controls how directed edges are collapsed:

The difference is easiest to see on k = 1:

graph_none <- knn_graph_bigmatrix(reference, k = 1, format = "edge_list", symmetrize = "none")
graph_union <- knn_graph_bigmatrix(reference, k = 1, format = "edge_list", symmetrize = "union")
graph_mutual <- knn_graph_bigmatrix(reference, k = 1, format = "edge_list", symmetrize = "mutual")

data.frame(
  symmetrize = c("none", "union", "mutual"),
  n_edges = c(nrow(clean_graph(graph_none)), nrow(clean_graph(graph_union)), nrow(clean_graph(graph_mutual))),
  row.names = NULL
)
#>   symmetrize n_edges
#> 1       none       6
#> 2      union       4
#> 3     mutual       2

When distances are stored, bigKNN keeps the minimum directed distance for the collapsed undirected pair. When unit weights are stored instead (include_distance = FALSE), the retained value is the maximum weight, which remains 1 for unweighted graphs.

Converting outputs with as_edge_list(), as_triplet(), and as_sparse_matrix()

Different downstream tasks prefer different graph formats:

triplet_graph <- as_triplet(graph_mutual)
sparse_graph <- as_sparse_matrix(
  knn_graph_bigmatrix(
    reference,
    k = 1,
    format = "edge_list",
    symmetrize = "union",
    include_distance = FALSE
  )
)

triplet_graph
#> $i
#> [1] 1 4
#> 
#> $j
#> [1] 2 5
#> 
#> $x
#> [1] 0.3000000 0.3162278
#> 
#> $Dim
#> [1] 6 6
#> 
#> attr(,"class")
#> [1] "bigknn_graph_triplet" "bigknn_graph"        
#> attr(,"value_name")
#> [1] "distance"
#> attr(,"symmetric")
#> [1] TRUE
class(sparse_graph)
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"
Matrix::summary(sparse_graph)
#> 6 x 6 sparse Matrix of class "dgCMatrix", with 8 entries
#>   i j x
#> 1 2 1 1
#> 2 1 2 1
#> 3 3 2 1
#> 4 2 3 1
#> 5 5 4 1
#> 6 4 5 1
#> 7 6 5 1
#> 8 5 6 1

One subtle but important point: symmetric edge lists and triplets store each undirected pair once, while the sparse matrix representation stores both matrix entries (i, j) and (j, i). That is why the sparse summary above shows twice as many non-zero entries as the symmetrized edge list.

You can round-trip back to an edge list when needed:

clean_graph(as_edge_list(sparse_graph))
#>   from to weight
#> 2    1  2      1
#> 1    2  1      1
#> 4    2  3      1
#> 3    3  2      1
#> 6    4  5      1
#> 5    5  4      1
#> 8    5  6      1
#> 7    6  5      1

Using graph outputs downstream

Once a graph is in sparse-matrix form, ordinary matrix operations become useful immediately. For an unweighted symmetric adjacency, row sums give vertex degrees:

degree <- Matrix::rowSums(sparse_graph)

data.frame(
  vertex = reference_points$id,
  degree = as.numeric(degree),
  row.names = NULL
)
#>   vertex degree
#> 1     p1      1
#> 2     p2      2
#> 3     p3      1
#> 4     p4      1
#> 5     p5      2
#> 6     p6      1

The same graph could also be handed off to a separate graph package after conversion:

That keeps bigKNN focused on exact search and exact graph construction, without forcing a particular downstream graph ecosystem on the user.