`adjclust`

packageThis document has two parts:

the first part aims at clarifying relations between dissimilarity and similarity methods for hierarchical agglomerative clustering (HAC) and at explaining implementation choices in

`adjclust`

;the second part describes the different types of dendrograms that are implemented in

`plot.chac`

.

In this document, we assume to be given \(n\) objects, \(\{1, \ldots, n\}\) that have to be clustered using adjacency-constrained HAC (CHAC), that is, in such a way that only adjacent objects/clusters can be merged.

We refer to [5] for a comprehensive treatment of the applicability and interpretability of Ward’s hierarchical agglomerative clustering with or without contiguity constraints.

`adjclust`

The basic implementation of `adjclust`

takes, as an input,
a kernel \(k\) which is supposed to be
symmetric and positive (in the kernel sense). If your data are under
this format, then the constrained clustering can be performed with

`fit <- adjClust(k, type = "similarity")`

or with

`fit <- adjClust(k, type = "similarity", h = h)`

if, in addition, the kernel \(k\) is
supposed to have only null entries outside of a diagonal of size
`h`

.

The implementation is the one described in [1].

In this section, the available data set is a matrix \(s\) that can either have only positive
entries (in this case it is called a similarity) or both positive and
non-positive entries. If, in addition, the matrix \(s\) is *normalized*, *i.e.*,
\(s(i,i) + s(j,j) - 2s(i,j) \geq 0\)
for all \(i,j=1,\ldots,n\) then the
algorithm implemented in `adjclust`

can be applied directly,
similarly as for a standard kernel (section 1). This section explains
why this is the case.

The interpretation is similar to the kernel case, under the
assumption that small similarity values or similarity values that are
strongly negative are less expected to be clustered together than large
similarity values. The application of the method is justified by the
fact that, for a given matrix \(s\)
described as above, we can find a \(\lambda
> 0\) such that the matrix \(k_\lambda\) defined by \[
\forall\,1,\ldots,n,\qquad k_\lambda(i,j) = s(i,j) + \lambda
\mathbb{1}_{\{i=j\}}
\] is a kernel (*i.e.*, the matrix \(k = s + \lambda I\) is positive definite;
indeed, it is the case for any \(\lambda\) larger than the opposite of the
smallest negative eigenvalue of \(s\).
[3] shows that the HAC obtained from the distance induced by the kernel
\(k_\lambda\) in its feature space and
the HAC obtained from the *ad hoc* dissimilarity defined by \[
\forall\, i,j=1,\ldots,n,\qquad d(i,j) = \sqrt{s(i,i) + s(j,j) -
2s(i,j)}
\] are identical, except that all the merging levels are shifted
by \(\lambda\).

In conclusion, to address this case, the command lines that have to be used are the ones described in section 1.

Suppose now that the data set is described by a matrix \(s\) as in the previous section except that this similarity matrix is not normalized, meaning that, there is at least one pair \((i,j)\), such that \[ 2s(i,j) > s(i,i) + s(j,j). \]

The package then performs the following pre-transformation: a matrix
\(s^{*}\) is defined as \[
\forall\,i,j=1,\ldots,n,\qquad s^{*}(i,j) = s(i,j) + \lambda
\mathbb{1}_{\{i=j\}}
\] for a \(\lambda\) large
enough to ensure that \(s^{*}\) becomes
normalized. In the package, \(\lambda\)
is chosen as \[
\lambda := \epsilon + \max_{i,j} \left(2s(i,j) - s(i,i) -
s(j,j)\right)_+
\] for a small \(\epsilon >
0\). This case is justified by the property described in Section
2.1 (Non-positive but normalized similarities). The underlying idea is
that, shifting the diagonal entries of a similarity matrix does not
change HAC result and thus they can be shifted until they induce a
proper *ad-hoc* dissimilarity matrix. The transformation affects
only the heights to ensure that they are all positive and the two
command lines described in the first section of this note are still
valid.

The original implementation of (unconstrained) HAC in
`stats::hclust`

takes as input a dissimilarity matrix.
However, the implementation of `adjclust`

is based on a
kernel/similarity approach. We describe in this section how the
dissimilarity case is handled.

Suppose given a dissimilarity \(d\) which satisfies:

\(d\) has non negative entries: \(d(i,j) \geq 0\) for all \(i=1,\ldots,n\);

\(d\) is symmetric: \(d(i,j) = d(j,i)\) for all \(i,j=1,\ldots,n\);

\(d\) has a null diagonal: \(d(i,i) = 0\) for all \(i=1,\ldots,n\).

Any sequence of positive numbers \((a_i)_{i=1,\ldots,n}\) would provide a
similarity \(s\) for which \(d\) is the *ad-hoc* dissimilarity by
setting: \[
\left\{ \begin{array}{l}
s(i,i) = a_i\\
s(i,j) = \frac{1}{2} (a_i + a_j - d^2(i,j))
\end{array} \right. .
\] By definition, such an \(s\)
is normalized and any choice for \((a_i)_{i=1,\ldots,n}\) yields the same
clustering (since they all correspond to the same *ad-hoc*
dissimilarity). The arbitrary choice \(a_i =
1\) for all \(i=1,\ldots,n\) has
thus been made.

The basic and the sparse implementations are both available with, respectively,

`fit <- adjClust(d, type = "dissimilarity")`

and

`fit <- adjClust(d, type = "dissimilarity", h = h)`

In this section, we suppose given an Euclidean distance \(d\) between objects (even though the
results described in this section are not specific to this case, they
are described more easily using this framework). Ward’s criterion, that
is implemented in `adjclust`

aims at minimizing the Error Sum
of Squares (ESS) which is equal to: \[
\mbox{ESS}(\mathcal{C}) = \sum_{C \in \mathcal{C}} \sum_{i \in C}
d^2(i, g_C)
\] where \(\mathcal{C}\) is the
clustering and \(g_C = \frac{1}{\mu_C} \sum_{i
\in C} i\) is the center of gravity of the cluster \(C\) with \(\mu_C\) elements [6]. In the sequel, we
will denote:

*within-cluster dispersion*which, for a given cluster \(C\), is equal to \[ I(C) = \sum_{i \in C} d^2(i, g_C). \] We can prove that \(I(C) = \frac{1}{2\mu_C} \sum_{i,j \in C} d^2(i,j)\) (see [4] for instance);*average within-cluster dispersion*which is equal to \(\frac{I(C)}{\mu_C}\) and corresponds to the cluster variance.

Usually, the results of standard HAC are displayed under the form of
a dendrogram for which the heights of the different merges correspond to
the linkage criterion \[
\delta(A,B) = I(A \cup B) - I(A) - I(B)
\] of that merge. This criterion corresponds to the increase in
total dispersion (ESS) that occurs by merging the two clusters \(A\) and \(B\). However, for constrained HAC, there is
no guaranty that this criterion is non decreasing (see [2] for instance)
and thus, the dendrogram build using this method can contain reversals
in its branches. This is the default option in `plot.chac`

(that corresponds to `mode = "standard"`

). To provide
dendrograms that are easier to interpret, alternative options have been
implemented in the package: the first one is a simple correction of the
standard method, and the three others are suggested by [3].

In the sequel, we denote by \((m_t)_{t=1,\ldots,n-1}\) the series of linkage criterion values obtained during the clustering.

`mode = "corrected"`

This option simply corrects the heights by adding the minimal value making them non decreasing. More precisely, if at a given step \(t \in \{2,\ldots,n-1\}\) of the clustering, we have that \(m_t < m_{t-1}\) then, we define the corrected weights as: \[ \tilde{m}_{t'} = \left\{ \begin{array}{ll} m_{t'} & \textrm{if } t' < t\\ m_{t'} + (m_{t-1} - m_t) & \textrm{otherwise} \end{array} \right.. \] This correction is iteratively performed for all decreasing merges, ensuring a visually increasing dendrogram.

`mode = "total-disp"`

This option represents the dendrogram using the total dispersion (that is the objective function) at every level of the clustering. It can easily be proved that the total dispersion is equal to ESS\(_t = \sum_{t' \leq t} m_{t'}\) and that this quantity is always non decreasing. This is the quantity recommended by [2] to display the dendrogram.

`mode = "within-disp"`

This option represents a cluster specific criterion by using the within cluster dispersion of the two clusters being merged at every given step of the algorithm. It can be proved that this quantity is also non decreasing, but it is depends strongly on the cluster size, leading to flattened dendrogram in most cases.

`mode = "average-disp"`

This last option addresses the problem of the dependency to cluster
sizes posed by the previous method (`"within-disp"`

) by using
the average within-cluster dispersion of the two clusters being merged
at every given step of the algorithm. This criterion is also a cluster
specific one but does not guaranty the absence of reversals in
heights.

As documented in [4], the call to
`hclust(..., method = "ward.D")`

implicitly assumes that
`...`

is a *squared* distance matrix. As explained
above, we did not make such an assumption so
`hclust(d^2, method = "ward.D")`

and
`adjClust(d, method = "dissimilarity")`

give identical
results when the ordering of the (unconstrained) clustering is
compatible with the natural ordering of objects used as a constraint. In
addition, since `hclust(..., method = "ward.D2")`

takes for
linkage \(\sqrt{m_t}\),
`hclust(d, method = "ward.D2")`

and
`adjClust(d, method = "dissimilarity")`

give identical
results for the merges and the slot `height`

of the first is
the square root of the slot `height`

of the second, when the
ordering of the (unconstrained) clustering is compatible with the
natural ordering of objects used as a constraint.

Finally, `rioja`

uses ESS\(_t\) to display the heights of the
dendrogram (because, as documented above, this quantity is non
decreasing, in the Euclidean case, even for constrained clusterings).
Hence, `rioja(d, method = "coniss")`

and
`adjClust(d, method = "dissimilarity")`

give identical
results for the merges and the slot `height`

of the first is
the cumulative sum of the slot `height`

of the second.

[1] Ambroise C., Dehman A., Neuvial P., Rigaill G., and Vialaneix N.
(2019). Adjacency-constrained hierarchical clustering of a band
similarity matrix with application to genomics. *Algorithms for
Molecular Biology*, **14**, 22.

[2] Grimm, E.C. (1987) CONISS: a fortran 77 program for
stratigraphically constrained cluster analysis by the method of
incremental sum of squares. *Computers & Geosciences*,
**13**(1), 13-35.

[3] Miyamoto S., Abe R., Endo Y., Takeshita J. (2015) Ward method of
hierarchical clustering for non-Euclidean similarity measures. In:
*Proceedings of the VIIth International Conference of Soft Computing
and Pattern Recognition* (SoCPaR 2015).

[4] Murtagh, F. and Legendre, P. (2014) Ward’s hierarchical
agglomerative clustering method: which algorithms implement Ward’s
criterion? *Journal of Classification*, **31**,
274-295.

[5] Randriamihamison N., Vialaneix N., & Neuvial P. (2020).
Applicability and interpretability of Ward’s hierarchical agglomerative
clustering with or without contiguity constraints. *Journal of
Classification* **38**, 1-27.

[6] Ward, J.H. (1963) Hierarchical grouping to optimize an objective
function. *Journal of the American Statistical Association*,
**58**(301), 236-244.