Hierarchical clustering can be categorized as either agglomerative or divisive, depending on the process of hierarchical decomposition. Agglomerative clustering, the bottom-up approach, considers each individual sample as one cluster and then merges the two most similar clusters. This process is repeated until all groups are combined into one at the highest level of the hierarchy, or until a termination condition is met. The output is a tree-like representation of the objects, which is also known as a dendrogram.
In this topic, we will focus on the theoretical foundations and various aspects of agglomerative clustering.
The general procedure
Agglomerative clustering builds the hierarchy from the individual samples by merging clusters based on a measure of similarity. The output is a set of nested clusters organized as a hierarchical tree (the dendrogram), with the leaf nodes representing individual samples and the root of the tree representing the entire dataset.
The procedure of agglomerative clustering is rather straightforward:
- Assign each sample to its own cluster.
- Find the most similar pair of clusters (based on the distance between clusters) and merge them into a single cluster.
- Repeat step 2 until only a single cluster remains.
The agglomerative clustering process could be illustrated as follows:
There are a few things to note. The number of clusters is directly related to the distance between the clusters, which will be reviewed further down the line. The -axis of the dendrogram, pictured on the right, represents the individual samples. The -axis represents the distance between clusters. It shows the point where two clusters merge. The depth of each branch (the distance between the branches and root) reflects the proximity of clusters. The closer the proximity between two clusters, the more similar the data points within those clusters are.
Longer branches indicate a greater distance between clusters. Clusters that are merged further up the -axis are less similar to each other. Conversely, a shorter branch suggests a higher similarity between the merged clusters.
In the illustration below, there are two thresholds — a dashed line at with 2 intersections with the branches, and a solid line at with 10 intersections with the branches, marked by blue squares. In the case of the first threshold at , there are two clusters present. The distance (Euclidean, in this case) between these two clusters can be somewhere between 90 and 130 when the number of clusters is 2. Similarly, there are 10 clusters at the second threshold of . The explanation mirrors the first case. When the number of clusters decreases, clusters become less similar, or further apart from each other, and vice versa.
Usually, the entire dendrogram, which ranges from all samples at the bottom of the -axis up to a single cluster at the top, is built unless a stopping condition is specified. This can be something like building up to a certain cluster distance threshold. Later, the dendrogram is sliced by the distance thresholds, as pictured below.
The dendrogram is the main output of the algorithm. You can set multiple thresholds on the vertical axis and see which distance produces a specific number of clusters. However, there isn't any concrete theory on which distance (or the number of clusters) should be chosen as the cut-off point. This process can be somewhat easier if you have knowledge of which distance indicates a cluster separation. For example, in geospatial applications, one might say "objects that are more than 500 meters apart should constitute separate groups". Additionally, the elbow method, the silhouette plot, and the gap statistic can be used to infer the optimal number of clusters. However, there's no silver bullet and all choices are heuristics.
Before running agglomerative clustering, the following minimum set of parameters should be set up:
- The distance metric, such as Manhattan or Euclidean. This shouldn't be confused with the cluster distance.
- The linkage criterion, which determines the distance between the clusters.
Distance in hierarchical clustering
Linkage in agglomerative clustering determines the distance between clusters, which affects the decision on which clusters should be merged.
There are several types of linkage methods:
- Single linkage: The distance between two clusters is the shortest distance between two points in each cluster.
- Complete linkage: The distance between two clusters is the maximum distance between any two points in the clusters.
- Average linkage: The distance between two clusters is the average distance between each point in one cluster to every point in the other cluster.
- Ward's linkage: The distance between two clusters is the amount by which the sum of squares will increase upon merging.
Single linkage computes all pairwise distances between the elements in cluster and the elements in cluster , and considers the smallest of these distances as a linkage criterion. It tends to produce long clusters and can also lead to a chaining phenomenon where clusters are stretched out, consisting of several distant parts, which hinders the interpretation of the results. Single linkage is severely influenced by the presence of outliers.
Complete linkage is the opposite of single linkage as it considers the maximum distance between two clusters instead of the minimum. It calculates all pairwise distances between the elements in cluster and the elements in cluster , and defines the distance as the largest pairwise distance. This approach results in smaller, spherical clusters and is less susceptible to noise than single linkage.
Average linkage considers the average of pairwise distances between the elements in cluster and the elements in cluster . It's less susceptible to noise compared to single or complete linkage, but it produces a less defined hierarchical structure compared to other linkage methods.
Ward's linkage, also known as the minimum variance method, minimizes the total within-cluster variance. At each step, the pair of clusters with the minimum between-cluster distance is merged. ESS (error sum of squares) is the sum of the squared distance between each data point and the centroid of its assigned cluster. In other words, ESS is the variance within each cluster.
Ward tends to be the most efficient linkage and performs well in the presence of noise. It tends to find compact, balanced, and well-separated clusters.
Ward makes the most sense with the Euclidean distance (since it deals with minimizing the variance). Thus, if the distance metric is not Euclidean, one might opt for the average linkage instead of Ward.
We can see how various linkages affect the clustering results on four different 2-dimensional datasets:
Single linkage severely deteriorates in the presence of noise. However, it can correctly separate the first two datasets, unlike other linkages. Average linkage produces less defined structures, and complete and Ward linkage appear to be somewhat similar to each other in those particular scenarios.
There is another quite important concept in agglomerative clustering, namely, connectivity that directly influences the shapes of the produced clusters. Connectivity sets a constraint on how the clusters should be merged. For example, only k-nearest points can be merged into a cluster. If a constraint of "only 5 closest neighbors of the point can be merged together" is applied, we get the following results for the first two datasets:
By setting the connectivity constraints, agglomerative clustering can be more focused on the local structures present in the data, thus improving the clustering results.
Ward is still off on the nested circles dataset, primarily for two reasons. The first reason is related to building the hierarchy itself. For two circles that share the same center, "the nearest" Euclidean distance doesn't necessarily mean the points belong to the same cluster intuitively (and thus, they could both be interpreted as a single unified structure). The second reason is that the inner and outer circles are centered at the same point. There's no increase in variance when points from the inner circle are merged into the cluster of points from the outer circle, or vice versa.
Considerations for agglomerative clustering
Let's look at the comparison between centroid-based (K-Means), density-based (DBSCAN), and agglomerative clustering:
Agglomerative clustering tends to suffer from the following limitations:
- Sensitivity to noise;
- The curse of dimensionality — as the dimensionality increases, data becomes sparse and distance starts to make less sense. Eventually, we might end up in a situation where all pairwise distances converge to the same value, which hinders the idea of similar or disimilar points;
- It deals with two types of distances: the distance metric and how the cluster distances are defined, both of which impact the clustering results;
- It requires a lot of guesses and eyeballing with respect to what the dendrogram actually outputs to get remotely meaningful results;
- Points are assigned to the clusters only once; thus, when the merge occurred, it can't be backtracked later on;
- It is not optimal for large datasets - the time-complexity is because all pairwise distances have to be calculated.
On the other hand, agglomerative clustering can be useful when there is an actual hierarchy present (e.g., an employee dataset, where you have various grades, or image processing, where features might be decomposed in various ways. For example, handwritten digits might be split into fragments such that and end up with overlapping strokes, etc).
Conclusion
As a result, you are now familiar with
- the general procedure of agglomerative clustering;
- how to interpret the dendrogram;
- the limitations of the algorithm;
- how agglomerative clustering compares to K-Means and DBSCAN.