Computer scienceData scienceMachine learningClusteringClustering evaluation

Introduction to clustering evaluation

10 minutes read

In today's topic, we will consider the approaches to evaluating clustering algorithms. Clustering is often formulated in an unsupervised setting, where there are no ground truth labels for direct comparison like in supervised learning. Metrics for clustering evaluation can be roughly split into two categories: internal measures, which don't depend on the ground truth labels, and external measures, where the ground truth labels are required.

Internal evaluation

Internal evaluation is performed in the unsupervised setting and mainly relies on two components: the intra-cluster similarity, which measures how similar the data points inside the cluster are, and the inter-cluster similarity, which shows how well the clusters are separated. Internal metrics are more widely applicable than external metrics since ground truth labels aren't involved in the evaluation process. Some of the most frequently mentioned internal metrics include the silhouette score, the Calinski-Harabasz index, and the Davies-Bouldin index. Let's quickly go over each of the aforementioned metrics.

The silhouette score takes into account the distances within and between the clusters. The silhouette values lie in the [-1,1] interval, with -1 indicating mislabeling, 0 — cluster overlap, and 1 indicating the perfect score.

There are some silhouette limitations to keep in mind:

The silhouette score could only be applied when the clusters have a spherical shape, and if this isn't the case, it won't be possible to realistically measure the clustering quality. Due to the same reason, the silhouette is sensitive to noise.
Might not be used to do comparisons between different algorithms, because the silhouette score prefers round clusters that are far from each other, which is suitable for K-Means, for example, but if the underlying data has differently shaped groups (and the algorithm is capable of making non-round clusters), the scores won't reflect the true performance.

To illustrate the spherical preference point for the internal metrics, let's consider the following example of clustering the generated moons dataset in scikit-learn:

A comparison of HDBSCAN and K-Means on the moons dataset with their respective silhouette scores

One can observe how HDBSCAN (a density-based algorithm) outperforms K-Means (a centroid-based algorithm) in this particular scenario but scores much lower on the silhouette score.

For a single point, $i$ , the silhouette score is calculated as

$\text{Silhouette score}(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}},$

$b(i)$ — the smallest average distance between $i$ and some other cluster;
$a(i)$ — the average distance between $i$ and all other points of its cluster.

The silhouette coefficient is the average of the silhouette scores for each point in the set.

The Calinski-Harabasz index (CHI, also known as the Variance Ratio Criterion) compares the dispersion (squared sum of distances) within and between the clusters. The CHI is calculated by dividing the sum of squared distances between centroids by the sum of squared distances between each point and its centroid, and that ratio is multiplied by a scaling factor that relies on the cluster count and the point count. A higher score signifies better-defined clusters, but there is no upper bound to signify a good score. CHI also will assign a higher score to the round and well-separated clusters. In terms of calculation the CHI is cheaper than the silhouette score but similarly won't show the real performance if the clusters aren't spherical.

The Calinski-Harabasz index is calculated as

$\text{Calinski-Harabasz score} = \frac{\text{B}}{\text{W}} \times \frac{n - k}{k-1},$

$k$ — number of clusters;
$\text{B}$ — the sum of squared distances between cluster centroids;
$\text{W}$ — the sum of squared distances between each point in a cluster and their centroids for all $k$ clusters;
$n$ — number of data points.

The Davies-Bouldin index (DBI) is the average similarity measure of different clusters, where the size of the cluster is compared to the distance between the clusters. The lower scores indicate better cluster separation. Similarly to the previous metrics, the DBI will assign a higher score to the spherical clusters, and thus it might be unsuitable to use the DBI to compare solutions that can produce clusters of different shapes (i.e., DBSCAN, as compared to K-Means). In order to compute the cluster size, the centroid distances are used.

The DBI is calculated as:

$\text{The Davies-Bouldin index} = \frac{1}{k}\sum_{i=1}^k\max_{i \neq j} \frac{s_i + s_j}{d_{ij}},$

$k$ — number of clusters;
$s_i$ — the average distance between each point of cluster $i$ and its centroid (same for cluster $j$ and $s_j$ );
$d_{i,j}$ — the distance between the $i$ 'th and the $j$ 'th cluster centroids.

Below is the comparative table for the discussed internal metrics:

	The silhouette coefficient	The CHI	The DBI
Is there an assumption on clustering structure?	Yes
Score interpretation	$[-1, 1]$ , higher scores indicate better performance	Unbounded, higher scores are better	Unbounded, lower scores are better
Can be used to compare different algorithms?	Prefer clusters that are well-separated and spherical

External evaluation

External evaluation relies on the presence of ground truth labels, which restricts the usage of these methods in real-life scenarios, where the labels might be unavailable. As the most intuitive approach, we could look at the simplest method of comparing two sets — accuracy, which is the proportion of correct assignments and the total sample count. This is also known as the Rand Index.

The Rand index measures the similarity between the clustering output and the targets. A pair of data points is considered correct if the assignment of the prediction and the target match — either the prediction and the ground truth belong in the same cluster (true positives) or both of them belong in different clusters (true negatives). The Rand index doesn't make an assumption on the clustering shape and can be used for comparing different approaches. The scores lie in the [0,1] range, with a 1 indicating the total agreement between the ground truth and predicted clustering.

The classical Rand index can be formulated as

$\text{Rand index} = \frac{TP+TN}{TP+FP+FN+TN},$

where $TP$ — true positives, $TN$ — true negatives, $FP$ — false positives, and $FN$ — false negatives.

The Rand index doesn't distinguish different types of errors, which might lead to incomplete assessment and might be sensitive to the number of clusters.

Mutual Information (MI) evaluates the clustering quality by measuring the degree of dependency between Y (the generated clustering) and X (the ground truth data). The MI can capture complex relationships between the two sets and could be applied to compare different clusterings, but might overestimate the clustering quality if there is a high number of clusters present.

MI has a lower bound of 0 and no upper bound, with higher values indicating better performance. There are 2 variants of MI that have a bounded range — the normalized mutual information (NMI) and the adjusted mutual information (AMI), which output the scores in the [0, 1] range.

The regular mutual information between clusterings $U$ and $V$ could be calculated as

$\text{Mutual information}(U, V) = \sum_{i=1}^{|U|} \sum_{j=1}^{|V|} \frac{|U_i \cap V_j|}{N}\log\left(\frac{N \cdot |U_i \cap V_j|}{|U_i||V_j|}\right)$

$U, V$ — predicted clustering and the ground truth clustering;
$U_i, V_j$ — clusters from their respective clusterings;
$|U_i|$ — number of samples in cluster $U_i$ ;
$|V_j|$ — number of samples in cluster $V_j$ ;
$N$ — number of samples in the dataset.

The MI will be higher if there is a greater cluster count (MI is maximized when every data point gets its own cluster). The normalized MI introduces a penalty for generating a higher number of clusters.

The V-measure is based on two main components: homogeneity, which measures the similarity of the data points inside the cluster, and completeness, which shows the degree to which the data points from the same class are assigned to the same cluster. The final score is computed as the harmonic mean between homogeneity and completeness.

$\text{V-measure}(v) = 2 \ \cdot \ \frac{h \cdot c}{h+c},$

$\text{Homogeneity}(h) = 1 - \frac{\text{Conditional entropy given cluster assignments}}{\text{Entropy of predicted class}}$
$\text{Completeness}(c) = 1 - \frac{\text{Conditional entropy given cluster assignments}}{\text{Entropy of actual class}}$

The V-measure is sensitive to the data imbalance (will favor clusters of similar size).

Below is a comparative table for the discussed external metrics:

	The Rand Index	Mutual Information	The V-measure
Is there an assumption on clustering structure?	No
Score interpretation	$[0, 1]$ , $1$ indicates the perfect agreement between two clusterings	The unmodified version has a lower bound of $0$ and no upper bound; NMI and AMI lie in the $[0, 1]$ range. Higher values indicate better performance	$[0,1]$ , with $1$ indicating the perfect labeling
Can be used to compare different algorithms?	Yes

Conclusion

In this topic, we have reviewed the two groups of some of the most important ways to evaluate the clustering algorithm's performance and considered their main characteristics:

In general, internal evaluation methods have specific assumptions about the underlying data — for example, that clusters have a round shape, are well separated, and a single point could only belong to a single cluster;
Internal metrics are suitable for centroid-based methods (i.e., K-Means), but won't represent the density-based algorithms well(such as DBSCAN);
External metrics do not make data structure assumptions but require ground truth labels for evaluation.

4 learners liked this piece of theory. 1 didn't like it. What about you?

Report a typo

Introduction to clustering evaluation

Internal evaluation

External evaluation

Conclusion

Related topics