In today's topic, we will consider the approaches to evaluating clustering algorithms. Clustering is often formulated in an unsupervised setting, where there are no ground truth labels for direct comparison like in supervised learning. Metrics for clustering evaluation can be roughly split into two categories: internal measures, which don't depend on the ground truth labels, and external measures, where the ground truth labels are required.
Internal evaluation
Internal evaluation is performed in the unsupervised setting and mainly relies on two components: the intra-cluster similarity, which measures how similar the data points inside the cluster are, and the inter-cluster similarity, which shows how well the clusters are separated. Internal metrics are more widely applicable than external metrics since ground truth labels aren't involved in the evaluation process. Some of the most frequently mentioned internal metrics include the silhouette score, the Calinski-Harabasz index, and the Davies-Bouldin index. Let's quickly go over each of the aforementioned metrics.
The silhouette score takes into account the distances within and between the clusters. The silhouette values lie in the [-1,1] interval, with -1 indicating mislabeling, 0 — cluster overlap, and 1 indicating the perfect score.
There are some silhouette limitations to keep in mind:
- The silhouette score could only be applied when the clusters have a spherical shape, and if this isn't the case, it won't be possible to realistically measure the clustering quality. Due to the same reason, the silhouette is sensitive to noise.
- Might not be used to do comparisons between different algorithms, because the silhouette score prefers round clusters that are far from each other, which is suitable for K-Means, for example, but if the underlying data has differently shaped groups (and the algorithm is capable of making non-round clusters), the scores won't reflect the true performance.
To illustrate the spherical preference point for the internal metrics, let's consider the following example of clustering the generated moons dataset in scikit-learn:
One can observe how HDBSCAN (a density-based algorithm) outperforms K-Means (a centroid-based algorithm) in this particular scenario but scores much lower on the silhouette score.
For a single point, , the silhouette score is calculated as
- — the smallest average distance between and some other cluster;
- — the average distance between and all other points of its cluster.
The silhouette coefficient is the average of the silhouette scores for each point in the set.
The Calinski-Harabasz index (CHI, also known as the Variance Ratio Criterion) compares the dispersion (squared sum of distances) within and between the clusters. The CHI is calculated by dividing the sum of squared distances between centroids by the sum of squared distances between each point and its centroid, and that ratio is multiplied by a scaling factor that relies on the cluster count and the point count. A higher score signifies better-defined clusters, but there is no upper bound to signify a good score. CHI also will assign a higher score to the round and well-separated clusters. In terms of calculation the CHI is cheaper than the silhouette score but similarly won't show the real performance if the clusters aren't spherical.
The Calinski-Harabasz index is calculated as
- — number of clusters;
- — the sum of squared distances between cluster centroids;
- — the sum of squared distances between each point in a cluster and their centroids for all clusters;
- — number of data points.
The Davies-Bouldin index (DBI) is the average similarity measure of different clusters, where the size of the cluster is compared to the distance between the clusters. The lower scores indicate better cluster separation. Similarly to the previous metrics, the DBI will assign a higher score to the spherical clusters, and thus it might be unsuitable to use the DBI to compare solutions that can produce clusters of different shapes (i.e., DBSCAN, as compared to K-Means). In order to compute the cluster size, the centroid distances are used.
The DBI is calculated as:
- — number of clusters;
- — the average distance between each point of cluster and its centroid (same for cluster and );
- — the distance between the 'th and the 'th cluster centroids.
Below is the comparative table for the discussed internal metrics:
|
The silhouette coefficient |
The CHI |
The DBI |
|
|
Is there an assumption on clustering structure? |
Yes |
||
|
Score interpretation |
, higher scores indicate better performance |
Unbounded, higher scores are better |
Unbounded, lower scores are better |
|
Can be used to compare different algorithms? |
Prefer clusters that are well-separated and spherical |
||
External evaluation
External evaluation relies on the presence of ground truth labels, which restricts the usage of these methods in real-life scenarios, where the labels might be unavailable. As the most intuitive approach, we could look at the simplest method of comparing two sets — accuracy, which is the proportion of correct assignments and the total sample count. This is also known as the Rand Index.
The Rand index measures the similarity between the clustering output and the targets. A pair of data points is considered correct if the assignment of the prediction and the target match — either the prediction and the ground truth belong in the same cluster (true positives) or both of them belong in different clusters (true negatives). The Rand index doesn't make an assumption on the clustering shape and can be used for comparing different approaches. The scores lie in the [0,1] range, with a 1 indicating the total agreement between the ground truth and predicted clustering.
The classical Rand index can be formulated as
where — true positives, — true negatives, — false positives, and — false negatives.
The Rand index doesn't distinguish different types of errors, which might lead to incomplete assessment and might be sensitive to the number of clusters.
Mutual Information (MI) evaluates the clustering quality by measuring the degree of dependency between Y (the generated clustering) and X (the ground truth data). The MI can capture complex relationships between the two sets and could be applied to compare different clusterings, but might overestimate the clustering quality if there is a high number of clusters present.
MI has a lower bound of 0 and no upper bound, with higher values indicating better performance. There are 2 variants of MI that have a bounded range — the normalized mutual information (NMI) and the adjusted mutual information (AMI), which output the scores in the [0, 1] range.
The regular mutual information between clusterings and could be calculated as
- — predicted clustering and the ground truth clustering;
- — clusters from their respective clusterings;
- — number of samples in cluster ;
- — number of samples in cluster ;
- — number of samples in the dataset.
The MI will be higher if there is a greater cluster count (MI is maximized when every data point gets its own cluster). The normalized MI introduces a penalty for generating a higher number of clusters.
The V-measure is based on two main components: homogeneity, which measures the similarity of the data points inside the cluster, and completeness, which shows the degree to which the data points from the same class are assigned to the same cluster. The final score is computed as the harmonic mean between homogeneity and completeness.
The V-measure is sensitive to the data imbalance (will favor clusters of similar size).
Below is a comparative table for the discussed external metrics:
|
The Rand Index |
Mutual Information |
The V-measure |
|
|
Is there an assumption on clustering structure? |
No |
||
|
Score interpretation |
, indicates the perfect agreement between two clusterings |
The unmodified version has a lower bound of and no upper bound; NMI and AMI lie in the range. Higher values indicate better performance |
, with indicating the perfect labeling |
|
Can be used to compare different algorithms? |
Yes |
||
Conclusion
In this topic, we have reviewed the two groups of some of the most important ways to evaluate the clustering algorithm's performance and considered their main characteristics:
- In general, internal evaluation methods have specific assumptions about the underlying data — for example, that clusters have a round shape, are well separated, and a single point could only belong to a single cluster;
- Internal metrics are suitable for centroid-based methods (i.e., K-Means), but won't represent the density-based algorithms well(such as DBSCAN);
- External metrics do not make data structure assumptions but require ground truth labels for evaluation.