In today's topic, we will look at one way to assess the quality of clustering and determine the optimal number of clusters — the silhouette coefficient and its plot. The silhouette coefficient does not rely on the previously known target labels and is categorized as an intrinsic measure of clustering performance.
The silhouette value
The silhouette score measures how similar the data points inside the cluster are compared to how different the clusters are from each other. The silhouette score calculation involves two core components: cohesion and separation. Cohesion measures the similarity of the points within the cluster and could be introduced as:
where — the cluster associated with the 'th data point; — number of data points in the cluster , and — the distance between the 'th and the 'th points.
Separation shows the degree to which the clusters don't overlap:
Combining cohesion and separation, the silhouette value for a single point is calculated as:
Silhouette values lie in the interval, with -1 indicating a misclassified point, 1 indicating that the point is closely tied with its cluster and poorly matched with the neighboring clusters, and a score of 0 means that there is no clear separation between the clusters and they might overlap. Generally speaking, scores of 0.7 and higher are considered acceptable. Here is an illustration of the silhouette score components:
After the silhouette values are calculated, they could be averaged for a single cluster (as a measure of grouping inside the cluster), or for the entire dataset to determine how well the data has been clustered. The latter is known as the silhouette coefficient (SC) and could be calculated as:
Below is a table that helps to interpret the silhouette coefficients:
| Silhouette coefficient | Interpretation |
| 0.71 — 1 | A strong structure is found |
| 0.51 — 0.7 | A reasonable structure is found |
| 0.26 — 0.5 | The found structure might be artificial |
| < 0.26 | No structure is found |
A walk-through example
Let's run the silhouette calculations on a small synthetic dataset. The distance is presumed to be Euclidean, which is defined as
for two points in the two-dimensional case.
Suppose there are 3 clusters present:
We start with point (1,1) from the first cluster. Calculating the average distance to all other points in the first cluster(cohesion):
To calculate the separation, we'll calculate the average distance from (1,1) in the first cluster to all objects in cluster 2 and cluster 3, and then take the minimum average distance.
The average distance of to all points in cluster 2:
The average distance of to all points in cluster 3:
Then, the separation is equal to
Substituting and , the silhouette value for the point (1,1) is:
We can see that the score is closer to 1, which indicates good cluster separation.
Visual interpretation
The silhouette coefficients could be utilized to determine the optimal number of clusters. The silhouette plot is built for every number of clusters under consideration. The plot displays the silhouette score for the data points inside each cluster, the thickness of the bars represents the number of samples in the cluster, and the bar height shows the silhouette scores.
Let's consider a clustered dataset, where the real number of clusters is equal to 3, and different values of n_clusters, along with their silhouette plots:
The coloring of silhouette plots on the left corresponds to the coloring of the scatter plot of the clusters on the right, and one can observe that for n_clusters=3, the silhouette scores are higher when compared to the n_clusters = 5, and also in the n_clusters=3 case there is less cluster size disbalance.
Silhouette issues
The silhouette values are easy to interpret, but they have a couple of limitations:
- Silhouette can't be used to compare different methods, for one, because applying feature scaling will produce different results, and the silhouette itself is based on a specific distance;
- It is assumed that the clusters are linearly separable and have a spherical shape;
- Silhouette calculation in the original version has a complexity;
- It does not work properly with many dimensions.
Keeping these drawbacks in mind, it's not possible to rely on the silhouette scores alone and the introduction of other methods(e.g., the Rand index) of performance evaluation is required.
Conclusion
Now, let's summarize and highlight what we have learned in this topic:
- What the silhouette score is;
- How to calculate the silhouette score for a single point;
- How to interpret the silhouette plots;
- The limitations of applying the silhouette score.