8 minutes read

In today's topic, we will look at one way to assess the quality of clustering and determine the optimal number of clusters — the silhouette coefficient and its plot. The silhouette coefficient does not rely on the previously known target labels and is categorized as an intrinsic measure of clustering performance.

The silhouette value

The silhouette score measures how similar the data points inside the cluster are compared to how different the clusters are from each other. The silhouette score calculation involves two core components: cohesion and separation. Cohesion measures the similarity of the points within the cluster and could be introduced as:

ai=1Ci1jCi,ijd(i,j),a_i = \frac{1}{|C_i| - 1}\sum_{j \in C_i, i \neq j} d(i, j),

where CiC_i — the cluster associated with the ii'th data point; Ci|C_i| — number of data points in the cluster ii, and d(i,j)d(i, j) — the distance between the ii'th and the jj'th points.

Separation shows the degree to which the clusters don't overlap:

bi=minki1CkjCkd(i,j)b_i = \min_{k \neq i} \frac{1}{|C_k|}\sum_{j \in C_k} d(i, j)

Combining cohesion and separation, the silhouette value for a single point is calculated as:

si=biaimax(bi,ai)s_i = \frac{b_i - a_i}{\max(b_i, a_i)}

Silhouette values lie in the [1,1][-1, 1] interval, with -1 indicating a misclassified point, 1 indicating that the point is closely tied with its cluster and poorly matched with the neighboring clusters, and a score of 0 means that there is no clear separation between the clusters and they might overlap. Generally speaking, scores of 0.7 and higher are considered acceptable. Here is an illustration of the silhouette score components:

A walkthrough of cohesion, separation, and the silhouette value on a synthetic 3 cluster example.

After the silhouette values are calculated, they could be averaged for a single cluster (as a measure of grouping inside the cluster), or for the entire dataset to determine how well the data has been clustered. The latter is known as the silhouette coefficient (SC) and could be calculated as:

SC=1ni=1nsi\text{SC} = \frac{1}{n} \sum_{i=1}^n s_i

Below is a table that helps to interpret the silhouette coefficients:

Silhouette coefficient Interpretation
0.71 — 1 A strong structure is found
0.51 — 0.7 A reasonable structure is found
0.26 — 0.5 The found structure might be artificial
< 0.26 No structure is found

A walk-through example

Let's run the silhouette calculations on a small synthetic dataset. The distance is presumed to be Euclidean, which is defined as

d(p,q)=(q1p1)2+(q2p2)2d(p, q) = \sqrt{(q_1 - p_1)^2 + (q_2 - p_2)^2}for two points p,qp, q in the two-dimensional case.

Suppose there are 3 clusters present:

Cluster 1=[(1,1),(1,2)]Cluster 2=[(3,4),(4,4),(4,5)]Cluster 3=[(6,2),(7,2),(7,3)]\text{Cluster 1} = [(1, 1), (1,2) ] \\ \text{Cluster 2} = [(3,4), (4,4), (4,5) ] \\ \text{Cluster 3} = [(6,2), (7,2), (7,3) ] \\We start with point (1,1) from the first cluster. Calculating the average distance to all other points in the first cluster(cohesion):

ai=1Ci1jCi,ijd(i,j)=121(11)2+(12)2=1a_i = \frac{1}{|C_i| - 1}\sum_{j \in C_i, i \neq j} d(i, j) = \frac{1}{2 - 1}\sqrt{(1-1)^2 + (1-2)^2} = 1

To calculate the separation, we'll calculate the average distance from (1,1) in the first cluster to all objects in cluster 2 and cluster 3, and then take the minimum average distance.

The average distance of (1,1)Cluster 1(1,1) \in \text{Cluster 1} to all points in cluster 2:

(13)2+(14)2+(14)2+(14)2+(14)2+(15)2312.8534.283\frac{\sqrt{(1-3)^2 + (1-4)^2} + \sqrt{(1-4)^2 + (1-4)^2} + \sqrt{(1-4)^2 + (1-5)^2}}{3} \approx \frac{12.85}{3} \approx 4.283

The average distance of (1,1)Cluster 1(1,1) \in \text{Cluster 1} to all points in cluster 3:

(16)2+(12)2+(17)2+(12)2+(17)2+(13)2317.5135.84\frac{\sqrt{(1-6)^2 + (1-2)^2} + \sqrt{(1-7)^2 + (1-2)^2} + \sqrt{(1-7)^2 + (1-3)^2}}{3} \approx \frac{17.51}{3} \approx 5.84

Then, the separation is equal to b1=min(4.283,5.84)=4.283b_1 = \min(4.283,\, 5.84) = 4.283

Substituting a1=1a_1 = 1 and b1=4.283b_1 = 4.283, the silhouette value for the point (1,1) is:

s1=biaimax(bi,ai)=4.28314.2830.77s_1 = \frac{b_i - a_i}{\max(b_i, a_i)} = \frac{4.283 - 1}{4.283} \approx 0.77

We can see that the score is closer to 1, which indicates good cluster separation.

Visual interpretation

The silhouette coefficients could be utilized to determine the optimal number of clusters. The silhouette plot is built for every number of clusters under consideration. The plot displays the silhouette score for the data points inside each cluster, the thickness of the bars represents the number of samples in the cluster, and the bar height shows the silhouette scores.

Let's consider a clustered dataset, where the real number of clusters is equal to 3, and different values of n_clusters, along with their silhouette plots:

Silhouette plots for 3 clusters, along with the plot of the clustered data.

Silhouette plots for 5 clusters, along with the plot of the clustered data.

The coloring of silhouette plots on the left corresponds to the coloring of the scatter plot of the clusters on the right, and one can observe that for n_clusters=3, the silhouette scores are higher when compared to the n_clusters = 5, and also in the n_clusters=3 case there is less cluster size disbalance.

Silhouette issues

The silhouette values are easy to interpret, but they have a couple of limitations:

  1. Silhouette can't be used to compare different methods, for one, because applying feature scaling will produce different results, and the silhouette itself is based on a specific distance;
  2. It is assumed that the clusters are linearly separable and have a spherical shape;
  3. Silhouette calculation in the original version has a O(n2)O(n^2)complexity;
  4. It does not work properly with many dimensions.

Keeping these drawbacks in mind, it's not possible to rely on the silhouette scores alone and the introduction of other methods(e.g., the Rand index) of performance evaluation is required.

Conclusion

Now, let's summarize and highlight what we have learned in this topic:

  • What the silhouette score is;
  • How to calculate the silhouette score for a single point;
  • How to interpret the silhouette plots;
  • The limitations of applying the silhouette score.
4 learners liked this piece of theory. 1 didn't like it. What about you?
Report a typo