WEBVTT 1 00:00:00.050 --> 00:00:03.000 - [Instructor] I want to talk about the whole concept 2 00:00:03.000 --> 00:00:05.060 of the silhouette statistic. 3 00:00:05.060 --> 00:00:08.010 You may have noticed in MODELLER 4 00:00:08.010 --> 00:00:10.030 as part of the MODELLER summary, 5 00:00:10.030 --> 00:00:15.020 you get this very visually-compelling red-yellow-green 6 00:00:15.020 --> 00:00:19.010 thermometer indicator of the cluster quality. 7 00:00:19.010 --> 00:00:21.040 That's not unique to MODELLER. 8 00:00:21.040 --> 00:00:26.000 The silhouette concept goes all the way back to the '80s. 9 00:00:26.000 --> 00:00:28.020 In a paper that was written back then, 10 00:00:28.020 --> 00:00:32.060 this is really the closest we come in cluster analysis 11 00:00:32.060 --> 00:00:36.010 to something like R-squared and the like. 12 00:00:36.010 --> 00:00:38.030 There's just not the same consensus. 13 00:00:38.030 --> 00:00:40.080 There's other challenges as well, 14 00:00:40.080 --> 00:00:42.050 because remember there's a lot of things 15 00:00:42.050 --> 00:00:45.030 to keep in mind with why we choose the particular 16 00:00:45.030 --> 00:00:47.040 cluster analysis that we do. 17 00:00:47.040 --> 00:00:51.030 But let's talk about the logic of silhouette. 18 00:00:51.030 --> 00:00:53.080 Here's a quick scatterplot that I put together. 19 00:00:53.080 --> 00:00:57.010 To keep it simple, I'm using just two variables, 20 00:00:57.010 --> 00:00:59.010 obviously, to make it two-dimensional. 21 00:00:59.010 --> 00:01:02.000 And I'm only plotting a small number 22 00:01:02.000 --> 00:01:04.060 of points that all belong to one cluster. 23 00:01:04.060 --> 00:01:07.000 The whole concept of silhouette is 24 00:01:07.000 --> 00:01:10.020 to take all of these points one at a time, 25 00:01:10.020 --> 00:01:14.000 and then measure how far they are to other points 26 00:01:14.000 --> 00:01:16.050 that belong to the same cluster. 27 00:01:16.050 --> 00:01:18.080 Naturally, we want these distances 28 00:01:18.080 --> 00:01:21.000 on average to be small. 29 00:01:21.000 --> 00:01:24.000 That's our measure of cohesion. 30 00:01:24.000 --> 00:01:26.060 Now, in addition to that, we then 31 00:01:26.060 --> 00:01:29.080 have to take this and bring the other cluster in. 32 00:01:29.080 --> 00:01:31.060 So once we've established the amount 33 00:01:31.060 --> 00:01:34.040 of cohesion, we want to then compare 34 00:01:34.040 --> 00:01:37.010 that to the amount of separation. 35 00:01:37.010 --> 00:01:39.070 And unfortunately, this little tiny 36 00:01:39.070 --> 00:01:42.020 scatterplot is very real-world in the sense 37 00:01:42.020 --> 00:01:45.030 that we don't get this incredibly clean separation 38 00:01:45.030 --> 00:01:47.060 that you so often see in practice 39 00:01:47.060 --> 00:01:49.050 files for cluster analysis. 40 00:01:49.050 --> 00:01:52.020 With real-world data, you'll have co-mixing. 41 00:01:52.020 --> 00:01:56.070 So by measuring the ratio between cohesion 42 00:01:56.070 --> 00:02:00.080 within the cluster, and separation between clusters, 43 00:02:00.080 --> 00:02:04.070 you get the ratio that's the building block of silhouette. 44 00:02:04.070 --> 00:02:07.020 So, whatever your tool of choice might 45 00:02:07.020 --> 00:02:10.080 be, MODELLER, something else like R 46 00:02:10.080 --> 00:02:13.090 or another commercial package, let's say like SAS, 47 00:02:13.090 --> 00:02:16.000 you should be able to find some kind 48 00:02:16.000 --> 00:02:20.000 of measure like this to evaluate the clusters.