WEBVTT 1 00:00:00.040 --> 00:00:01.030 - [Instructor] We're going to pick up 2 00:00:01.030 --> 00:00:04.020 where we left off, in the same stream. 3 00:00:04.020 --> 00:00:06.070 What I want to talk about now is a tricky subject, 4 00:00:06.070 --> 00:00:09.010 because folks want a straightforward answer, 5 00:00:09.010 --> 00:00:11.090 and there's no straightforward answer to give, 6 00:00:11.090 --> 00:00:15.030 and that is, what's the best cluster. 7 00:00:15.030 --> 00:00:19.050 There are techniques we can use to try to narrow down 8 00:00:19.050 --> 00:00:22.020 the possibilities, but remember there are 9 00:00:22.020 --> 00:00:25.090 some considerations that fall outside the math. 10 00:00:25.090 --> 00:00:30.030 For instance, some years ago I was working with 11 00:00:30.030 --> 00:00:33.030 a cruise ship company, and what they wanted to do 12 00:00:33.030 --> 00:00:36.070 is try to anticipate where someone's next cruise 13 00:00:36.070 --> 00:00:40.090 would be, because if they could guess accurately 14 00:00:40.090 --> 00:00:45.050 they would then put an appropriate picture on the cover 15 00:00:45.050 --> 00:00:47.050 of the catalog that got mailed out. 16 00:00:47.050 --> 00:00:50.020 The idea was is that if you were interested in trips 17 00:00:50.020 --> 00:00:54.080 like Alaska or the fjords, or Patagonia and you got 18 00:00:54.080 --> 00:00:57.050 a close up of an umbrella drink in the Caribbean, 19 00:00:57.050 --> 00:01:00.050 that might turn you off because that would be a disconnect. 20 00:01:00.050 --> 00:01:03.020 So let's say we did that, and we looked at folks' 21 00:01:03.020 --> 00:01:08.060 cruise patterns and we came up with 11 clusters, 22 00:01:08.060 --> 00:01:10.050 or even seven or eight clusters. 23 00:01:10.050 --> 00:01:12.090 The marketing team might say, "We don't want to spend all 24 00:01:12.090 --> 00:01:15.060 "the money on that many catalog covers, 25 00:01:15.060 --> 00:01:18.010 "we want to do four or five catalog covers." 26 00:01:18.010 --> 00:01:19.020 So be careful. 27 00:01:19.020 --> 00:01:21.070 We're going to walk through the logic of this, 28 00:01:21.070 --> 00:01:24.050 but I want you to remember there's a lot of considerations 29 00:01:24.050 --> 00:01:27.050 that fall outside the cluster analysis itself 30 00:01:27.050 --> 00:01:30.070 that will influence the value of K. 31 00:01:30.070 --> 00:01:33.020 But let's take a look at the easiest way 32 00:01:33.020 --> 00:01:35.030 in Modeller to tackle this, and you should 33 00:01:35.030 --> 00:01:37.020 be able to do something similar, even if 34 00:01:37.020 --> 00:01:39.090 your tool of choice is not Modeller. 35 00:01:39.090 --> 00:01:43.000 So I'm going to go in and I'm going to go to Modeling 36 00:01:43.000 --> 00:01:48.030 and I'm going to choose the so-called Auto Cluster node. 37 00:01:48.030 --> 00:01:52.030 But what I'm going to do, is go to the Expert settings, 38 00:01:52.030 --> 00:01:58.000 and go with just K-means now, just K-means. 39 00:01:58.000 --> 00:02:04.020 But I'm going to specify, 40 00:02:04.020 --> 00:02:09.050 that K is going to be 41 00:02:09.050 --> 00:02:17.050 three, four, five, all the way up to eight. 42 00:02:17.050 --> 00:02:20.040 This is the easiest way to do it. 43 00:02:20.040 --> 00:02:23.060 There are other ways, for instance two step, 44 00:02:23.060 --> 00:02:27.070 uses an alternate way of trying to tell you 45 00:02:27.070 --> 00:02:30.000 what the best value of K is. 46 00:02:30.000 --> 00:02:34.070 But what this is going to use is the Silhouette calculation, 47 00:02:34.070 --> 00:02:37.060 but it's going to do it for all six of our models 48 00:02:37.060 --> 00:02:39.030 all at once, it's very straightforward. 49 00:02:39.030 --> 00:02:44.000 So I'm going to go ahead and return to this other window, 50 00:02:44.000 --> 00:02:47.030 and tell it that we want to keep all six, 51 00:02:47.030 --> 00:02:51.060 and we're going to run. 52 00:02:51.060 --> 00:02:55.000 Produces my model, I'm going to take a look inside, 53 00:02:55.000 --> 00:02:58.020 and look at that, it's already sorted on Silhouette. 54 00:02:58.020 --> 00:03:03.090 So, according to this, the best solution is three clusters, 55 00:03:03.090 --> 00:03:07.000 and the second best solution is six. 56 00:03:07.000 --> 00:03:09.070 Followed by five, and so on. 57 00:03:09.070 --> 00:03:12.070 So this is interesting stuff, and certainly if 58 00:03:12.070 --> 00:03:16.040 I'm going to start working through a thorough exploration 59 00:03:16.040 --> 00:03:21.090 of my solutions, I might want to start with three and six. 60 00:03:21.090 --> 00:03:27.050 But again, don't forget that business considerations and/or 61 00:03:27.050 --> 00:03:30.080 information and variables outside of the cluster, 62 00:03:30.080 --> 00:03:34.090 like distance to store, or what cities they live in, 63 00:03:34.090 --> 00:03:38.000 or whether they're rural or urban, or if they've come to a 64 00:03:38.000 --> 00:03:42.020 brick and mortar calculation or an online location, 65 00:03:42.020 --> 00:03:46.020 that also, all that stuff is going to influence our decision. 66 00:03:46.020 --> 00:03:50.040 But according to Silhouette, three is the best value of K, 67 00:03:50.040 --> 00:03:52.050 and we'll certainly consider that solution 68 00:03:52.050 --> 00:03:53.050 along with the others. 69 00:03:53.050 --> 00:03:56.000 But the final decision is ours.