WEBVTT 1 00:00:00.050 --> 00:00:02.050 - [Instructor] Okay, when you are taking a look 2 00:00:02.050 --> 00:00:05.070 at a data set for the first time and you're about to build 3 00:00:05.070 --> 00:00:07.070 a multiple regression model, 4 00:00:07.070 --> 00:00:10.060 there's no substitute for just taking your time 5 00:00:10.060 --> 00:00:13.050 and thoroughly examining it visually. 6 00:00:13.050 --> 00:00:15.000 So let's do just that. 7 00:00:15.000 --> 00:00:18.020 We're gonna go into one of the case study data files, 8 00:00:18.020 --> 00:00:22.060 specifically waste.sav. 9 00:00:22.060 --> 00:00:25.000 Let me orient you to the data set. 10 00:00:25.000 --> 00:00:26.090 It's pretty straightforward. 11 00:00:26.090 --> 00:00:32.080 We've got amounts of municipal waste in tons. 12 00:00:32.080 --> 00:00:36.060 So this first case here is just over 13 00:00:36.060 --> 00:00:40.070 a third of a million tons of waste in a calendar year, 14 00:00:40.070 --> 00:00:45.060 and then these other variables are land utilization. 15 00:00:45.060 --> 00:00:49.060 So we have industrial land 16 00:00:49.060 --> 00:00:54.020 in acres, we've got fabricated metals in acres, 17 00:00:54.020 --> 00:00:59.050 trucking, retail, and restaurants, all acreage. 18 00:00:59.050 --> 00:01:01.090 So we're trying to predict the amount 19 00:01:01.090 --> 00:01:04.080 of municipal waste per year based upon 20 00:01:04.080 --> 00:01:07.000 how the land is being used. 21 00:01:07.000 --> 00:01:10.080 So how should we go about exploring this visually. 22 00:01:10.080 --> 00:01:12.070 Again, you just have to jump in 23 00:01:12.070 --> 00:01:14.030 and get comfortable with the variables. 24 00:01:14.030 --> 00:01:16.060 And since these were all scale variables 25 00:01:16.060 --> 00:01:18.030 and since we're doing regression, 26 00:01:18.030 --> 00:01:22.020 the core visual approaches will be the histogram 27 00:01:22.020 --> 00:01:24.020 and the scatter plot. 28 00:01:24.020 --> 00:01:26.050 So let's just start with the first payer. 29 00:01:26.050 --> 00:01:30.020 And by the first payer I mean the dependent waste 30 00:01:30.020 --> 00:01:33.020 in the first of the independent variables. 31 00:01:33.020 --> 00:01:37.060 So we'll go into Chart Builder, 32 00:01:37.060 --> 00:01:43.000 and we're gonna do a scatter plot with a fit line. 33 00:01:43.000 --> 00:01:44.060 Waste is our dependent. 34 00:01:44.060 --> 00:01:47.040 The dependent will always go into the y axis. 35 00:01:47.040 --> 00:01:51.060 And industrial land will go into the x axis. 36 00:01:51.060 --> 00:01:54.030 I'm gonna go ahead and click on OK. 37 00:01:54.030 --> 00:01:56.020 Now as I look at this it's hard for me 38 00:01:56.020 --> 00:01:58.010 not to have a visceral reaction to it, 39 00:01:58.010 --> 00:01:59.070 even though I'm familiar with the data. 40 00:01:59.070 --> 00:02:02.030 Let's remind ourselves of the assumptions, though, 41 00:02:02.030 --> 00:02:04.060 so you get a sense of what I'm thinking about 42 00:02:04.060 --> 00:02:06.020 when I look at this. 43 00:02:06.020 --> 00:02:10.010 I wand my variables to be normally distributed. 44 00:02:10.010 --> 00:02:13.010 I want the relationships to be linear 45 00:02:13.010 --> 00:02:15.040 and oh my goodness we have neither 46 00:02:15.040 --> 00:02:16.090 of those things going on here. 47 00:02:16.090 --> 00:02:18.010 We've got trouble. 48 00:02:18.010 --> 00:02:20.040 This really doesn't look like 49 00:02:20.040 --> 00:02:23.030 a very pretty scatter plot at all. 50 00:02:23.030 --> 00:02:25.030 And let's talk about why. 51 00:02:25.030 --> 00:02:27.060 So for instance, let me go ahead and right click 52 00:02:27.060 --> 00:02:29.080 and I'm gonna edit this in a separate window 53 00:02:29.080 --> 00:02:32.080 so it's clear what we're talking about. 54 00:02:32.080 --> 00:02:38.010 I actually can use this ID tool to identify 55 00:02:38.010 --> 00:02:42.010 what row of data that we're talking about, 56 00:02:42.010 --> 00:02:46.000 and these three in particular are all kind of all off 57 00:02:46.000 --> 00:02:47.040 in all directions. 58 00:02:47.040 --> 00:02:52.070 But then also notice in the extreme bottom left corner 59 00:02:52.070 --> 00:02:54.040 there's a whole bunch of data points 60 00:02:54.040 --> 00:02:56.060 that look like they're going straight up. 61 00:02:56.060 --> 00:03:00.070 They're not within a country mile of that regression line. 62 00:03:00.070 --> 00:03:04.010 So we have a very weak relationship here. 63 00:03:04.010 --> 00:03:05.080 One of the nice things about what 64 00:03:05.080 --> 00:03:09.020 SPSS automatically does with Chart Builder, 65 00:03:09.020 --> 00:03:13.070 is it shows us the R squared in the upper right hand corner, 66 00:03:13.070 --> 00:03:17.030 and sure enough, if we multiply that times 100 67 00:03:17.030 --> 00:03:20.080 that is telling us that only 3.4% 68 00:03:20.080 --> 00:03:24.030 of variants explained by industrial. 69 00:03:24.030 --> 00:03:26.040 And what can happen, actually, 70 00:03:26.040 --> 00:03:30.000 is when you have two data points like row ID 31 there 71 00:03:30.000 --> 00:03:35.090 and row ID 40, those two points seem to be 72 00:03:35.090 --> 00:03:37.080 pulling the regression line towards 73 00:03:37.080 --> 00:03:39.060 themselves on the right-hand side. 74 00:03:39.060 --> 00:03:40.080 That can happen. 75 00:03:40.080 --> 00:03:43.090 Data points can have what's called undue influence. 76 00:03:43.090 --> 00:03:47.030 There are very specific ways of measuring 77 00:03:47.030 --> 00:03:51.020 and diagnosing undue influence, 78 00:03:51.020 --> 00:03:54.010 but we seem to be seeing it visually here. 79 00:03:54.010 --> 00:03:57.040 So let me show you a different visual approach 80 00:03:57.040 --> 00:04:00.050 so we can start looking at the other variables. 81 00:04:00.050 --> 00:04:02.030 But we would wanna take the time to run 82 00:04:02.030 --> 00:04:04.090 that scatter plot on all of them. 83 00:04:04.090 --> 00:04:07.020 But let's use this alternate approach. 84 00:04:07.020 --> 00:04:10.040 We can go to the graph board template chooser. 85 00:04:10.040 --> 00:04:14.020 It's another way of doing graphics in SPSS. 86 00:04:14.020 --> 00:04:15.080 Because it's got a cool choice 87 00:04:15.080 --> 00:04:17.050 that's gonna be helpful to us. 88 00:04:17.050 --> 00:04:21.000 We can do what's called, I'm scrolling down to it now, 89 00:04:21.000 --> 00:04:25.040 a scatter plot matrix. 90 00:04:25.040 --> 00:04:28.030 And so that we can see what we're doing, 91 00:04:28.030 --> 00:04:31.090 I'm gonna go ahead and choose just three variables here. 92 00:04:31.090 --> 00:04:34.010 But make sure to include the dependent 93 00:04:34.010 --> 00:04:36.070 because that's really our focus right now. 94 00:04:36.070 --> 00:04:39.020 We wanna look at each of those independent variables 95 00:04:39.020 --> 00:04:41.060 against the dependent, but for now, 96 00:04:41.060 --> 00:04:44.030 we'll grab just two of them, retail and restaurants. 97 00:04:44.030 --> 00:04:47.040 After all, we've just seen industrial. 98 00:04:47.040 --> 00:04:48.050 And here you go, this is what 99 00:04:48.050 --> 00:04:50.000 our scatter plot matrix looks like. 100 00:04:50.000 --> 00:04:52.060 So it's a terribly useful way of looking at 101 00:04:52.060 --> 00:04:55.060 more than one variable at a time. 102 00:04:55.060 --> 00:05:00.020 The way this works is we can see that this row is retail 103 00:05:00.020 --> 00:05:03.090 and then over here on the right the column is waste tons. 104 00:05:03.090 --> 00:05:07.090 So this scatter plot, shown as one dot way up in the right. 105 00:05:07.090 --> 00:05:11.010 Again, there's probably a bit of an outlier there. 106 00:05:11.010 --> 00:05:13.000 It seems to be an extreme value 107 00:05:13.000 --> 00:05:17.010 on both retail and waste tons. 108 00:05:17.010 --> 00:05:20.000 Restaurants and hotels is here. 109 00:05:20.000 --> 00:05:21.050 But be careful because this is 110 00:05:21.050 --> 00:05:23.020 the relationship between restaurants 111 00:05:23.020 --> 00:05:26.010 and hotels in this column against retail. 112 00:05:26.010 --> 00:05:30.030 So if we wanna see restaurants against waste, 113 00:05:30.030 --> 00:05:32.080 I have to scroll up a little bit. 114 00:05:32.080 --> 00:05:35.030 And that's gonna be over here. 115 00:05:35.030 --> 00:05:37.080 We can see that that's the row for restaurants, 116 00:05:37.080 --> 00:05:40.090 and again, the column for waste tons. 117 00:05:40.090 --> 00:05:43.070 So in general, we could certainly say 118 00:05:43.070 --> 00:05:47.050 that the relationship for restaurants and retail 119 00:05:47.050 --> 00:05:51.030 isn't quite as nice and linear as we might like, 120 00:05:51.030 --> 00:05:53.080 but it's certainly better than 121 00:05:53.080 --> 00:05:56.060 industrial versus waste tons isn't it? 122 00:05:56.060 --> 00:05:59.000 Let's revisit restaurants against retail. 123 00:05:59.000 --> 00:06:00.030 That one's up here. 124 00:06:00.030 --> 00:06:03.030 This one actually is quite linear. 125 00:06:03.030 --> 00:06:06.000 You can imagine a regression line here 126 00:06:06.000 --> 00:06:08.050 that would pass through these points in such a way 127 00:06:08.050 --> 00:06:11.000 that they'd be fairly close. 128 00:06:11.000 --> 00:06:13.000 So oddly enough the relationship 129 00:06:13.000 --> 00:06:17.000 between restaurants and retail appears to be stronger 130 00:06:17.000 --> 00:06:21.010 than the relationship of either of them with waste tons. 131 00:06:21.010 --> 00:06:22.060 Keep in mind that at this stage 132 00:06:22.060 --> 00:06:24.030 as you're exploring the data, 133 00:06:24.030 --> 00:06:27.080 the keyword to be thinking about is familiarity. 134 00:06:27.080 --> 00:06:29.080 You're trying to get familiar with the data. 135 00:06:29.080 --> 00:06:32.070 You're not drawing any concrete conclusions yet. 136 00:06:32.070 --> 00:06:36.010 But we're certainly learning some things about the data. 137 00:06:36.010 --> 00:06:38.040 Now we've looked at a bunch of scatter plots, 138 00:06:38.040 --> 00:06:41.080 but I've been ignoring the histogram up until now. 139 00:06:41.080 --> 00:06:44.000 So let's look at all three histograms, 140 00:06:44.000 --> 00:06:48.040 starting with retail in the upper left-hand corner. 141 00:06:48.040 --> 00:06:51.010 Does that look like a bell curve to us? 142 00:06:51.010 --> 00:06:53.020 Gosh, it really, really doesn't. 143 00:06:53.020 --> 00:06:56.040 There's a pile up of data points in the far left-hand side, 144 00:06:56.040 --> 00:06:58.070 which frankly is communities 145 00:06:58.070 --> 00:07:01.030 that have zero utilization of retail. 146 00:07:01.030 --> 00:07:03.030 There's just this huge spike at zero. 147 00:07:03.030 --> 00:07:05.040 A bell curve, we would imagine, 148 00:07:05.040 --> 00:07:08.050 fewer points on the left, lots of points in the middle, 149 00:07:08.050 --> 00:07:10.000 fewer points on the right. 150 00:07:10.000 --> 00:07:12.060 So this doesn't look anything like a bell curve at all. 151 00:07:12.060 --> 00:07:16.060 If we scroll down and we look at restaurants, 152 00:07:16.060 --> 00:07:18.040 that one looks about the same. 153 00:07:18.040 --> 00:07:21.080 There's this huge pile up at the far left-hand side. 154 00:07:21.080 --> 00:07:23.030 This shape, by the way everybody, 155 00:07:23.030 --> 00:07:24.050 has a specific name. 156 00:07:24.050 --> 00:07:26.070 This is called a skewed data set. 157 00:07:26.070 --> 00:07:31.010 When it's piled up on the left and it's thin, 158 00:07:31.010 --> 00:07:33.090 pulled out to the right, that's called a positive skew. 159 00:07:33.090 --> 00:07:37.060 And the mirror image of this, with a pileup on the right 160 00:07:37.060 --> 00:07:40.060 and pulled thin out to the left 161 00:07:40.060 --> 00:07:43.010 would be called a negative skew. 162 00:07:43.010 --> 00:07:45.060 But when you describe data as being skewed, 163 00:07:45.060 --> 00:07:48.020 it's the same as saying that it doesn't look normal, 164 00:07:48.020 --> 00:07:50.010 it doesn't look like a bell curve. 165 00:07:50.010 --> 00:07:52.050 Finally let's look at the bottom right, 166 00:07:52.050 --> 00:07:55.030 and we see the histogram for waste tons. 167 00:07:55.030 --> 00:07:57.090 It seems somewhat less extreme 168 00:07:57.090 --> 00:08:00.080 in its skew compared to the other two. 169 00:08:00.080 --> 00:08:03.080 But this also does not look like a bell curve. 170 00:08:03.080 --> 00:08:06.090 Let me close on a minor technical detail. 171 00:08:06.090 --> 00:08:08.040 As it turns out, that when you 172 00:08:08.040 --> 00:08:11.020 violate normality in regression, 173 00:08:11.020 --> 00:08:13.010 and clearly that's not a good thing, 174 00:08:13.010 --> 00:08:14.030 we don't want to embrace that, 175 00:08:14.030 --> 00:08:17.020 but when you violate normality in regression, 176 00:08:17.020 --> 00:08:19.000 if they all have a similar shape 177 00:08:19.000 --> 00:08:23.010 it's not quite as bad if they all have different shapes. 178 00:08:23.010 --> 00:08:26.000 So the fact that they all have a positive skew here 179 00:08:26.000 --> 00:08:27.040 helps us out a little bit, 180 00:08:27.040 --> 00:08:29.080 but certainly the conclusion that we would draw 181 00:08:29.080 --> 00:08:34.080 is that the relationship between industrial and waste 182 00:08:34.080 --> 00:08:37.010 was not very promising at all. 183 00:08:37.010 --> 00:08:39.070 Restaurants and retail looks better. 184 00:08:39.070 --> 00:08:42.090 The data does not seem to be normally distributed, 185 00:08:42.090 --> 00:08:46.020 and then finally, one more that's important to remember, 186 00:08:46.020 --> 00:08:49.050 restaurants and retail seem to be more strongly related 187 00:08:49.050 --> 00:08:51.060 with each other than either of them 188 00:08:51.060 --> 00:08:53.040 are related to waste tons. 189 00:08:53.040 --> 00:08:54.070 We're just looking at this point, 190 00:08:54.070 --> 00:08:56.090 but those are all important things to observe 191 00:08:56.090 --> 00:08:56.090 as we look at this data for the first time.