1 00:00:00,790 --> 00:00:03,130 The following content is provided under a Creative 2 00:00:03,130 --> 00:00:04,550 Commons license. 3 00:00:04,550 --> 00:00:06,760 Your support will help MIT OpenCourseWare 4 00:00:06,760 --> 00:00:10,850 continue to offer high quality educational resources for free. 5 00:00:10,850 --> 00:00:13,390 To make a donation, or to view additional materials 6 00:00:13,390 --> 00:00:17,320 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,320 --> 00:00:18,570 at ocw.mit.edu. 8 00:00:29,590 --> 00:00:33,320 JOHN GUTTAG: Hello, everybody. 9 00:00:33,320 --> 00:00:36,880 Some announcements. 10 00:00:36,880 --> 00:00:42,310 The last reading assignment of the semester, at least from us. 11 00:00:42,310 --> 00:00:48,850 Course evaluations are still available through this Friday. 12 00:00:48,850 --> 00:00:50,530 But only till noon. 13 00:00:50,530 --> 00:00:53,930 Again, I urge you all to do it. 14 00:00:53,930 --> 00:00:58,180 And then finally, for the final exam, 15 00:00:58,180 --> 00:00:59,890 we're going to be giving you some code 16 00:00:59,890 --> 00:01:02,800 to study in advance of the exam. 17 00:01:02,800 --> 00:01:05,110 And then we will ask questions about that code 18 00:01:05,110 --> 00:01:07,480 on the exam itself. 19 00:01:07,480 --> 00:01:10,480 This was described in the announcement for the exam. 20 00:01:10,480 --> 00:01:14,290 And we will be making this code available later today. 21 00:01:14,290 --> 00:01:17,110 Now, I would suggest that you try and get 22 00:01:17,110 --> 00:01:18,610 your heads around it. 23 00:01:18,610 --> 00:01:21,160 If you are confused, that's a good thing 24 00:01:21,160 --> 00:01:24,850 to talk about in office hours, to get some help with it, 25 00:01:24,850 --> 00:01:28,160 as opposed to waiting till 20 minutes before the exam 26 00:01:28,160 --> 00:01:29,410 and realizing you're confused. 27 00:01:31,940 --> 00:01:32,720 All right. 28 00:01:32,720 --> 00:01:36,850 I want to pick up where we left off on Monday. 29 00:01:36,850 --> 00:01:43,850 So you may recall that we were comparing results of KNN 30 00:01:43,850 --> 00:01:47,970 and logistic regression on our Titanic data. 31 00:01:47,970 --> 00:01:56,240 And we have this up using 10 80/20 splits for KNN equals 3 32 00:01:56,240 --> 00:02:01,170 and logistic regression with p equals 0.5. 33 00:02:01,170 --> 00:02:06,800 And what I observed is that logistic regression happened 34 00:02:06,800 --> 00:02:09,979 to perform slightly better, but certainly 35 00:02:09,979 --> 00:02:13,810 nothing that you would choose to write home about. 36 00:02:13,810 --> 00:02:15,344 It's a little bit better. 37 00:02:15,344 --> 00:02:17,135 That isn't to say it will always be better. 38 00:02:17,135 --> 00:02:19,620 It happens to be here. 39 00:02:19,620 --> 00:02:23,390 But the point I closed with is one 40 00:02:23,390 --> 00:02:27,680 of the things we care about when we use machine learning is not 41 00:02:27,680 --> 00:02:31,890 only our ability to make predictions with the model. 42 00:02:31,890 --> 00:02:37,120 But what can we learn by studying the model itself? 43 00:02:37,120 --> 00:02:39,630 Remember, the idea is that the model is somehow 44 00:02:39,630 --> 00:02:43,590 capturing the system or the process that 45 00:02:43,590 --> 00:02:45,330 generated the data. 46 00:02:45,330 --> 00:02:48,720 And by studying the model we can learn something useful. 47 00:02:51,630 --> 00:02:55,170 So to do that for logistic regression, 48 00:02:55,170 --> 00:02:57,390 we begin by looking at the weights 49 00:02:57,390 --> 00:03:00,400 of the different variables. 50 00:03:00,400 --> 00:03:04,850 And we had this up in the last slide. 51 00:03:04,850 --> 00:03:09,070 The model classes are "Died" and "Survived." 52 00:03:09,070 --> 00:03:12,500 For the label Survived, we said that if you 53 00:03:12,500 --> 00:03:15,880 were in a first-class cabin, that 54 00:03:15,880 --> 00:03:20,170 had a positive impact on your survival, a pretty 55 00:03:20,170 --> 00:03:23,740 strong positive impact. 56 00:03:23,740 --> 00:03:27,920 You can't interpret these weights in and of themselves. 57 00:03:27,920 --> 00:03:32,540 If I said it's 1.6, that really doesn't mean anything. 58 00:03:32,540 --> 00:03:36,970 So what you have to look at is the relative weights, not 59 00:03:36,970 --> 00:03:38,080 the absolute weights. 60 00:03:38,080 --> 00:03:42,420 And we see that it's a pretty strong relative weight. 61 00:03:42,420 --> 00:03:45,940 A second-class cabin also has a positive weight, in this case, 62 00:03:45,940 --> 00:03:49,230 of 0.46. 63 00:03:49,230 --> 00:03:52,300 So it was indicating you better had 64 00:03:52,300 --> 00:03:55,870 a better-than-average chance of surviving, 65 00:03:55,870 --> 00:04:00,410 but much less strong than a first class. 66 00:04:00,410 --> 00:04:03,990 And if you are one of those poor people in a third-class cabin, 67 00:04:03,990 --> 00:04:07,170 well, that had a negative weight on survival. 68 00:04:07,170 --> 00:04:10,640 You were less likely to survive. 69 00:04:10,640 --> 00:04:16,899 Age had a very small effect here, slightly negative. 70 00:04:16,899 --> 00:04:20,160 What that meant is the older you were, the less likely 71 00:04:20,160 --> 00:04:21,959 you were to have survived. 72 00:04:21,959 --> 00:04:27,610 But it's a very small negative value. 73 00:04:27,610 --> 00:04:32,720 The male gender had a relatively large negative gender, 74 00:04:32,720 --> 00:04:34,280 suggesting that if you were a male 75 00:04:34,280 --> 00:04:36,650 you were more likely to die than if you were a female. 76 00:04:39,250 --> 00:04:41,300 This might be true in the general population, 77 00:04:41,300 --> 00:04:46,010 but it was especially true on the Titanic. 78 00:04:46,010 --> 00:04:50,360 Finally, I warned you that while what I just went through 79 00:04:50,360 --> 00:04:54,500 is something you will read in lots of papers that use machine 80 00:04:54,500 --> 00:04:58,520 learning, you will hear in lots of talks 81 00:04:58,520 --> 00:05:00,800 about people who have used machine learning. 82 00:05:00,800 --> 00:05:04,610 But you should be very wary when people speak that way. 83 00:05:04,610 --> 00:05:09,290 It's not nonsense, but some cautionary notes. 84 00:05:09,290 --> 00:05:12,200 In particular, there's a big issue 85 00:05:12,200 --> 00:05:16,950 because the features are often correlated with one another. 86 00:05:16,950 --> 00:05:20,345 And so you can't interpret the weights one feature at a time. 87 00:05:23,380 --> 00:05:25,980 To get a little bit technical, there 88 00:05:25,980 --> 00:05:31,410 are two major ways people use logistic regression. 89 00:05:31,410 --> 00:05:33,180 They're called L1 and L2. 90 00:05:36,130 --> 00:05:36,960 We used an L2. 91 00:05:36,960 --> 00:05:38,610 I'll come back to that in a minute. 92 00:05:38,610 --> 00:05:44,460 Because that's the default in Python, or in [INAUDIBLE]. 93 00:05:44,460 --> 00:05:49,290 You can set that parameter at L2 and do that to L1 if you want. 94 00:05:49,290 --> 00:05:50,460 I experimented with it. 95 00:05:50,460 --> 00:05:53,580 It didn't change the results that much. 96 00:05:53,580 --> 00:05:56,760 But what an L1 regression is designed to do 97 00:05:56,760 --> 00:05:59,595 is to find some weights and drive them to 0. 98 00:06:03,200 --> 00:06:05,150 This is particularly useful when you 99 00:06:05,150 --> 00:06:10,160 have a very high-dimensional problem relative to the number 100 00:06:10,160 --> 00:06:11,840 of examples. 101 00:06:11,840 --> 00:06:14,030 And this gets back to that question 102 00:06:14,030 --> 00:06:17,650 we've talked about many times, of overfitting. 103 00:06:17,650 --> 00:06:23,290 If you've got 1,000 variables and 1,000 examples, 104 00:06:23,290 --> 00:06:24,565 you're very likely to overfit. 105 00:06:27,330 --> 00:06:30,270 L1 is designed to avoid overfitting 106 00:06:30,270 --> 00:06:33,450 by taking many of those 1,000 variables 107 00:06:33,450 --> 00:06:36,470 and just giving them 0 weight. 108 00:06:36,470 --> 00:06:41,030 And it does typically generalize better. 109 00:06:41,030 --> 00:06:45,000 But if you have two variables that are correlated, 110 00:06:45,000 --> 00:06:47,160 L1 will drive 1 to 0, and it will 111 00:06:47,160 --> 00:06:49,602 look like it's unimportant. 112 00:06:49,602 --> 00:06:51,060 But in fact, it might be important. 113 00:06:51,060 --> 00:06:53,370 It's just correlated with another, which 114 00:06:53,370 --> 00:06:56,880 has gotten all the credit. 115 00:06:56,880 --> 00:07:00,880 L2, which is what we did, does the opposite. 116 00:07:00,880 --> 00:07:04,550 Is spreads the weight across the variables. 117 00:07:04,550 --> 00:07:07,300 So have a bunch of correlated variables, 118 00:07:07,300 --> 00:07:10,060 it might look like none of them are very important. 119 00:07:10,060 --> 00:07:14,230 Because each of them gets a small amount of the weight. 120 00:07:14,230 --> 00:07:17,350 Again, not so important when you have four or five variables, 121 00:07:17,350 --> 00:07:18,850 is what I'm showing you. 122 00:07:18,850 --> 00:07:24,180 But it matters when you have 100 or 1,000 variables. 123 00:07:24,180 --> 00:07:27,110 Let's look at an example. 124 00:07:27,110 --> 00:07:36,410 So the cabin classes, the way we set it up, c1 plus c2 plus c3-- 125 00:07:36,410 --> 00:07:38,600 whoops-- is not equal to 0. 126 00:07:38,600 --> 00:07:42,000 What is it equal to? 127 00:07:42,000 --> 00:07:45,916 I'll fix this right now. 128 00:07:45,916 --> 00:07:47,040 What should that have said? 129 00:07:51,720 --> 00:07:55,040 What's the invariant here? 130 00:07:55,040 --> 00:07:58,010 Well, a person is in exactly one class. 131 00:07:58,010 --> 00:07:59,510 I guess if you're really rich, maybe 132 00:07:59,510 --> 00:08:03,590 you rented two cabins, one in first and one in second. 133 00:08:03,590 --> 00:08:05,110 But probably not. 134 00:08:05,110 --> 00:08:08,900 Or if you did, you put your servants in second or third. 135 00:08:08,900 --> 00:08:11,290 But what does this got to add up to? 136 00:08:11,290 --> 00:08:11,790 Yeah? 137 00:08:11,790 --> 00:08:12,289 AUDIENCE: 1. 138 00:08:12,289 --> 00:08:14,044 JOHN GUTTAG: Has to add up to 1. 139 00:08:14,044 --> 00:08:15,012 Thank you. 140 00:08:18,890 --> 00:08:20,240 So it adds up to 1. 141 00:08:24,140 --> 00:08:25,900 Whoa. 142 00:08:25,900 --> 00:08:28,990 Got his attention, at least. 143 00:08:28,990 --> 00:08:33,250 So what this tells us is the values are not independent. 144 00:08:33,250 --> 00:08:37,590 Because if c1 is 1, then c2 and c3 must be 0. 145 00:08:37,590 --> 00:08:38,090 Right? 146 00:08:40,870 --> 00:08:44,410 And so now we could go back to the previous slide 147 00:08:44,410 --> 00:08:49,000 and ask the question well, is it that being in first class 148 00:08:49,000 --> 00:08:50,590 is protective? 149 00:08:50,590 --> 00:08:57,070 Or is it that being in second or third class is risky? 150 00:08:57,070 --> 00:09:00,580 And there's no simple answer to that. 151 00:09:00,580 --> 00:09:02,140 So let's do an experiment. 152 00:09:02,140 --> 00:09:04,910 We have these correlated variables. 153 00:09:04,910 --> 00:09:10,270 Suppose we eliminate c1 altogether. 154 00:09:10,270 --> 00:09:18,020 So I did that by changing the init method of class passenger. 155 00:09:18,020 --> 00:09:25,650 Takes the same arguments, but we'll look at the code. 156 00:09:25,650 --> 00:09:29,190 Because it's a little bit clearer there. 157 00:09:29,190 --> 00:09:32,240 So there was the original one. 158 00:09:32,240 --> 00:09:36,190 And I'm going to replace that by this. 159 00:09:41,120 --> 00:09:42,670 combine that with the original one. 160 00:09:47,330 --> 00:09:54,700 So what you see is that instead of having five features, 161 00:09:54,700 --> 00:09:55,870 I now have four. 162 00:09:55,870 --> 00:09:59,290 I've eliminated the c1 binary feature. 163 00:09:59,290 --> 00:10:09,330 And then the code is straightforward, 164 00:10:09,330 --> 00:10:12,240 that I've just come through here, 165 00:10:12,240 --> 00:10:15,900 and I've just enumerated the possibilities. 166 00:10:15,900 --> 00:10:19,830 So if you're in first class, then second and third 167 00:10:19,830 --> 00:10:21,570 are both 0. 168 00:10:21,570 --> 00:10:25,150 Otherwise, one of them is a 1. 169 00:10:25,150 --> 00:10:29,690 So my invariant is gone now, right? 170 00:10:29,690 --> 00:10:32,030 It's not the case that we know that these two 171 00:10:32,030 --> 00:10:34,740 things have to add up to 1, because maybe I'm 172 00:10:34,740 --> 00:10:35,535 in the third case. 173 00:10:38,980 --> 00:10:46,120 OK, let's go run that code and see what happens. 174 00:10:59,970 --> 00:11:03,680 Well, if you remember, we see that our accuracy has not 175 00:11:03,680 --> 00:11:05,570 really declined much. 176 00:11:05,570 --> 00:11:08,960 Pretty much the same results we got before. 177 00:11:08,960 --> 00:11:12,320 But our weights are really quite different. 178 00:11:12,320 --> 00:11:17,000 Now, suddenly, c2 and c3 have large negative weights. 179 00:11:20,170 --> 00:11:22,320 We can look at them side by side here. 180 00:11:32,110 --> 00:11:33,730 So you see, not much difference. 181 00:11:33,730 --> 00:11:36,040 It actually performs maybe-- 182 00:11:36,040 --> 00:11:39,145 well, really no real difference in performance. 183 00:11:39,145 --> 00:11:41,020 But you'll notice that the weights are really 184 00:11:41,020 --> 00:11:42,490 quite different. 185 00:11:42,490 --> 00:11:49,400 That now, what had been a strong positive weight and relatively 186 00:11:49,400 --> 00:11:52,370 weak negative weights is now replaced 187 00:11:52,370 --> 00:11:54,095 by two strong negative weights. 188 00:11:58,870 --> 00:12:04,120 And age and gender change just a little bit. 189 00:12:04,120 --> 00:12:05,770 So the whole point here is that we 190 00:12:05,770 --> 00:12:08,710 have to be very careful, when you have correlated features, 191 00:12:08,710 --> 00:12:11,590 about over-interpreting the weights. 192 00:12:11,590 --> 00:12:14,560 It is generally pretty safe to rely on the sign, 193 00:12:14,560 --> 00:12:18,160 whether it's negative or positive. 194 00:12:18,160 --> 00:12:22,740 All right, changing the topic but sticking 195 00:12:22,740 --> 00:12:26,760 with logistic regression, there is this parameter 196 00:12:26,760 --> 00:12:30,342 you may recall, p, which is the probability. 197 00:12:30,342 --> 00:12:32,810 And that was the cut-off. 198 00:12:32,810 --> 00:12:36,440 And we set it to 0.5, saying if it estimates 199 00:12:36,440 --> 00:12:41,660 the probability of survival to be 0.5 or higher, 200 00:12:41,660 --> 00:12:45,110 then we're going to guess survived, predict survived. 201 00:12:45,110 --> 00:12:48,440 Otherwise, deceased. 202 00:12:48,440 --> 00:12:51,510 You can change that. 203 00:12:51,510 --> 00:12:57,300 And so I'm going to try two extreme values, setting p 204 00:12:57,300 --> 00:13:04,500 to 0.1 and p to 0.9. 205 00:13:04,500 --> 00:13:06,875 Now, what do we think that's likely to change? 206 00:13:10,620 --> 00:13:13,682 Remember, we looked at a bunch of different attributes. 207 00:13:16,740 --> 00:13:18,360 In particular, what attributes do we 208 00:13:18,360 --> 00:13:20,310 think are most likely to change? 209 00:13:20,310 --> 00:13:24,062 Anyone who has not answered a question want to volunteer? 210 00:13:24,062 --> 00:13:25,770 I have nothing against you, it's just I'm 211 00:13:25,770 --> 00:13:27,490 trying to spread the wealth. 212 00:13:27,490 --> 00:13:30,555 And I don't want to give you diabetes, with all the candy. 213 00:13:33,210 --> 00:13:35,552 All right, you get to go again. 214 00:13:35,552 --> 00:13:37,356 AUDIENCE: Sensitivity. 215 00:13:37,356 --> 00:13:38,290 JOHN GUTTAG: Pardon? 216 00:13:38,290 --> 00:13:40,039 AUDIENCE: The sensitivity and specificity. 217 00:13:40,039 --> 00:13:42,630 JOHN GUTTAG: Sensitivity and specificity, 218 00:13:42,630 --> 00:13:44,400 positive predictive value. 219 00:13:44,400 --> 00:13:46,080 Because we're shifting. 220 00:13:46,080 --> 00:13:51,845 And we're saying, well, by changing the probability, 221 00:13:51,845 --> 00:13:53,220 we're making a decision that it's 222 00:13:53,220 --> 00:13:57,570 more important to not miss survivors 223 00:13:57,570 --> 00:14:03,020 than it is to, say, ask gets too high. 224 00:14:03,020 --> 00:14:05,870 So let's look at what happens when we run that. 225 00:14:08,910 --> 00:14:10,230 I won't run it for you. 226 00:14:10,230 --> 00:14:13,900 But these are the results we got. 227 00:14:13,900 --> 00:14:18,380 So as it happens, 0.9 gave me higher accuracy. 228 00:14:18,380 --> 00:14:22,040 But the key thing is, notice the big difference here. 229 00:14:28,850 --> 00:14:32,460 So what is that telling me? 230 00:14:32,460 --> 00:14:34,800 Well, it's telling me that if I predict you're going 231 00:14:34,800 --> 00:14:38,520 to survive you probably did. 232 00:14:38,520 --> 00:14:40,980 But look what it did to the sensitivity. 233 00:14:46,320 --> 00:14:49,380 It means that most of the survivors, 234 00:14:49,380 --> 00:14:53,300 I'm predicting they died. 235 00:14:53,300 --> 00:14:56,910 Why is the accuracy still OK? 236 00:14:56,910 --> 00:15:00,060 Well, because most people died on the boat, on the ship, 237 00:15:00,060 --> 00:15:01,140 right? 238 00:15:01,140 --> 00:15:03,740 So we would have done pretty well, you recall, 239 00:15:03,740 --> 00:15:06,135 if we just guessed died for everybody. 240 00:15:11,250 --> 00:15:14,960 So it's important to understand these things. 241 00:15:14,960 --> 00:15:16,799 I once did some work using machine 242 00:15:16,799 --> 00:15:18,340 learning for an insurance company who 243 00:15:18,340 --> 00:15:21,440 was trying to set rates. 244 00:15:21,440 --> 00:15:23,570 And I asked them what they wanted to do. 245 00:15:23,570 --> 00:15:28,059 And they said they didn't want to lose money. 246 00:15:28,059 --> 00:15:29,600 They didn't want to insure people who 247 00:15:29,600 --> 00:15:32,060 were going to get in accidents. 248 00:15:32,060 --> 00:15:35,750 So I was able to change this p parameter 249 00:15:35,750 --> 00:15:38,730 so that it did a great job. 250 00:15:38,730 --> 00:15:43,130 The problem was they got to write almost no policies. 251 00:15:43,130 --> 00:15:46,010 Because I could pretty much guarantee the people I said 252 00:15:46,010 --> 00:15:47,787 wouldn't get in an accident wouldn't. 253 00:15:47,787 --> 00:15:49,370 But there were a whole bunch of people 254 00:15:49,370 --> 00:15:51,980 who didn't, who they wouldn't write policies for. 255 00:15:51,980 --> 00:15:53,844 So they ended up not making any money. 256 00:15:53,844 --> 00:15:54,760 It was a bad decision. 257 00:15:59,530 --> 00:16:03,960 So we can change the cutoff. 258 00:16:03,960 --> 00:16:09,720 That leads to a really important concept 259 00:16:09,720 --> 00:16:15,330 of something called the Receiver Operating Characteristic. 260 00:16:15,330 --> 00:16:17,870 And it's a funny name, having to do with it originally 261 00:16:17,870 --> 00:16:20,990 going back to radio receivers. 262 00:16:20,990 --> 00:16:22,050 But we can ignore that. 263 00:16:24,580 --> 00:16:27,970 The goal here is to say, suppose I 264 00:16:27,970 --> 00:16:33,160 don't want to make a decision about where the cutoff is, 265 00:16:33,160 --> 00:16:38,320 but I want to look at, in some sense, all possible cutoffs 266 00:16:38,320 --> 00:16:39,980 and look at the shape of it. 267 00:16:42,500 --> 00:16:46,190 And that's what this code is designed to do. 268 00:16:46,190 --> 00:16:54,180 So the way it works is I'll take a training set and a test 269 00:16:54,180 --> 00:16:56,450 set, usual thing. 270 00:16:56,450 --> 00:16:58,990 I'll build one model. 271 00:16:58,990 --> 00:17:01,620 And that's an important thing, that there's only one model 272 00:17:01,620 --> 00:17:06,589 getting built. And then I'm going to vary p. 273 00:17:09,550 --> 00:17:13,300 And I'm going to call apply model 274 00:17:13,300 --> 00:17:16,390 with the same model and the same test set, 275 00:17:16,390 --> 00:17:23,234 but different p's and keep track of all of those results. 276 00:17:27,990 --> 00:17:33,050 I'm then going to plot a two-dimensional plot. 277 00:17:33,050 --> 00:17:36,590 The y-axis will have sensitivity. 278 00:17:39,830 --> 00:17:43,940 And the x-axis will have one minus specificity. 279 00:17:55,260 --> 00:17:57,695 So I am accumulating a bunch of results. 280 00:18:05,020 --> 00:18:08,920 And then I'm going to produce this curve calling 281 00:18:08,920 --> 00:18:14,050 sklearn.metrics.auc, that's not the curve. 282 00:18:14,050 --> 00:18:20,455 AUC stands for Area Under the Curve. 283 00:18:23,570 --> 00:18:27,175 And we'll see why we want to get that area under the curve. 284 00:18:31,690 --> 00:18:37,410 When I run that, it produces this. 285 00:18:37,410 --> 00:18:40,202 So here's the curve, the blue line. 286 00:18:40,202 --> 00:18:42,270 And there's some things to note about it. 287 00:18:44,870 --> 00:18:55,640 Way down at this end I can have 0, right? 288 00:18:55,640 --> 00:18:58,430 I can set it so that I don't make any predictions. 289 00:19:02,740 --> 00:19:05,210 And this is interesting. 290 00:19:05,210 --> 00:19:09,070 So at this end it is saying what? 291 00:19:12,730 --> 00:19:16,180 Remember that my x-axis is not specificity, 292 00:19:16,180 --> 00:19:20,280 but 1 minus specificity. 293 00:19:20,280 --> 00:19:27,600 So what we see is this corner is highly sensitive and very 294 00:19:27,600 --> 00:19:28,300 unspecific. 295 00:19:31,000 --> 00:19:34,570 So I'll get a lot of false positives. 296 00:19:34,570 --> 00:19:40,330 This corner is very specific, because 1 minus specificity 297 00:19:40,330 --> 00:19:42,610 is 0, and very insensitive. 298 00:19:46,670 --> 00:19:50,210 So way down at the bottom, I'm declaring nobody 299 00:19:50,210 --> 00:19:51,100 to be positive. 300 00:19:54,110 --> 00:19:56,960 And way up here, everybody. 301 00:19:56,960 --> 00:19:58,580 Clearly, I don't want to be at either 302 00:19:58,580 --> 00:20:00,470 of these places on the curve, right? 303 00:20:00,470 --> 00:20:03,680 Typically I want to be somewhere in the middle. 304 00:20:03,680 --> 00:20:07,820 And here, we can see, there's a nice knee in the curve here. 305 00:20:07,820 --> 00:20:10,910 We can choose a place. 306 00:20:10,910 --> 00:20:13,490 What does this green line represent, do you think? 307 00:20:20,970 --> 00:20:26,430 The green line represents a random classifier. 308 00:20:30,880 --> 00:20:33,370 I flip a coin and I just classify something 309 00:20:33,370 --> 00:20:38,430 positive or negative, depending on the heads or tails, 310 00:20:38,430 --> 00:20:39,000 in this case. 311 00:20:43,820 --> 00:20:49,190 So now we can look at an interesting region, which 312 00:20:49,190 --> 00:20:57,560 is this region, the area between the curve 313 00:20:57,560 --> 00:20:59,780 and a random classifier. 314 00:20:59,780 --> 00:21:02,650 And that sort of tells me how much better I am than random. 315 00:21:06,400 --> 00:21:12,810 I can look at the whole area, the area under the curve. 316 00:21:15,330 --> 00:21:20,280 And that's this, the area under the Receiver Operating Curve. 317 00:21:25,300 --> 00:21:33,160 In the best of all worlds, the curve would be 1. 318 00:21:33,160 --> 00:21:35,290 That would be a perfect classifier. 319 00:21:38,050 --> 00:21:40,110 In the worst of all worlds, it would be 0. 320 00:21:40,110 --> 00:21:45,190 But it's never 0 because we don't do worse than 0.5. 321 00:21:45,190 --> 00:21:47,170 We hope not to do worse than random. 322 00:21:47,170 --> 00:21:50,320 If so, we just reverse our predictions. 323 00:21:50,320 --> 00:21:52,840 And then we're better than random. 324 00:21:52,840 --> 00:21:56,750 So random is as bad as you can do, really. 325 00:21:56,750 --> 00:21:59,600 And so this is a very important concept. 326 00:21:59,600 --> 00:22:05,120 And it lets us evaluate how good a classifier is independently 327 00:22:05,120 --> 00:22:07,000 of what we choose to be the cutoff. 328 00:22:10,330 --> 00:22:13,020 So when you read the literature and people say, 329 00:22:13,020 --> 00:22:16,050 I have this wonderful method of making predictions, 330 00:22:16,050 --> 00:22:18,780 you'll almost always see them cite the AUROC. 331 00:22:25,500 --> 00:22:29,430 Any questions about this or about machine 332 00:22:29,430 --> 00:22:30,900 learning in general? 333 00:22:30,900 --> 00:22:33,420 If so, this would be a good time to ask them, 334 00:22:33,420 --> 00:22:35,640 since I'm about to totally change the topic. 335 00:22:41,300 --> 00:22:43,750 Yes? 336 00:22:43,750 --> 00:22:45,710 AUDIENCE: At what level does AUROC 337 00:22:45,710 --> 00:22:48,160 start to be statistically significant? 338 00:22:48,160 --> 00:22:50,610 And how many data points do you need 339 00:22:50,610 --> 00:22:52,080 to also prove that [INAUDIBLE]? 340 00:22:52,080 --> 00:22:52,871 JOHN GUTTAG: Right. 341 00:22:52,871 --> 00:22:54,340 So the question is, at what point 342 00:22:54,340 --> 00:22:59,260 does the AUROC become statistically significant? 343 00:22:59,260 --> 00:23:02,515 And that is, essentially, an unanswerable question. 344 00:23:06,840 --> 00:23:10,260 Whoops, relay it back. 345 00:23:10,260 --> 00:23:13,046 Needed to put more air under the throw. 346 00:23:13,046 --> 00:23:18,751 I look like the quarterback for the Rams, 347 00:23:18,751 --> 00:23:21,790 if you saw them play lately. 348 00:23:21,790 --> 00:23:27,370 So if you ask this question about significance, 349 00:23:27,370 --> 00:23:30,890 it will depend upon a number of things. 350 00:23:30,890 --> 00:23:36,147 So you're always asking, is it significantly better than x? 351 00:23:36,147 --> 00:23:38,230 And so the question is, is it significantly better 352 00:23:38,230 --> 00:23:40,150 than random? 353 00:23:40,150 --> 00:23:44,770 And you can't just say, for example, that 0.6 isn't and 0.7 354 00:23:44,770 --> 00:23:45,880 is. 355 00:23:45,880 --> 00:23:50,320 Because it depends how many points you have. 356 00:23:50,320 --> 00:23:51,910 If you have a lot of points, it could 357 00:23:51,910 --> 00:23:54,280 be only a tiny bit better than 0.5 358 00:23:54,280 --> 00:23:57,360 and still be statistically significant. 359 00:23:57,360 --> 00:24:00,990 It may be uninterestingly better. 360 00:24:00,990 --> 00:24:03,120 It may not be significant in the English sense, 361 00:24:03,120 --> 00:24:07,280 but you still get statistical significance. 362 00:24:07,280 --> 00:24:10,570 So that's a problem when studies have lots of points. 363 00:24:10,570 --> 00:24:14,620 In general, it depends upon the application. 364 00:24:14,620 --> 00:24:17,950 For a lot of applications, you'll see things in the 0.7's 365 00:24:17,950 --> 00:24:20,770 being considered pretty useful. 366 00:24:20,770 --> 00:24:23,740 And the real question shouldn't be whether it's significant, 367 00:24:23,740 --> 00:24:25,450 but whether it's useful. 368 00:24:25,450 --> 00:24:29,270 Can you make useful decisions based upon it? 369 00:24:29,270 --> 00:24:31,420 And the other thing is, typically, 370 00:24:31,420 --> 00:24:36,010 when you're talking about that, you're selecting some point 371 00:24:36,010 --> 00:24:40,410 and really talking about a region relative to that point. 372 00:24:40,410 --> 00:24:43,710 We usually don't really care what it does out here. 373 00:24:43,710 --> 00:24:46,950 Because we hardly ever operate out there anyway. 374 00:24:46,950 --> 00:24:49,970 We're usually somewhere in the middle. 375 00:24:49,970 --> 00:24:52,040 But good question. 376 00:24:52,040 --> 00:24:53,576 Yeah? 377 00:24:53,576 --> 00:24:55,836 AUDIENCE: Why are we doing 1 minus specificity? 378 00:24:55,836 --> 00:24:58,300 JOHN GUTTAG: Why are we doing 1 minus specificity 379 00:24:58,300 --> 00:25:00,320 instead of specificity? 380 00:25:00,320 --> 00:25:02,500 Is that the question? 381 00:25:02,500 --> 00:25:04,720 And the answer is, essentially, so we 382 00:25:04,720 --> 00:25:07,630 can do this trick of computing the area. 383 00:25:07,630 --> 00:25:11,480 It gives us this nice curve. 384 00:25:11,480 --> 00:25:14,620 This nice, if you will, concave curve 385 00:25:14,620 --> 00:25:16,750 which lets us compute this area under here 386 00:25:16,750 --> 00:25:21,850 nicely if you were to take specificity and just draw it, 387 00:25:21,850 --> 00:25:23,055 it would look different. 388 00:25:26,290 --> 00:25:29,500 Obviously, mathematically, they're, in some sense, 389 00:25:29,500 --> 00:25:30,820 the same right. 390 00:25:30,820 --> 00:25:35,920 If you have 1 minus x and x, you can get either from the other. 391 00:25:35,920 --> 00:25:37,750 So it really just has to do with the way 392 00:25:37,750 --> 00:25:39,904 people want to draw this picture. 393 00:25:39,904 --> 00:25:42,748 AUDIENCE: [INAUDIBLE]? 394 00:25:42,748 --> 00:25:43,696 JOHN GUTTAG: Pardon? 395 00:25:43,696 --> 00:25:45,592 AUDIENCE: Does that not change [INAUDIBLE]? 396 00:25:45,592 --> 00:25:47,694 JOHN GUTTAG: Does it not-- 397 00:25:47,694 --> 00:25:49,908 AUDIENCE: Doesn't it change the meaning 398 00:25:49,908 --> 00:25:51,112 of what you're [INAUDIBLE]? 399 00:25:51,112 --> 00:25:53,570 JOHN GUTTAG: Well, you'd have to use a different statistic. 400 00:25:53,570 --> 00:25:59,550 You couldn't cite the AUROC if you did specificity directly. 401 00:25:59,550 --> 00:26:02,760 Which is why they do 1 minus. 402 00:26:02,760 --> 00:26:06,650 The goal is you want to have this point at 0 and this 0.00 403 00:26:06,650 --> 00:26:08,350 and 1.1. 404 00:26:08,350 --> 00:26:10,480 And playing 1 minus gives you this trick, 405 00:26:10,480 --> 00:26:12,760 of anchoring those two points. 406 00:26:12,760 --> 00:26:15,170 And so then you get a curve connecting them, 407 00:26:15,170 --> 00:26:19,330 which you can then easily compare to the random curve. 408 00:26:19,330 --> 00:26:21,330 It's just one of these little tricks 409 00:26:21,330 --> 00:26:23,760 that statisticians like to play to make 410 00:26:23,760 --> 00:26:28,530 things easy to visualize and easy to compute statistics 411 00:26:28,530 --> 00:26:29,310 about. 412 00:26:29,310 --> 00:26:31,920 It's not a fundamentally important issue. 413 00:26:34,500 --> 00:26:35,330 Anything else? 414 00:26:39,920 --> 00:26:44,315 All right, so I told you I was going to change topics-- 415 00:26:48,030 --> 00:26:49,760 finally got one completed-- 416 00:26:49,760 --> 00:26:52,000 and I am. 417 00:26:52,000 --> 00:26:56,300 And this is a topic I approach with some reluctance. 418 00:26:56,300 --> 00:26:59,540 So you have probably all heard this expression, 419 00:26:59,540 --> 00:27:04,290 that there are three kinds of lies, lies, 420 00:27:04,290 --> 00:27:08,480 damn lies, and statistics. 421 00:27:08,480 --> 00:27:11,390 And we've been talking a lot about statistics. 422 00:27:11,390 --> 00:27:14,810 And now I want to spend the rest of today's lecture 423 00:27:14,810 --> 00:27:19,340 and the start of Wednesday's lecture 424 00:27:19,340 --> 00:27:24,080 talking about how to lie with statistics. 425 00:27:24,080 --> 00:27:29,750 So at this point, I usually put on my "Numbers Never Lie" hat. 426 00:27:29,750 --> 00:27:36,260 But do say that numbers never lie, but liars use numbers. 427 00:27:36,260 --> 00:27:39,530 And I hope none of you will ever go work for a politician 428 00:27:39,530 --> 00:27:42,440 and put this knowledge to bad use. 429 00:27:42,440 --> 00:27:43,960 This quote is well known. 430 00:27:43,960 --> 00:27:46,220 It's variously attributed, often, 431 00:27:46,220 --> 00:27:49,490 to Mark Twain, the fellow on the left. 432 00:27:49,490 --> 00:27:51,590 He claimed not to have invented it, 433 00:27:51,590 --> 00:27:55,460 but said it was invented by Benjamin Disraeli. 434 00:27:55,460 --> 00:27:57,620 And I prefer to believe that, since it 435 00:27:57,620 --> 00:28:02,690 does seem like something a Prime Minister would invent. 436 00:28:02,690 --> 00:28:04,880 So let's think about this. 437 00:28:04,880 --> 00:28:08,090 The issue here is the way the human mind works 438 00:28:08,090 --> 00:28:09,320 and statistics. 439 00:28:11,840 --> 00:28:15,440 Darrell Huff, a well-known statistician 440 00:28:15,440 --> 00:28:19,064 who did write a book called How to Lie with Statistics, 441 00:28:19,064 --> 00:28:20,480 says, "if you can't prove what you 442 00:28:20,480 --> 00:28:24,740 want to prove, demonstrate something else and pretend 443 00:28:24,740 --> 00:28:27,200 they are the same thing. 444 00:28:27,200 --> 00:28:29,780 In the daze that follows the collision of statistics 445 00:28:29,780 --> 00:28:32,300 with the human mind, hardly anyone 446 00:28:32,300 --> 00:28:34,940 will notice the difference." 447 00:28:34,940 --> 00:28:39,910 And indeed, empirically, he seems to be right. 448 00:28:39,910 --> 00:28:42,530 So let's look at some examples. 449 00:28:42,530 --> 00:28:43,360 Here's one I like. 450 00:28:43,360 --> 00:28:47,890 This is from another famous statistician called Anscombe. 451 00:28:47,890 --> 00:28:50,890 And he invented this thing called Anscombe's Quartet. 452 00:28:50,890 --> 00:28:52,600 I take my hat off now. 453 00:28:52,600 --> 00:28:53,530 It's too hot in here. 454 00:28:56,560 --> 00:29:00,745 A bunch of numbers, 11 x, y pairs. 455 00:29:03,675 --> 00:29:05,550 I know you don't want to look at the numbers, 456 00:29:05,550 --> 00:29:08,980 so here are some statistics about them. 457 00:29:08,980 --> 00:29:14,090 Each of those pairs has the same mean value 458 00:29:14,090 --> 00:29:17,840 for x, the same mean for y, the same variance 459 00:29:17,840 --> 00:29:20,750 for x, the same variance for y. 460 00:29:20,750 --> 00:29:23,630 And then I went and I fit a linear regression model to it. 461 00:29:23,630 --> 00:29:27,560 And lo and behold, I got the same equation for everyone, 462 00:29:27,560 --> 00:29:31,430 y equals 0.5x plus 3. 463 00:29:31,430 --> 00:29:36,020 So that raises the question, if we go back, 464 00:29:36,020 --> 00:29:40,740 is there really much difference between these pairs of x and y? 465 00:29:44,190 --> 00:29:46,212 Are they really similar? 466 00:29:46,212 --> 00:29:47,670 And the answer is, that's what they 467 00:29:47,670 --> 00:29:49,010 look like if you plot them. 468 00:29:51,770 --> 00:29:53,710 So even though statistically they 469 00:29:53,710 --> 00:29:56,580 appear to be kind of the same, they 470 00:29:56,580 --> 00:29:59,110 could hardly be more different, right? 471 00:29:59,110 --> 00:30:02,440 Those are not the same distributions. 472 00:30:02,440 --> 00:30:05,560 So there's an important moral here, 473 00:30:05,560 --> 00:30:07,960 which is that statistics about data 474 00:30:07,960 --> 00:30:12,380 is not the same thing as the data itself. 475 00:30:12,380 --> 00:30:15,460 And this seems obvious, but it's amazing 476 00:30:15,460 --> 00:30:17,590 how easy it is to forget it. 477 00:30:17,590 --> 00:30:19,270 The number of papers I've read where 478 00:30:19,270 --> 00:30:21,470 I see a bunch of statistics about the data 479 00:30:21,470 --> 00:30:24,550 but don't see the data is enormous. 480 00:30:24,550 --> 00:30:27,940 And it's easy to lose track of the fact 481 00:30:27,940 --> 00:30:33,470 that the statistics don't tell the whole story. 482 00:30:33,470 --> 00:30:36,800 So the answer is the old Chinese proverb, 483 00:30:36,800 --> 00:30:39,824 a picture is worth a thousand words, 484 00:30:39,824 --> 00:30:41,740 I urge you, the first thing you should do when 485 00:30:41,740 --> 00:30:44,740 you get a data set, is plot it. 486 00:30:44,740 --> 00:30:47,980 If it's got too many points to plot all the points, 487 00:30:47,980 --> 00:30:52,050 subsample it and plot of subsample. 488 00:30:52,050 --> 00:30:57,530 Use some visualization tool to look at the data itself. 489 00:30:57,530 --> 00:31:00,820 Now, that said, pictures are wonderful. 490 00:31:00,820 --> 00:31:03,400 But you can lie with pictures. 491 00:31:03,400 --> 00:31:05,380 So here's an interesting chart. 492 00:31:05,380 --> 00:31:10,020 These are grades in 6.0001 by gender. 493 00:31:10,020 --> 00:31:12,600 So the males are blue and the females are pink. 494 00:31:12,600 --> 00:31:15,440 Sorry for being such a traditionalist. 495 00:31:15,440 --> 00:31:20,520 And as you can see, the women did way better than the men. 496 00:31:20,520 --> 00:31:23,970 Now, I know for some of you this is confirmation bias. 497 00:31:23,970 --> 00:31:25,680 You say, of course. 498 00:31:25,680 --> 00:31:29,350 Others say, impossible, But in fact, if you look carefully, 499 00:31:29,350 --> 00:31:32,730 you'll see that's not what this chart says at all. 500 00:31:32,730 --> 00:31:37,050 Because if you look at the axis here, 501 00:31:37,050 --> 00:31:41,510 you'll see that actually there's not much difference. 502 00:31:41,510 --> 00:31:45,490 Here's what I get if I plot it from 0 to 5. 503 00:31:45,490 --> 00:31:47,390 Yeah, the women did a little bit better. 504 00:31:47,390 --> 00:31:50,620 But that's not a statistically-significant 505 00:31:50,620 --> 00:31:53,190 difference. 506 00:31:53,190 --> 00:31:57,510 And by the way, when I plotted it last year for 6.0002, 507 00:31:57,510 --> 00:32:00,780 the blue was about that much higher than the pink. 508 00:32:00,780 --> 00:32:03,480 Don't read much into either of them. 509 00:32:03,480 --> 00:32:07,350 But the trick was here, I took the y-axis 510 00:32:07,350 --> 00:32:13,280 and ran it from 3.9 to 4.05. 511 00:32:13,280 --> 00:32:16,550 I cleverly chose my baseline in such a way 512 00:32:16,550 --> 00:32:20,160 to make the difference look much bigger than it is. 513 00:32:22,690 --> 00:32:26,740 Here I did the honest thing of put the baseline at 0 514 00:32:26,740 --> 00:32:28,750 and run it to 5. 515 00:32:28,750 --> 00:32:33,140 Because that's the range of grades at MIT. 516 00:32:33,140 --> 00:32:37,820 And so when you look at a chart, it's 517 00:32:37,820 --> 00:32:39,410 important to keep in mind that you 518 00:32:39,410 --> 00:32:43,650 need to look at the axis labels and the scales. 519 00:32:47,092 --> 00:32:48,800 Let's look at another chart, just in case 520 00:32:48,800 --> 00:32:53,580 you think I'm the only one who likes to play with graphics. 521 00:32:53,580 --> 00:32:57,280 This is a chart from Fox News. 522 00:32:57,280 --> 00:33:01,210 And they're arguing here. 523 00:33:01,210 --> 00:33:03,860 It's the shocking statistics that there 524 00:33:03,860 --> 00:33:06,640 are 108.6 million people on welfare, 525 00:33:06,640 --> 00:33:11,980 and 101.7 with a full-time job. 526 00:33:11,980 --> 00:33:14,780 And you can imagine the rhetoric that accompanies this chart. 527 00:33:18,540 --> 00:33:20,255 This is actually correct. 528 00:33:22,770 --> 00:33:24,880 It is true from the Census Bureau data. 529 00:33:24,880 --> 00:33:26,230 Sort of. 530 00:33:26,230 --> 00:33:30,010 But notice that I said you should 531 00:33:30,010 --> 00:33:33,970 read the labels on the axes. 532 00:33:33,970 --> 00:33:35,150 There is no label here. 533 00:33:39,150 --> 00:33:45,640 But you can bet that the y-intercept is not 0 on this. 534 00:33:45,640 --> 00:33:50,670 Because you can see how small 101.7 looks like. 535 00:33:50,670 --> 00:33:53,420 So it makes the difference look bigger than it is. 536 00:33:56,310 --> 00:34:02,000 Now, that's not the only funny thing about it. 537 00:34:02,000 --> 00:34:06,650 I said you should look at the labels on the x-axis. 538 00:34:06,650 --> 00:34:08,290 Well, they've labeled them. 539 00:34:08,290 --> 00:34:11,639 But what do these things mean? 540 00:34:11,639 --> 00:34:13,679 Well, I looked it up, and I'll tell you 541 00:34:13,679 --> 00:34:15,960 what they actually mean. 542 00:34:15,960 --> 00:34:19,760 People on welfare counts the number 543 00:34:19,760 --> 00:34:23,581 of people in a household in which at least one person is 544 00:34:23,581 --> 00:34:24,080 on welfare. 545 00:34:26,780 --> 00:34:31,340 So if there is say, two parents, one is working 546 00:34:31,340 --> 00:34:33,590 and one is collecting welfare and there 547 00:34:33,590 --> 00:34:36,949 are four kids, that counts as six people on welfare. 548 00:34:40,469 --> 00:34:43,860 People with a full-time job, is actually 549 00:34:43,860 --> 00:34:46,370 does not count households. 550 00:34:46,370 --> 00:34:50,030 So in the same family, you would have 551 00:34:50,030 --> 00:34:56,179 six on the bar on the left, and one on the bar on the right. 552 00:34:56,179 --> 00:35:00,670 Clearly giving a very different impression. 553 00:35:00,670 --> 00:35:04,600 And so again, pictures can be good. 554 00:35:04,600 --> 00:35:09,970 But if you don't dive deep into them, they really can fool you. 555 00:35:09,970 --> 00:35:12,010 Now, before I should leave this slide, 556 00:35:12,010 --> 00:35:14,740 I should say that it's not the case that you can't believe 557 00:35:14,740 --> 00:35:17,500 anything you read on Fox News. 558 00:35:17,500 --> 00:35:20,560 Because in fact, the Red Sox did beat the St. Louis Cardinals 559 00:35:20,560 --> 00:35:21,540 4 to 2 that day. 560 00:35:27,090 --> 00:35:30,840 So the moral here is to ask whether the things being 561 00:35:30,840 --> 00:35:33,990 compared are actually comparable. 562 00:35:33,990 --> 00:35:36,860 Or you're really comparing apples and oranges, 563 00:35:36,860 --> 00:35:39,600 as they say. 564 00:35:39,600 --> 00:35:44,370 OK, this is probably the most common statistical sin. 565 00:35:44,370 --> 00:35:46,260 It's called GIGO. 566 00:35:46,260 --> 00:35:49,110 And perhaps this picture can make 567 00:35:49,110 --> 00:35:52,510 you guess what the G's stand for GIGO 568 00:35:52,510 --> 00:35:55,590 is Garbage In, Garbage Out. 569 00:36:00,280 --> 00:36:03,260 So here's a great, again, quote about it. 570 00:36:03,260 --> 00:36:10,840 So Charles Babbage designed the first digital computer, 571 00:36:10,840 --> 00:36:12,750 the first actual computation engine. 572 00:36:12,750 --> 00:36:15,430 He was unable to build it. 573 00:36:15,430 --> 00:36:17,650 But hundreds of years after he died one 574 00:36:17,650 --> 00:36:22,030 was built according to his design, and it actually worked. 575 00:36:22,030 --> 00:36:24,460 No electronics, really. 576 00:36:24,460 --> 00:36:27,220 So he was a famous person. 577 00:36:27,220 --> 00:36:31,090 And he was asked by Parliament about his machine, which 578 00:36:31,090 --> 00:36:32,890 he was asking them to fund. 579 00:36:32,890 --> 00:36:36,640 Well, if you put wrong numbers into the machine, 580 00:36:36,640 --> 00:36:41,450 will the machine have right numbers come out the other end? 581 00:36:41,450 --> 00:36:44,250 And of course, he was a very smart guy. 582 00:36:44,250 --> 00:36:45,830 And he was totally baffled. 583 00:36:45,830 --> 00:36:48,020 This question seems so stupid, he 584 00:36:48,020 --> 00:36:50,510 couldn't believe anyone would even ask it. 585 00:36:50,510 --> 00:36:53,810 That it was just computation. 586 00:36:53,810 --> 00:36:58,420 And the answers you get are based on the data you put in. 587 00:36:58,420 --> 00:37:03,270 If you put in garbage, you get out garbage. 588 00:37:03,270 --> 00:37:08,100 So here is an example from the 1840s. 589 00:37:08,100 --> 00:37:10,140 They did a census in the 1840s. 590 00:37:10,140 --> 00:37:13,840 And for those of you who are not familiar with American history, 591 00:37:13,840 --> 00:37:17,610 it was a very contentious time in the US. 592 00:37:17,610 --> 00:37:20,640 The country was divided between states that had slavery 593 00:37:20,640 --> 00:37:22,610 and states that didn't. 594 00:37:22,610 --> 00:37:28,160 And that was the dominant political issue of the day. 595 00:37:28,160 --> 00:37:31,550 John Calhoun, who was Secretary of State 596 00:37:31,550 --> 00:37:37,010 and a leader in the Senate, was from South Carolina 597 00:37:37,010 --> 00:37:41,420 and probably the strongest proponent of slavery. 598 00:37:41,420 --> 00:37:47,480 And he used the census data to say that slavery was actually 599 00:37:47,480 --> 00:37:49,920 good for the slaves. 600 00:37:49,920 --> 00:37:51,780 Kind of an amazing thought. 601 00:37:51,780 --> 00:37:56,040 Basically saying that this data claimed 602 00:37:56,040 --> 00:37:59,760 that freed slaves were more likely to be 603 00:37:59,760 --> 00:38:01,590 insane than enslaved slaves. 604 00:38:07,760 --> 00:38:13,250 He was rebutted in the House by John Quincy 605 00:38:13,250 --> 00:38:17,120 Adams, who had formerly been President of the United States. 606 00:38:17,120 --> 00:38:20,810 After he stopped being President, he ran for Congress. 607 00:38:20,810 --> 00:38:23,390 From Braintree, Massachusetts. 608 00:38:23,390 --> 00:38:24,980 Actually now called Quincy, the part 609 00:38:24,980 --> 00:38:28,280 he's from, after his family. 610 00:38:28,280 --> 00:38:30,920 And he claimed that atrocious misrepresentations 611 00:38:30,920 --> 00:38:34,550 had been made on a subject of deep importance. 612 00:38:34,550 --> 00:38:37,100 He was an abolitionist. 613 00:38:37,100 --> 00:38:39,292 So you don't even have to look at that statistics 614 00:38:39,292 --> 00:38:40,250 to know who to believe. 615 00:38:40,250 --> 00:38:42,137 Just look at these pictures. 616 00:38:42,137 --> 00:38:43,970 Are you going to believe this nice gentleman 617 00:38:43,970 --> 00:38:49,100 from Braintree or this scary guy from South Carolina? 618 00:38:49,100 --> 00:38:53,780 But setting looks aside, Calhoun eventually 619 00:38:53,780 --> 00:38:58,080 admitted that the census was indeed full of errors. 620 00:38:58,080 --> 00:38:59,600 But he said that was fine. 621 00:38:59,600 --> 00:39:01,992 Because there were so many of them 622 00:39:01,992 --> 00:39:03,950 that they would balance each other out and lead 623 00:39:03,950 --> 00:39:07,500 to the same conclusion, as if they were all correct. 624 00:39:07,500 --> 00:39:09,880 So he didn't believe in garbage in, garbage out. 625 00:39:09,880 --> 00:39:12,050 He said yeah, it is garbage. 626 00:39:12,050 --> 00:39:16,160 But it'll all come out in the end OK. 627 00:39:16,160 --> 00:39:19,790 Well, now we know enough to ask the question. 628 00:39:22,540 --> 00:39:27,480 This isn't totally brain dead, in that 629 00:39:27,480 --> 00:39:30,090 we've already looked at experiments 630 00:39:30,090 --> 00:39:32,580 and said we get experimental error. 631 00:39:32,580 --> 00:39:37,750 And under some circumstances, you can manage the error. 632 00:39:37,750 --> 00:39:38,890 The data isn't garbage. 633 00:39:38,890 --> 00:39:40,390 It just has errors. 634 00:39:40,390 --> 00:39:42,370 But it's true if the measurement errors 635 00:39:42,370 --> 00:39:46,840 are unbiased and independent of each other. 636 00:39:46,840 --> 00:39:50,740 And almost identically distributed on either side 637 00:39:50,740 --> 00:39:52,330 of the mean, right? 638 00:39:52,330 --> 00:39:54,370 That's why we spend so much time looking 639 00:39:54,370 --> 00:39:58,990 at the normal distribution, and why it's called Gaussian. 640 00:39:58,990 --> 00:40:00,880 Because Gauss said, yes, I know I 641 00:40:00,880 --> 00:40:04,730 have errors in my astronomical measurements. 642 00:40:04,730 --> 00:40:08,200 But I believe my errors are distributed in what we now 643 00:40:08,200 --> 00:40:10,120 call a Gaussian curve. 644 00:40:10,120 --> 00:40:12,130 And therefore, I can still work with them 645 00:40:12,130 --> 00:40:13,960 and get an accurate estimate of the values. 646 00:40:16,820 --> 00:40:21,010 Now, of course, that wasn't true here. 647 00:40:21,010 --> 00:40:22,570 The errors were not random. 648 00:40:22,570 --> 00:40:25,360 They were, in fact, quite systematic, 649 00:40:25,360 --> 00:40:27,970 designed to produce a certain thing. 650 00:40:27,970 --> 00:40:31,570 And the last word was from another abolitionist 651 00:40:31,570 --> 00:40:35,710 who claimed it was the census that was insane. 652 00:40:35,710 --> 00:40:40,070 All right, that's Garbage In, Garbage Out. 653 00:40:40,070 --> 00:40:45,130 The moral here is that analysis of bad data 654 00:40:45,130 --> 00:40:49,090 is worse than no analysis at all, really. 655 00:40:49,090 --> 00:40:52,270 Time and again we see people doing, actually often, 656 00:40:52,270 --> 00:40:57,190 correct statistical analysis of incorrect data 657 00:40:57,190 --> 00:41:00,400 and reaching conclusions. 658 00:41:00,400 --> 00:41:02,500 And that's really risky. 659 00:41:02,500 --> 00:41:04,540 So before one goes off and starts 660 00:41:04,540 --> 00:41:08,770 using statistical techniques of the sort we've been discussing, 661 00:41:08,770 --> 00:41:10,690 the first question you have to ask is, 662 00:41:10,690 --> 00:41:14,620 is the data itself worth analyzing? 663 00:41:14,620 --> 00:41:18,250 And it often isn't. 664 00:41:18,250 --> 00:41:21,220 Now, you could argue that this is a thing of the past, 665 00:41:21,220 --> 00:41:25,090 and no modern politician would make these kinds of mistakes. 666 00:41:25,090 --> 00:41:27,130 I'm not going to insert a photo here. 667 00:41:27,130 --> 00:41:30,700 But I leave it to you to think which politician's photo 668 00:41:30,700 --> 00:41:32,810 you might paste in this frame. 669 00:41:35,370 --> 00:41:38,490 All right, onto another statistical sin. 670 00:41:38,490 --> 00:41:42,062 This is a picture of a World War II fighter plane. 671 00:41:42,062 --> 00:41:44,520 I don't know enough about planes to know what kind of plane 672 00:41:44,520 --> 00:41:45,019 it is. 673 00:41:45,019 --> 00:41:45,810 Anyone here? 674 00:41:45,810 --> 00:41:47,190 There must be an Aero student who 675 00:41:47,190 --> 00:41:50,660 will be able to tell me what plane this is. 676 00:41:50,660 --> 00:41:54,460 Don't they teach you guys anything in Aero these days? 677 00:41:54,460 --> 00:41:55,150 Shame on them. 678 00:41:55,150 --> 00:41:56,080 All right. 679 00:41:56,080 --> 00:41:57,310 Anyway, it's a plane. 680 00:41:57,310 --> 00:41:58,690 That much I know. 681 00:41:58,690 --> 00:42:00,340 And it has a propeller. 682 00:42:00,340 --> 00:42:03,100 And that's all I can tell you about the airplane. 683 00:42:03,100 --> 00:42:07,980 So this was a photo taken at a airfield in Britain. 684 00:42:07,980 --> 00:42:12,960 And the Allies would send planes over Germany 685 00:42:12,960 --> 00:42:16,920 for bombing runs and fighters to protect the bombers. 686 00:42:16,920 --> 00:42:20,610 And when they came back, the planes were often damaged. 687 00:42:20,610 --> 00:42:23,070 And they would inspect the damage and say look, 688 00:42:23,070 --> 00:42:24,990 there's a lot of flak there. 689 00:42:24,990 --> 00:42:28,740 The Germans shot flak at the planes. 690 00:42:28,740 --> 00:42:31,560 And that would be a part of the plane 691 00:42:31,560 --> 00:42:34,990 that maybe we should reinforce in the future. 692 00:42:34,990 --> 00:42:38,740 So when it gets hit by flak it survives it. 693 00:42:38,740 --> 00:42:40,419 It does less damage. 694 00:42:40,419 --> 00:42:42,460 So you can analyze where the Germans were hitting 695 00:42:42,460 --> 00:42:46,030 the planes, and you would add a little extra armor 696 00:42:46,030 --> 00:42:49,160 to that part of the plane. 697 00:42:49,160 --> 00:42:53,528 What's the flaw in that? 698 00:42:53,528 --> 00:42:54,492 Yeah? 699 00:42:54,492 --> 00:42:56,366 AUDIENCE: They didn't look at the planes that 700 00:42:56,366 --> 00:42:57,785 actually got shot down. 701 00:42:57,785 --> 00:42:59,080 JOHN GUTTAG: Yeah. 702 00:42:59,080 --> 00:43:03,070 This is what's called, in the jargon, survivor bias. 703 00:43:09,955 --> 00:43:16,120 S-U-R-V-I-V-O-R. 704 00:43:16,120 --> 00:43:18,440 The planes they really should have been analyzing 705 00:43:18,440 --> 00:43:20,720 were the ones that got shot down. 706 00:43:20,720 --> 00:43:23,000 But those were hard to analyze. 707 00:43:23,000 --> 00:43:26,370 So they analyzed the ones they had 708 00:43:26,370 --> 00:43:29,820 and drew conclusions, and perhaps totally 709 00:43:29,820 --> 00:43:31,254 the wrong conclusion. 710 00:43:31,254 --> 00:43:33,420 Maybe the conclusion they should have drawn is well, 711 00:43:33,420 --> 00:43:35,430 it's OK if you get hit here. 712 00:43:35,430 --> 00:43:38,070 Let's reinforce the other places. 713 00:43:38,070 --> 00:43:40,590 I don't know enough to know what the right answer was. 714 00:43:40,590 --> 00:43:44,250 I do know that this was statistically the wrong thing 715 00:43:44,250 --> 00:43:47,420 to be thinking about doing. 716 00:43:47,420 --> 00:43:51,630 And this is an issue we have whenever we do sampling. 717 00:43:51,630 --> 00:43:55,620 All statistical techniques are based upon the assumption 718 00:43:55,620 --> 00:43:59,140 that by sampling a subset of the population 719 00:43:59,140 --> 00:44:02,910 we can infer things about the population as a whole. 720 00:44:02,910 --> 00:44:06,210 Everything we've done this term has been based on that. 721 00:44:06,210 --> 00:44:11,220 When we were fitting curves we were doing that. 722 00:44:11,220 --> 00:44:15,270 When we were talking about the empirical rule and Monte Carlo 723 00:44:15,270 --> 00:44:17,362 Simulation, we were doing that, when 724 00:44:17,362 --> 00:44:19,320 we were building models, with machine learning, 725 00:44:19,320 --> 00:44:20,170 we were doing that. 726 00:44:23,000 --> 00:44:26,290 And if random sampling is used, you 727 00:44:26,290 --> 00:44:29,800 can make meaningful mathematical statements 728 00:44:29,800 --> 00:44:34,000 about the relation of the sample to the entire population. 729 00:44:36,950 --> 00:44:40,700 And that's why so much of what we did works. 730 00:44:40,700 --> 00:44:42,150 And when we're doing simulations, 731 00:44:42,150 --> 00:44:45,010 that's really easy. 732 00:44:45,010 --> 00:44:48,970 When we were choosing random values of the needles 733 00:44:48,970 --> 00:44:52,045 for trying to find pi, or random value 734 00:44:52,045 --> 00:44:54,490 if the roulette wheel spins. 735 00:44:54,490 --> 00:44:57,130 We could be pretty sure our samples were, indeed, random. 736 00:44:59,930 --> 00:45:04,290 In the field, it's not so easy. 737 00:45:04,290 --> 00:45:05,040 Right? 738 00:45:05,040 --> 00:45:06,750 Because some samples are much more 739 00:45:06,750 --> 00:45:10,050 convenient to acquire than others. 740 00:45:10,050 --> 00:45:13,470 It's much easier to acquire a plane on the field in Britain 741 00:45:13,470 --> 00:45:15,960 than a plane on the ground in France. 742 00:45:19,210 --> 00:45:21,730 Convenient sampling, as it's often called, 743 00:45:21,730 --> 00:45:25,290 is not usually random. 744 00:45:25,290 --> 00:45:27,620 So you have survivor bias. 745 00:45:27,620 --> 00:45:31,830 So I asked you to do course evaluations. 746 00:45:31,830 --> 00:45:34,370 Well, there's survivor bias there. 747 00:45:34,370 --> 00:45:38,240 The people who really hated this course have already dropped it. 748 00:45:38,240 --> 00:45:40,310 And so we won't sample them. 749 00:45:40,310 --> 00:45:43,400 That's good for me, at least. 750 00:45:43,400 --> 00:45:45,890 But we see that. 751 00:45:45,890 --> 00:45:47,960 We see that with grades. 752 00:45:47,960 --> 00:45:49,769 The people who are really struggling, 753 00:45:49,769 --> 00:45:51,560 who were most likely to fail, have probably 754 00:45:51,560 --> 00:45:53,715 dropped the course too. 755 00:45:53,715 --> 00:45:56,090 That's one of the reasons I don't think it's fair to say, 756 00:45:56,090 --> 00:45:57,256 we're going to have a curve. 757 00:45:57,256 --> 00:45:59,840 And we're going to always fail this fraction, 758 00:45:59,840 --> 00:46:01,910 and give A's to this fraction. 759 00:46:01,910 --> 00:46:05,800 Because by the end of the term, we have a lot of survivor bias. 760 00:46:05,800 --> 00:46:08,430 The students who are left are, on average, better 761 00:46:08,430 --> 00:46:11,150 than the students who started the semester. 762 00:46:11,150 --> 00:46:15,090 So you need to take that into account. 763 00:46:15,090 --> 00:46:19,050 Another kind of non-representative sampling 764 00:46:19,050 --> 00:46:23,620 or convenience sampling is opinion polls, 765 00:46:23,620 --> 00:46:28,200 in that you have something there called non-response bias. 766 00:46:28,200 --> 00:46:29,700 So I don't know about you, but I get 767 00:46:29,700 --> 00:46:33,470 phone calls asking my opinion about things. 768 00:46:33,470 --> 00:46:36,010 Surveys about products, whatever. 769 00:46:36,010 --> 00:46:36,850 I never answer. 770 00:46:36,850 --> 00:46:38,470 I just hang up the phone. 771 00:46:38,470 --> 00:46:39,910 I get a zillion emails. 772 00:46:39,910 --> 00:46:42,880 Every time I stay in a hotel, I get an email 773 00:46:42,880 --> 00:46:45,460 asking me to rate the hotel. 774 00:46:45,460 --> 00:46:48,310 When I fly I get e-mails from the airline. 775 00:46:48,310 --> 00:46:51,010 I don't answer any of those surveys. 776 00:46:51,010 --> 00:46:54,580 But some people do, presumably, or they wouldn't send them out. 777 00:46:54,580 --> 00:46:56,590 But why should they think that the people who 778 00:46:56,590 --> 00:46:59,080 answer the survey are representative 779 00:46:59,080 --> 00:47:02,790 of all the people who stay in the hotel or all the people who 780 00:47:02,790 --> 00:47:04,180 fly on the plane? 781 00:47:04,180 --> 00:47:05,090 They're not. 782 00:47:05,090 --> 00:47:06,790 They're the kind of people who maybe 783 00:47:06,790 --> 00:47:09,350 have time to answer surveys. 784 00:47:09,350 --> 00:47:13,550 And so you get a non-response bias. 785 00:47:13,550 --> 00:47:16,040 And that tends to distort your results. 786 00:47:19,270 --> 00:47:23,970 When samples are not random and independent, 787 00:47:23,970 --> 00:47:25,920 we can still run statistics on them. 788 00:47:25,920 --> 00:47:28,560 We can compute means and standard deviations. 789 00:47:28,560 --> 00:47:30,630 And that's fine. 790 00:47:30,630 --> 00:47:33,450 But we can't draw conclusions using 791 00:47:33,450 --> 00:47:36,780 things like the Empirical Rule or the Central Limit 792 00:47:36,780 --> 00:47:38,940 Theorem, Standard Error. 793 00:47:38,940 --> 00:47:43,200 Because the basic assumption underlying all of that 794 00:47:43,200 --> 00:47:48,180 is that the samples are random and independent. 795 00:47:48,180 --> 00:47:51,190 This is one of the reasons why political polls are 796 00:47:51,190 --> 00:47:53,020 so unreliable. 797 00:47:53,020 --> 00:47:56,230 They compute statistics using Standard Error, 798 00:47:56,230 --> 00:47:59,940 assuming that the samples are random and independent. 799 00:47:59,940 --> 00:48:06,030 But they, for example, get them mostly by calling landlines. 800 00:48:06,030 --> 00:48:09,090 And so they get a bias towards people who actually 801 00:48:09,090 --> 00:48:11,621 answer the phone on a landline. 802 00:48:11,621 --> 00:48:13,620 How many of you have a land line where you live? 803 00:48:16,150 --> 00:48:16,950 Not many, right? 804 00:48:16,950 --> 00:48:20,470 Mostly you rely on your cell phones. 805 00:48:20,470 --> 00:48:24,070 And so any survey that depends on landlines 806 00:48:24,070 --> 00:48:27,090 is going to leave a lot of the population out. 807 00:48:27,090 --> 00:48:30,280 They'll get a lot of people of my vintage, 808 00:48:30,280 --> 00:48:33,220 not of your vintage. 809 00:48:33,220 --> 00:48:36,820 And that gets you in trouble. 810 00:48:36,820 --> 00:48:42,200 So the moral here is always understand 811 00:48:42,200 --> 00:48:48,110 how the data was collected, what the assumptions in the analysis 812 00:48:48,110 --> 00:48:53,700 were, and whether they're satisfied. 813 00:48:53,700 --> 00:48:56,000 If these things are not true, you 814 00:48:56,000 --> 00:48:59,690 need to be very wary of the results. 815 00:48:59,690 --> 00:49:02,100 All right, I think I'll stop here. 816 00:49:02,100 --> 00:49:07,130 We'll finish up our panoply of statistical sins 817 00:49:07,130 --> 00:49:09,340 on Wednesday, in the first half. 818 00:49:09,340 --> 00:49:11,210 Then we'll do a course wrap-up. 819 00:49:11,210 --> 00:49:14,450 Then I'll wish you all godspeed and a good final. 820 00:49:14,450 --> 00:49:16,570 See you Wednesday.