1 00:00:00,370 --> 00:00:02,790 III The following content is provided under a Creative 2 00:00:02,790 --> 00:00:04,210 Commons license. 3 00:00:04,210 --> 00:00:07,240 Your support will help MIT OpenCourseWare continue to 4 00:00:07,240 --> 00:00:10,900 offer high quality educational resources for free. 5 00:00:10,900 --> 00:00:13,790 To make a donation or view additional materials from 6 00:00:13,790 --> 00:00:19,620 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:19,620 --> 00:00:20,870 ocw.mit.edu. 8 00:00:29,080 --> 00:00:30,910 PROFESSOR: All right. 9 00:00:30,910 --> 00:00:33,970 We'll back up and start again since I just turned on my 10 00:00:33,970 --> 00:00:35,220 microphone. 11 00:00:37,180 --> 00:00:41,090 I started with the observation that for most of recorded 12 00:00:41,090 --> 00:00:44,960 history people thought qualitatively, not 13 00:00:44,960 --> 00:00:46,900 quantitatively. 14 00:00:46,900 --> 00:00:49,020 They didn't know what statistics were. 15 00:00:49,020 --> 00:00:52,590 They must have had some intuitive sense, for example, 16 00:00:52,590 --> 00:00:56,030 that if you're old, you're more likely to have bad 17 00:00:56,030 --> 00:00:58,000 hearing than if you're young. 18 00:00:58,000 --> 00:01:00,040 If you're old you're more likely to die 19 00:01:00,040 --> 00:01:00,940 than if you're young. 20 00:01:00,940 --> 00:01:02,500 Things like that. 21 00:01:02,500 --> 00:01:06,890 But they were just anecdotes, and they had no careful way to 22 00:01:06,890 --> 00:01:10,170 go from statements about individuals to statements 23 00:01:10,170 --> 00:01:14,050 about populations or expectations. 24 00:01:14,050 --> 00:01:18,550 This changed in about the middle of the 17th century, 25 00:01:18,550 --> 00:01:22,970 changed fairly dramatically, when an Englishman named John 26 00:01:22,970 --> 00:01:26,380 Graunt published something called The Natural and 27 00:01:26,380 --> 00:01:32,110 Political Observations Made Upon the Bills of Mortality. 28 00:01:32,110 --> 00:01:36,990 This was the first work in recorded history that actually 29 00:01:36,990 --> 00:01:39,050 used statistics. 30 00:01:39,050 --> 00:01:42,540 And what he did is he looked at the fairly comprehensive 31 00:01:42,540 --> 00:01:47,180 statistics of when people died in the city of London. 32 00:01:47,180 --> 00:01:51,070 And from that, attempted to produce a model that could be 33 00:01:51,070 --> 00:01:55,400 used to predict the spread of the plague. 34 00:01:55,400 --> 00:01:58,800 And it turns out, he was pretty good at it. 35 00:01:58,800 --> 00:02:01,530 And that just changed the way people started to think. 36 00:02:01,530 --> 00:02:06,910 And since that time, people have used statistics both to 37 00:02:06,910 --> 00:02:09,304 inform and, unfortunately, to mislead. 38 00:02:12,400 --> 00:02:17,840 Some have willfully used statistics to mislead, others 39 00:02:17,840 --> 00:02:20,430 have merely been incompetent. 40 00:02:20,430 --> 00:02:23,760 And that gets me to what I want to talk about today, 41 00:02:23,760 --> 00:02:25,750 which is statistics. 42 00:02:29,060 --> 00:02:32,070 This is often attributed to Mark Twain but, in fact, he 43 00:02:32,070 --> 00:02:35,750 copied it from Benjamin Disraeli, who said, "There are 44 00:02:35,750 --> 00:02:43,760 three kinds of lies: lies, damned lies, and statistics." 45 00:02:43,760 --> 00:02:49,840 And there is, unfortunately, a lot of truth in that. 46 00:02:49,840 --> 00:02:53,080 More recently, in the '50s Darrell Huff wrote a wonderful 47 00:02:53,080 --> 00:02:58,030 book, I recommend it, called How to Lie with Statistics. 48 00:02:58,030 --> 00:03:01,290 Now here's a quote from the book, "If you can't prove what 49 00:03:01,290 --> 00:03:05,540 you want to prove, demonstrate something else and pretend 50 00:03:05,540 --> 00:03:07,360 that they're the same thing. 51 00:03:07,360 --> 00:03:10,530 In the daze that follows the collision of statistics with 52 00:03:10,530 --> 00:03:15,270 the human mind, hardly anyone will notice the difference." 53 00:03:15,270 --> 00:03:19,340 Alas, that seems to be true. 54 00:03:19,340 --> 00:03:24,900 So what I want to do today is talk about a few ways in which 55 00:03:24,900 --> 00:03:29,230 one can be fooled into drawing inappropriate conclusions from 56 00:03:29,230 --> 00:03:31,610 statistical data. 57 00:03:31,610 --> 00:03:34,100 Now I trust that you will use this 58 00:03:34,100 --> 00:03:36,750 information only for good. 59 00:03:36,750 --> 00:03:42,170 And it will make you a better consumer and purveyor of data 60 00:03:42,170 --> 00:03:46,980 rather than a better liar, but that's up to you. 61 00:03:46,980 --> 00:03:47,810 All right. 62 00:03:47,810 --> 00:03:52,430 So let's start with the first way out thing I 63 00:03:52,430 --> 00:03:54,540 want to talk about. 64 00:03:54,540 --> 00:03:59,150 And it's important to always remember that no matter how 65 00:03:59,150 --> 00:04:08,690 good your statistics are, statistical measures don't 66 00:04:08,690 --> 00:04:09,940 tell the whole story. 67 00:04:12,310 --> 00:04:15,575 We've seen examples of this already earlier in the term. 68 00:04:19,480 --> 00:04:22,290 There are an enormous set of statistics that can be 69 00:04:22,290 --> 00:04:24,670 extracted from a data set. 70 00:04:24,670 --> 00:04:28,480 And by carefully picking and choosing among them, it's 71 00:04:28,480 --> 00:04:33,450 possible to convey almost any impression you want about the 72 00:04:33,450 --> 00:04:35,590 same data set. 73 00:04:35,590 --> 00:04:40,250 The best antidote, of course, is to look at the data itself. 74 00:04:40,250 --> 00:04:46,320 In 1973, the statistician John Anscombe published a paper 75 00:04:46,320 --> 00:04:50,050 with this set of data. 76 00:04:50,050 --> 00:04:56,760 Four different examples where he gave values for x and y, 77 00:04:56,760 --> 00:04:59,570 nothing very sophisticated. 78 00:04:59,570 --> 00:05:06,150 The interesting thing is that in many ways, the statistics 79 00:05:06,150 --> 00:05:09,085 for these four data sets are very similar. 80 00:05:11,730 --> 00:05:18,980 They have the same mean, the same median, the same variance 81 00:05:18,980 --> 00:05:24,020 for x and y both, the same correlation between x and y. 82 00:05:24,020 --> 00:05:28,520 And even if we use linear regression to get a fit, we 83 00:05:28,520 --> 00:05:30,210 get very similar things. 84 00:05:30,210 --> 00:05:32,070 So let's look at it. 85 00:05:32,070 --> 00:05:36,130 I actually wrote some code to do that. 86 00:05:36,130 --> 00:05:42,520 So this just reads in the data set that we just had up, and 87 00:05:42,520 --> 00:05:45,810 then plots some things with it. 88 00:05:45,810 --> 00:05:47,070 And so let's look at it. 89 00:06:01,580 --> 00:06:05,170 So what you can see is, you have these in your hand out, 90 00:06:05,170 --> 00:06:08,890 we have four graphs, four plots. 91 00:06:08,890 --> 00:06:10,920 And I won't show you them all because 92 00:06:10,920 --> 00:06:13,080 they're all are identical. 93 00:06:13,080 --> 00:06:17,080 What they all give me is a mean of 7.5 and a whole bunch 94 00:06:17,080 --> 00:06:23,090 of things, same to 17 decimal places or so, the same median 95 00:06:23,090 --> 00:06:32,530 and the same linear fit, y equals 0.5x plus 3. 96 00:06:32,530 --> 00:06:36,550 So I might write a paper in which I would tell you that 97 00:06:36,550 --> 00:06:39,180 well, from a statistical sense, these data 98 00:06:39,180 --> 00:06:41,940 sets are all the same. 99 00:06:41,940 --> 00:06:45,840 Every statistical result I ran said these things are 100 00:06:45,840 --> 00:06:47,850 indistinguishable. 101 00:06:47,850 --> 00:06:51,330 Well, is it true that these data sets are 102 00:06:51,330 --> 00:06:52,580 indistinguishable? 103 00:06:55,060 --> 00:07:02,220 Well, one way to look at it is to actually plot the points. 104 00:07:02,220 --> 00:07:08,100 So we can run it with true, which says in addition to 105 00:07:08,100 --> 00:07:12,580 plotting the data, put points on them in addition to 106 00:07:12,580 --> 00:07:14,750 plotting the statistics. 107 00:07:14,750 --> 00:07:19,120 And now we see something fairly dramatic. 108 00:07:19,120 --> 00:07:24,160 So for example, figure (2) and figure (1) don't look a lot 109 00:07:24,160 --> 00:07:30,130 like each other, when we actually look at the data. 110 00:07:30,130 --> 00:07:33,890 Figure (3) looks quite different and figure (4) is 111 00:07:33,890 --> 00:07:36,590 remarkably different. 112 00:07:36,590 --> 00:07:39,300 Compare figure (4) to, say, figure (1). 113 00:07:39,300 --> 00:07:41,900 Again, these are all in your hand out. 114 00:07:41,900 --> 00:07:45,810 So the moral is pretty simple here, and it's one we looked 115 00:07:45,810 --> 00:07:51,820 at before, which is, don't ever ignore the data. 116 00:07:51,820 --> 00:07:55,310 Don't just look at statistics about the data, try and find a 117 00:07:55,310 --> 00:07:57,455 way to look at the data itself. 118 00:08:08,060 --> 00:08:12,610 Of course, it's easy enough to look at the data, but how do 119 00:08:12,610 --> 00:08:15,190 you look at it? 120 00:08:15,190 --> 00:08:17,390 And the next thing to remember is that 121 00:08:17,390 --> 00:08:20,990 pictures can be deceiving. 122 00:08:27,710 --> 00:08:31,740 There can be no doubt about the utility of graphics for 123 00:08:31,740 --> 00:08:34,740 quickly conveying information. 124 00:08:34,740 --> 00:08:39,919 However, when used carelessly or maliciously, a plot can be 125 00:08:39,919 --> 00:08:42,090 highly misleading. 126 00:08:42,090 --> 00:08:56,570 So let's look-- go back to the PowerPoint, and look at this 127 00:08:56,570 --> 00:09:01,090 plot of housing prices in the Midwest. 128 00:09:01,090 --> 00:09:08,830 So we've got a plot here, and we've got years 2006, 2007 and 129 00:09:08,830 --> 00:09:12,400 then in 2008 and 9, we have quarters. 130 00:09:12,400 --> 00:09:15,170 You may remember that there was an event in the housing 131 00:09:15,170 --> 00:09:21,000 market in 2008 precipitating the global financial crisis. 132 00:09:21,000 --> 00:09:24,960 And if we look at this, what impression do we get about 133 00:09:24,960 --> 00:09:29,060 housing prices in the mid west during this period? 134 00:09:29,060 --> 00:09:32,450 Well I would get the impression that they are 135 00:09:32,450 --> 00:09:35,360 remarkably stable. 136 00:09:35,360 --> 00:09:38,220 You publish this and you say, OK, look they really haven't 137 00:09:38,220 --> 00:09:39,210 changed very much. 138 00:09:39,210 --> 00:09:42,180 Maybe they've gone down a little, but 139 00:09:42,180 --> 00:09:45,490 nothing very serious. 140 00:09:45,490 --> 00:09:50,560 If we compare that to this plot, which is exactly the 141 00:09:50,560 --> 00:09:56,980 same data, and now I ask you about housing 142 00:09:56,980 --> 00:09:58,240 prices in the Midwest. 143 00:09:58,240 --> 00:09:59,960 Well, what you might tell me is it's-- 144 00:09:59,960 --> 00:10:03,630 they're remarkably unstable. 145 00:10:03,630 --> 00:10:08,440 And in fact, there was clearly some sort of 146 00:10:08,440 --> 00:10:11,460 horrible event here. 147 00:10:11,460 --> 00:10:15,140 Exactly the same data, two plots both of which are 148 00:10:15,140 --> 00:10:19,460 truthful, but which give a radically different impression 149 00:10:19,460 --> 00:10:20,710 about what happened. 150 00:10:23,320 --> 00:10:25,680 The chart on the right was designed-- 151 00:10:25,680 --> 00:10:27,440 on the right in your handout-- 152 00:10:27,440 --> 00:10:31,660 was designed to show that they were highly unstable. 153 00:10:31,660 --> 00:10:32,930 So what's the difference? 154 00:10:32,930 --> 00:10:36,110 What trick did I use to produce these two plots? 155 00:10:36,110 --> 00:10:36,580 Yeah? 156 00:10:36,580 --> 00:10:39,002 AUDIENCE: In the more stable one, you used 157 00:10:39,002 --> 00:10:39,990 the logarithmic scale. 158 00:10:39,990 --> 00:10:43,942 And then here only has selected numbers so-- 159 00:10:43,942 --> 00:10:45,660 PROFESSOR: So that was certainly one 160 00:10:45,660 --> 00:10:47,310 trick that I performed. 161 00:10:47,310 --> 00:10:50,750 That in the first chart, I plotted the y-axis 162 00:10:50,750 --> 00:10:53,460 logarithmically, which always make things look like they are 163 00:10:53,460 --> 00:10:56,990 changing less than if I plot it linearly. 164 00:10:56,990 --> 00:11:00,100 And in this chart, I used a linear plot. 165 00:11:00,100 --> 00:11:00,850 Go ahead. 166 00:11:00,850 --> 00:11:02,870 AUDIENCE: You see a much narrower scale 167 00:11:02,870 --> 00:11:03,370 on the second plot. 168 00:11:03,370 --> 00:11:07,230 So that the magnitude of the difference is much less 169 00:11:07,230 --> 00:11:09,630 compared to the magnitude of the whole graph of the scale. 170 00:11:09,630 --> 00:11:14,370 PROFESSOR: So that's the other thing I did is, if you look at 171 00:11:14,370 --> 00:11:18,480 it, I sort of cheated. 172 00:11:18,480 --> 00:11:23,200 I had full years here and then I went to quarters. 173 00:11:23,200 --> 00:11:28,160 So in part of my chart, the resolution on the x is pretty 174 00:11:28,160 --> 00:11:32,750 wide is a whole year, and then part of it's on a quarter. 175 00:11:32,750 --> 00:11:36,260 And, not surprisingly, since we know that housing prices 176 00:11:36,260 --> 00:11:40,020 change seasonally, they're different in the spring than 177 00:11:40,020 --> 00:11:44,040 in the winter, once I start plotting quarters, even if 178 00:11:44,040 --> 00:11:47,150 there had not been a crash, it would have looked much less 179 00:11:47,150 --> 00:11:51,430 stable in the out years, because I changed the 180 00:11:51,430 --> 00:11:55,270 resolution on the x-axis. 181 00:11:55,270 --> 00:11:56,600 I didn't lie. 182 00:11:56,600 --> 00:11:59,620 You can tell reading the legend that I did that. 183 00:11:59,620 --> 00:12:01,810 But I sure could have fooled a lot of 184 00:12:01,810 --> 00:12:03,060 people with these charts. 185 00:12:06,120 --> 00:12:10,150 Here's another nice example of statistics. 186 00:12:10,150 --> 00:12:11,950 So this plots-- 187 00:12:11,950 --> 00:12:18,670 this is from a paper by two professors, I think from Yale, 188 00:12:18,670 --> 00:12:22,800 shows what you can do if you're in the Ivy league, that 189 00:12:22,800 --> 00:12:28,750 plots initials against GPA for students at Yale. 190 00:12:31,420 --> 00:12:34,890 And so you can see that if your first name starts with 191 00:12:34,890 --> 00:12:38,860 the letter A, I think it was first initials, your GPA is 192 00:12:38,860 --> 00:12:44,410 considerably higher than if it starts with C or D. And if 193 00:12:44,410 --> 00:12:47,140 your parents weren't nice enough to give you an A name, 194 00:12:47,140 --> 00:12:49,930 you could hope they at least gave you a B name. 195 00:12:49,930 --> 00:12:51,400 And you certainly don't want them to give 196 00:12:51,400 --> 00:12:52,630 you a C or a D name. 197 00:12:52,630 --> 00:12:58,520 So if you're Charlene or David you could have a real problem. 198 00:12:58,520 --> 00:13:01,730 I have to say my first child was named David. 199 00:13:01,730 --> 00:13:05,600 His GPA might have been somewhere in there. 200 00:13:05,600 --> 00:13:06,670 All right -- 201 00:13:06,670 --> 00:13:08,170 clearly it matters. 202 00:13:08,170 --> 00:13:12,578 Well, what tricks did I perform here? 203 00:13:12,578 --> 00:13:15,063 AUDIENCE: There is very little disparity in x. 204 00:13:15,063 --> 00:13:19,715 You are going from 3.32 to-- 205 00:13:19,715 --> 00:13:20,620 PROFESSOR: Right. 206 00:13:20,620 --> 00:13:24,770 So what I did here is I made the range on the y-axis very 207 00:13:24,770 --> 00:13:32,500 small, ranging from 3.32 to 3.38, not a big difference. 208 00:13:32,500 --> 00:13:35,890 However, because that's the whole thing, it looks like 209 00:13:35,890 --> 00:13:37,140 it's a big difference. 210 00:13:40,480 --> 00:13:43,120 You will often see this when you, say, look at things in 211 00:13:43,120 --> 00:13:47,510 newspapers where, in fact, someone has manipulated one of 212 00:13:47,510 --> 00:13:51,880 the axes to make things look big or small. 213 00:13:51,880 --> 00:13:57,430 If I had ranged this from a GPA of 0.0 to a GPA of 4.0, 214 00:13:57,430 --> 00:14:01,210 the highest GPA at Yale, this difference would 215 00:14:01,210 --> 00:14:04,520 have looked very tiny. 216 00:14:04,520 --> 00:14:05,440 But I didn't -- 217 00:14:05,440 --> 00:14:08,060 actually it wasn't I. I actually copied this from 218 00:14:08,060 --> 00:14:09,170 their paper. 219 00:14:09,170 --> 00:14:13,150 This was the way they presented it in their paper. 220 00:14:13,150 --> 00:14:15,880 Because they were trying to argue in the paper that your 221 00:14:15,880 --> 00:14:18,670 name had a big influence on your life. 222 00:14:18,670 --> 00:14:21,870 And they used many statistics including your grades. 223 00:14:21,870 --> 00:14:26,340 And so they actually formatted it kind of like this to try 224 00:14:26,340 --> 00:14:28,580 and give you this impression. 225 00:14:28,580 --> 00:14:33,770 Now later we'll see another statistical sin they committed 226 00:14:33,770 --> 00:14:38,070 in this paper, which, basically, was designed to 227 00:14:38,070 --> 00:14:42,890 show that in your name was destiny. 228 00:14:42,890 --> 00:14:45,380 And they had many other things. 229 00:14:45,380 --> 00:14:47,850 If you're a baseball player, and your name starts with K, 230 00:14:47,850 --> 00:14:50,650 you're more likely to strike out. 231 00:14:50,650 --> 00:14:53,200 Because K is the symbol for strikeouts in a baseball score 232 00:14:53,200 --> 00:14:57,460 book, a lot of implausible things. 233 00:14:57,460 --> 00:14:58,870 All right. 234 00:14:58,870 --> 00:15:05,850 Moving right along, probably the most serious and common 235 00:15:05,850 --> 00:15:10,150 statistical error is the one known as 236 00:15:10,150 --> 00:15:13,030 Garbage In Garbage Out. 237 00:15:17,180 --> 00:15:22,700 And it's so common that people typically refer to it by its 238 00:15:22,700 --> 00:15:26,880 acronym GIGO -- 239 00:15:26,880 --> 00:15:30,550 Garbage In Garbage Out. 240 00:15:30,550 --> 00:15:34,610 A classic example of this occurred and, we could look at 241 00:15:34,610 --> 00:15:36,680 more recent examples, but I don't want to 242 00:15:36,680 --> 00:15:38,930 offend any of you. 243 00:15:38,930 --> 00:15:45,670 In 1840, United states census showed that insanity among 244 00:15:45,670 --> 00:15:50,750 free blacks and mulattoes was roughly 10 times more common 245 00:15:50,750 --> 00:15:56,360 than insanity among enslaved blacks or mulattoes. 246 00:15:56,360 --> 00:16:00,900 And the conclusion from this was obvious, I guess. 247 00:16:00,900 --> 00:16:04,500 US Senator, former Vice President, and later, 248 00:16:04,500 --> 00:16:08,870 Secretary of State John C. Calhoun concluded from the 249 00:16:08,870 --> 00:16:13,040 census and I quote, "The data on sanity revealed In this 250 00:16:13,040 --> 00:16:15,500 census is unimpeachable. 251 00:16:15,500 --> 00:16:18,340 From it, our nation must conclude that the abolition of 252 00:16:18,340 --> 00:16:22,770 slavery would be to the African, a curse." Because 253 00:16:22,770 --> 00:16:24,960 after all, if you freed them from slavery, they 254 00:16:24,960 --> 00:16:26,210 would all go insane. 255 00:16:28,620 --> 00:16:31,550 That's what the statistics reported, he said. 256 00:16:31,550 --> 00:16:36,310 Now never mind it was soon clear that in fact that census 257 00:16:36,310 --> 00:16:38,760 was riddled with errors. 258 00:16:38,760 --> 00:16:42,610 And John Quincy Adams, a former Vice President and 259 00:16:42,610 --> 00:16:47,350 Massachusetts resident responded to Calhoun and said, 260 00:16:47,350 --> 00:16:48,950 no that's a ridiculous conclusion. 261 00:16:48,950 --> 00:16:51,660 The census is full of errors. 262 00:16:51,660 --> 00:16:56,680 Calhoun, being a very patient person, explained to Adams the 263 00:16:56,680 --> 00:17:03,900 following, "There were so many errors, that they balanced one 264 00:17:03,900 --> 00:17:07,520 another out, and led to the same conclusion just as much 265 00:17:07,520 --> 00:17:10,970 as if they were all correct." There were just enough errors 266 00:17:10,970 --> 00:17:13,800 that you could be OK. 267 00:17:13,800 --> 00:17:16,640 Well, what was he relying on? 268 00:17:16,640 --> 00:17:19,950 What should he have said if he wanted to make this statement 269 00:17:19,950 --> 00:17:23,300 more mathematically precise? 270 00:17:23,300 --> 00:17:26,910 What he was basically implying is that the measurement errors 271 00:17:26,910 --> 00:17:32,190 are unbiased and independent of each other and, therefore, 272 00:17:32,190 --> 00:17:37,230 almost identically distributed on either side of the mean. 273 00:17:37,230 --> 00:17:39,140 I see a typo, I might as well fix it. 274 00:17:44,030 --> 00:17:45,720 Might as well make it big enough to fix it. 275 00:17:54,770 --> 00:17:56,020 That's interesting. 276 00:18:00,540 --> 00:18:05,400 If he had made this much more precise statement, then you 277 00:18:05,400 --> 00:18:08,090 could have had an meaningful discussion-- 278 00:18:08,090 --> 00:18:10,210 assuming it was possible to have a meaningful discussion 279 00:18:10,210 --> 00:18:14,940 with John Calhoun, which is perhaps dubious-- 280 00:18:14,940 --> 00:18:19,910 about whether or not, in fact, the errors are independent. 281 00:18:19,910 --> 00:18:23,600 Because if they're not, if, for example, they represent 282 00:18:23,600 --> 00:18:28,600 bias in the people compiling the data, then you cannot rely 283 00:18:28,600 --> 00:18:31,480 upon statistical methods to say that they'll 284 00:18:31,480 --> 00:18:34,030 balance each other out. 285 00:18:34,030 --> 00:18:38,080 You remember way back in Gauss' time, Gauss talked 286 00:18:38,080 --> 00:18:41,600 about this when he talked about the normal distribution. 287 00:18:41,600 --> 00:18:44,130 And said, well, if we take these astronomical 288 00:18:44,130 --> 00:18:48,450 measurements and we assume our errors are independent and 289 00:18:48,450 --> 00:18:53,130 normally distributed, then we can look at the mean and 290 00:18:53,130 --> 00:18:55,425 assume that that's close to the truth. 291 00:18:58,190 --> 00:19:01,890 Well, those are important assumptions which in this case 292 00:19:01,890 --> 00:19:04,490 turned out to be not correct. 293 00:19:07,340 --> 00:19:10,940 And, in fact, it was later shown that the errors did not 294 00:19:10,940 --> 00:19:13,570 balance each other out nicely. 295 00:19:13,570 --> 00:19:17,840 And in fact, today you can say that no statistical conclusion 296 00:19:17,840 --> 00:19:19,930 can be drawn from that. 297 00:19:19,930 --> 00:19:23,460 On the other hand recently, the US National Research 298 00:19:23,460 --> 00:19:27,180 Council, perhaps the most prestigious academic 299 00:19:27,180 --> 00:19:31,130 organization in the United states, published a ranking of 300 00:19:31,130 --> 00:19:33,920 all universities in the country. 301 00:19:33,920 --> 00:19:39,590 And it was later shown that it was full of garbagy input. 302 00:19:39,590 --> 00:19:42,810 And they did extensive statistical analysis and 303 00:19:42,810 --> 00:19:48,800 published it on data that turned out to be just wrong. 304 00:19:48,800 --> 00:19:52,770 And it was very embarrassing. 305 00:19:52,770 --> 00:19:54,740 Now the good news is MIT came out near 306 00:19:54,740 --> 00:19:57,710 the top of this analysis. 307 00:19:57,710 --> 00:20:00,130 And the bad news is we can't conclude that it actually 308 00:20:00,130 --> 00:20:02,940 should have been near the top because who knows about the 309 00:20:02,940 --> 00:20:07,560 quality of the data, but kind of embarrassing. 310 00:20:07,560 --> 00:20:08,760 All right. 311 00:20:08,760 --> 00:20:14,320 Moving right along, another very common way to lie with 312 00:20:14,320 --> 00:20:21,890 statistics is to exploit what's called the cum hoc ergo 313 00:20:21,890 --> 00:20:29,520 propter hoc fallacy. 314 00:20:34,450 --> 00:20:37,550 So anyone here study Latin? 315 00:20:40,690 --> 00:20:41,240 Bunch of techy-- 316 00:20:41,240 --> 00:20:41,930 Oh OK. 317 00:20:41,930 --> 00:20:42,900 Well what does it mean? 318 00:20:42,900 --> 00:20:45,546 AUDIENCE: With this therefore, because of this? 319 00:20:45,546 --> 00:20:48,410 PROFESSOR: Boy, your Latin is good. 320 00:20:48,410 --> 00:20:51,120 Either that or you just know statistics. 321 00:20:51,120 --> 00:20:55,940 But I have to say that was the most fluent translation I've 322 00:20:55,940 --> 00:21:00,040 had it all the years I've asked this question. 323 00:21:00,040 --> 00:21:02,980 I hit the relay man-- the relay woman on the throw. 324 00:21:02,980 --> 00:21:03,700 All right. 325 00:21:03,700 --> 00:21:08,660 Yes, with this, therefore because of this. 326 00:21:08,660 --> 00:21:11,440 I don't know why but statisticians, like physicians 327 00:21:11,440 --> 00:21:13,160 and attorneys, like to show off by 328 00:21:13,160 --> 00:21:16,210 phrasing things in Latin. 329 00:21:16,210 --> 00:21:21,280 So for example, it is a statistical fact that college 330 00:21:21,280 --> 00:21:25,320 students, including MIT students, who regularly attend 331 00:21:25,320 --> 00:21:29,970 lectures have higher GPAs than students who attend lectures 332 00:21:29,970 --> 00:21:32,200 only sporadically. 333 00:21:32,200 --> 00:21:36,010 So that would tell us that those of you in the room are 334 00:21:36,010 --> 00:21:39,140 likely to have a higher GPA than the various students in 335 00:21:39,140 --> 00:21:42,180 6.00 who are not in this room. 336 00:21:42,180 --> 00:21:44,850 I hope it's true. 337 00:21:44,850 --> 00:21:48,030 Now if you're a professor who gives these lectures, what you 338 00:21:48,030 --> 00:21:51,260 want to believe it's because the lectures are so incredibly 339 00:21:51,260 --> 00:21:54,630 informative, that we make the students who come much smarter 340 00:21:54,630 --> 00:21:56,540 and, therefore, they do better. 341 00:21:56,540 --> 00:21:59,170 And so we'd like to assume causality. 342 00:21:59,170 --> 00:22:01,970 Because I give beautiful lectures and you choose to 343 00:22:01,970 --> 00:22:05,800 come, you will get a better grade in 6.00. 344 00:22:05,800 --> 00:22:09,310 Well, yes, there's a correlation. 345 00:22:09,310 --> 00:22:16,290 It's unquestionably true, but causation is hard to jump to. 346 00:22:16,290 --> 00:22:20,700 For example, maybe it's the point that students who bother 347 00:22:20,700 --> 00:22:24,910 to come to lecture also bother to do the problem sets, and 348 00:22:24,910 --> 00:22:27,540 are just more conscientious. 349 00:22:27,540 --> 00:22:30,070 And whether they came to lecture or not, the fact that 350 00:22:30,070 --> 00:22:33,650 they're more conscientious would give them better GPAs. 351 00:22:33,650 --> 00:22:37,650 There's no way I know to separate those two things, 352 00:22:37,650 --> 00:22:40,150 other than doing a controlled experiment, right? 353 00:22:40,150 --> 00:22:43,410 Maybe kicking half of you out of lecture every day and just 354 00:22:43,410 --> 00:22:45,020 see how it goes. 355 00:22:45,020 --> 00:22:49,140 But it's dangerous but again, you can read things like the 356 00:22:49,140 --> 00:22:53,300 faculty newsletter, which will talk about how important it is 357 00:22:53,300 --> 00:22:55,930 to come the lecture because you'll do better. 358 00:22:55,930 --> 00:22:58,920 Because whoever wrote that article for the faculty 359 00:22:58,920 --> 00:23:01,390 newsletter didn't understand this-- 360 00:23:01,390 --> 00:23:04,750 this fallacy, or was just thinking wishfully. 361 00:23:07,300 --> 00:23:11,120 Another nice example, one that was in the news not too long 362 00:23:11,120 --> 00:23:13,185 ago, has to do with the flu. 363 00:23:16,090 --> 00:23:19,980 This was the cases of flu in New York 364 00:23:19,980 --> 00:23:23,400 State in recent years. 365 00:23:23,400 --> 00:23:27,140 And you'll notice that there was a peak in 2009, and that 366 00:23:27,140 --> 00:23:30,970 was the famous swine flu epidemic, which I'm sure you 367 00:23:30,970 --> 00:23:33,220 all remember. 368 00:23:33,220 --> 00:23:36,780 Now, if you look at this carefully, or even not too 369 00:23:36,780 --> 00:23:41,500 carefully, you'll notice a correlation between when 370 00:23:41,500 --> 00:23:48,480 schools are in session and when the flu occurs. 371 00:23:48,480 --> 00:23:51,760 That in fact, during those months when schools are in 372 00:23:51,760 --> 00:23:56,810 session, there are more cases of flu than in the months when 373 00:23:56,810 --> 00:24:02,890 school is not in session, high schools, colleges, whatever. 374 00:24:02,890 --> 00:24:06,070 Quite a strong correlation, in fact. 375 00:24:06,070 --> 00:24:10,670 This led many to conclude that going to school is an 376 00:24:10,670 --> 00:24:13,350 important causative factor in getting the flu. 377 00:24:16,100 --> 00:24:19,410 And so maybe you shouldn't have come to the lectures 378 00:24:19,410 --> 00:24:22,410 because you would have just gotten the flu by doing so. 379 00:24:22,410 --> 00:24:26,430 And in fact because of this, you had many parents not 380 00:24:26,430 --> 00:24:30,190 sending their kids to school during the swine flu epidemic. 381 00:24:30,190 --> 00:24:33,470 And in fact, you had many schools closing in some 382 00:24:33,470 --> 00:24:36,265 communities because of the swine flu epidemic. 383 00:24:39,460 --> 00:24:42,720 Well, let's think about it. 384 00:24:42,720 --> 00:24:46,570 Just as you could use this correlation to conclude that 385 00:24:46,570 --> 00:24:51,380 going to school causes the swine flu, you could have also 386 00:24:51,380 --> 00:24:56,930 used it to prove that the flu causes you to go to school. 387 00:24:56,930 --> 00:25:00,340 Because more people are in school when the flu season is 388 00:25:00,340 --> 00:25:01,440 at it's height. 389 00:25:01,440 --> 00:25:04,310 And therefore, it's the growth of flu that causes people to 390 00:25:04,310 --> 00:25:06,320 go to school. 391 00:25:06,320 --> 00:25:08,340 That's an equally valid statistical 392 00:25:08,340 --> 00:25:09,590 assumption from this data. 393 00:25:12,150 --> 00:25:15,250 Kind of a weird thing but it's true, right? 394 00:25:15,250 --> 00:25:19,150 Just as we could conclude that having a high GPA causes 395 00:25:19,150 --> 00:25:20,750 people to come to the lecture. 396 00:25:20,750 --> 00:25:22,790 You look at your GPA every morning, and if it's high 397 00:25:22,790 --> 00:25:25,890 enough, you come to lecture, otherwise, you don't. 398 00:25:25,890 --> 00:25:29,880 You could draw that conclusion from the data as well. 399 00:25:29,880 --> 00:25:34,630 The issue here that you have to think about is whether or 400 00:25:34,630 --> 00:25:42,920 not there is what's called a lurking variable, some other 401 00:25:42,920 --> 00:25:48,020 variable that's related to the other two, and maybe that's 402 00:25:48,020 --> 00:25:49,830 the causative one. 403 00:25:49,830 --> 00:25:55,520 So for example, a lurking variable here is that the 404 00:25:55,520 --> 00:26:01,560 school season coincides with or the non-school season, 405 00:26:01,560 --> 00:26:03,920 maybe I should say, coincides with the summer. 406 00:26:07,770 --> 00:26:14,860 And in fact, if you study the flu virus in a lab, you will 407 00:26:14,860 --> 00:26:19,900 discover that it survives longer in cold weather than in 408 00:26:19,900 --> 00:26:22,270 hot and humid weather. 409 00:26:22,270 --> 00:26:26,390 When it's cold and dry, the flu virus will survive for a 410 00:26:26,390 --> 00:26:30,120 longer time on a surface than it will when 411 00:26:30,120 --> 00:26:32,860 it's warm and humid. 412 00:26:32,860 --> 00:26:37,530 And so in fact, maybe it's the weather, not the presence of 413 00:26:37,530 --> 00:26:42,120 schools, that causes the flu to be more virulent during 414 00:26:42,120 --> 00:26:44,350 certain times of the year. 415 00:26:44,350 --> 00:26:48,950 In fact, it's probably likely true. 416 00:26:48,950 --> 00:26:52,180 So there is a lurking variable that we have to consider, and 417 00:26:52,180 --> 00:26:55,870 maybe that's the causative factor. 418 00:26:55,870 --> 00:27:03,060 Now, this can actually lead to some really bad 419 00:27:03,060 --> 00:27:06,230 decisions in the world. 420 00:27:06,230 --> 00:27:09,100 I'm particularly interested in issues related to health care 421 00:27:09,100 --> 00:27:12,390 and public health. 422 00:27:12,390 --> 00:27:18,470 In 2002, roughly 6 million American women were taking 423 00:27:18,470 --> 00:27:24,450 hormone replacement therapy, in the belief that this would 424 00:27:24,450 --> 00:27:30,310 substantially lower their risk of cardiovascular disease. 425 00:27:30,310 --> 00:27:34,810 It was argued that women over a certain age or of a certain 426 00:27:34,810 --> 00:27:37,620 age, if you took extra hormones, they were less 427 00:27:37,620 --> 00:27:40,200 likely to have a heart attack. 428 00:27:40,200 --> 00:27:45,880 This belief was supported by several published studies in 429 00:27:45,880 --> 00:27:50,720 highly reputable journals in which they showed a strong 430 00:27:50,720 --> 00:27:55,780 correlation between being on hormone replacement therapy 431 00:27:55,780 --> 00:28:00,930 and not having cardiovascular disease. 432 00:28:00,930 --> 00:28:04,650 And this data had been around a while and, as I said, by 433 00:28:04,650 --> 00:28:08,190 2002 in the US, roughly 6 million women 434 00:28:08,190 --> 00:28:10,470 were on this therapy. 435 00:28:10,470 --> 00:28:13,960 Later that year, the Journal of the American Medical 436 00:28:13,960 --> 00:28:19,440 Society published an article asserting that in fact being 437 00:28:19,440 --> 00:28:22,290 on this therapy increased women's risk of 438 00:28:22,290 --> 00:28:25,990 cardiovascular disease. 439 00:28:25,990 --> 00:28:29,616 It made you more likely to have a heart attack. 440 00:28:29,616 --> 00:28:33,170 Well, how could this have happened? 441 00:28:33,170 --> 00:28:37,900 After the new study came out, people went back and 442 00:28:37,900 --> 00:28:43,600 reanalyzed the old study and discovered that the women in 443 00:28:43,600 --> 00:28:47,690 that study who'd been on hormone replacement therapy 444 00:28:47,690 --> 00:28:51,660 were more likely than the other women in the group to 445 00:28:51,660 --> 00:28:58,900 have also better diet and be on a better exercise regimen. 446 00:28:58,900 --> 00:29:01,335 In fact, they were women who were more health conscious. 447 00:29:04,350 --> 00:29:08,330 So there were the lurking variables of diet and exercise 448 00:29:08,330 --> 00:29:13,120 and other things that were, in fact, probably the causative 449 00:29:13,120 --> 00:29:18,030 factors in better health, not the replacement therapy. 450 00:29:18,030 --> 00:29:20,540 But there was this lurking variable that had not been 451 00:29:20,540 --> 00:29:24,950 discovered in the initial analysis of the data. 452 00:29:24,950 --> 00:29:30,350 So what we saw is that taking hormone replacement therapy 453 00:29:30,350 --> 00:29:35,010 and improved cardiac health were coincident effects of a 454 00:29:35,010 --> 00:29:38,565 common cause, that is to say being health conscious. 455 00:29:41,280 --> 00:29:46,270 Kind of a strange thing but a true and sad story. 456 00:29:46,270 --> 00:29:47,620 All right. 457 00:29:47,620 --> 00:29:57,520 Moving right along, another thing to be cautious of is 458 00:29:57,520 --> 00:30:07,370 non-response bias and related problem of a 459 00:30:07,370 --> 00:30:08,815 non-representative sample. 460 00:30:18,870 --> 00:30:22,010 You'll probably recall that when I first started talking 461 00:30:22,010 --> 00:30:27,510 about statistics and the use of randomness, I said that all 462 00:30:27,510 --> 00:30:32,400 statistical techniques are based upon the assumption that 463 00:30:32,400 --> 00:30:36,940 by sampling a subset of a population, we can infer 464 00:30:36,940 --> 00:30:39,205 things about the population as a whole. 465 00:30:45,190 --> 00:30:50,970 And that's true, typically, because if random sampling is 466 00:30:50,970 --> 00:30:55,610 used, you can make assumptions that the distribution of 467 00:30:55,610 --> 00:30:59,730 results from the random sample, if it's large enough, 468 00:30:59,730 --> 00:31:02,490 will be the same as a distribution of results from 469 00:31:02,490 --> 00:31:05,100 the whole population. 470 00:31:05,100 --> 00:31:09,290 And that's why we typically want to sample randomly. 471 00:31:09,290 --> 00:31:13,190 And so for all the simulations we looked at, we used random 472 00:31:13,190 --> 00:31:18,120 sampling to try and ensure that a small number of samples 473 00:31:18,120 --> 00:31:19,740 would give us something representative of the 474 00:31:19,740 --> 00:31:21,530 population. 475 00:31:21,530 --> 00:31:24,500 And then we use statistical techniques to answer the 476 00:31:24,500 --> 00:31:29,120 question about how many random samples we needed. 477 00:31:29,120 --> 00:31:33,620 But those techniques were only valid if the samples were 478 00:31:33,620 --> 00:31:37,350 indeed random. 479 00:31:37,350 --> 00:31:41,190 Otherwise, you can analyze it to your heart's content and 480 00:31:41,190 --> 00:31:47,060 any conclusions you've drawn are likely to be fallacious. 481 00:31:47,060 --> 00:31:51,330 Unfortunately, many studies, particularly in the social 482 00:31:51,330 --> 00:31:56,730 sciences, are based on what is often called 483 00:31:56,730 --> 00:31:58,395 a convenience sampling. 484 00:32:11,360 --> 00:32:14,350 So for example, if you look at psychological-- 485 00:32:14,350 --> 00:32:17,480 psychology journals, you'll find that many psychological 486 00:32:17,480 --> 00:32:19,570 studies use populations of 487 00:32:19,570 --> 00:32:22,830 undergraduates for their studies. 488 00:32:22,830 --> 00:32:24,830 Why did they do this? 489 00:32:24,830 --> 00:32:27,160 Is it because they believe that undergraduates are 490 00:32:27,160 --> 00:32:30,380 representative of the population as a whole? 491 00:32:30,380 --> 00:32:30,730 No. 492 00:32:30,730 --> 00:32:32,460 It's because they're captive. 493 00:32:32,460 --> 00:32:36,350 They have to agree to participate, right? 494 00:32:36,350 --> 00:32:39,300 It's a convenience if you happen to be at a university 495 00:32:39,300 --> 00:32:43,030 to do your experiments on undergraduates. 496 00:32:43,030 --> 00:32:45,710 And so they do that and then they say, well, the 497 00:32:45,710 --> 00:32:50,290 undergraduates are just like the population as a whole. 498 00:32:50,290 --> 00:32:53,300 You may have observed that at least at this institution, the 499 00:32:53,300 --> 00:32:56,070 undergraduates are probably not representative of the 500 00:32:56,070 --> 00:32:57,320 population as a whole. 501 00:33:00,960 --> 00:33:04,780 A well-known example of what you can do with this occurred 502 00:33:04,780 --> 00:33:07,160 during World War II. 503 00:33:07,160 --> 00:33:11,000 Whenever an allied plane would return from a bombing run over 504 00:33:11,000 --> 00:33:17,390 Germany, the plane would be inspected to see where flak 505 00:33:17,390 --> 00:33:17,970 had hit it. 506 00:33:17,970 --> 00:33:22,070 So the planes would fly over to drop bombs, the Germans 507 00:33:22,070 --> 00:33:25,320 would shoot flak at the planes to try and knock 508 00:33:25,320 --> 00:33:26,690 them out the air. 509 00:33:26,690 --> 00:33:29,590 They'd come back to England, they'd get inspected, they'd 510 00:33:29,590 --> 00:33:32,350 say, well the flak hit this part of the plane more often 511 00:33:32,350 --> 00:33:34,820 than that part of the plane on average. 512 00:33:34,820 --> 00:33:39,020 And so they would reinforce the skin of those parts of the 513 00:33:39,020 --> 00:33:43,370 plane where they expected the flak to hit, to try and make 514 00:33:43,370 --> 00:33:47,230 the plane less likely to be damaged in future runs, or the 515 00:33:47,230 --> 00:33:48,500 planes in general. 516 00:33:48,500 --> 00:33:50,860 What's wrong with this? 517 00:33:50,860 --> 00:33:51,200 Yeah? 518 00:33:51,200 --> 00:33:55,030 AUDIENCE: They're not getting the planes that dropped? 519 00:33:55,030 --> 00:33:56,840 PROFESSOR: Exactly. 520 00:33:56,840 --> 00:34:00,170 What they're not sampling is the planes that never made it 521 00:34:00,170 --> 00:34:04,020 back from the bombing, ooh, that never made it back from 522 00:34:04,020 --> 00:34:11,389 the bombing runs because they weren't there to sample. 523 00:34:11,389 --> 00:34:15,150 And in fact maybe it's the case that what they were doing 524 00:34:15,150 --> 00:34:17,699 was reinforcing those parts of the planes where it didn't 525 00:34:17,699 --> 00:34:20,150 matter if you got hit by flak because it wouldn't cause a 526 00:34:20,150 --> 00:34:23,610 plane to crash, and not reinforcing those parts of the 527 00:34:23,610 --> 00:34:26,360 planes that were most vulnerable to 528 00:34:26,360 --> 00:34:29,100 being damaged by flak. 529 00:34:29,100 --> 00:34:31,080 They did a convenient sampling, they drew 530 00:34:31,080 --> 00:34:34,429 conclusions, and they probably did exactly the wrong thing in 531 00:34:34,429 --> 00:34:36,844 what they chose to reinforce in the airplanes. 532 00:34:39,650 --> 00:34:46,659 This particular error is called non-response bias where 533 00:34:46,659 --> 00:34:51,010 you do some sort of survey, for example, and some people 534 00:34:51,010 --> 00:34:56,980 don't respond and, therefore, you ignore what 535 00:34:56,980 --> 00:34:58,230 they would have said. 536 00:35:00,635 --> 00:35:03,530 It's perhaps something we see when we do the underground 537 00:35:03,530 --> 00:35:05,610 guide to Course 6. 538 00:35:05,610 --> 00:35:08,830 In fact I should point out that it's now online. 539 00:35:08,830 --> 00:35:12,460 And it would be good if each of you would go and rate this 540 00:35:12,460 --> 00:35:16,210 course, rate the lectures, rate the TAs, et cetera. 541 00:35:16,210 --> 00:35:18,560 We actually do read them and it makes a difference in how 542 00:35:18,560 --> 00:35:21,300 we teach the course in subsequent terms. 543 00:35:21,300 --> 00:35:23,650 But there's clearly a bias. 544 00:35:23,650 --> 00:35:26,090 You know, maybe only the people who really feel 545 00:35:26,090 --> 00:35:28,570 strongly about the course, either positively or 546 00:35:28,570 --> 00:35:31,720 negatively, bother to fill out the survey. 547 00:35:31,720 --> 00:35:34,180 And we draw the conclusions that there's a bimodal 548 00:35:34,180 --> 00:35:37,570 distribution, and nobody thinks it's kind of mediocre, 549 00:35:37,570 --> 00:35:40,010 because they don't bother responding. 550 00:35:40,010 --> 00:35:42,570 Or maybe only the people who hate the course respond, and 551 00:35:42,570 --> 00:35:44,520 we think everybody hates the course. 552 00:35:44,520 --> 00:35:45,690 Who knows. 553 00:35:45,690 --> 00:35:48,590 It's a big problem. 554 00:35:48,590 --> 00:35:52,340 We see it, it's a big problem today with telephone polls, 555 00:35:52,340 --> 00:35:54,970 where you get more convenient sampling and 556 00:35:54,970 --> 00:35:58,830 non-representative samples, where a lot of polls are done 557 00:35:58,830 --> 00:36:00,970 using telephones. 558 00:36:00,970 --> 00:36:06,410 By law, these pollsters cannot call cell phones, so they only 559 00:36:06,410 --> 00:36:07,500 call land lines. 560 00:36:07,500 --> 00:36:11,060 How many of you have a land line? 561 00:36:11,060 --> 00:36:13,390 Let the record show, nobody. 562 00:36:13,390 --> 00:36:16,800 How many of your parents have a land line? 563 00:36:16,800 --> 00:36:20,290 Let the record show, pretty much everybody. 564 00:36:20,290 --> 00:36:23,340 Well, that means your parents are more likely to get sampled 565 00:36:23,340 --> 00:36:27,220 than you when there's a poll of, say, who should be 566 00:36:27,220 --> 00:36:29,530 nominated for president. 567 00:36:29,530 --> 00:36:32,020 And so any of these polls that are based on 568 00:36:32,020 --> 00:36:34,740 telephones will be biased. 569 00:36:34,740 --> 00:36:37,910 And, unfortunately, their poll may just say, a telephone 570 00:36:37,910 --> 00:36:40,250 sample, and people may not realize the 571 00:36:40,250 --> 00:36:41,730 implication of that. 572 00:36:41,730 --> 00:36:46,880 That whole part of the population is under sampled. 573 00:36:46,880 --> 00:36:49,370 There are lots of examples of this. 574 00:36:49,370 --> 00:36:50,620 All right. 575 00:36:53,190 --> 00:36:58,650 Moving along, another problem we often see is data 576 00:36:58,650 --> 00:36:59,900 enhancement. 577 00:37:06,570 --> 00:37:12,650 It's easy to read much more into data than it actually 578 00:37:12,650 --> 00:37:16,070 implies, especially when viewed out of context. 579 00:37:16,070 --> 00:37:23,090 So on April 29, 2009, CNN reported that quote, "Mexican 580 00:37:23,090 --> 00:37:27,320 health officials suspect that the swine flu outbreak has 581 00:37:27,320 --> 00:37:34,700 caused more than 159 deaths and roughly 2,500 illnesses." 582 00:37:34,700 --> 00:37:38,360 It was pretty scary stuff at the time, and people got all 583 00:37:38,360 --> 00:37:41,240 worried about the swine flu. 584 00:37:41,240 --> 00:37:45,730 On the other hand, how many deaths a year do you think are 585 00:37:45,730 --> 00:37:48,260 attributable to the conventional 586 00:37:48,260 --> 00:37:52,310 seasonal flu in the US? 587 00:37:52,310 --> 00:37:55,650 Anyone want to hazard a guess? 588 00:37:55,650 --> 00:37:58,550 36,000. 589 00:37:58,550 --> 00:38:01,750 So 36,000 people a year, on average, will die from the 590 00:38:01,750 --> 00:38:06,920 seasonal flu, which sort of puts in prospective that 159 591 00:38:06,920 --> 00:38:09,200 deaths from the swine flu maybe shouldn't be so 592 00:38:09,200 --> 00:38:11,950 terrifying. 593 00:38:11,950 --> 00:38:17,210 But again, people typically did not report both of those. 594 00:38:17,210 --> 00:38:21,890 Another great statistic, and accurate, is that most auto 595 00:38:21,890 --> 00:38:26,030 accidents happen within 10 miles of home. 596 00:38:26,030 --> 00:38:28,880 I'm sure many of you have heard that. 597 00:38:28,880 --> 00:38:31,290 So what does that mean? 598 00:38:31,290 --> 00:38:32,750 Almost nothing. 599 00:38:32,750 --> 00:38:35,270 Most driving is done with 10 miles-- 600 00:38:35,270 --> 00:38:37,780 within 10 miles of home. 601 00:38:37,780 --> 00:38:41,110 And besides that, what does home mean in this context? 602 00:38:41,110 --> 00:38:44,000 What home means is the registration 603 00:38:44,000 --> 00:38:46,570 address of the car. 604 00:38:46,570 --> 00:38:52,150 So if I were to choose to register my car in Alaska, 605 00:38:52,150 --> 00:38:54,500 does that mean I'm less likely to have an accident driving 606 00:38:54,500 --> 00:38:56,190 around MIT? 607 00:38:56,190 --> 00:38:57,950 I don't think so. 608 00:38:57,950 --> 00:38:59,290 Again, it's a kind of a meaningless. 609 00:39:02,330 --> 00:39:04,560 Another aspect of this is people often 610 00:39:04,560 --> 00:39:06,980 extrapolate from data. 611 00:39:06,980 --> 00:39:10,990 So we can look at an example of internet usage. 612 00:39:10,990 --> 00:39:14,860 This is kind of a fun one too. 613 00:39:14,860 --> 00:39:18,860 So what I've plotted here is the internet usage in the 614 00:39:18,860 --> 00:39:20,966 United states as a percentage of population. 615 00:39:34,400 --> 00:39:36,740 And I plotted of this from-- 616 00:39:36,740 --> 00:39:39,680 starting at 1994. 617 00:39:39,680 --> 00:39:43,820 And the green line, or actually, the blue line there 618 00:39:43,820 --> 00:39:48,730 are the points and the green line is a linear fit. 619 00:39:48,730 --> 00:39:52,050 If you looked at my code, you'd see I was using polyfit 620 00:39:52,050 --> 00:39:56,860 with a 1 to get a line to fit, and you can see it's a pretty 621 00:39:56,860 --> 00:39:59,850 darn good fit. 622 00:39:59,850 --> 00:40:02,860 So people actually looked at these things and used this to 623 00:40:02,860 --> 00:40:08,860 extrapolate internet usage going forward. 624 00:40:08,860 --> 00:40:10,110 So we can do that. 625 00:40:13,380 --> 00:40:16,810 Now, we'll run the same code with the 626 00:40:16,810 --> 00:40:18,400 extrapolation turned on. 627 00:40:25,860 --> 00:40:29,350 And so figure (1) is the same figure (1) as before, same 628 00:40:29,350 --> 00:40:31,120 data, same fit. 629 00:40:31,120 --> 00:40:33,140 And here's figure (2). 630 00:40:33,140 --> 00:40:38,000 And you'll notice that as of last year about 115% of the US 631 00:40:38,000 --> 00:40:44,450 population was using the internet, probably not true. 632 00:40:44,450 --> 00:40:48,550 It may be possible in sports to give 110%, but in 633 00:40:48,550 --> 00:40:49,800 statistics it's not. 634 00:40:53,600 --> 00:40:56,510 Again, you see this all the time when people are doing 635 00:40:56,510 --> 00:40:57,430 these projections. 636 00:40:57,430 --> 00:41:00,400 They'll say, fit some data, they extrapolate into the 637 00:41:00,400 --> 00:41:03,930 future without understanding why maybe that isn't a good 638 00:41:03,930 --> 00:41:05,410 thing to do. 639 00:41:05,410 --> 00:41:08,490 We saw that by when we were modeling springs, right? 640 00:41:08,490 --> 00:41:12,620 We could accurately project linearly until we exceeded the 641 00:41:12,620 --> 00:41:16,970 constant of elasticity at which point our linear model 642 00:41:16,970 --> 00:41:19,510 was totally broken. 643 00:41:19,510 --> 00:41:22,320 So you always need to have some reason other than just 644 00:41:22,320 --> 00:41:26,850 fitting the data to believe that what you're doing makes 645 00:41:26,850 --> 00:41:30,230 actual sense. 646 00:41:30,230 --> 00:41:30,800 All right. 647 00:41:30,800 --> 00:41:37,240 The final one I want to talk about is what is typically 648 00:41:37,240 --> 00:41:42,680 called in the literature the Texas sharpshooter fallacy. 649 00:41:57,360 --> 00:41:59,440 And this is a little bit tricky 650 00:41:59,440 --> 00:42:02,170 to understand sometimes. 651 00:42:02,170 --> 00:42:05,530 Actually, is there anyone here from Texas? 652 00:42:05,530 --> 00:42:09,280 Oh good, so no one will be offended by this. 653 00:42:09,280 --> 00:42:11,770 Well imagine that you're driving down some country road 654 00:42:11,770 --> 00:42:15,990 in Texas and that you see a barn. 655 00:42:15,990 --> 00:42:20,510 And that barn has six targets painted on it, and in the dead 656 00:42:20,510 --> 00:42:25,650 center of each target, you find a bullet hole. 657 00:42:25,650 --> 00:42:28,680 So you're driving, your pretty impressed, and you stop. 658 00:42:28,680 --> 00:42:31,680 And you see the owner of the barn and you say, you must be 659 00:42:31,680 --> 00:42:33,010 a damn good shot. 660 00:42:33,010 --> 00:42:36,280 And he says, absolutely, I never miss. 661 00:42:36,280 --> 00:42:39,990 At which point the farmer's wife walks out and says, 662 00:42:39,990 --> 00:42:43,370 that's right there ain't a man in the state of Texas who is 663 00:42:43,370 --> 00:42:46,510 more accurate with a paint gun. 664 00:42:46,510 --> 00:42:48,400 What did he do? 665 00:42:48,400 --> 00:42:51,840 He shot six bullets into the barn, and he 666 00:42:51,840 --> 00:42:52,790 was a terrible shot. 667 00:42:52,790 --> 00:42:54,390 They were all over the place. 668 00:42:54,390 --> 00:42:56,635 Then he went and painted a target around each of them. 669 00:42:59,690 --> 00:43:03,000 And it looked like he was a great shot. 670 00:43:03,000 --> 00:43:05,880 Now you might think that, well that's silly, no one would do 671 00:43:05,880 --> 00:43:07,500 that in practice. 672 00:43:07,500 --> 00:43:13,010 But in fact, it happens all of the time in practice. 673 00:43:13,010 --> 00:43:17,500 A classic of this genre appeared in the magazine New 674 00:43:17,500 --> 00:43:20,550 Scientist, in 2001. 675 00:43:20,550 --> 00:43:23,875 And it reported that a research team led by John 676 00:43:23,875 --> 00:43:27,930 Eagles of the Royal Cornhill Hospital in Aberdeen, had 677 00:43:27,930 --> 00:43:32,280 discovered that and I quote, "Anorexic women are most 678 00:43:32,280 --> 00:43:35,780 likely to have been born in the spring or early summer 679 00:43:35,780 --> 00:43:37,030 between March and June. 680 00:43:39,220 --> 00:43:42,480 In fact, there were more than-- 681 00:43:42,480 --> 00:43:45,370 there were more than 13%-- 682 00:43:45,370 --> 00:43:49,680 there were 13% more anorexics born on average in those 683 00:43:49,680 --> 00:43:54,150 months, and 30% more anorexics, on average, in 684 00:43:54,150 --> 00:43:57,870 June." 685 00:43:57,870 --> 00:44:01,100 Now, let's look at this worrisome statistic. 686 00:44:01,100 --> 00:44:04,450 Are any of you women here born in June? 687 00:44:04,450 --> 00:44:04,760 All right. 688 00:44:04,760 --> 00:44:08,630 Well, I won't ask about your health history. 689 00:44:08,630 --> 00:44:11,740 But maybe you should be worried, or maybe not. 690 00:44:11,740 --> 00:44:14,470 So let's look at how they did this study. 691 00:44:14,470 --> 00:44:17,920 You may wonder why so many of these studies are all studies 692 00:44:17,920 --> 00:44:19,300 about women's health. 693 00:44:19,300 --> 00:44:20,650 And then, perhaps, because they're all 694 00:44:20,650 --> 00:44:22,760 done by male doctors. 695 00:44:22,760 --> 00:44:27,890 Anyway, the team studied 446 women who had been diagnosed 696 00:44:27,890 --> 00:44:29,140 as anorexic. 697 00:44:31,110 --> 00:44:36,030 So if you divide that by 12 what you've discovered is 698 00:44:36,030 --> 00:44:42,660 that, on average, there should have been 37 women born in 699 00:44:42,660 --> 00:44:48,130 each of those months, of the 446. 700 00:44:48,130 --> 00:44:56,790 And in fact, in June, there were 48 anorexic women born. 701 00:44:56,790 --> 00:45:01,030 So they said, well, how likely is this to have occurred 702 00:45:01,030 --> 00:45:02,280 simply by chance? 703 00:45:05,030 --> 00:45:09,960 Well as I am want to do in such occasions, I checked 704 00:45:09,960 --> 00:45:12,620 their analysis, and I wrote a little piece 705 00:45:12,620 --> 00:45:15,660 of code to do that. 706 00:45:15,660 --> 00:45:19,390 So trying to figure out what's the probability of 48 women 707 00:45:19,390 --> 00:45:23,870 being born in June, I ran a simulation in which I 708 00:45:23,870 --> 00:45:30,700 simulated 446 births and chose a month at random, and looked 709 00:45:30,700 --> 00:45:33,140 at the probability. 710 00:45:33,140 --> 00:45:35,900 And let's see what it was when we run it. 711 00:45:44,510 --> 00:45:46,120 Oops, well we didn't want these graphs. 712 00:45:49,360 --> 00:45:55,680 The probability of at least 48 births in June was 0.042. 713 00:45:55,680 --> 00:46:00,940 So in fact, pretty low. 714 00:46:00,940 --> 00:46:02,840 You might say, well, what's the odds of this 715 00:46:02,840 --> 00:46:04,060 happening by accident? 716 00:46:04,060 --> 00:46:05,290 Pretty small. 717 00:46:05,290 --> 00:46:07,280 Therefore, maybe we are really on to something. 718 00:46:07,280 --> 00:46:11,250 Maybe it has to do with the conditions of the birth and 719 00:46:11,250 --> 00:46:14,050 the weather or who knows what. 720 00:46:14,050 --> 00:46:20,940 Well, what's wrong with this analysis? 721 00:46:24,890 --> 00:46:29,810 Well, one way to look at it is this analysis would have been 722 00:46:29,810 --> 00:46:37,290 perfectly valid if the researchers had started with a 723 00:46:37,290 --> 00:46:43,480 hypothesis that there are more babies born in June than in 724 00:46:43,480 --> 00:46:46,890 any other month-- more future anorexics born in June than in 725 00:46:46,890 --> 00:46:50,400 any other month, and then run this experiment to test it, 726 00:46:50,400 --> 00:46:52,760 and validated it. 727 00:46:52,760 --> 00:46:58,130 So if they had started with the hypothesis, and then from 728 00:46:58,130 --> 00:47:05,930 the hypothesis conducted what's called a prospective 729 00:47:05,930 --> 00:47:16,750 study then they would have, perhaps, valid reason to 730 00:47:16,750 --> 00:47:21,120 believe that the study supports the hypothesis. 731 00:47:21,120 --> 00:47:24,790 But that's not what they did. 732 00:47:24,790 --> 00:47:28,440 Instead what they did is they looked at the data and then 733 00:47:28,440 --> 00:47:33,300 chose a hypothesis that matched the data, the Texas 734 00:47:33,300 --> 00:47:36,600 sharpshooter fallacy. 735 00:47:36,600 --> 00:47:40,600 Given that that was the experiment they performed, the 736 00:47:40,600 --> 00:47:45,070 right question to ask is not what is the probability that 737 00:47:45,070 --> 00:47:50,770 you had 48 future anorexics born in June, but what was the 738 00:47:50,770 --> 00:47:56,170 probability that you have 48 future anorexics born in at 739 00:47:56,170 --> 00:47:59,400 least one of the 12 months? 740 00:47:59,400 --> 00:48:03,020 Because that's what they were really doing, right? 741 00:48:03,020 --> 00:48:10,670 So therefore, we should really have run this simulation. 742 00:48:10,670 --> 00:48:14,030 Similar to the previous one, again, these are in your hand 743 00:48:14,030 --> 00:48:22,750 out, but is there at least one month in which 744 00:48:22,750 --> 00:48:26,034 there were 48 births? 745 00:48:26,034 --> 00:48:33,100 And if we run that we'll see that the probability is over 746 00:48:33,100 --> 00:48:38,820 40%, not so impressive as 4%. 747 00:48:38,820 --> 00:48:43,560 So in fact, we see that we probably shouldn't draw any 748 00:48:43,560 --> 00:48:44,980 conclusion. 749 00:48:44,980 --> 00:48:47,600 The probability of this happening by pure accident is 750 00:48:47,600 --> 00:48:50,090 almost 50%. 751 00:48:50,090 --> 00:48:56,030 So why should we believe that it's somehow meaningful. 752 00:48:56,030 --> 00:49:01,030 Again, an example of the Texas sharpshooter fallacy that 753 00:49:01,030 --> 00:49:04,400 appeared in the literature and a lot of people fell for it. 754 00:49:04,400 --> 00:49:06,840 And if we had more time, I would give you many more 755 00:49:06,840 --> 00:49:08,970 examples, but we don't. 756 00:49:08,970 --> 00:49:13,980 I'll see you on Thursday, on Tuesday, rather. 757 00:49:13,980 --> 00:49:15,140 Two more lectures to go. 758 00:49:15,140 --> 00:49:19,300 On Tuesday, I'm going to go over some code that I'll be 759 00:49:19,300 --> 00:49:20,900 asking you to look at in preparation 760 00:49:20,900 --> 00:49:22,860 for the final exam. 761 00:49:22,860 --> 00:49:25,510 And then on Thursday, we'll wrap things up.