1 00:00:00,000 --> 00:00:02,000 OPERATOR: The following content is provided under a 2 00:00:02,000 --> 00:00:03,840 Creative Commons license. 3 00:00:03,840 --> 00:00:06,840 Your support will help MIT OpenCourseWare continue to 4 00:00:06,840 --> 00:00:10,530 offer high quality educational resources for free. 5 00:00:10,530 --> 00:00:13,390 To make a donation or view additional materials from 6 00:00:13,390 --> 00:00:17,600 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:17,600 --> 00:00:19,980 ocw.mit.edu. 8 00:00:19,980 --> 00:00:25,420 PROFESSOR: So, if you remember, just before the 9 00:00:25,420 --> 00:00:30,710 break, as long ago as it was, we had looked at the problem 10 00:00:30,710 --> 00:00:35,520 of fitting curves to data. 11 00:00:35,520 --> 00:00:42,690 And the example we had seen, is that it's often possible, 12 00:00:42,690 --> 00:00:49,060 in fact, usually possible, to find a good fit to old values. 13 00:00:49,060 --> 00:00:51,460 What we looked at was, we looked at a small number of 14 00:00:51,460 --> 00:00:54,800 points, we took a high degree polynomial, sure enough, we 15 00:00:54,800 --> 00:00:57,970 got a great fit. 16 00:00:57,970 --> 00:01:03,210 The difficulty was, a great fit to old values does not 17 00:01:03,210 --> 00:01:24,840 necessarily imply a good fit to new values. 18 00:01:24,840 --> 00:01:31,150 And in general, that's somewhat worrisome. 19 00:01:31,150 --> 00:01:34,560 So now I want to spend a little bit of time I'm looking 20 00:01:34,560 --> 00:01:38,850 at some tools, that we can use to better understand the 21 00:01:38,850 --> 00:01:43,190 notion of, when we have a bunch of points, what 22 00:01:43,190 --> 00:01:45,220 do they look like? 23 00:01:45,220 --> 00:01:47,610 How does the variation work? 24 00:01:47,610 --> 00:01:51,680 This gets back to a concept that we've used a number of 25 00:01:51,680 --> 00:01:59,540 times, which is a notion of a distribution. 26 00:01:59,540 --> 00:02:04,860 Remember, the whole logic behind our idea of using 27 00:02:04,860 --> 00:02:11,340 simulation, or polling, or any kind of statistical technique, 28 00:02:11,340 --> 00:02:16,160 is the assumption that the values we would draw were 29 00:02:16,160 --> 00:02:21,650 representative of the values of the larger population. 30 00:02:21,650 --> 00:02:25,540 We're sampling some subset of the population, and we're 31 00:02:25,540 --> 00:02:30,150 assuming that that sample is representative of the greater 32 00:02:30,150 --> 00:02:31,700 population. 33 00:02:31,700 --> 00:02:33,170 We talked about several different 34 00:02:33,170 --> 00:02:36,130 issues related to that. 35 00:02:36,130 --> 00:02:40,400 I now want to look at that a little bit more formally. 36 00:02:40,400 --> 00:02:47,760 And we'll start with the very old problem of rolling dice. 37 00:02:47,760 --> 00:02:50,026 I presume you've all seen what a pair of 38 00:02:50,026 --> 00:02:51,570 dice look like, right? 39 00:02:51,570 --> 00:02:54,590 They've got the numbers 1 through 6 on them, you roll 40 00:02:54,590 --> 00:02:57,730 them and something comes up. 41 00:02:57,730 --> 00:03:01,700 If you haven't seen it, if you look at the very back, at the 42 00:03:01,700 --> 00:03:05,040 back page of the handout today, you'll see a picture of 43 00:03:05,040 --> 00:03:06,740 a very old die. 44 00:03:06,740 --> 00:03:11,040 Some time from the fourth to the second century BC. 45 00:03:11,040 --> 00:03:15,260 Looks remarkably like a modern dice, except it's not made out 46 00:03:15,260 --> 00:03:18,610 of plastic, it's made out of bones. 47 00:03:18,610 --> 00:03:20,970 And in fact, if you were interested in the history of 48 00:03:20,970 --> 00:03:23,790 gambling, or if you happen to play with dice, people do call 49 00:03:23,790 --> 00:03:25,610 them bones. 50 00:03:25,610 --> 00:03:28,320 And that just dates back to the fact that the original 51 00:03:28,320 --> 00:03:30,770 ones were made that way. 52 00:03:30,770 --> 00:03:35,505 And in fact, what we'll see is, that in the history of 53 00:03:35,505 --> 00:03:39,020 probability and statistics, an awful lot of the math that we 54 00:03:39,020 --> 00:03:42,240 take for granted today, came from people's attempts to 55 00:03:42,240 --> 00:03:46,290 understand various games of chance. 56 00:03:46,290 --> 00:03:50,230 So, let's look at it. 57 00:03:50,230 --> 00:03:53,610 So we'll look at this program. 58 00:03:53,610 --> 00:04:05,660 You should have this in the front of the handout. 59 00:04:05,660 --> 00:04:08,960 So I'm going to start with a fair dice. 60 00:04:08,960 --> 00:04:11,660 That is to say, when you roll it, it's equally probable that 61 00:04:11,660 --> 00:04:18,360 you get 1, 2, 3, 4, 5, or 6. 62 00:04:18,360 --> 00:04:20,830 And I'm going to throw a pair. 63 00:04:20,830 --> 00:04:23,930 You can see it's very simple. 64 00:04:23,930 --> 00:04:30,910 I'll take d 1, first die is random dot choice from vals 1. 65 00:04:30,910 --> 00:04:34,370 d 2 will be random dot choice from vals 2. 66 00:04:34,370 --> 00:04:40,100 So I'm going to pass it in two sets of possible values, and 67 00:04:40,100 --> 00:04:45,710 randomly choose one or the other, and then return them. 68 00:04:45,710 --> 00:04:49,660 And the way I'll conduct a trial is, I'll take some 69 00:04:49,660 --> 00:04:53,940 number of throws, and two different kinds of dice. 70 00:04:53,940 --> 00:04:58,840 Throws will be the empty set, actually, yeah. 71 00:04:58,840 --> 00:05:01,070 And then I'll just do it. 72 00:05:01,070 --> 00:05:06,270 For i in range number of throws, d 1, d 2 is equal to 73 00:05:06,270 --> 00:05:08,480 throw a pair, and then I'll append it, and 74 00:05:08,480 --> 00:05:12,940 then I'll return it. 75 00:05:12,940 --> 00:05:14,660 Very simple, right? 76 00:05:14,660 --> 00:05:19,130 Could hardly imagine a simpler little program. 77 00:05:19,130 --> 00:05:21,740 And then, we'll analyze it. 78 00:05:21,740 --> 00:05:22,960 And we're going to analyze it. 79 00:05:22,960 --> 00:05:25,430 Well, first let's analyze it one way, and then we'll look 80 00:05:25,430 --> 00:05:41,050 at something slightly different. 81 00:05:41,050 --> 00:05:43,200 I'm going to conduct some number of trials 82 00:05:43,200 --> 00:05:46,120 with two fair die. 83 00:05:46,120 --> 00:05:50,270 Then I'm going to make a histogram, because I happen to 84 00:05:50,270 --> 00:05:55,730 know that there are only 11 possible values, 85 00:05:55,730 --> 00:06:01,600 I'll make 11 bins. 86 00:06:01,600 --> 00:06:03,620 You may not have seen this locution 87 00:06:03,620 --> 00:06:06,710 here, Pylab dot x ticks. 88 00:06:06,710 --> 00:06:10,680 That's telling it where to put the markers on the x-axis, and 89 00:06:10,680 --> 00:06:12,460 what they should be. 90 00:06:12,460 --> 00:06:16,450 In this case 2 through 12, and then I'll label it. 91 00:06:16,450 --> 00:06:29,960 So let's run this program. 92 00:06:29,960 --> 00:06:34,760 And here we see the distribution of values. 93 00:06:34,760 --> 00:06:39,220 So we see that I get more 7s than anything else, and fewer 94 00:06:39,220 --> 00:06:42,320 2s and 12s. 95 00:06:42,320 --> 00:06:47,540 Snake eyes and boxcars to you gamblers. 96 00:06:47,540 --> 00:06:51,090 And it's a beautiful distribution, in some sense. 97 00:06:51,090 --> 00:06:53,150 I ran it enough trials. 98 00:06:53,150 --> 00:07:01,240 This kind of distribution is called normal. 99 00:07:01,240 --> 00:07:03,780 Also sometimes called Gaussian, after the 100 00:07:03,780 --> 00:07:07,060 mathematician Gauss. 101 00:07:07,060 --> 00:07:12,110 Sometimes called the bell curve, because in someone's 102 00:07:12,110 --> 00:07:16,650 imagination it looks like a bell. 103 00:07:16,650 --> 00:07:22,370 We see these things all the time. 104 00:07:22,370 --> 00:07:27,110 They're called normal, or sometimes even natural, 105 00:07:27,110 --> 00:07:30,800 because it's probably the most commonly observed probability 106 00:07:30,800 --> 00:07:35,460 distribution in nature. 107 00:07:35,460 --> 00:07:40,240 First documented, although I'm sure not first seen, by 108 00:07:40,240 --> 00:07:44,470 deMoivre and Laplace in the 1700s. 109 00:07:44,470 --> 00:07:48,340 And then in the 1800s, Gauss used it to analyze 110 00:07:48,340 --> 00:07:51,980 astronomical data. 111 00:07:51,980 --> 00:07:54,280 And it got to be called, in that case, the Gaussian 112 00:07:54,280 --> 00:07:57,250 distribution. 113 00:07:57,250 --> 00:07:58,890 So where do we see it occur? 114 00:07:58,890 --> 00:08:00,870 We see it occurring all over the place. 115 00:08:00,870 --> 00:08:04,000 We certainly see it rolling dice. 116 00:08:04,000 --> 00:08:09,180 We see it occur in things like the distribution of heights. 117 00:08:09,180 --> 00:08:12,700 If we were to take the height of all the students at MIT and 118 00:08:12,700 --> 00:08:17,100 plot the distribution, I would be astonished if it didn't 119 00:08:17,100 --> 00:08:18,700 look more or less like that. 120 00:08:18,700 --> 00:08:22,280 It would be a normal distribution. 121 00:08:22,280 --> 00:08:24,050 A lot of things in the same height. 122 00:08:24,050 --> 00:08:26,980 Now presumably, we'd have to round off to the nearest 123 00:08:26,980 --> 00:08:29,380 millimeter or something. 124 00:08:29,380 --> 00:08:35,250 And a few really tall people, and a few really short people. 125 00:08:35,250 --> 00:08:39,010 It's just astonishing in nature how often we look at 126 00:08:39,010 --> 00:08:40,690 these things. 127 00:08:40,690 --> 00:08:44,900 The graph looks exactly like that, or similar to that. 128 00:08:44,900 --> 00:08:47,880 The shape is roughly that. 129 00:08:47,880 --> 00:08:53,460 The normal distribution can be described, interestingly 130 00:08:53,460 --> 00:08:57,700 enough, with just two numbers. 131 00:08:57,700 --> 00:09:13,400 The mean and the standard deviation. 132 00:09:13,400 --> 00:09:15,920 So if I give you those two numbers, you 133 00:09:15,920 --> 00:09:18,540 can draw that curve. 134 00:09:18,540 --> 00:09:21,920 Now you might not be able to label, you couldn't label the 135 00:09:21,920 --> 00:09:25,210 axes, because how would you know how many 136 00:09:25,210 --> 00:09:26,540 trials I did, right? 137 00:09:26,540 --> 00:09:29,720 Whether I did 100, or 1,000 or a million, but the shape would 138 00:09:29,720 --> 00:09:34,970 always be the same. 139 00:09:34,970 --> 00:09:39,180 And if I were to, instead of doing, 100,000 throws of the 140 00:09:39,180 --> 00:09:44,030 dice, as I did here, I did a million, the label on the 141 00:09:44,030 --> 00:09:47,310 y-axis would change, but the shape would 142 00:09:47,310 --> 00:09:51,370 be absolutely identical. 143 00:09:51,370 --> 00:09:54,430 This is what's called a stable distribution. 144 00:09:54,430 --> 00:10:14,160 As you change the scale, the shape doesn't change. 145 00:10:14,160 --> 00:10:18,370 So the mean tells us where it's centered, and the 146 00:10:18,370 --> 00:10:23,500 standard deviation, basically, is a measure of statistical 147 00:10:23,500 --> 00:10:42,570 dispersion. 148 00:10:42,570 --> 00:10:45,400 It tells us how widely spread the points of 149 00:10:45,400 --> 00:10:48,460 the data set are. 150 00:10:48,460 --> 00:10:52,420 If many points are going to be very close to the mean, then 151 00:10:52,420 --> 00:10:57,650 the standard deviation is what, big or small? 152 00:10:57,650 --> 00:10:59,090 Pardon? 153 00:10:59,090 --> 00:11:04,720 Small. 154 00:11:04,720 --> 00:11:08,950 If they're spread out, and it's kind of a flat bell, then 155 00:11:08,950 --> 00:11:13,750 the standard deviation will be large. 156 00:11:13,750 --> 00:11:16,980 And I'm sure you've all seen standard deviations. 157 00:11:16,980 --> 00:11:19,500 We give exams, and we say here's the mean, here's the 158 00:11:19,500 --> 00:11:22,630 standard deviation. 159 00:11:22,630 --> 00:11:25,170 And the notion is, that's trying to tell you what the 160 00:11:25,170 --> 00:11:29,280 average score was, and how spread out they are. 161 00:11:29,280 --> 00:11:35,080 Now as it happens, rarely do we have exams that actually 162 00:11:35,080 --> 00:11:37,050 fall on a bell curve. 163 00:11:37,050 --> 00:11:38,270 Like this. 164 00:11:38,270 --> 00:11:44,710 So in a way, don't be deceived by thinking that we're really 165 00:11:44,710 --> 00:11:48,790 giving you a good measure of the dispersion, in the sense, 166 00:11:48,790 --> 00:11:52,880 that we would get with the bell curve. 167 00:11:52,880 --> 00:11:58,420 So the standard deviation does have a formal value, usually 168 00:11:58,420 --> 00:12:11,440 written sigma, and it's the estimates of x squared minus 169 00:12:11,440 --> 00:12:18,840 the estimates of x, and then I take all of this, the 170 00:12:18,840 --> 00:12:26,790 estimates of x, right, squared. 171 00:12:26,790 --> 00:12:29,300 So I don't worry much about this, but what I'm basically 172 00:12:29,300 --> 00:12:36,200 doing is, x is all of the values I have. And I can 173 00:12:36,200 --> 00:12:44,500 square each of the values, and then I subtract from that, the 174 00:12:44,500 --> 00:12:50,670 sum of the values, squaring that. 175 00:12:50,670 --> 00:12:54,730 What's more important than this formula, for most uses, 176 00:12:54,730 --> 00:13:03,480 is what people think of as the -- why didn't I write it down 177 00:13:03,480 --> 00:13:07,580 -- this is interesting. 178 00:13:07,580 --> 00:13:09,260 There is a some number -- 179 00:13:09,260 --> 00:13:10,930 I see why, I did write it down, it just 180 00:13:10,930 --> 00:13:12,280 got printed on two-sided. 181 00:13:12,280 --> 00:13:26,540 Is the Empirical Rule. 182 00:13:26,540 --> 00:13:35,370 And this applies for normal distributions. 183 00:13:35,370 --> 00:13:38,680 So anyone know how much of the data that you should expect to 184 00:13:38,680 --> 00:13:44,120 fall within one standard deviation of the mean? 185 00:13:44,120 --> 00:13:49,200 68. 186 00:13:49,200 --> 00:13:59,330 So 68% within one, 95% of the data falls within two, and 187 00:13:59,330 --> 00:14:06,720 almost all of the data within three. 188 00:14:06,720 --> 00:14:11,080 These values are approximations, by the way. 189 00:14:11,080 --> 00:14:20,070 So, this is, really 95% falls within 1.96 standard 190 00:14:20,070 --> 00:14:23,530 deviations, it's not two. 191 00:14:23,530 --> 00:14:28,330 But this gives you a sense of how spread out it is. 192 00:14:28,330 --> 00:14:33,770 And again, this applies only for a normal distribution. 193 00:14:33,770 --> 00:14:37,650 If you compute the standard deviation this way, and apply 194 00:14:37,650 --> 00:14:40,910 it to something other than a normal distribution, there's 195 00:14:40,910 --> 00:14:52,130 no reason to expect that Empirical Rule will hold. 196 00:14:52,130 --> 00:14:55,330 OK people with me on this? 197 00:14:55,330 --> 00:14:58,560 It's amazing to me how many people in society talk about 198 00:14:58,560 --> 00:15:01,190 standard deviations without actually 199 00:15:01,190 --> 00:15:02,860 knowing what they are. 200 00:15:02,860 --> 00:15:06,010 And there's another way to look at the same data, or 201 00:15:06,010 --> 00:15:07,120 almost the same data. 202 00:15:07,120 --> 00:15:08,990 Since it's a random experiment, it won't be 203 00:15:08,990 --> 00:15:22,220 exactly the same. 204 00:15:22,220 --> 00:15:28,670 So as before, we had the distribution, and fortunately 205 00:15:28,670 --> 00:15:30,890 it looks pretty much like the last one. 206 00:15:30,890 --> 00:15:32,950 We would've expected that. 207 00:15:32,950 --> 00:15:35,940 And I've now done something, another way of looking at the 208 00:15:35,940 --> 00:15:42,050 same information, really, is, I printed, I plotted, the 209 00:15:42,050 --> 00:15:46,900 probabilities of different values. 210 00:15:46,900 --> 00:15:52,950 So we can see here that the probability of getting a 7 is 211 00:15:52,950 --> 00:15:59,120 about 0.17 or something like that. 212 00:15:59,120 --> 00:16:03,200 Now, since I threw 100,000 die, it's not surprising that 213 00:16:03,200 --> 00:16:07,050 the probability of 0.17 looks about the same 214 00:16:07,050 --> 00:16:11,540 as 17,000 over here. 215 00:16:11,540 --> 00:16:15,770 But it's just a different way of looking at things. 216 00:16:15,770 --> 00:16:21,480 Right, had I thrown some different number, it might 217 00:16:21,480 --> 00:16:23,900 have been harder to visualize what the probability 218 00:16:23,900 --> 00:16:26,000 distribution looked like. 219 00:16:26,000 --> 00:16:28,140 But we often do talk about that. 220 00:16:28,140 --> 00:16:35,380 How probable is a certain value? 221 00:16:35,380 --> 00:16:39,210 People who design games of chance, by the way, something 222 00:16:39,210 --> 00:16:39,960 I've been meaning to say. 223 00:16:39,960 --> 00:16:42,690 You'll notice down here there's just, when we want to 224 00:16:42,690 --> 00:16:45,840 save these things, there's this little icon that's a 225 00:16:45,840 --> 00:16:48,800 floppy disk, to indicate store. 226 00:16:48,800 --> 00:16:51,260 And I thought maybe many of you'd never seen a floppy 227 00:16:51,260 --> 00:16:53,730 disk, so I decided to bring one in. 228 00:16:53,730 --> 00:16:55,500 You've seen the icons. 229 00:16:55,500 --> 00:16:58,590 And probably by the time most of you got, any you ever had a 230 00:16:58,590 --> 00:17:01,310 machine with a quote floppy drive? 231 00:17:01,310 --> 00:17:04,030 Did it actually flop the disk? 232 00:17:04,030 --> 00:17:07,540 No, they were pretty rigid, but the old floppy disks were 233 00:17:07,540 --> 00:17:09,840 really floppy. 234 00:17:09,840 --> 00:17:12,670 Hence they got the name. 235 00:17:12,670 --> 00:17:16,580 And, you know it's kind of like a giant size version. 236 00:17:16,580 --> 00:17:19,800 And it's amazing how people will probably continue to talk 237 00:17:19,800 --> 00:17:21,770 about floppy disks as long as they talk 238 00:17:21,770 --> 00:17:24,270 about dialing a telephone. 239 00:17:24,270 --> 00:17:27,190 And probably none of you've ever actually dialed a phone, 240 00:17:27,190 --> 00:17:29,820 for that matter, just pushed buttons . 241 00:17:29,820 --> 00:17:33,040 But they used to have dials that you would twirl. 242 00:17:33,040 --> 00:17:35,060 Anyway, I just thought everyone should at least see a 243 00:17:35,060 --> 00:17:37,250 floppy disk once. 244 00:17:37,250 --> 00:17:40,140 This is, by the way, a very good way to get data security. 245 00:17:40,140 --> 00:17:42,510 There's probably no way in the world to read the information 246 00:17:42,510 --> 00:17:48,870 on this disk anymore. 247 00:17:48,870 --> 00:17:55,060 All right, as I said people who design games of chance 248 00:17:55,060 --> 00:17:58,430 understand these probabilities very well. 249 00:17:58,430 --> 00:18:02,010 So I'm gonna now look at, show how we can understand these 250 00:18:02,010 --> 00:18:06,950 things in some other ways of popular example. 251 00:18:06,950 --> 00:18:09,410 A game of dice. 252 00:18:09,410 --> 00:18:13,520 Has anyone here ever played the game called craps? 253 00:18:13,520 --> 00:18:16,390 Did you win or lose money? 254 00:18:16,390 --> 00:18:18,050 You won. 255 00:18:18,050 --> 00:18:21,880 All right, you beat the odds. 256 00:18:21,880 --> 00:18:23,780 Well, it's a very popular game, and I'm going to 257 00:18:23,780 --> 00:18:25,020 explain it to you. 258 00:18:25,020 --> 00:18:28,150 As you will see, this is not an endorsement of gambling, 259 00:18:28,150 --> 00:18:30,370 because one of the things you will notice is, you are likely 260 00:18:30,370 --> 00:18:32,590 to lose money if you do this. 261 00:18:32,590 --> 00:18:35,800 So I tend not, I don't do it. 262 00:18:35,800 --> 00:18:45,830 All right, so how does a game of craps work? 263 00:18:45,830 --> 00:18:50,630 You start by rolling two dice. 264 00:18:50,630 --> 00:18:55,840 If you get a 7 or an 11, the roller, we'll call that the 265 00:18:55,840 --> 00:19:01,490 shooter, you win. 266 00:19:01,490 --> 00:19:08,870 If you get a 2, 3, or a 12, you lose. 267 00:19:08,870 --> 00:19:10,720 I'm assuming here, you're betting what's 268 00:19:10,720 --> 00:19:12,690 called the pass line. 269 00:19:12,690 --> 00:19:14,720 There are different ways to bet, this is the most common 270 00:19:14,720 --> 00:19:17,820 way to bet, we'll just deal with that. 271 00:19:17,820 --> 00:19:27,150 If it's not any of these, what you get is, otherwise, the 272 00:19:27,150 --> 00:19:37,330 number becomes what's called the point. 273 00:19:37,330 --> 00:19:41,470 Once you've got the point, you keep rolling the dice until 1 274 00:19:41,470 --> 00:19:44,540 of o things happens. 275 00:19:44,540 --> 00:19:56,430 You get a 7, in which case you lose, or you get the point, in 276 00:19:56,430 --> 00:20:04,250 which case you win. 277 00:20:04,250 --> 00:20:07,420 So it's a pretty simple game. 278 00:20:07,420 --> 00:20:12,740 Very popular game. 279 00:20:12,740 --> 00:20:14,430 So I've implemented. 280 00:20:14,430 --> 00:20:18,400 So one of the interesting things about this is, if you 281 00:20:18,400 --> 00:20:22,160 try and actually figure out what the probabilities are 282 00:20:22,160 --> 00:20:25,700 using pencil and paper, you can, but it gets 283 00:20:25,700 --> 00:20:33,670 a little bit involved. 284 00:20:33,670 --> 00:20:35,960 Gets involved because you have to, all right, what are the 285 00:20:35,960 --> 00:20:38,890 odds of winning or losing on the first throw? 286 00:20:38,890 --> 00:20:41,160 Well, you can compute those pretty easily, and you can see 287 00:20:41,160 --> 00:20:43,550 that you'd actually win more than you lose, 288 00:20:43,550 --> 00:20:44,920 on the first throw. 289 00:20:44,920 --> 00:20:48,030 But if you look at the distribution of 7s and 11s and 290 00:20:48,030 --> 00:20:51,690 2s, 3s, and 12s, you add them up, you'll see, well, this is 291 00:20:51,690 --> 00:20:53,890 more likely than this. 292 00:20:53,890 --> 00:20:55,900 But then you say, suppose I don't get those. 293 00:20:55,900 --> 00:20:58,610 What's the likelihood of getting each other possible 294 00:20:58,610 --> 00:21:02,070 point value, and then given that point value, what's the 295 00:21:02,070 --> 00:21:04,900 likelihood of getting that before a 7? 296 00:21:04,900 --> 00:21:07,720 And you can do it, but it gets very tedious. 297 00:21:07,720 --> 00:21:11,450 So those of us who are inclined to think 298 00:21:11,450 --> 00:21:15,430 computationally, and I hope by now that's all of you as well 299 00:21:15,430 --> 00:21:19,270 as me, say well, instead of doing the probabilities by 300 00:21:19,270 --> 00:21:22,420 hand, I'm just going to write a little program. 301 00:21:22,420 --> 00:21:27,330 And it's a program that took me maybe 10 minutes to write. 302 00:21:27,330 --> 00:21:34,240 You can see it's quite small, I did it yesterday. 303 00:21:34,240 --> 00:21:39,620 So the first function here is craps, it returns true if the 304 00:21:39,620 --> 00:21:43,350 shooter wins by betting the pass line. 305 00:21:43,350 --> 00:21:46,380 And it's just does what I said. 306 00:21:46,380 --> 00:21:48,860 Rolls them, if the total is 1 or 11, it 307 00:21:48,860 --> 00:21:51,270 returns true, you win. 308 00:21:51,270 --> 00:21:54,750 If it's 2, 3, or 12, it returns false, you lose. 309 00:21:54,750 --> 00:21:58,190 Otherwise the point becomes the total. 310 00:21:58,190 --> 00:22:01,830 And then while true, I'll just keep rolling. 311 00:22:01,830 --> 00:22:07,290 Until either, if the total gets the point, I return true. 312 00:22:07,290 --> 00:22:09,930 Or if the total's equal 7, I return false. 313 00:22:09,930 --> 00:22:12,110 And that's it. 314 00:22:12,110 --> 00:22:15,590 So essentially I just took these rules, typed them down, 315 00:22:15,590 --> 00:22:20,270 and I had my game. 316 00:22:20,270 --> 00:22:24,620 And then I'll simulate it will some number of bets. 317 00:22:24,620 --> 00:22:28,980 Keeping track of the numbers of wins and losses. 318 00:22:28,980 --> 00:22:31,950 Just by incrementing 1 or the other, depending upon whether 319 00:22:31,950 --> 00:22:34,500 I return true or false. 320 00:22:34,500 --> 00:22:37,220 I'm going to, just to show what we do, print the number 321 00:22:37,220 --> 00:22:39,410 of wins and losses. 322 00:22:39,410 --> 00:22:44,210 And then compute, how does the house do? 323 00:22:44,210 --> 00:22:48,120 Not the gambler, but the person who's running the game, 324 00:22:48,120 --> 00:22:49,610 the casino. 325 00:22:49,610 --> 00:22:53,720 Or in other circumstances, other places. 326 00:22:53,720 --> 00:22:56,910 And then we'll see how that goes. 327 00:22:56,910 --> 00:23:00,570 And I'll try it with 100,000 games. 328 00:23:00,570 --> 00:23:03,940 Now, this is more than 100,000 rolls of the dice, right? 329 00:23:03,940 --> 00:23:08,730 Because I don't get a 7 or 11, I keep rolling. 330 00:23:08,730 --> 00:23:15,790 So before I do it, I'll as the easy question first. Who 331 00:23:15,790 --> 00:23:21,990 thinks the casino wins more often than the player? 332 00:23:21,990 --> 00:23:26,150 Who thinks the player wins more often than the casino? 333 00:23:26,150 --> 00:23:30,090 Well, very logical, casinos are not in business of giving 334 00:23:30,090 --> 00:23:32,870 away money. 335 00:23:32,870 --> 00:23:34,800 So now the more interesting question. 336 00:23:34,800 --> 00:23:38,770 How steep do you think the odds are in the house's favor? 337 00:23:38,770 --> 00:23:43,980 Anyone want to guess? 338 00:23:43,980 --> 00:23:45,040 Actually pretty thin. 339 00:23:45,040 --> 00:23:57,850 Let's run it and see. 340 00:23:57,850 --> 00:24:02,800 So what we see is, the house wins 50, in this case 50.424% 341 00:24:02,800 --> 00:24:04,770 of the time. 342 00:24:04,770 --> 00:24:06,360 Not a lot. 343 00:24:06,360 --> 00:24:08,570 On the other hand, if people bet 100,000, 344 00:24:08,570 --> 00:24:14,200 the house wins 848. 345 00:24:14,200 --> 00:24:23,010 Now, 100,000 is actually a small number. 346 00:24:23,010 --> 00:24:25,780 Let's get rid of these, should have gotten rid of these 347 00:24:25,780 --> 00:24:53,560 figures, you don't need to see them every time. 348 00:24:53,560 --> 00:24:58,240 We'll keep one figure, just for fun. 349 00:24:58,240 --> 00:25:04,850 Let's try it again. 350 00:25:04,850 --> 00:25:06,300 Probably get a little different answer. 351 00:25:06,300 --> 00:25:11,650 Considerable different, but still, less than 51% of the 352 00:25:11,650 --> 00:25:13,270 time in this trial. 353 00:25:13,270 --> 00:25:16,070 But you can see that the house is slowly but surely going to 354 00:25:16,070 --> 00:25:20,570 get rich playing this game. 355 00:25:20,570 --> 00:25:24,650 Now let's ask the other interesting question. 356 00:25:24,650 --> 00:25:29,330 Just for fun, suppose we want to cheat. 357 00:25:29,330 --> 00:25:32,670 Now, I realize none of you would never do that. 358 00:25:32,670 --> 00:25:36,750 But let's consider using a pair of loaded dice. 359 00:25:36,750 --> 00:25:39,870 So there's a long history, well you can imagine when you 360 00:25:39,870 --> 00:25:43,480 looked at that old bone I showed you, that it wasn't 361 00:25:43,480 --> 00:25:44,960 exactly fair. 362 00:25:44,960 --> 00:25:47,940 That some sides were a little heavier than others, and in 363 00:25:47,940 --> 00:25:52,970 fact you didn't get, say, a 5 exactly 1/6 of the time. 364 00:25:52,970 --> 00:25:55,430 And therefore, if you were using your own dice, instead 365 00:25:55,430 --> 00:25:57,950 of somebody else's, and you knew what was most likely, you 366 00:25:57,950 --> 00:26:00,070 might do better. 367 00:26:00,070 --> 00:26:02,560 Well, the modern version of that is, people do cheat by 368 00:26:02,560 --> 00:26:06,430 putting little weights in dice, to just make tiny 369 00:26:06,430 --> 00:26:09,210 changes in the probability of one number or 370 00:26:09,210 --> 00:26:11,240 another coming up. 371 00:26:11,240 --> 00:26:14,570 So let's do that. 372 00:26:14,570 --> 00:26:16,560 And let's first ask the question, well, what would be 373 00:26:16,560 --> 00:26:20,250 a nice way to do that? 374 00:26:20,250 --> 00:26:24,540 It's very easy here. 375 00:26:24,540 --> 00:26:37,290 If we look at it, all I've done is, I've changed the 376 00:26:37,290 --> 00:26:41,210 distribution of values, so instead of here being 1, 2, 3, 377 00:26:41,210 --> 00:26:46,270 4, 5, and 6, it's 1, 2, 3, 4, 5, 5, and 6. 378 00:26:46,270 --> 00:26:51,520 I snuck in an extra 5 on one of the two dice. 379 00:26:51,520 --> 00:26:59,300 So this has changed the odds of rolling a 5 from 1 in 6 to 380 00:26:59,300 --> 00:27:04,150 roughly 3 in 12. 381 00:27:04,150 --> 00:27:09,510 Now 1/6, which is 2/12, vs. 3/12, it's not a big 382 00:27:09,510 --> 00:27:10,790 difference. 383 00:27:10,790 --> 00:27:12,950 And you can imagine, if you were sitting there watching 384 00:27:12,950 --> 00:27:17,130 it, you wouldn't notice that 5 was coming up a little bit 385 00:27:17,130 --> 00:27:19,830 more often than you expected. 386 00:27:19,830 --> 00:27:21,150 Normally. 387 00:27:21,150 --> 00:27:23,750 Close enough that you wouldn't notice it. 388 00:27:23,750 --> 00:27:28,800 But let's see if, what difference it makes? 389 00:27:28,800 --> 00:27:30,590 What difference do you think it will make here? 390 00:27:30,590 --> 00:27:32,840 First of all, is it going to be better or 391 00:27:32,840 --> 00:27:36,110 worse for the player? 392 00:27:36,110 --> 00:27:39,830 Who thinks better? 393 00:27:39,830 --> 00:27:42,500 Who thinks worse? 394 00:27:42,500 --> 00:27:45,650 Who thinks they haven't a clue? 395 00:27:45,650 --> 00:27:48,640 All right, we have an honest man. 396 00:27:48,640 --> 00:27:52,130 Where is Diogenes when we we need him? 397 00:27:52,130 --> 00:27:55,650 The reward for honesty. 398 00:27:55,650 --> 00:27:58,030 I could reward you and wake him up at the same time. 399 00:27:58,030 --> 00:28:01,340 It's good. 400 00:28:01,340 --> 00:28:16,810 All right, well, let's see what happens. 401 00:28:16,810 --> 00:28:21,095 All right, so suddenly, the odds have swung in favor of 402 00:28:21,095 --> 00:28:23,090 the player. 403 00:28:23,090 --> 00:28:28,030 This tiny little change has now made it likely that the 404 00:28:28,030 --> 00:28:33,440 player win money, instead of the house. 405 00:28:33,440 --> 00:28:34,300 So what's the point? 406 00:28:34,300 --> 00:28:36,730 The point is not, you should go out and try and cheat 407 00:28:36,730 --> 00:28:41,070 casinos, because you'll probably find an unpleasant 408 00:28:41,070 --> 00:28:42,880 consequence of that. 409 00:28:42,880 --> 00:28:47,700 The point is that, once I've written this simulation, I can 410 00:28:47,700 --> 00:28:52,480 play thought experiments in a very easy way. 411 00:28:52,480 --> 00:28:53,870 So-called what if games. 412 00:28:53,870 --> 00:28:54,820 What if we did this? 413 00:28:54,820 --> 00:28:56,690 What if we did that? 414 00:28:56,690 --> 00:28:59,680 And it's trivial to do those kinds of things. 415 00:28:59,680 --> 00:29:03,000 And that's one of the reasons we typically do try and write 416 00:29:03,000 --> 00:29:04,640 these simulations. 417 00:29:04,640 --> 00:29:09,080 So that we can experiment with things. 418 00:29:09,080 --> 00:29:11,070 Are there any other experiments people would like 419 00:29:11,070 --> 00:29:13,500 to perform while we're here? 420 00:29:13,500 --> 00:29:18,280 Any other sets of die you might like to try? 421 00:29:18,280 --> 00:29:22,060 All right, someone give me a suggestion of something that 422 00:29:22,060 --> 00:29:24,320 might work in the house's favor. 423 00:29:24,320 --> 00:29:27,120 Suppose a casino wanted to cheat. 424 00:29:27,120 --> 00:29:28,430 What do you think would help them out? 425 00:29:28,430 --> 00:29:28,690 Yeah? 426 00:29:28,690 --> 00:29:32,594 STUDENT: Increase prevalence of 1, instead of 5? 427 00:29:32,594 --> 00:29:34,760 PROFESSOR: All right, so let's see if we increase the 428 00:29:34,760 --> 00:29:52,210 probability of 1, what it does? 429 00:29:52,210 --> 00:29:58,490 Yep, clearly helped the house out, didn't it? 430 00:29:58,490 --> 00:30:04,750 So that would be a good thing for the house. 431 00:30:04,750 --> 00:30:08,840 Again, you know, three key strokes and we get to try it. 432 00:30:08,840 --> 00:30:15,350 It's really a very nice kind of thing to be able to do. 433 00:30:15,350 --> 00:30:19,320 OK, this works nicely. 434 00:30:19,320 --> 00:30:21,350 We'll get normal distributions. 435 00:30:21,350 --> 00:30:24,210 We can look at some things. 436 00:30:24,210 --> 00:30:26,130 There are two other kinds of distributions I 437 00:30:26,130 --> 00:30:27,190 want to talk about. 438 00:30:27,190 --> 00:30:40,680 We can get rid of this distraction. 439 00:30:40,680 --> 00:30:43,492 As you can imagine, I played a lot with these things, just 440 00:30:43,492 --> 00:30:52,310 cause it was fun once I had it. 441 00:30:52,310 --> 00:30:57,170 You have these in your handout. 442 00:30:57,170 --> 00:31:01,800 So the one on the upper right, is the Gaussian, or normal, 443 00:31:01,800 --> 00:31:07,800 distribution we've been talking about. 444 00:31:07,800 --> 00:31:12,080 As I said earlier, quite common, we see it a lot. 445 00:31:12,080 --> 00:31:17,090 The upper left is what's called a, and these, by the 446 00:31:17,090 --> 00:31:19,550 way, all of these distributions are symmetric, 447 00:31:19,550 --> 00:31:23,110 just in this particular picture. 448 00:31:23,110 --> 00:31:26,750 How do you spell symmetric, one or two m's? 449 00:31:26,750 --> 00:31:29,420 I help here. 450 00:31:29,420 --> 00:31:30,290 That right? 451 00:31:30,290 --> 00:31:32,720 OK, thank you. 452 00:31:32,720 --> 00:31:35,720 And they're symmetric in the sense that, if you take the 453 00:31:35,720 --> 00:31:40,280 mean, it looks the same on both sides of the mean. 454 00:31:40,280 --> 00:31:43,370 Now in general, you can have asymmetric 455 00:31:43,370 --> 00:31:48,260 distributions as well. 456 00:31:48,260 --> 00:31:51,830 But for simplicity, we'll here look at symmetric ones. 457 00:31:51,830 --> 00:31:54,950 So we've seen the bell curve, and then on the upper left is 458 00:31:54,950 --> 00:32:04,290 what's called the uniform. 459 00:32:04,290 --> 00:32:09,690 In a uniform distribution, each value in the range is 460 00:32:09,690 --> 00:32:15,810 equally likely. 461 00:32:15,810 --> 00:32:18,960 So to characterize it, you only need to give 462 00:32:18,960 --> 00:32:21,930 the range of values. 463 00:32:21,930 --> 00:32:26,250 I say the values range from 0 to 100, and it tells me 464 00:32:26,250 --> 00:32:30,050 everything I know about the uniform distribution. 465 00:32:30,050 --> 00:32:35,850 Each value in that will occur the same number of times. 466 00:32:35,850 --> 00:32:39,970 Have we seen a uniform distribution? 467 00:32:39,970 --> 00:32:46,350 What have we seen that's uniform here? 468 00:32:46,350 --> 00:32:46,650 Pardon? 469 00:32:46,650 --> 00:32:48,620 STUDENT: Playing dice. 470 00:32:48,620 --> 00:32:49,990 PROFESSOR: Playing dice. 471 00:32:49,990 --> 00:32:51,580 Exactly right. 472 00:32:51,580 --> 00:32:57,780 Each roll of the die was equally likely. 473 00:32:57,780 --> 00:32:59,900 Between 1 and 6. 474 00:32:59,900 --> 00:33:03,610 So, we got a normal distribution when I summed 475 00:33:03,610 --> 00:33:09,710 them, but if I gave you the distribution of a single die, 476 00:33:09,710 --> 00:33:11,870 it would have been uniform, right? 477 00:33:11,870 --> 00:33:14,340 So there's an interesting lesson there. 478 00:33:14,340 --> 00:33:18,060 One die, the distribution was uniform, but when I summed 479 00:33:18,060 --> 00:33:27,700 them, I ended up getting a normal distribution. 480 00:33:27,700 --> 00:33:29,870 So where else do we see them? 481 00:33:29,870 --> 00:33:34,070 In principle, lottery winners are uniformly distributed. 482 00:33:34,070 --> 00:33:38,480 Each number is equally likely to come up. 483 00:33:38,480 --> 00:33:41,640 To a first approximation, birthdays are uniformly 484 00:33:41,640 --> 00:33:44,670 distributed, things like that. 485 00:33:44,670 --> 00:33:49,710 But, in fact, they rarely arise in nature. 486 00:33:49,710 --> 00:33:52,350 You'll hardly ever run a physics experiment, or a 487 00:33:52,350 --> 00:33:55,440 biology experiment, or anything like that, and come 488 00:33:55,440 --> 00:33:58,100 up with a uniform distribution. 489 00:33:58,100 --> 00:34:04,810 Nor do they arise very often in complex systems. So if you 490 00:34:04,810 --> 00:34:07,970 look at what happens in financial markets, none of the 491 00:34:07,970 --> 00:34:10,500 interesting distributions are uniform. 492 00:34:10,500 --> 00:34:13,680 You know, the prices of stocks, for example, are 493 00:34:13,680 --> 00:34:16,250 clearly not uniformly distributed. 494 00:34:16,250 --> 00:34:19,820 Up days and down days in the stock market are not uniformly 495 00:34:19,820 --> 00:34:22,780 distributed. 496 00:34:22,780 --> 00:34:28,700 Winners of football games are not uniformly distributed. 497 00:34:28,700 --> 00:34:32,160 People seem to like to use them in games of chance, 498 00:34:32,160 --> 00:34:35,120 because they seem fair, but mostly you see them only in 499 00:34:35,120 --> 00:34:39,750 invented things, rather than real things. 500 00:34:39,750 --> 00:34:41,800 The third kind of distribution, the one in the 501 00:34:41,800 --> 00:34:55,010 bottom, is the exponential distribution. 502 00:34:55,010 --> 00:34:59,360 That's actually quite common in the real world. 503 00:34:59,360 --> 00:35:05,560 It's often used, for example, to model arrival times. 504 00:35:05,560 --> 00:35:08,170 If you want to model the frequency at which, say, 505 00:35:08,170 --> 00:35:16,040 automobiles arrive, get on the Mass Turnpike, you would find 506 00:35:16,040 --> 00:35:20,750 that the arrivals are exponential. 507 00:35:20,750 --> 00:35:25,180 We see with an exponential is, things fall off much more 508 00:35:25,180 --> 00:35:34,700 steeply around the mean than with the normal distribution. 509 00:35:34,700 --> 00:35:40,520 All right, that make sense? 510 00:35:40,520 --> 00:35:43,960 What else is exponentially distributed? 511 00:35:43,960 --> 00:35:48,090 Requests for web pages are often exponentially 512 00:35:48,090 --> 00:35:49,170 distributed. 513 00:35:49,170 --> 00:35:52,890 The amount of traffic at a website. 514 00:35:52,890 --> 00:35:55,290 How frequently they arrive. 515 00:35:55,290 --> 00:35:59,640 We'll see much more starting next week, or maybe even 516 00:35:59,640 --> 00:36:03,260 starting Thursday, about exponential distributions, as 517 00:36:03,260 --> 00:36:06,370 we go on with a final case study that we'll be dealing 518 00:36:06,370 --> 00:36:09,960 with in the course. 519 00:36:09,960 --> 00:36:15,460 You can think of each of these, by the way, as 520 00:36:15,460 --> 00:36:19,350 increasing order of predictability. 521 00:36:19,350 --> 00:36:23,590 Uniform distribution means the result is most unpredictable, 522 00:36:23,590 --> 00:36:27,390 it could be anything. 523 00:36:27,390 --> 00:36:34,750 A normal distribution says, well, it's pretty predictable. 524 00:36:34,750 --> 00:36:37,170 Again, depending on the standard deviation. 525 00:36:37,170 --> 00:36:44,090 If you guess the mean, you're pretty close to right. 526 00:36:44,090 --> 00:36:47,730 The exponential is very predictable. 527 00:36:47,730 --> 00:36:55,100 Most of the answers are right around the mean. 528 00:36:55,100 --> 00:36:57,220 Now there are many other distributions, there are 529 00:36:57,220 --> 00:37:01,330 Pareto distributions which have fat tails, there are 530 00:37:01,330 --> 00:37:05,920 fractal distributions, there are all sorts of things. 531 00:37:05,920 --> 00:37:09,580 We won't go into to those details. 532 00:37:09,580 --> 00:37:13,090 Now, I hope you didn't find this short excursion into 533 00:37:13,090 --> 00:37:17,610 statistics either too boring or too confusing. 534 00:37:17,610 --> 00:37:21,350 The point was not to teach you statistics, probability, we 535 00:37:21,350 --> 00:37:24,040 have multiple courses to do that. 536 00:37:24,040 --> 00:37:27,040 But to give you some tools that would help improve your 537 00:37:27,040 --> 00:37:31,450 intuition in thinking about data. 538 00:37:31,450 --> 00:37:36,400 In closing, I want to give a few words about 539 00:37:36,400 --> 00:37:38,830 the misuse of data. 540 00:37:38,830 --> 00:37:43,400 Since I think we misuse data an awful lot. 541 00:37:43,400 --> 00:37:57,960 So, point number 0, as in the most important, is beware of 542 00:37:57,960 --> 00:38:25,100 people who give you properties of data, but not the data. 543 00:38:25,100 --> 00:38:29,130 We see that sort of thing all the time. 544 00:38:29,130 --> 00:38:33,940 Where people come in, and they say, OK, here it is, here's 545 00:38:33,940 --> 00:38:37,120 the mean value of the quiz, and here's the standard 546 00:38:37,120 --> 00:38:42,460 deviation of the quiz, and that just doesn't really tell 547 00:38:42,460 --> 00:38:44,840 you where you stand, in some sense. 548 00:38:44,840 --> 00:38:47,630 Because it's probably not normally distributed. 549 00:38:47,630 --> 00:38:50,030 You want to see the data. 550 00:38:50,030 --> 00:38:53,310 At the very least, if you see the data, you can then say, 551 00:38:53,310 --> 00:38:57,060 yeah, it is normally distributed, so the standard 552 00:38:57,060 --> 00:39:01,530 deviation is meaningful, or not meaningful. 553 00:39:01,530 --> 00:39:05,880 So, whenever you can, try and get, at 554 00:39:05,880 --> 00:39:14,840 least, to see the data. 555 00:39:14,840 --> 00:39:18,500 So that's 1, or 0. 556 00:39:18,500 --> 00:39:24,830 1 is, well, all right. 557 00:39:24,830 --> 00:39:38,060 I'm going to test your Latin. 558 00:39:38,060 --> 00:39:41,060 Cum hoc ergo propter hoc. 559 00:39:41,060 --> 00:39:43,920 All right. 560 00:39:43,920 --> 00:39:49,170 I need a Latin scholar to translate this. 561 00:39:49,170 --> 00:39:53,870 Did not one of you take Latin in high school? 562 00:39:53,870 --> 00:39:55,110 We have someone who did. 563 00:39:55,110 --> 00:39:55,570 Go ahead. 564 00:39:55,570 --> 00:39:59,119 STUDENT: I think it means, with this, 565 00:39:59,119 --> 00:40:01,150 therefore, because of this. 566 00:40:01,150 --> 00:40:03,070 PROFESSOR: Exactly right. 567 00:40:03,070 --> 00:40:08,090 With this, therefore, because of this. 568 00:40:08,090 --> 00:40:09,660 I'm glad that at least one person 569 00:40:09,660 --> 00:40:12,800 has a classical education. 570 00:40:12,800 --> 00:40:16,660 I don't, by the way. 571 00:40:16,660 --> 00:40:19,610 Essentially what this is telling us, is that 572 00:40:19,610 --> 00:40:39,970 correlation does not imply causation. 573 00:40:39,970 --> 00:40:42,830 So sometimes two things go together. 574 00:40:42,830 --> 00:40:45,310 They both go up, they both go down. 575 00:40:45,310 --> 00:40:46,990 And people jump to the conclusion that 576 00:40:46,990 --> 00:40:49,590 one causes the other. 577 00:40:49,590 --> 00:40:53,250 That there's a cause and effect relationship. 578 00:40:53,250 --> 00:40:57,660 That is just not true. 579 00:40:57,660 --> 00:41:01,100 It's what's called a logical fallacy. 580 00:41:01,100 --> 00:41:04,310 So we see some examples of this. 581 00:41:04,310 --> 00:41:05,710 And you can get into big trouble. 582 00:41:05,710 --> 00:41:08,230 So here's a very interesting one. 583 00:41:08,230 --> 00:41:11,110 There was a very widely reported epidemiological 584 00:41:11,110 --> 00:41:14,830 study, that's a medical study where you get statistics about 585 00:41:14,830 --> 00:41:16,870 large populations. 586 00:41:16,870 --> 00:41:20,570 And it showed that women, who are taking hormone replacement 587 00:41:20,570 --> 00:41:24,850 therapy, were found to have a lower incidence of coronary 588 00:41:24,850 --> 00:41:28,570 heart disease than women who didn't. 589 00:41:28,570 --> 00:41:32,320 This was a big study of a lot of women. 590 00:41:32,320 --> 00:41:38,230 This led doctors to propose that hormone replacement 591 00:41:38,230 --> 00:41:41,430 therapy for middle aged women was protective against 592 00:41:41,430 --> 00:41:44,210 coronary heart disease. 593 00:41:44,210 --> 00:41:48,720 And in fact, in response to this, a large number of 594 00:41:48,720 --> 00:41:50,640 medical societies recommended this. 595 00:41:50,640 --> 00:41:55,310 And a large number of women were given this therapy. 596 00:41:55,310 --> 00:42:00,810 Later, controlled trials showed that in fact, hormone 597 00:42:00,810 --> 00:42:04,960 replacement therapy in women caused a small and significant 598 00:42:04,960 --> 00:42:10,430 increase in coronary heart disease. 599 00:42:10,430 --> 00:42:13,350 So they had taken the fact that these were correlated, 600 00:42:13,350 --> 00:42:17,760 said one causes the other, made a prescription, and it 601 00:42:17,760 --> 00:42:20,840 turned out to be the wrong one. 602 00:42:20,840 --> 00:42:25,100 Now, how could this be? 603 00:42:25,100 --> 00:42:27,960 How could this be? 604 00:42:27,960 --> 00:42:33,320 It turned out that the women in the original study who were 605 00:42:33,320 --> 00:42:39,120 taking the hormone replacement therapy, tended to be from a 606 00:42:39,120 --> 00:42:43,760 higher socioeconomic group than those who didn't. 607 00:42:43,760 --> 00:42:46,460 Because the therapy was not covered by insurance, so the 608 00:42:46,460 --> 00:42:48,960 women who took it were wealthy. 609 00:42:48,960 --> 00:42:52,000 Turns out wealthy people do a lot of other things that are 610 00:42:52,000 --> 00:42:54,010 protective of their hearts. 611 00:42:54,010 --> 00:42:57,520 And, therefore, are in general healthier than poor people. 612 00:42:57,520 --> 00:42:59,370 This is not a surprise. 613 00:42:59,370 --> 00:43:02,590 Rich people are healthier than poor people. 614 00:43:02,590 --> 00:43:09,090 And so in fact, it was this third variable that was 615 00:43:09,090 --> 00:43:13,150 actually the meaningful one. 616 00:43:13,150 --> 00:43:14,460 This is what is called in 617 00:43:14,460 --> 00:43:30,500 statistics, a lurking variable. 618 00:43:30,500 --> 00:43:34,390 Both of the things they were looking at in this study, who 619 00:43:34,390 --> 00:43:39,180 took the therapy, and who had a heart disease, each of those 620 00:43:39,180 --> 00:43:42,320 was correlated with the lurking variable of 621 00:43:42,320 --> 00:43:46,680 socioeconomic position. 622 00:43:46,680 --> 00:43:49,960 And so, in effect, there was no cause and effect 623 00:43:49,960 --> 00:43:51,370 relationship. 624 00:43:51,370 --> 00:43:54,990 And once they did another study, in which the lurking 625 00:43:54,990 --> 00:43:59,400 variable was controlled, and they looked at heart disease 626 00:43:59,400 --> 00:44:02,620 among rich women separately from poor women, with this 627 00:44:02,620 --> 00:44:07,060 therapy, they discovered that therapy was not good. 628 00:44:07,060 --> 00:44:10,700 It was, in fact, harmful. 629 00:44:10,700 --> 00:44:16,290 So this is a very important moral to remember. 630 00:44:16,290 --> 00:44:21,790 When you look at correlations, don't assume cause and effect. 631 00:44:21,790 --> 00:44:25,540 And don't assume that there isn't a lurking variable that 632 00:44:25,540 --> 00:44:29,940 really is the dominant factor. 633 00:44:29,940 --> 00:44:35,310 So that's one statistical, a second statistical worry. 634 00:44:35,310 --> 00:44:48,260 Number 2 is, beware of what's called, non-response bias. 635 00:44:48,260 --> 00:44:51,030 Which is another fancy way of saying, 636 00:44:51,030 --> 00:45:05,120 non-representative samples. 637 00:45:05,120 --> 00:45:08,130 No one doing a study beyond the trivial can sample 638 00:45:08,130 --> 00:45:11,590 everybody or everything. 639 00:45:11,590 --> 00:45:17,080 And only mind readers can be sure of what they've missed. 640 00:45:17,080 --> 00:45:18,640 Unless, of course, people choose to 641 00:45:18,640 --> 00:45:20,600 miss things on purpose. 642 00:45:20,600 --> 00:45:22,030 Which you also see. 643 00:45:22,030 --> 00:45:24,720 And that brings me to my next anecdote. 644 00:45:24,720 --> 00:45:31,240 A former professor at the University of Nebraska, who 645 00:45:31,240 --> 00:45:34,300 later headed a group called The Family Research Institute, 646 00:45:34,300 --> 00:45:37,340 which some of you may have heard about, claimed that gay 647 00:45:37,340 --> 00:45:43,610 men have an average life expectancy of 43 years. 648 00:45:43,610 --> 00:45:46,570 And they did a study full of statistics showing that this 649 00:45:46,570 --> 00:45:49,550 was the case. 650 00:45:49,550 --> 00:45:53,690 And the key was, they calculated the figure by 651 00:45:53,690 --> 00:45:58,160 checking gay newspapers for obituaries and news about 652 00:45:58,160 --> 00:46:00,170 stories of death. 653 00:46:00,170 --> 00:46:04,410 So they went through the gay newspapers, took a list of 654 00:46:04,410 --> 00:46:08,190 everybody whose obituary appeared, how old they were 655 00:46:08,190 --> 00:46:13,080 when they died, took the average, and said it was 43. 656 00:46:13,080 --> 00:46:15,430 Then they did a bunch of statistics, with all sorts of 657 00:46:15,430 --> 00:46:19,190 tests, showing how, you know, what the curves look like, the 658 00:46:19,190 --> 00:46:21,310 distributions, and the significance. 659 00:46:21,310 --> 00:46:25,130 All the math was valid. 660 00:46:25,130 --> 00:46:28,890 The problem was, it was a very unrepresentative sample. 661 00:46:28,890 --> 00:46:29,680 What was the most 662 00:46:29,680 --> 00:46:32,990 unrepresentative thing about it? 663 00:46:32,990 --> 00:46:33,480 Somebody? 664 00:46:33,480 --> 00:46:38,595 STUDENT: Not all deaths have obituaries. 665 00:46:38,595 --> 00:46:40,930 PROFESSOR: Well, that's one thing. 666 00:46:40,930 --> 00:46:41,820 That's certainly true. 667 00:46:41,820 --> 00:46:44,420 But what else? 668 00:46:44,420 --> 00:46:48,170 Well, not all gay people are dead, right? 669 00:46:48,170 --> 00:46:51,370 So if you're looking at obituaries, you're in fact 670 00:46:51,370 --> 00:46:52,690 only getting -- 671 00:46:52,690 --> 00:46:56,310 I'm sure that's what you were planning to say -- sorry. 672 00:46:56,310 --> 00:46:59,400 You're only getting the people who are dead, so it's clearly 673 00:46:59,400 --> 00:47:02,970 going to make the number look smaller, right? 674 00:47:02,970 --> 00:47:05,460 Furthermore, you're only getting the ones that were 675 00:47:05,460 --> 00:47:09,720 reported in newspapers, the newspapers are typically 676 00:47:09,720 --> 00:47:17,140 urban, rather than out in rural areas, so it turns out, 677 00:47:17,140 --> 00:47:22,230 it's also biased against gays who chose not come out of the 678 00:47:22,230 --> 00:47:26,400 closet, and therefore didn't appear in these. 679 00:47:26,400 --> 00:47:29,410 Lots and lots of things with the problems. 680 00:47:29,410 --> 00:47:31,260 Believe it or not, this paper was published 681 00:47:31,260 --> 00:47:34,040 in a reputable journal. 682 00:47:34,040 --> 00:47:38,330 And someone checked all the math, but missed the fact that 683 00:47:38,330 --> 00:47:49,100 all of that was irrelevant because the sample was wrong. 684 00:47:49,100 --> 00:47:50,790 Data enhancement. 685 00:47:50,790 --> 00:47:53,500 It even sounds bad, right? 686 00:47:53,500 --> 00:47:56,820 You run an experiment, you get your data, and you enhance it. 687 00:47:56,820 --> 00:47:59,460 It's kind of like when you ran those physics experiments in 688 00:47:59,460 --> 00:48:02,640 high school, and you've got answers that you knew didn't 689 00:48:02,640 --> 00:48:05,340 match the theory, so you fudged the data? 690 00:48:05,340 --> 00:48:08,910 I know none of you would have ever done that, but some 691 00:48:08,910 --> 00:48:09,850 people been known to do. 692 00:48:09,850 --> 00:48:12,380 That's not actually what this means. 693 00:48:12,380 --> 00:48:16,260 What this means is, reading more into the data than it 694 00:48:16,260 --> 00:48:19,390 actually implies. 695 00:48:19,390 --> 00:48:23,410 So well-meaning people are often the guiltiest here. 696 00:48:23,410 --> 00:48:26,010 So here's another one of my favorites. 697 00:48:26,010 --> 00:48:29,730 For example, there are people who try to scare us into 698 00:48:29,730 --> 00:48:31,330 driving safely. 699 00:48:31,330 --> 00:48:33,920 Driving safely is a good thing. 700 00:48:33,920 --> 00:48:36,280 By telling holiday deaths. 701 00:48:36,280 --> 00:48:39,790 So you'll read things like, 400 killed on the highways 702 00:48:39,790 --> 00:48:42,240 over long weekend. 703 00:48:42,240 --> 00:48:46,600 It sounds really bad, until you observe the fact that 704 00:48:46,600 --> 00:48:51,940 roughly 400 people are killed on any 3-day period. 705 00:48:51,940 --> 00:48:56,790 And in fact, it's no higher on the holiday weekends. 706 00:48:56,790 --> 00:48:58,360 I'll bet you all thought more people got 707 00:48:58,360 --> 00:48:59,580 killed on holiday weekends. 708 00:48:59,580 --> 00:49:02,330 Well, typically not. 709 00:49:02,330 --> 00:49:05,210 They just report how many died, but they don't tell you 710 00:49:05,210 --> 00:49:12,440 the context, say, oh, by the way, take any 3-day period. 711 00:49:12,440 --> 00:49:17,260 So the moral there is, you really want to place the data 712 00:49:17,260 --> 00:49:20,250 in context. 713 00:49:20,250 --> 00:49:25,530 Data taken out of context without comparison is usually 714 00:49:25,530 --> 00:49:27,460 meaningless. 715 00:49:27,460 --> 00:49:39,540 Another variance of this is extrapolation. 716 00:49:39,540 --> 00:49:42,150 A commonly quoted statistic. 717 00:49:42,150 --> 00:49:45,300 Most auto accidents happen within 10 miles of home. 718 00:49:45,300 --> 00:49:48,750 Anyone here heard that statistic? 719 00:49:48,750 --> 00:49:53,080 It's true, but what does it mean? 720 00:49:53,080 --> 00:49:55,730 Well, people tend to say it means, it's dangerous to drive 721 00:49:55,730 --> 00:49:57,620 when you're near home. 722 00:49:57,620 --> 00:50:02,430 But in fact, most driving is done within 10 miles of home. 723 00:50:02,430 --> 00:50:06,090 Furthermore, we don't actually know where home is. 724 00:50:06,090 --> 00:50:09,370 Home is where the car is supposedly garaged on the 725 00:50:09,370 --> 00:50:14,800 state registration forms. So, data enhancements would 726 00:50:14,800 --> 00:50:18,650 suggest that I should register my car in Alaska. 727 00:50:18,650 --> 00:50:20,980 And then I would never be driving within 10 miles of 728 00:50:20,980 --> 00:50:24,100 home, and I would be much safer. 729 00:50:24,100 --> 00:50:32,330 Well, it's probably not a fact. 730 00:50:32,330 --> 00:50:37,020 So there are all sorts of things on that. 731 00:50:37,020 --> 00:50:39,090 Well, I think I will come back to this, because I have a 732 00:50:39,090 --> 00:50:42,090 couple more good stories which I hate not to give you. 733 00:50:42,090 --> 00:50:45,750 So we'll come back on Thursday and look at a couple of more 734 00:50:45,750 --> 00:50:48,180 things that can go wrong with statistics.