1 00:00:00,000 --> 00:00:00,040 2 00:00:00,040 --> 00:00:02,460 The following content is provided under a Creative 3 00:00:02,460 --> 00:00:03,870 Commons license. 4 00:00:03,870 --> 00:00:06,910 Your support will help MIT OpenCourseWare continue to 5 00:00:06,910 --> 00:00:10,560 offer high-quality educational resources for free. 6 00:00:10,560 --> 00:00:13,460 To make a donation or view additional materials from 7 00:00:13,460 --> 00:00:19,290 hundreds of MIT courses, visit MIT OpenCourseWare at 8 00:00:19,290 --> 00:00:20,540 ocw.mit.edu. 9 00:00:20,540 --> 00:00:22,560 10 00:00:22,560 --> 00:00:25,340 PROFESSOR: We're going to finish today our discussion of 11 00:00:25,340 --> 00:00:27,460 limit theorems. 12 00:00:27,460 --> 00:00:30,340 I'm going to remind you what the central limit theorem is, 13 00:00:30,340 --> 00:00:33,460 which we introduced briefly last time. 14 00:00:33,460 --> 00:00:37,230 We're going to discuss what exactly it says and its 15 00:00:37,230 --> 00:00:38,780 implications. 16 00:00:38,780 --> 00:00:42,100 And then we're going to apply to a couple of examples, 17 00:00:42,100 --> 00:00:45,520 mostly on the binomial distribution. 18 00:00:45,520 --> 00:00:49,950 OK, so the situation is that we are dealing with a large 19 00:00:49,950 --> 00:00:52,420 number of independent, identically 20 00:00:52,420 --> 00:00:55,000 distributed random variables. 21 00:00:55,000 --> 00:00:58,270 And we want to look at the sum of them and say something 22 00:00:58,270 --> 00:01:00,510 about the distribution of the sum. 23 00:01:00,510 --> 00:01:03,310 24 00:01:03,310 --> 00:01:06,910 We might want to say that the sum is distributed 25 00:01:06,910 --> 00:01:10,510 approximately as a normal random variable, although, 26 00:01:10,510 --> 00:01:12,750 formally, this is not quite right. 27 00:01:12,750 --> 00:01:16,330 As n goes to infinity, the distribution of the sum 28 00:01:16,330 --> 00:01:20,000 becomes very spread out, and it doesn't converge to a 29 00:01:20,000 --> 00:01:21,830 limiting distribution. 30 00:01:21,830 --> 00:01:24,930 In order to get an interesting limit, we need first to take 31 00:01:24,930 --> 00:01:28,150 the sum and standardize it. 32 00:01:28,150 --> 00:01:32,267 By standardizing it, what we mean is to subtract the mean 33 00:01:32,267 --> 00:01:38,060 and then divide by the standard deviation. 34 00:01:38,060 --> 00:01:41,320 Now, the mean is, of course, n times the expected value of 35 00:01:41,320 --> 00:01:43,080 each one of the X's. 36 00:01:43,080 --> 00:01:45,130 And the standard deviation is the 37 00:01:45,130 --> 00:01:46,610 square root of the variance. 38 00:01:46,610 --> 00:01:50,530 The variance is n times sigma squared, where sigma is the 39 00:01:50,530 --> 00:01:52,180 variance of the X's -- 40 00:01:52,180 --> 00:01:53,400 so that's the standard deviation. 41 00:01:53,400 --> 00:01:56,330 And after we do this, we obtain a random variable that 42 00:01:56,330 --> 00:02:01,100 has 0 mean -- its centered -- and the 43 00:02:01,100 --> 00:02:03,230 variance is equal to 1. 44 00:02:03,230 --> 00:02:07,240 And so the variance stays the same, no matter how large n is 45 00:02:07,240 --> 00:02:08,500 going to be. 46 00:02:08,500 --> 00:02:12,660 So the distribution of Zn keeps changing with n, but it 47 00:02:12,660 --> 00:02:14,080 cannot change too much. 48 00:02:14,080 --> 00:02:15,240 It stays in place. 49 00:02:15,240 --> 00:02:19,550 The mean is 0, and the width remains also roughly the same 50 00:02:19,550 --> 00:02:22,000 because the variance is 1. 51 00:02:22,000 --> 00:02:25,820 The surprising thing is that, as n grows, that distribution 52 00:02:25,820 --> 00:02:31,250 of Zn kind of settles in a certain asymptotic shape. 53 00:02:31,250 --> 00:02:33,620 And that's the shape of a standard 54 00:02:33,620 --> 00:02:35,290 normal random variable. 55 00:02:35,290 --> 00:02:37,580 So standard normal means that it has 0 56 00:02:37,580 --> 00:02:39,930 mean and unit variance. 57 00:02:39,930 --> 00:02:43,850 More precisely, what the central limit theorem tells us 58 00:02:43,850 --> 00:02:46,560 is a relation between the cumulative distribution 59 00:02:46,560 --> 00:02:49,430 function of Zn and its relation to the cumulative 60 00:02:49,430 --> 00:02:51,990 distribution function of the standard normal. 61 00:02:51,990 --> 00:02:56,620 So for any given number, c, the probability that Zn is 62 00:02:56,620 --> 00:03:01,140 less than or equal to c, in the limit, becomes the same as 63 00:03:01,140 --> 00:03:04,090 the probability that the standard normal becomes less 64 00:03:04,090 --> 00:03:05,760 than or equal to c. 65 00:03:05,760 --> 00:03:08,800 And of course, this is useful because these probabilities 66 00:03:08,800 --> 00:03:11,960 are available from the normal tables, whereas the 67 00:03:11,960 --> 00:03:15,850 distribution of Zn might be a very complicated expression if 68 00:03:15,850 --> 00:03:19,520 you were to calculate it exactly. 69 00:03:19,520 --> 00:03:22,960 So some comments about the central limit theorem. 70 00:03:22,960 --> 00:03:27,860 First thing is that it's quite amazing that it's universal. 71 00:03:27,860 --> 00:03:31,970 It doesn't matter what the distribution of the X's is. 72 00:03:31,970 --> 00:03:35,970 It can be any distribution whatsoever, as long as it has 73 00:03:35,970 --> 00:03:39,070 finite mean and finite variance. 74 00:03:39,070 --> 00:03:42,170 And when you go and do your approximations using the 75 00:03:42,170 --> 00:03:44,520 central limit theorem, the only thing that you need to 76 00:03:44,520 --> 00:03:47,580 know about the distribution of the X's are the 77 00:03:47,580 --> 00:03:49,130 mean and the variance. 78 00:03:49,130 --> 00:03:52,410 You need those in order to standardize Sn. 79 00:03:52,410 --> 00:03:55,910 I mean -- to subtract the mean and divide by the standard 80 00:03:55,910 --> 00:03:56,810 deviation -- 81 00:03:56,810 --> 00:03:59,120 you need to know the mean and the variance. 82 00:03:59,120 --> 00:04:02,350 But these are the only things that you need to know in order 83 00:04:02,350 --> 00:04:03,600 to apply it. 84 00:04:03,600 --> 00:04:06,060 85 00:04:06,060 --> 00:04:08,730 In addition, it's a very accurate 86 00:04:08,730 --> 00:04:10,650 computational shortcut. 87 00:04:10,650 --> 00:04:14,660 So the distribution of this Zn's, in principle, you can 88 00:04:14,660 --> 00:04:18,130 calculate it by convolution of the distribution of the X's 89 00:04:18,130 --> 00:04:20,050 with itself many, many times. 90 00:04:20,050 --> 00:04:23,720 But this is tedious, and if you try to do it analytically, 91 00:04:23,720 --> 00:04:26,570 it might be a very complicated expression. 92 00:04:26,570 --> 00:04:29,910 Whereas by just appealing to the standard normal table for 93 00:04:29,910 --> 00:04:33,870 the standard normal random variable, things are done in a 94 00:04:33,870 --> 00:04:35,360 very quick way. 95 00:04:35,360 --> 00:04:39,070 So it's a nice computational shortcut if you don't want to 96 00:04:39,070 --> 00:04:42,210 get an exact answer to a probability problem. 97 00:04:42,210 --> 00:04:47,480 Now, at a more philosophical level, it justifies why we are 98 00:04:47,480 --> 00:04:50,930 really interested in normal random variables. 99 00:04:50,930 --> 00:04:55,230 Whenever you have a phenomenon which is noisy, and the noise 100 00:04:55,230 --> 00:05:00,420 that you observe is created by adding the lots of little 101 00:05:00,420 --> 00:05:03,820 pieces of randomness that are independent of each other, the 102 00:05:03,820 --> 00:05:06,840 overall effect that you're going to observe can be 103 00:05:06,840 --> 00:05:10,240 described by a normal random variable. 104 00:05:10,240 --> 00:05:16,810 So in a classic example that goes 100 years back or so, 105 00:05:16,810 --> 00:05:19,840 suppose that you have a fluid, and inside that fluid, there's 106 00:05:19,840 --> 00:05:23,340 a little particle of dust or whatever that's 107 00:05:23,340 --> 00:05:24,950 suspended in there. 108 00:05:24,950 --> 00:05:28,380 That little particle gets hit by molecules 109 00:05:28,380 --> 00:05:30,000 completely at random -- 110 00:05:30,000 --> 00:05:32,730 and so what you're going to see is that particle kind of 111 00:05:32,730 --> 00:05:36,020 moving randomly inside that liquid. 112 00:05:36,020 --> 00:05:40,260 Now that random motion, if you ask, after one second, how 113 00:05:40,260 --> 00:05:45,520 much is my particle displaced, let's say, in the x-axis along 114 00:05:45,520 --> 00:05:47,170 the x direction. 115 00:05:47,170 --> 00:05:50,960 That displacement is very, very well modeled by a normal 116 00:05:50,960 --> 00:05:51,960 random variable. 117 00:05:51,960 --> 00:05:55,710 And the reason is that the position of that particle is 118 00:05:55,710 --> 00:06:00,160 decided by the cumulative effect of lots of random hits 119 00:06:00,160 --> 00:06:04,480 by molecules that hit that particle. 120 00:06:04,480 --> 00:06:11,630 So that's a sort of celebrated physical model that goes under 121 00:06:11,630 --> 00:06:15,000 the name of Brownian motion. 122 00:06:15,000 --> 00:06:18,100 And it's the same model that some people use to describe 123 00:06:18,100 --> 00:06:20,300 the movement in the financial markets. 124 00:06:20,300 --> 00:06:24,660 The argument might go that the movement of prices has to do 125 00:06:24,660 --> 00:06:28,300 with lots of little decisions and lots of little events by 126 00:06:28,300 --> 00:06:31,310 many, many different actors that are 127 00:06:31,310 --> 00:06:32,890 involved in the market. 128 00:06:32,890 --> 00:06:37,440 So the distribution of stock prices might be well described 129 00:06:37,440 --> 00:06:39,740 by normal random variables. 130 00:06:39,740 --> 00:06:43,780 At least that's what people wanted to believe until 131 00:06:43,780 --> 00:06:45,160 somewhat recently. 132 00:06:45,160 --> 00:06:48,300 Now, the evidence is that, actually, these distributions 133 00:06:48,300 --> 00:06:52,210 are a little more heavy-tailed in the sense that extreme 134 00:06:52,210 --> 00:06:55,630 events are a little more likely to occur that what 135 00:06:55,630 --> 00:06:58,040 normal random variables would seem to indicate. 136 00:06:58,040 --> 00:07:03,110 But as a first model, again, it could be a plausible 137 00:07:03,110 --> 00:07:07,300 argument to have, at least as a starting model, one that 138 00:07:07,300 --> 00:07:10,200 involves normal random variables. 139 00:07:10,200 --> 00:07:13,020 So this is the philosophical side of things. 140 00:07:13,020 --> 00:07:15,820 On the more accurate, mathematical side, it's 141 00:07:15,820 --> 00:07:18,290 important to appreciate exactly quite kind of 142 00:07:18,290 --> 00:07:21,250 statement the central limit theorem is. 143 00:07:21,250 --> 00:07:25,460 It's a statement about the convergence of the CDF of 144 00:07:25,460 --> 00:07:27,940 these standardized random variables to 145 00:07:27,940 --> 00:07:29,840 the CDF of a normal. 146 00:07:29,840 --> 00:07:32,470 So it's a statement about convergence of CDFs. 147 00:07:32,470 --> 00:07:36,580 It's not a statement about convergence of PMFs, or 148 00:07:36,580 --> 00:07:39,100 convergence of PDFs. 149 00:07:39,100 --> 00:07:42,160 Now, if one makes additional mathematical assumptions, 150 00:07:42,160 --> 00:07:44,880 there are variations of the central limit theorem that 151 00:07:44,880 --> 00:07:47,220 talk about PDFs and PMFs. 152 00:07:47,220 --> 00:07:51,930 But in general, that's not necessarily the case. 153 00:07:51,930 --> 00:07:54,610 And I'm going to illustrate this with-- 154 00:07:54,610 --> 00:07:58,890 I have a plot here which is not in your slides. 155 00:07:58,890 --> 00:08:04,700 But just to make the point, consider two different 156 00:08:04,700 --> 00:08:06,710 discrete distributions. 157 00:08:06,710 --> 00:08:09,820 This discrete distribution takes values 1, 4, 7. 158 00:08:09,820 --> 00:08:13,470 159 00:08:13,470 --> 00:08:16,110 This discrete distribution can take values 1, 2, 4, 6, and 7. 160 00:08:16,110 --> 00:08:18,720 161 00:08:18,720 --> 00:08:24,270 So this one has sort of a periodicity of 3, this one, 162 00:08:24,270 --> 00:08:27,960 the range of values is a little more interesting. 163 00:08:27,960 --> 00:08:30,910 The numbers in these two distributions are cooked up so 164 00:08:30,910 --> 00:08:34,380 that they have the same mean and the same variance. 165 00:08:34,380 --> 00:08:38,970 Now, what I'm going to do is to take eight independent 166 00:08:38,970 --> 00:08:44,090 copies of the random variable and plot the PMF of the sum of 167 00:08:44,090 --> 00:08:45,980 eight random variables. 168 00:08:45,980 --> 00:08:51,520 Now, if I plot the PMF of the sum of 8 of these, I get the 169 00:08:51,520 --> 00:08:59,690 plot, which corresponds to these bullets in this diagram. 170 00:08:59,690 --> 00:09:03,040 If I take 8 random variables, according to this 171 00:09:03,040 --> 00:09:07,270 distribution, and add them up and compute their PMF, the PMF 172 00:09:07,270 --> 00:09:10,310 I get is the one denoted here by the X's. 173 00:09:10,310 --> 00:09:15,630 The two PMFs look really different, at least, when you 174 00:09:15,630 --> 00:09:16,890 eyeball them. 175 00:09:16,890 --> 00:09:23,500 On the other hand, if you were to plot the CDFs of them, then 176 00:09:23,500 --> 00:09:34,000 the CDFs, if you compare them with the normal CDF, which is 177 00:09:34,000 --> 00:09:38,390 this continuous curve, the CDF, of course, it goes up in 178 00:09:38,390 --> 00:09:41,870 steps because we're looking at discrete random variables. 179 00:09:41,870 --> 00:09:47,600 But it's very close to the normal CDF. 180 00:09:47,600 --> 00:09:52,000 And if we, instead of n equal to 8, we were to take 16, then 181 00:09:52,000 --> 00:09:54,480 the coincidence would be even better. 182 00:09:54,480 --> 00:09:59,850 So in terms of CDFs, when we add 8 or 16 of these, we get 183 00:09:59,850 --> 00:10:01,930 very close to the normal CDF. 184 00:10:01,930 --> 00:10:05,080 We would get essentially the same picture if I were to take 185 00:10:05,080 --> 00:10:06,850 8 or 16 of these. 186 00:10:06,850 --> 00:10:11,730 So the CDFs sit, essentially, on top of each other, although 187 00:10:11,730 --> 00:10:14,400 the two PMFs look quite different. 188 00:10:14,400 --> 00:10:17,230 So this is to appreciate that, formally speaking, we only 189 00:10:17,230 --> 00:10:22,470 have a statement about CDFs, not about PMFs. 190 00:10:22,470 --> 00:10:26,980 Now in practice, how do you use the central limit theorem? 191 00:10:26,980 --> 00:10:30,550 Well, it tells us that we can calculate probabilities by 192 00:10:30,550 --> 00:10:32,810 treating Zn as if it were a standard 193 00:10:32,810 --> 00:10:34,550 normal random variable. 194 00:10:34,550 --> 00:10:38,280 Now Zn is a linear function of Sn. 195 00:10:38,280 --> 00:10:43,120 Conversely, Sn is a linear function of Zn. 196 00:10:43,120 --> 00:10:45,680 Linear functions of normals are normal. 197 00:10:45,680 --> 00:10:49,450 So if I pretend that Zn is normal, it's essentially the 198 00:10:49,450 --> 00:10:53,230 same as if we pretend that Sn is normal. 199 00:10:53,230 --> 00:10:55,560 And so we can calculate probabilities that have to do 200 00:10:55,560 --> 00:10:59,830 with Sn as if Sn were normal. 201 00:10:59,830 --> 00:11:03,850 Now, the central limit theorem does not tell us that Sn is 202 00:11:03,850 --> 00:11:05,120 approximately normal. 203 00:11:05,120 --> 00:11:08,860 The formal statement is about Zn, but, practically speaking, 204 00:11:08,860 --> 00:11:11,150 when you use the result, you can just 205 00:11:11,150 --> 00:11:14,650 pretend that Sn is normal. 206 00:11:14,650 --> 00:11:18,620 Finally, it's a limit theorem, so it tells us about what 207 00:11:18,620 --> 00:11:21,240 happens when n goes to infinity. 208 00:11:21,240 --> 00:11:23,880 If we are to use it in practice, of course, n is not 209 00:11:23,880 --> 00:11:25,120 going to be infinity. 210 00:11:25,120 --> 00:11:28,320 Maybe n is equal to 15. 211 00:11:28,320 --> 00:11:32,130 Can we use a limit theorem when n is a small number, as 212 00:11:32,130 --> 00:11:34,020 small as 15? 213 00:11:34,020 --> 00:11:36,980 Well, it turns out that it's a very good approximation. 214 00:11:36,980 --> 00:11:41,420 Even for quite small values of n, it gives us 215 00:11:41,420 --> 00:11:43,770 very accurate answers. 216 00:11:43,770 --> 00:11:49,710 So n over the order of 15, or 20, or so give us very good 217 00:11:49,710 --> 00:11:51,790 results in practice. 218 00:11:51,790 --> 00:11:54,820 There are no good theorems that will give us hard 219 00:11:54,820 --> 00:11:58,550 guarantees because the quality of the approximation does 220 00:11:58,550 --> 00:12:03,490 depend on the details of the distribution of the X's. 221 00:12:03,490 --> 00:12:07,510 If the X's have a distribution that, from the outset, looks a 222 00:12:07,510 --> 00:12:13,200 little bit like the normal, then for small values of n, 223 00:12:13,200 --> 00:12:15,700 you are going to see, essentially, a normal 224 00:12:15,700 --> 00:12:16,980 distribution for the sum. 225 00:12:16,980 --> 00:12:20,030 If the distribution of the X's is very different from the 226 00:12:20,030 --> 00:12:23,350 normal, it's going to take a larger value of n for the 227 00:12:23,350 --> 00:12:25,770 central limit theorem to take effect. 228 00:12:25,770 --> 00:12:29,960 So let's illustrates this with a few representative plots. 229 00:12:29,960 --> 00:12:32,600 230 00:12:32,600 --> 00:12:36,460 So here, we're starting with a discrete uniform distribution 231 00:12:36,460 --> 00:12:39,580 that goes from 1 to 8. 232 00:12:39,580 --> 00:12:44,200 Let's add 2 of these random variables, 2 random variables 233 00:12:44,200 --> 00:12:47,870 with this PMF, and find the PMF of the sum. 234 00:12:47,870 --> 00:12:52,570 This is a convolution of 2 discrete uniforms, and I 235 00:12:52,570 --> 00:12:54,960 believe you have seen this exercise before. 236 00:12:54,960 --> 00:12:59,030 When you convolve this with itself, you get a triangle. 237 00:12:59,030 --> 00:13:04,400 So this is the PMF for the sum of two discrete uniforms. 238 00:13:04,400 --> 00:13:05,370 Now let's continue. 239 00:13:05,370 --> 00:13:07,980 Let's convolve this with itself. 240 00:13:07,980 --> 00:13:10,750 These was going to give us the PMF of a sum 241 00:13:10,750 --> 00:13:13,740 of 4 discrete uniforms. 242 00:13:13,740 --> 00:13:17,930 And we get this, which starts looking like a normal. 243 00:13:17,930 --> 00:13:23,450 If we go to n equal to 32, then it looks, essentially, 244 00:13:23,450 --> 00:13:25,270 exactly like a normal. 245 00:13:25,270 --> 00:13:27,850 And it's an excellent approximation. 246 00:13:27,850 --> 00:13:32,290 So this is the PMF of the sum of 32 discrete random 247 00:13:32,290 --> 00:13:36,560 variables with this uniform distribution. 248 00:13:36,560 --> 00:13:42,190 If we start with a PMF which is not symmetric-- 249 00:13:42,190 --> 00:13:44,640 this one is symmetric around the mean. 250 00:13:44,640 --> 00:13:47,630 But if we start with a PMF which is non-symmetric, so 251 00:13:47,630 --> 00:13:53,780 this is, here, is a truncated geometric PMF, then things do 252 00:13:53,780 --> 00:13:58,960 not work out as nicely when I add 8 of these. 253 00:13:58,960 --> 00:14:03,640 That is, if I convolve this with itself 8 times, I get 254 00:14:03,640 --> 00:14:08,600 this PMF, which maybe resembles a little bit to the 255 00:14:08,600 --> 00:14:09,800 normal one. 256 00:14:09,800 --> 00:14:13,050 But you can really tell that it's different from the normal 257 00:14:13,050 --> 00:14:16,640 if you focus at the details here and there. 258 00:14:16,640 --> 00:14:19,930 Here it sort of rises sharply. 259 00:14:19,930 --> 00:14:23,420 Here it tails off a bit slower. 260 00:14:23,420 --> 00:14:27,660 So there's an asymmetry here that's present, and which is a 261 00:14:27,660 --> 00:14:29,340 consequence of the asymmetry of the 262 00:14:29,340 --> 00:14:31,710 distribution we started with. 263 00:14:31,710 --> 00:14:35,320 If we go to 16, it looks a little better, but still you 264 00:14:35,320 --> 00:14:39,600 can see the asymmetry between this tail and that tail. 265 00:14:39,600 --> 00:14:43,030 If you get to 32 there's still a little bit of asymmetry, but 266 00:14:43,030 --> 00:14:48,520 at least now it starts looking like a normal distribution. 267 00:14:48,520 --> 00:14:54,270 So the moral from these plots is that it might vary, a 268 00:14:54,270 --> 00:14:57,360 little bit, what kind of values of n you need before 269 00:14:57,360 --> 00:15:00,070 you get the really good approximation. 270 00:15:00,070 --> 00:15:04,520 But for values of n in the range 20 to 30 or so, usually 271 00:15:04,520 --> 00:15:07,340 you expect to get a pretty good approximation. 272 00:15:07,340 --> 00:15:10,180 At least that's what the visual inspection of these 273 00:15:10,180 --> 00:15:13,330 graphs tells us. 274 00:15:13,330 --> 00:15:16,560 So now that we know that we have a good approximation in 275 00:15:16,560 --> 00:15:18,460 our hands, let's use it. 276 00:15:18,460 --> 00:15:21,890 Let's use it by revisiting an example from last time. 277 00:15:21,890 --> 00:15:24,480 This is the polling problem. 278 00:15:24,480 --> 00:15:28,360 We're interested in the fraction of population that 279 00:15:28,360 --> 00:15:30,220 has a certain habit been. 280 00:15:30,220 --> 00:15:33,680 And we try to find what f is. 281 00:15:33,680 --> 00:15:38,120 And the way we do it is by polling people at random and 282 00:15:38,120 --> 00:15:40,600 recording the answers that they give, whether they have 283 00:15:40,600 --> 00:15:42,340 the habit or not. 284 00:15:42,340 --> 00:15:45,250 So for each person, we get the Bernoulli random variable. 285 00:15:45,250 --> 00:15:52,050 With probability f, a person is going to respond 1, or yes, 286 00:15:52,050 --> 00:15:55,080 so this is with probability f. 287 00:15:55,080 --> 00:15:58,490 And with the remaining probability 1-f, the person 288 00:15:58,490 --> 00:16:00,390 responds no. 289 00:16:00,390 --> 00:16:04,520 We record this number, which is how many people answered 290 00:16:04,520 --> 00:16:06,800 yes, divided by the total number of people. 291 00:16:06,800 --> 00:16:10,740 That's the fraction of the population that we asked. 292 00:16:10,740 --> 00:16:16,980 This is the fraction inside our sample that answered yes. 293 00:16:16,980 --> 00:16:21,410 And as we discussed last time, you might start with some 294 00:16:21,410 --> 00:16:23,210 specs for the poll. 295 00:16:23,210 --> 00:16:25,660 And the specs have two parameters-- 296 00:16:25,660 --> 00:16:29,400 the accuracy that you want and the confidence that you want 297 00:16:29,400 --> 00:16:33,620 to have that you did really obtain the desired accuracy. 298 00:16:33,620 --> 00:16:40,550 So the specs here is that we want, probability 95% that our 299 00:16:40,550 --> 00:16:46,400 estimate is within 1 % point from the true answer. 300 00:16:46,400 --> 00:16:48,940 So the event of interest is this. 301 00:16:48,940 --> 00:16:53,640 That's the result of the poll minus distance from the true 302 00:16:53,640 --> 00:16:59,150 answer is less or bigger than 1 % point. 303 00:16:59,150 --> 00:17:02,000 And we're interested in calculating or approximating 304 00:17:02,000 --> 00:17:04,140 this particular probability. 305 00:17:04,140 --> 00:17:08,000 So we want to do it using the central limit theorem. 306 00:17:08,000 --> 00:17:13,050 And one way of arranging the mechanics of this calculation 307 00:17:13,050 --> 00:17:17,880 is to take the event of interest and massage it by 308 00:17:17,880 --> 00:17:21,400 subtracting and dividing things from both sides of this 309 00:17:21,400 --> 00:17:27,510 inequality so that you bring him to the picture the 310 00:17:27,510 --> 00:17:31,600 standardized random variable, the Zn, and then apply the 311 00:17:31,600 --> 00:17:33,900 central limit theorem. 312 00:17:33,900 --> 00:17:38,550 So the event of interest, let me write it in full, Mn is 313 00:17:38,550 --> 00:17:42,280 this quantity, so I'm putting it here, minus f, which is the 314 00:17:42,280 --> 00:17:44,410 same as nf divided by n. 315 00:17:44,410 --> 00:17:46,980 So this is the same as that event. 316 00:17:46,980 --> 00:17:49,840 We're going to calculate the probability of this. 317 00:17:49,840 --> 00:17:52,460 This is not exactly in the form in which we apply the 318 00:17:52,460 --> 00:17:53,430 central limit theorem. 319 00:17:53,430 --> 00:17:56,570 To apply the central limit theorem, we need, down here, 320 00:17:56,570 --> 00:17:59,660 to have sigma square root n. 321 00:17:59,660 --> 00:18:03,100 So how can I put sigma square root n here? 322 00:18:03,100 --> 00:18:07,350 I can divide both sides of this inequality by sigma. 323 00:18:07,350 --> 00:18:10,970 And then I can take a factor of square root n from here and 324 00:18:10,970 --> 00:18:13,240 send it to the other side. 325 00:18:13,240 --> 00:18:15,660 So this event is the same as that event. 326 00:18:15,660 --> 00:18:19,190 This will happen if and only if that will happen. 327 00:18:19,190 --> 00:18:23,670 So calculating the probability of this event here is the same 328 00:18:23,670 --> 00:18:27,110 as calculating the probability that this events happens. 329 00:18:27,110 --> 00:18:30,870 And now we are in business because the random variable 330 00:18:30,870 --> 00:18:36,510 that we got in here is Zn, or the absolute value of Zn, and 331 00:18:36,510 --> 00:18:41,480 we're talking about the probability that Zn, absolute 332 00:18:41,480 --> 00:18:45,660 value of Zn, is bigger than a certain number. 333 00:18:45,660 --> 00:18:50,310 Since Zn is to be approximated by a standard normal random 334 00:18:50,310 --> 00:18:54,560 variable, our approximation is going to be, instead of asking 335 00:18:54,560 --> 00:18:59,040 for Zn being bigger than this number, we will ask for Z, 336 00:18:59,040 --> 00:19:02,500 absolute value of Z, being bigger than this number. 337 00:19:02,500 --> 00:19:05,640 So this is the probability that we want to calculate. 338 00:19:05,640 --> 00:19:09,730 And now Z is a standard normal random variable. 339 00:19:09,730 --> 00:19:12,760 There's a small difficulty, the one that we also 340 00:19:12,760 --> 00:19:14,310 encountered last time. 341 00:19:14,310 --> 00:19:18,110 And the difficulty is that the standard deviation, sigma, of 342 00:19:18,110 --> 00:19:20,720 the Xi's is not known. 343 00:19:20,720 --> 00:19:24,560 Sigma is equal to f times-- 344 00:19:24,560 --> 00:19:30,090 sigma, in this example, is f times (1-f), and the only 345 00:19:30,090 --> 00:19:32,690 thing that we know about sigma is that it's going to be a 346 00:19:32,690 --> 00:19:35,010 number less than 1/2. 347 00:19:35,010 --> 00:19:39,810 348 00:19:39,810 --> 00:19:45,180 OK, so we're going to have to use an inequality here. 349 00:19:45,180 --> 00:19:48,890 We're going to use a conservative value of sigma, 350 00:19:48,890 --> 00:19:54,120 the value of sigma equal to 1/2 and use that instead of 351 00:19:54,120 --> 00:19:55,760 the exact value of sigma. 352 00:19:55,760 --> 00:19:59,100 And this gives us an inequality going this way. 353 00:19:59,100 --> 00:20:03,710 Let's just make sure why the inequality goes this way. 354 00:20:03,710 --> 00:20:06,683 We got, on our axis, two numbers. 355 00:20:06,683 --> 00:20:12,390 356 00:20:12,390 --> 00:20:21,650 One number is 0.01 square root n divided by sigma. 357 00:20:21,650 --> 00:20:27,870 And the other number is 0.02 square root of n. 358 00:20:27,870 --> 00:20:30,840 And my claim is that the numbers are related to each 359 00:20:30,840 --> 00:20:32,930 other in this particular way. 360 00:20:32,930 --> 00:20:33,500 Why is this? 361 00:20:33,500 --> 00:20:35,410 Sigma is less than 2. 362 00:20:35,410 --> 00:20:39,580 So 1/sigma is bigger than 2. 363 00:20:39,580 --> 00:20:44,020 So since 1/sigma is bigger than 2 this means that this 364 00:20:44,020 --> 00:20:47,740 numbers sits to the right of that number. 365 00:20:47,740 --> 00:20:51,950 So here we have the probability that Z is bigger 366 00:20:51,950 --> 00:20:54,820 than this number. 367 00:20:54,820 --> 00:20:59,060 The probability of falling out there is less than the 368 00:20:59,060 --> 00:21:03,060 probability of falling in this interval. 369 00:21:03,060 --> 00:21:06,170 So that's what that last inequality is saying-- 370 00:21:06,170 --> 00:21:09,330 this probability is smaller than that probability. 371 00:21:09,330 --> 00:21:12,010 This is the probability that we're interested in, but since 372 00:21:12,010 --> 00:21:16,490 we don't know sigma, we take the conservative value, and we 373 00:21:16,490 --> 00:21:21,610 use an upper bound in terms of the probability of this 374 00:21:21,610 --> 00:21:23,730 interval here. 375 00:21:23,730 --> 00:21:26,920 And now we are in business. 376 00:21:26,920 --> 00:21:30,980 We can start using our normal tables to calculate 377 00:21:30,980 --> 00:21:33,140 probabilities of interest. 378 00:21:33,140 --> 00:21:40,300 So for example, let's say that's we take n to be 10,000. 379 00:21:40,300 --> 00:21:42,370 How is the calculation going to go? 380 00:21:42,370 --> 00:21:45,860 We want to calculate the probability that the absolute 381 00:21:45,860 --> 00:21:52,920 value of Z is bigger than 0.2 times 1000, which is the 382 00:21:52,920 --> 00:21:56,530 probability that the absolute value of Z is larger than or 383 00:21:56,530 --> 00:21:58,490 equal to 2. 384 00:21:58,490 --> 00:22:00,500 And here let's do some mechanics, 385 00:22:00,500 --> 00:22:03,300 just to stay in shape. 386 00:22:03,300 --> 00:22:05,860 The probability that you're larger than or equal to 2 in 387 00:22:05,860 --> 00:22:09,290 absolute value, since the normal is symmetric around the 388 00:22:09,290 --> 00:22:13,590 mean, this is going to be twice the probability that Z 389 00:22:13,590 --> 00:22:16,560 is larger than or equal to 2. 390 00:22:16,560 --> 00:22:22,330 Can we use the cumulative distribution function of Z to 391 00:22:22,330 --> 00:22:23,300 calculate this? 392 00:22:23,300 --> 00:22:26,100 Well, almost the cumulative gives us probabilities of 393 00:22:26,100 --> 00:22:28,910 being less than something, not bigger than something. 394 00:22:28,910 --> 00:22:33,480 So we need one more step and write this as 1 minus the 395 00:22:33,480 --> 00:22:38,420 probability that Z is less than or equal to 2. 396 00:22:38,420 --> 00:22:41,620 And this probability, now, you can read off 397 00:22:41,620 --> 00:22:43,770 from the normal tables. 398 00:22:43,770 --> 00:22:46,460 And the normal tables will tell you that this 399 00:22:46,460 --> 00:22:52,840 probability is 0.9772. 400 00:22:52,840 --> 00:22:54,520 And you do get an answer. 401 00:22:54,520 --> 00:23:02,530 And the answer is 0.0456. 402 00:23:02,530 --> 00:23:05,220 OK, so we tried 10,000. 403 00:23:05,220 --> 00:23:10,990 And we find that our probably of error is 4.5%, so we're 404 00:23:10,990 --> 00:23:15,710 doing better than the spec that we had. 405 00:23:15,710 --> 00:23:19,490 So this tells us that maybe we have some leeway. 406 00:23:19,490 --> 00:23:24,070 Maybe we can use a smaller sample size and still stay 407 00:23:24,070 --> 00:23:26,030 without our specs. 408 00:23:26,030 --> 00:23:29,630 Let's try to find how much we can push the envelope. 409 00:23:29,630 --> 00:23:34,716 How much smaller can we take n? 410 00:23:34,716 --> 00:23:37,890 To answer that question, we need to do this kind of 411 00:23:37,890 --> 00:23:40,790 calculation, essentially, going backwards. 412 00:23:40,790 --> 00:23:46,420 We're going to fix this number to be 0.05 and work backwards 413 00:23:46,420 --> 00:23:49,130 here to find-- 414 00:23:49,130 --> 00:23:50,770 did I do a mistake here? 415 00:23:50,770 --> 00:23:51,770 10,000. 416 00:23:51,770 --> 00:23:53,700 So I'm missing a 0 here. 417 00:23:53,700 --> 00:23:57,440 418 00:23:57,440 --> 00:24:07,540 Ah, but I'm taking the square root, so it's 100. 419 00:24:07,540 --> 00:24:11,080 Where did the 0.02 come in from? 420 00:24:11,080 --> 00:24:12,020 Ah, from here. 421 00:24:12,020 --> 00:24:15,870 OK, all right. 422 00:24:15,870 --> 00:24:19,330 0.02 times 100, that gives us 2. 423 00:24:19,330 --> 00:24:22,130 OK, all right. 424 00:24:22,130 --> 00:24:24,240 Very good, OK. 425 00:24:24,240 --> 00:24:27,570 So we'll have to do this calculation now backwards, 426 00:24:27,570 --> 00:24:33,510 figure out if this is 0.05, what kind of number we're 427 00:24:33,510 --> 00:24:41,380 going to need here and then here, and from this we will be 428 00:24:41,380 --> 00:24:45,240 able to tell what value of n do we need. 429 00:24:45,240 --> 00:24:53,670 OK, so we want to find n such that the probability that Z is 430 00:24:53,670 --> 00:25:04,870 bigger than 0.02 square root n is 0.05. 431 00:25:04,870 --> 00:25:09,320 OK, so Z is a standard normal random variable. 432 00:25:09,320 --> 00:25:16,810 And we want the probability that we are 433 00:25:16,810 --> 00:25:18,640 outside this range. 434 00:25:18,640 --> 00:25:21,940 We want the probability of those two tails together. 435 00:25:21,940 --> 00:25:24,960 436 00:25:24,960 --> 00:25:26,920 Those two tails together should have 437 00:25:26,920 --> 00:25:29,990 probability of 0.05. 438 00:25:29,990 --> 00:25:33,280 This means that this tail, by itself, should have 439 00:25:33,280 --> 00:25:36,900 probability 0.025. 440 00:25:36,900 --> 00:25:45,960 And this means that this probability should be 0.975. 441 00:25:45,960 --> 00:25:52,350 Now, if this probability is to be 0.975, what 442 00:25:52,350 --> 00:25:54,970 should that number be? 443 00:25:54,970 --> 00:25:59,980 You go to the normal tables, and you find which is the 444 00:25:59,980 --> 00:26:03,190 entry that corresponds to that number. 445 00:26:03,190 --> 00:26:07,020 I actually brought a normal table with me. 446 00:26:07,020 --> 00:26:12,740 And 0.975 is down here. 447 00:26:12,740 --> 00:26:15,420 And it tells you that to the number that 448 00:26:15,420 --> 00:26:19,820 corresponds to it is 1.96. 449 00:26:19,820 --> 00:26:24,890 So this tells us that this number 450 00:26:24,890 --> 00:26:31,790 should be equal to 1.96. 451 00:26:31,790 --> 00:26:36,380 And now, from here, you do the calculations. 452 00:26:36,380 --> 00:26:47,510 And you find that n is 9604. 453 00:26:47,510 --> 00:26:53,200 So with a sample of 10,000, we got probability of error 4.5%. 454 00:26:53,200 --> 00:26:57,910 With a slightly smaller sample size of 9,600, we can get the 455 00:26:57,910 --> 00:27:01,880 probability of a mistake to be 0.05, which 456 00:27:01,880 --> 00:27:04,070 was exactly our spec. 457 00:27:04,070 --> 00:27:07,450 So these are essentially the two ways that you're going to 458 00:27:07,450 --> 00:27:09,830 be using the central limit theorem. 459 00:27:09,830 --> 00:27:12,690 Either you're given n and you try to calculate 460 00:27:12,690 --> 00:27:13,610 probabilities. 461 00:27:13,610 --> 00:27:15,590 Or you're given the probabilities, and you want to 462 00:27:15,590 --> 00:27:18,210 work backwards to find n itself. 463 00:27:18,210 --> 00:27:20,990 464 00:27:20,990 --> 00:27:27,710 So in this example, the random variable that we dealt with 465 00:27:27,710 --> 00:27:30,450 was, of course, a binomial random variable. 466 00:27:30,450 --> 00:27:38,590 The Xi's were Bernoulli, so the sum of 467 00:27:38,590 --> 00:27:40,950 the Xi's were binomial. 468 00:27:40,950 --> 00:27:44,100 So the central limit theorem certainly applies to the 469 00:27:44,100 --> 00:27:45,950 binomial distribution. 470 00:27:45,950 --> 00:27:49,440 To be more precise, of course, it applies to the standardized 471 00:27:49,440 --> 00:27:52,730 version of the binomial random variable. 472 00:27:52,730 --> 00:27:55,140 So here's what we did, essentially, in 473 00:27:55,140 --> 00:27:57,300 the previous example. 474 00:27:57,300 --> 00:28:00,690 We fixed the number p, which is the probability of success 475 00:28:00,690 --> 00:28:02,010 in our experiments. 476 00:28:02,010 --> 00:28:06,550 p corresponds to f in the previous example. 477 00:28:06,550 --> 00:28:10,570 Let every Xi a Bernoulli random variable and are 478 00:28:10,570 --> 00:28:13,790 standing assumption is that these random variables are 479 00:28:13,790 --> 00:28:15,040 independent. 480 00:28:15,040 --> 00:28:17,580 481 00:28:17,580 --> 00:28:20,730 When we add them, we get a random variable that has a 482 00:28:20,730 --> 00:28:22,030 binomial distribution. 483 00:28:22,030 --> 00:28:25,220 We know the mean and the variance of the binomial, so 484 00:28:25,220 --> 00:28:29,130 we take Sn, we subtract the mean, which is this, divide by 485 00:28:29,130 --> 00:28:30,470 the standard deviation. 486 00:28:30,470 --> 00:28:32,790 The central limit theorem tells us that the cumulative 487 00:28:32,790 --> 00:28:36,130 distribution function of this random variable is a standard 488 00:28:36,130 --> 00:28:39,860 normal random variable in the limit. 489 00:28:39,860 --> 00:28:43,730 So let's do one more example of a calculation. 490 00:28:43,730 --> 00:28:47,160 Let's take n to be-- 491 00:28:47,160 --> 00:28:50,110 let's choose some specific numbers to work with. 492 00:28:50,110 --> 00:28:52,950 493 00:28:52,950 --> 00:28:58,300 So in this example, first thing to do is to find the 494 00:28:58,300 --> 00:29:02,390 expected value of Sn, which is n times p. 495 00:29:02,390 --> 00:29:04,150 It's 18. 496 00:29:04,150 --> 00:29:08,100 Then we need to write down the standard deviation. 497 00:29:08,100 --> 00:29:12,430 498 00:29:12,430 --> 00:29:16,530 The variance of Sn is the sum of the variances. 499 00:29:16,530 --> 00:29:19,940 It's np times (1-p). 500 00:29:19,940 --> 00:29:25,920 And in this particular example, p times (1-p) is 1/4, 501 00:29:25,920 --> 00:29:28,320 n is 36, so this is 9. 502 00:29:28,320 --> 00:29:33,120 And that tells us that the standard deviation of this n 503 00:29:33,120 --> 00:29:34,370 is equal to 3. 504 00:29:34,370 --> 00:29:37,170 505 00:29:37,170 --> 00:29:40,650 So what we're going to do is to take the event of interest, 506 00:29:40,650 --> 00:29:46,400 which is Sn less than 21, and rewrite it in a way that 507 00:29:46,400 --> 00:29:48,910 involves the standardized random variable. 508 00:29:48,910 --> 00:29:51,700 So to do that, we need to subtract the mean. 509 00:29:51,700 --> 00:29:55,680 So we write this as Sn-3 should be less 510 00:29:55,680 --> 00:29:58,460 than or equal to 21-3. 511 00:29:58,460 --> 00:30:00,360 This is the same event. 512 00:30:00,360 --> 00:30:02,890 And then divide by the standard deviation, which is 513 00:30:02,890 --> 00:30:06,450 3, and we end up with this. 514 00:30:06,450 --> 00:30:08,300 So the event itself of-- 515 00:30:08,300 --> 00:30:09,550 AUDIENCE: [INAUDIBLE]. 516 00:30:09,550 --> 00:30:13,700 517 00:30:13,700 --> 00:30:24,150 Should subtract, 18, yes, which gives me a much nicer 518 00:30:24,150 --> 00:30:26,640 number out here, which is 1. 519 00:30:26,640 --> 00:30:31,650 So the event of interest, that Sn is less than 21, is the 520 00:30:31,650 --> 00:30:37,330 same as the event that a standard normal random 521 00:30:37,330 --> 00:30:41,580 variable is less than or equal to 1. 522 00:30:41,580 --> 00:30:44,690 And once more, you can look this up at the normal tables. 523 00:30:44,690 --> 00:30:50,690 And you find that the answer that you get is 0.43. 524 00:30:50,690 --> 00:30:53,390 Now it's interesting to compare this answer that we 525 00:30:53,390 --> 00:30:57,230 got through the central limit theorem with the exact answer. 526 00:30:57,230 --> 00:31:01,920 The exact answer involves the exact binomial distribution. 527 00:31:01,920 --> 00:31:08,780 What we have here is the binomial probability that, Sn 528 00:31:08,780 --> 00:31:10,970 is equal to k. 529 00:31:10,970 --> 00:31:15,230 Sn being equal to k is given by this formula. 530 00:31:15,230 --> 00:31:22,610 And we add, over all values for k going from 0 up to 21, 531 00:31:22,610 --> 00:31:28,670 we write a two lines code to calculate this sum, and we get 532 00:31:28,670 --> 00:31:32,530 the exact answer, which is 0.8785. 533 00:31:32,530 --> 00:31:35,760 So there's a pretty good agreements between the two, 534 00:31:35,760 --> 00:31:38,600 although you wouldn't call that's 535 00:31:38,600 --> 00:31:40,395 necessarily excellent agreement. 536 00:31:40,395 --> 00:31:45,080 537 00:31:45,080 --> 00:31:47,060 Can we do a little better than that? 538 00:31:47,060 --> 00:31:51,570 539 00:31:51,570 --> 00:31:53,750 OK. 540 00:31:53,750 --> 00:31:56,510 It turns out that we can. 541 00:31:56,510 --> 00:31:58,625 And here's the idea. 542 00:31:58,625 --> 00:32:02,300 543 00:32:02,300 --> 00:32:07,750 So our random variable Sn has a mean of 18. 544 00:32:07,750 --> 00:32:09,540 It has a binomial distribution. 545 00:32:09,540 --> 00:32:14,050 It's described by a PMF that has a shape roughly like this 546 00:32:14,050 --> 00:32:16,690 and which keeps going on. 547 00:32:16,690 --> 00:32:20,960 Using the central limit theorem is basically 548 00:32:20,960 --> 00:32:26,650 pretending that Sn is normal with the 549 00:32:26,650 --> 00:32:28,650 right mean and variance. 550 00:32:28,650 --> 00:32:35,200 So pretending that Zn has 0 mean unit variance, we 551 00:32:35,200 --> 00:32:38,850 approximate it with Z, that has 0 mean unit variance. 552 00:32:38,850 --> 00:32:42,190 If you were to pretend that Sn is normal, you would 553 00:32:42,190 --> 00:32:45,407 approximate it with a normal that has the correct mean and 554 00:32:45,407 --> 00:32:46,250 correct variance. 555 00:32:46,250 --> 00:32:49,390 So it would still be centered at 18. 556 00:32:49,390 --> 00:32:53,800 And it would have the same variance as the binomial PMF. 557 00:32:53,800 --> 00:32:57,350 So using the central limit theorem essentially means that 558 00:32:57,350 --> 00:33:00,420 we keep the mean and the variance what they are but we 559 00:33:00,420 --> 00:33:03,960 pretend that our distribution is normal. 560 00:33:03,960 --> 00:33:06,780 We want to calculate the probability that Sn is less 561 00:33:06,780 --> 00:33:09,590 than or equal to 21. 562 00:33:09,590 --> 00:33:14,310 I pretend that my random variable is normal, so I draw 563 00:33:14,310 --> 00:33:18,680 a line here and I calculate the area under the normal 564 00:33:18,680 --> 00:33:22,000 curve going up to 21. 565 00:33:22,000 --> 00:33:23,500 That's essentially what we did. 566 00:33:23,500 --> 00:33:26,260 567 00:33:26,260 --> 00:33:29,730 Now, a smart person comes around and says, Sn is a 568 00:33:29,730 --> 00:33:31,360 discrete random variable. 569 00:33:31,360 --> 00:33:34,750 So the event that Sn is less than or equal to 21 is the 570 00:33:34,750 --> 00:33:38,480 same as Sn being strictly less than 22 because nothing in 571 00:33:38,480 --> 00:33:41,240 between can happen. 572 00:33:41,240 --> 00:33:43,700 So I'm going to use the central limit theorem 573 00:33:43,700 --> 00:33:48,290 approximation by pretending again that Sn is normal and 574 00:33:48,290 --> 00:33:51,650 finding the probability of this event while pretending 575 00:33:51,650 --> 00:33:53,720 that Sn is normal. 576 00:33:53,720 --> 00:33:57,870 So what this person would do would be to draw a line here, 577 00:33:57,870 --> 00:34:02,780 at 22, and calculate the area under the normal curve 578 00:34:02,780 --> 00:34:05,490 all the way to 22. 579 00:34:05,490 --> 00:34:06,700 Who is right? 580 00:34:06,700 --> 00:34:08,820 Which one is better? 581 00:34:08,820 --> 00:34:15,639 Well neither, but we can do better than both if we sort of 582 00:34:15,639 --> 00:34:17,949 split the difference. 583 00:34:17,949 --> 00:34:21,969 So another way of writing the same event for Sn is to write 584 00:34:21,969 --> 00:34:25,940 it as Sn being less than 21.5. 585 00:34:25,940 --> 00:34:29,570 In terms of the discrete random variable Sn, all three 586 00:34:29,570 --> 00:34:32,239 of these are exactly the same event. 587 00:34:32,239 --> 00:34:35,090 But when you do the continuous approximation, they give you 588 00:34:35,090 --> 00:34:36,250 different probabilities. 589 00:34:36,250 --> 00:34:39,760 It's a matter of whether you integrate the area under the 590 00:34:39,760 --> 00:34:46,159 normal curve up to here, up to the midway point, or up to 22. 591 00:34:46,159 --> 00:34:50,659 It turns out that integrating up to the midpoint is what 592 00:34:50,659 --> 00:34:54,469 gives us the better numerical results. 593 00:34:54,469 --> 00:34:59,170 So we take here 21 and 1/2, and we integrate the area 594 00:34:59,170 --> 00:35:01,170 under the normal curve up to here. 595 00:35:01,170 --> 00:35:14,100 596 00:35:14,100 --> 00:35:18,560 So let's do this calculation and see what we get. 597 00:35:18,560 --> 00:35:21,330 What would we change here? 598 00:35:21,330 --> 00:35:27,730 Instead of 21, we would now write 21 and 1/2. 599 00:35:27,730 --> 00:35:32,810 This 18 becomes, no, that 18 stays what it is. 600 00:35:32,810 --> 00:35:36,890 But this 21 becomes 21 and 1/2. 601 00:35:36,890 --> 00:35:44,790 And so this one becomes 1 + 0.5 by 3. 602 00:35:44,790 --> 00:35:48,210 This is 117. 603 00:35:48,210 --> 00:35:51,980 So we now look up into the normal tables and ask for the 604 00:35:51,980 --> 00:36:00,000 probability that Z is less than 1.17. 605 00:36:00,000 --> 00:36:06,070 So this here gets approximated by the probability that the 606 00:36:06,070 --> 00:36:09,240 standard normal is less than 1.17. 607 00:36:09,240 --> 00:36:15,960 And the normal tables will tell us this is 0.879. 608 00:36:15,960 --> 00:36:23,550 Going back to the previous slide, what we got this time 609 00:36:23,550 --> 00:36:30,310 with this improved approximation is 0.879. 610 00:36:30,310 --> 00:36:33,730 This is a really good approximation 611 00:36:33,730 --> 00:36:35,730 of the correct number. 612 00:36:35,730 --> 00:36:39,160 This is what we got using the 21. 613 00:36:39,160 --> 00:36:42,360 This is what we get using the 21 and 1/2. 614 00:36:42,360 --> 00:36:45,940 And it's an approximation that's sort of right on-- a 615 00:36:45,940 --> 00:36:48,350 very good one. 616 00:36:48,350 --> 00:36:54,120 The moral from this numerical example is that doing this 1 617 00:36:54,120 --> 00:37:00,933 and 1/2 correction does give us better approximations. 618 00:37:00,933 --> 00:37:06,070 619 00:37:06,070 --> 00:37:12,010 In fact, we can use this 1/2 idea to even calculate 620 00:37:12,010 --> 00:37:14,340 individual probabilities. 621 00:37:14,340 --> 00:37:17,130 So suppose you want to approximate the probability 622 00:37:17,130 --> 00:37:21,010 that Sn equal to 19. 623 00:37:21,010 --> 00:37:25,620 If you were to pretend that Sn is normal and calculate this 624 00:37:25,620 --> 00:37:28,470 probability, the probability that the normal random 625 00:37:28,470 --> 00:37:31,670 variable is equal to 19 is 0. 626 00:37:31,670 --> 00:37:34,150 So you don't get an interesting answer. 627 00:37:34,150 --> 00:37:37,610 You get a more interesting answer by writing this event, 628 00:37:37,610 --> 00:37:41,460 19 as being the same as the event of falling between 18 629 00:37:41,460 --> 00:37:45,910 and 1/2 and 19 and 1/2 and using the normal approximation 630 00:37:45,910 --> 00:37:48,230 to calculate this probability. 631 00:37:48,230 --> 00:37:51,890 In terms of our previous picture, this corresponds to 632 00:37:51,890 --> 00:37:53,140 the following. 633 00:37:53,140 --> 00:37:59,400 634 00:37:59,400 --> 00:38:04,650 We are interested in the probability that 635 00:38:04,650 --> 00:38:07,130 Sn is equal to 19. 636 00:38:07,130 --> 00:38:11,230 So we're interested in the height of this bar. 637 00:38:11,230 --> 00:38:15,720 We're going to consider the area under the normal curve 638 00:38:15,720 --> 00:38:21,500 going from here to here, and use this area as an 639 00:38:21,500 --> 00:38:25,110 approximation for the height of that particular bar. 640 00:38:25,110 --> 00:38:30,670 So what we're basically doing is, we take the probability 641 00:38:30,670 --> 00:38:33,830 under the normal curve that's assigned over a continuum of 642 00:38:33,830 --> 00:38:38,280 values and attributed it to different discrete values. 643 00:38:38,280 --> 00:38:43,510 Whatever is above the midpoint gets attributed to 19. 644 00:38:43,510 --> 00:38:45,640 Whatever is below that midpoint gets 645 00:38:45,640 --> 00:38:47,250 attributed to 18. 646 00:38:47,250 --> 00:38:54,280 So this is green area is our approximation of the value of 647 00:38:54,280 --> 00:38:56,500 the PMF at 19. 648 00:38:56,500 --> 00:39:00,740 So similarly, if you wanted to approximate the value of the 649 00:39:00,740 --> 00:39:04,440 PMF at this point, you would take this interval and 650 00:39:04,440 --> 00:39:06,580 integrate the area under the normal 651 00:39:06,580 --> 00:39:09,350 curve over that interval. 652 00:39:09,350 --> 00:39:13,410 It turns out that this gives a very good approximation of the 653 00:39:13,410 --> 00:39:15,660 PMF of the binomial. 654 00:39:15,660 --> 00:39:22,580 And actually, this was the context in which the central 655 00:39:22,580 --> 00:39:26,310 limit theorem was proved in the first place, when this 656 00:39:26,310 --> 00:39:27,990 business started. 657 00:39:27,990 --> 00:39:33,060 So this business goes back a few hundred years. 658 00:39:33,060 --> 00:39:35,700 And the central limit theorem was first approved by 659 00:39:35,700 --> 00:39:39,420 considering the PMF of a binomial random variable when 660 00:39:39,420 --> 00:39:41,840 p is equal to 1/2. 661 00:39:41,840 --> 00:39:45,590 People did the algebra, and they found out that the exact 662 00:39:45,590 --> 00:39:49,700 expression for the PMF is quite well approximated by 663 00:39:49,700 --> 00:39:51,980 that expression hat you would get from a normal 664 00:39:51,980 --> 00:39:53,380 distribution. 665 00:39:53,380 --> 00:39:57,510 Then the proof was extended to binomials for more general 666 00:39:57,510 --> 00:39:59,690 values of p. 667 00:39:59,690 --> 00:40:04,220 So here we talk about this as a refinement of the general 668 00:40:04,220 --> 00:40:07,480 central limit theorem, but, historically, that refinement 669 00:40:07,480 --> 00:40:09,830 was where the whole business got started 670 00:40:09,830 --> 00:40:11,820 in the first place. 671 00:40:11,820 --> 00:40:18,700 All right, so let's go through the mechanics of approximating 672 00:40:18,700 --> 00:40:21,970 the probability that Sn is equal to 19-- 673 00:40:21,970 --> 00:40:23,810 exactly 19. 674 00:40:23,810 --> 00:40:27,340 As we said, we're going to write this event as an event 675 00:40:27,340 --> 00:40:31,040 that covers an interval of unit length from 18 and 1/2 to 676 00:40:31,040 --> 00:40:31,970 19 and 1/2. 677 00:40:31,970 --> 00:40:33,730 This is the event of interest. 678 00:40:33,730 --> 00:40:37,070 First step is to massage the event of interest so that it 679 00:40:37,070 --> 00:40:40,010 involves our Zn random variable. 680 00:40:40,010 --> 00:40:43,290 So subtract 18 from all sides. 681 00:40:43,290 --> 00:40:46,860 Divide by the standard deviation of 3 from all sides. 682 00:40:46,860 --> 00:40:50,850 That's the equivalent representation of the event. 683 00:40:50,850 --> 00:40:54,200 This is our standardized random variable Zn. 684 00:40:54,200 --> 00:40:56,950 These are just these numbers. 685 00:40:56,950 --> 00:41:00,530 And to do an approximation, we want to find the probability 686 00:41:00,530 --> 00:41:04,380 of this event, but Zn is approximately normal, so we 687 00:41:04,380 --> 00:41:08,030 plug in here the Z, which is the standard normal. 688 00:41:08,030 --> 00:41:10,150 So we want to find the probability that the standard 689 00:41:10,150 --> 00:41:12,890 normal falls inside this interval. 690 00:41:12,890 --> 00:41:15,630 You find these using CDFs because this is the 691 00:41:15,630 --> 00:41:18,760 probability that you're less than this but 692 00:41:18,760 --> 00:41:22,370 not less than that. 693 00:41:22,370 --> 00:41:25,370 So it's a difference between two cumulative probabilities. 694 00:41:25,370 --> 00:41:27,400 Then, you look up your normal tables. 695 00:41:27,400 --> 00:41:30,560 You find two numbers for these quantities, and, finally, you 696 00:41:30,560 --> 00:41:35,140 get a numerical answer for an individual entry of the PMF of 697 00:41:35,140 --> 00:41:36,480 the binomial. 698 00:41:36,480 --> 00:41:39,350 This is a pretty good approximation, it turns out. 699 00:41:39,350 --> 00:41:42,910 If you were to do the calculations using the exact 700 00:41:42,910 --> 00:41:47,130 formula, you would get something 701 00:41:47,130 --> 00:41:49,360 which is pretty close-- 702 00:41:49,360 --> 00:41:52,800 an error in the third digit-- 703 00:41:52,800 --> 00:41:56,980 this is pretty good. 704 00:41:56,980 --> 00:41:59,650 So I guess what we did here with our discussion of the 705 00:41:59,650 --> 00:42:04,560 binomial slightly contradicts what I said before-- 706 00:42:04,560 --> 00:42:07,330 that the central limit theorem is a statement about 707 00:42:07,330 --> 00:42:09,240 cumulative distribution functions. 708 00:42:09,240 --> 00:42:13,240 In general, it doesn't tell you what to do to approximate 709 00:42:13,240 --> 00:42:15,270 PMFs themselves. 710 00:42:15,270 --> 00:42:17,440 And that's indeed the case in general. 711 00:42:17,440 --> 00:42:20,220 One the other hand, for the special case of a binomial 712 00:42:20,220 --> 00:42:23,610 distribution, the central limit theorem approximation, 713 00:42:23,610 --> 00:42:28,200 with this 1/2 correction, is a very good approximation even 714 00:42:28,200 --> 00:42:29,560 for the individual PMF. 715 00:42:29,560 --> 00:42:33,290 716 00:42:33,290 --> 00:42:40,210 All right, so we spent quite a bit of time on mechanics. 717 00:42:40,210 --> 00:42:46,050 So let's spend the last few minutes today thinking a bit 718 00:42:46,050 --> 00:42:47,930 and look at a small puzzle. 719 00:42:47,930 --> 00:42:51,390 720 00:42:51,390 --> 00:42:54,240 So the puzzle is the following. 721 00:42:54,240 --> 00:43:02,460 Consider Poisson process that runs over a unit interval. 722 00:43:02,460 --> 00:43:07,770 And where the arrival rate is equal to 1. 723 00:43:07,770 --> 00:43:09,790 So this is the unit interval. 724 00:43:09,790 --> 00:43:12,720 And let X be the number of arrivals. 725 00:43:12,720 --> 00:43:15,430 726 00:43:15,430 --> 00:43:19,930 And this is Poisson, with mean 1. 727 00:43:19,930 --> 00:43:25,000 728 00:43:25,000 --> 00:43:28,160 Now, let me take this interval and divide it 729 00:43:28,160 --> 00:43:30,650 into n little pieces. 730 00:43:30,650 --> 00:43:34,270 So each piece has length 1/n. 731 00:43:34,270 --> 00:43:41,225 And let Xi be the number of arrivals during 732 00:43:41,225 --> 00:43:43,490 the Ith little interval. 733 00:43:43,490 --> 00:43:48,000 734 00:43:48,000 --> 00:43:51,630 OK, what do we know about the random variables Xi? 735 00:43:51,630 --> 00:43:55,260 Is they are themselves Poisson. 736 00:43:55,260 --> 00:43:58,490 It's a number of arrivals during a small interval. 737 00:43:58,490 --> 00:44:02,340 We also know that when n is big, so the length of the 738 00:44:02,340 --> 00:44:08,190 interval is small, these Xi's are approximately Bernoulli, 739 00:44:08,190 --> 00:44:11,730 with mean 1/n. 740 00:44:11,730 --> 00:44:13,970 Guess it doesn't matter whether we model them as 741 00:44:13,970 --> 00:44:15,720 Bernoulli or not. 742 00:44:15,720 --> 00:44:19,660 What matters is that the Xi's are independent. 743 00:44:19,660 --> 00:44:20,970 Why are they independent? 744 00:44:20,970 --> 00:44:24,410 Because, in a Poisson process, these joint intervals are 745 00:44:24,410 --> 00:44:26,770 independent of each other. 746 00:44:26,770 --> 00:44:28,955 So the Xi's are independent. 747 00:44:28,955 --> 00:44:31,840 748 00:44:31,840 --> 00:44:35,570 And they also have the same distribution. 749 00:44:35,570 --> 00:44:40,360 And we have that X, the total number of arrivals, is the sum 750 00:44:40,360 --> 00:44:41,610 over the Xn's. 751 00:44:41,610 --> 00:44:44,470 752 00:44:44,470 --> 00:44:49,510 So the central limit theorem tells us that, approximately, 753 00:44:49,510 --> 00:44:53,670 the sum of independent, identically distributed random 754 00:44:53,670 --> 00:44:57,720 variables, when we have lots of these random variables, 755 00:44:57,720 --> 00:45:01,530 behaves like a normal random variable. 756 00:45:01,530 --> 00:45:07,475 So by using this decomposition of X into a sum of i.i.d 757 00:45:07,475 --> 00:45:11,540 random variables, and by using values of n that are bigger 758 00:45:11,540 --> 00:45:16,540 and bigger, by taking the limit, it should follow that X 759 00:45:16,540 --> 00:45:19,510 has a normal distribution. 760 00:45:19,510 --> 00:45:22,120 On the other hand, we know that X has a Poisson 761 00:45:22,120 --> 00:45:23,370 distribution. 762 00:45:23,370 --> 00:45:25,270 763 00:45:25,270 --> 00:45:32,640 So something must be wrong in this argument here. 764 00:45:32,640 --> 00:45:34,900 Can we really use the central limit 765 00:45:34,900 --> 00:45:38,330 theorem in this situation? 766 00:45:38,330 --> 00:45:41,300 So what do we need for the central limit theorem? 767 00:45:41,300 --> 00:45:44,160 We need to have independent, identically 768 00:45:44,160 --> 00:45:46,700 distributed random variables. 769 00:45:46,700 --> 00:45:49,060 We have it here. 770 00:45:49,060 --> 00:45:53,410 We want them to have a finite mean and finite variance. 771 00:45:53,410 --> 00:45:57,610 We also have it here, means variances are finite. 772 00:45:57,610 --> 00:46:02,050 What is another assumption that was never made explicit, 773 00:46:02,050 --> 00:46:04,080 but essentially was there? 774 00:46:04,080 --> 00:46:07,680 775 00:46:07,680 --> 00:46:13,260 Or in other words, what is the flaw in this argument that 776 00:46:13,260 --> 00:46:15,520 uses the central limit theorem here? 777 00:46:15,520 --> 00:46:16,770 Any thoughts? 778 00:46:16,770 --> 00:46:24,110 779 00:46:24,110 --> 00:46:29,640 So in the central limit theorem, we said, consider-- 780 00:46:29,640 --> 00:46:34,820 fix a probability distribution, and let the Xi's 781 00:46:34,820 --> 00:46:38,280 be distributed according to that probability distribution, 782 00:46:38,280 --> 00:46:42,935 and add a larger and larger number or Xi's. 783 00:46:42,935 --> 00:46:47,410 But the underlying, unstated assumption is that we fix the 784 00:46:47,410 --> 00:46:49,490 distribution of the Xi's. 785 00:46:49,490 --> 00:46:52,810 As we let n increase, the statistics of 786 00:46:52,810 --> 00:46:55,930 each Xi do not change. 787 00:46:55,930 --> 00:46:59,010 Whereas here, I'm playing a trick on you. 788 00:46:59,010 --> 00:47:03,700 As I'm taking more and more random variables, I'm actually 789 00:47:03,700 --> 00:47:07,850 changing what those random variables are. 790 00:47:07,850 --> 00:47:12,960 When I take a larger n, the Xi's are random variables with 791 00:47:12,960 --> 00:47:15,720 a different mean and different variance. 792 00:47:15,720 --> 00:47:19,800 So I'm adding more of these, but at the same time, in this 793 00:47:19,800 --> 00:47:23,420 example, I'm changing their distributions. 794 00:47:23,420 --> 00:47:26,380 That's something that doesn't fit the setting of the central 795 00:47:26,380 --> 00:47:27,000 limit theorem. 796 00:47:27,000 --> 00:47:29,910 In the central limit theorem, you first fix the distribution 797 00:47:29,910 --> 00:47:31,200 of the X's. 798 00:47:31,200 --> 00:47:35,290 You keep it fixed, and then you consider adding more and 799 00:47:35,290 --> 00:47:38,950 more according to that particular fixed distribution. 800 00:47:38,950 --> 00:47:40,020 So that's the catch. 801 00:47:40,020 --> 00:47:42,240 That's why the central limit theorem does not 802 00:47:42,240 --> 00:47:43,970 apply to this situation. 803 00:47:43,970 --> 00:47:46,230 And we're lucky that it doesn't apply because, 804 00:47:46,230 --> 00:47:50,220 otherwise, we would have a huge contradiction destroying 805 00:47:50,220 --> 00:47:52,770 probability theory. 806 00:47:52,770 --> 00:48:02,240 OK, but now that's still leaves us with a 807 00:48:02,240 --> 00:48:05,040 little bit of a dilemma. 808 00:48:05,040 --> 00:48:08,510 Suppose that, here, essentially we're adding 809 00:48:08,510 --> 00:48:12,815 independent Bernoulli random variables. 810 00:48:12,815 --> 00:48:22,650 811 00:48:22,650 --> 00:48:25,300 So the issue is that the central limit theorem has to 812 00:48:25,300 --> 00:48:28,920 do with asymptotics as n goes to infinity. 813 00:48:28,920 --> 00:48:34,260 And if we consider a binomial, and somebody gives us specific 814 00:48:34,260 --> 00:48:38,870 numbers about the parameters of that binomial, it might not 815 00:48:38,870 --> 00:48:40,830 necessarily be obvious what kind of 816 00:48:40,830 --> 00:48:42,790 approximation do we use. 817 00:48:42,790 --> 00:48:45,660 In particular, we do have two different approximations for 818 00:48:45,660 --> 00:48:47,100 the binomial. 819 00:48:47,100 --> 00:48:51,610 If we fix p, then the binomial is the sum of Bernoulli's that 820 00:48:51,610 --> 00:48:54,930 come from a fixed distribution, we consider more 821 00:48:54,930 --> 00:48:56,450 and more of these. 822 00:48:56,450 --> 00:48:58,990 When we add them, the central limit theorem tells us that we 823 00:48:58,990 --> 00:49:01,190 get the normal distribution. 824 00:49:01,190 --> 00:49:04,430 There's another sort of limit, which has the flavor of this 825 00:49:04,430 --> 00:49:10,770 example, in which we still deal with a binomial, sum of n 826 00:49:10,770 --> 00:49:11,170 Bernoulli's. 827 00:49:11,170 --> 00:49:14,310 We let that sum, the number of the 828 00:49:14,310 --> 00:49:16,090 Bernoulli's go to infinity. 829 00:49:16,090 --> 00:49:18,890 But each Bernoulli has a probability of success that 830 00:49:18,890 --> 00:49:23,830 goes to 0, and we do this in a way so that np, the expected 831 00:49:23,830 --> 00:49:27,090 number of successes, stays finite. 832 00:49:27,090 --> 00:49:30,660 This is the situation that we dealt with when we first 833 00:49:30,660 --> 00:49:32,960 defined our Poisson process. 834 00:49:32,960 --> 00:49:37,540 We have a very, very large number so lots, of time slots, 835 00:49:37,540 --> 00:49:40,920 but during each time slot, there's a tiny probability of 836 00:49:40,920 --> 00:49:42,950 obtaining an arrival. 837 00:49:42,950 --> 00:49:48,460 Under that setting, in discrete time, we have a 838 00:49:48,460 --> 00:49:51,670 binomial distribution, or Bernoulli process, but when we 839 00:49:51,670 --> 00:49:54,530 take the limit, we obtain the Poisson process and the 840 00:49:54,530 --> 00:49:56,470 Poisson approximation. 841 00:49:56,470 --> 00:49:58,510 So these are two equally valid 842 00:49:58,510 --> 00:50:00,550 approximations of the binomial. 843 00:50:00,550 --> 00:50:03,300 But they're valid in different asymptotic regimes. 844 00:50:03,300 --> 00:50:06,180 In one regime, we fixed p, let n go to infinity. 845 00:50:06,180 --> 00:50:09,360 In the other regime, we let both n and p change 846 00:50:09,360 --> 00:50:11,540 simultaneously. 847 00:50:11,540 --> 00:50:14,240 Now, in real life, you're never dealing with the 848 00:50:14,240 --> 00:50:15,290 limiting situations. 849 00:50:15,290 --> 00:50:17,870 You're dealing with actual numbers. 850 00:50:17,870 --> 00:50:21,820 So if somebody tells you that the numbers are like this, 851 00:50:21,820 --> 00:50:25,160 then you should probably say that this is the situation 852 00:50:25,160 --> 00:50:27,380 that fits the Poisson description-- 853 00:50:27,380 --> 00:50:30,180 large number of slots with each slot having a tiny 854 00:50:30,180 --> 00:50:32,460 probability of success. 855 00:50:32,460 --> 00:50:36,890 On the other hand, if p is something like this, and n is 856 00:50:36,890 --> 00:50:40,460 500, then you expect to get the distribution for the 857 00:50:40,460 --> 00:50:41,680 number of successes. 858 00:50:41,680 --> 00:50:45,740 It's going to have a mean of 50 and to have a fair amount 859 00:50:45,740 --> 00:50:47,280 of spread around there. 860 00:50:47,280 --> 00:50:50,150 It turns out that the normal approximation would be better 861 00:50:50,150 --> 00:50:51,500 in this context. 862 00:50:51,500 --> 00:50:57,120 As a rule of thumb, if n times p is bigger than 10 or 20, you 863 00:50:57,120 --> 00:50:59,320 can start using the normal approximation. 864 00:50:59,320 --> 00:51:04,310 If n times p is a small number, then you prefer to use 865 00:51:04,310 --> 00:51:06,090 the Poisson approximation. 866 00:51:06,090 --> 00:51:08,840 But there's no hard theorems or rules about 867 00:51:08,840 --> 00:51:11,650 how to go about this. 868 00:51:11,650 --> 00:51:15,440 OK, so from next time we're going to switch base again. 869 00:51:15,440 --> 00:51:17,830 And we're going to put together everything we learned 870 00:51:17,830 --> 00:51:20,620 in this class to start solving inference problems. 871 00:51:20,620 --> 00:51:22,050