1 00:00:00,040 --> 00:00:02,460 The following content is provided under a Creative 2 00:00:02,460 --> 00:00:03,870 Commons license. 3 00:00:03,870 --> 00:00:06,910 Your support will help MIT OpenCourseWare continue to 4 00:00:06,910 --> 00:00:10,560 offer high-quality educational resources for free. 5 00:00:10,560 --> 00:00:13,460 To make a donation or view additional materials from 6 00:00:13,460 --> 00:00:19,290 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:19,290 --> 00:00:20,540 ocw.mit.edu. 8 00:00:22,560 --> 00:00:25,340 PROFESSOR: We're going to finish today our discussion of 9 00:00:25,340 --> 00:00:27,460 limit theorems. 10 00:00:27,460 --> 00:00:30,340 I'm going to remind you what the central limit theorem is, 11 00:00:30,340 --> 00:00:33,460 which we introduced briefly last time. 12 00:00:33,460 --> 00:00:37,230 We're going to discuss what exactly it says and its 13 00:00:37,230 --> 00:00:38,780 implications. 14 00:00:38,780 --> 00:00:42,100 And then we're going to apply to a couple of examples, 15 00:00:42,100 --> 00:00:45,520 mostly on the binomial distribution. 16 00:00:45,520 --> 00:00:49,950 OK, so the situation is that we are dealing with a large 17 00:00:49,950 --> 00:00:52,420 number of independent, identically 18 00:00:52,420 --> 00:00:55,000 distributed random variables. 19 00:00:55,000 --> 00:00:58,270 And we want to look at the sum of them and say something 20 00:00:58,270 --> 00:01:00,510 about the distribution of the sum. 21 00:01:03,310 --> 00:01:06,910 We might want to say that the sum is distributed 22 00:01:06,910 --> 00:01:10,510 approximately as a normal random variable, although, 23 00:01:10,510 --> 00:01:12,750 formally, this is not quite right. 24 00:01:12,750 --> 00:01:16,330 As n goes to infinity, the distribution of the sum 25 00:01:16,330 --> 00:01:20,000 becomes very spread out, and it doesn't converge to a 26 00:01:20,000 --> 00:01:21,830 limiting distribution. 27 00:01:21,830 --> 00:01:24,930 In order to get an interesting limit, we need first to take 28 00:01:24,930 --> 00:01:28,150 the sum and standardize it. 29 00:01:28,150 --> 00:01:32,267 By standardizing it, what we mean is to subtract the mean 30 00:01:32,267 --> 00:01:38,060 and then divide by the standard deviation. 31 00:01:38,060 --> 00:01:41,320 Now, the mean is, of course, n times the expected value of 32 00:01:41,320 --> 00:01:43,080 each one of the X's. 33 00:01:43,080 --> 00:01:45,130 And the standard deviation is the 34 00:01:45,130 --> 00:01:46,610 square root of the variance. 35 00:01:46,610 --> 00:01:50,530 The variance is n times sigma squared, where sigma is the 36 00:01:50,530 --> 00:01:52,180 variance of the X's -- 37 00:01:52,180 --> 00:01:53,400 so that's the standard deviation. 38 00:01:53,400 --> 00:01:56,330 And after we do this, we obtain a random variable that 39 00:01:56,330 --> 00:02:01,100 has 0 mean -- its centered -- and the 40 00:02:01,100 --> 00:02:03,230 variance is equal to 1. 41 00:02:03,230 --> 00:02:07,240 And so the variance stays the same, no matter how large n is 42 00:02:07,240 --> 00:02:08,500 going to be. 43 00:02:08,500 --> 00:02:12,660 So the distribution of Zn keeps changing with n, but it 44 00:02:12,660 --> 00:02:14,080 cannot change too much. 45 00:02:14,080 --> 00:02:15,240 It stays in place. 46 00:02:15,240 --> 00:02:19,550 The mean is 0, and the width remains also roughly the same 47 00:02:19,550 --> 00:02:22,000 because the variance is 1. 48 00:02:22,000 --> 00:02:25,820 The surprising thing is that, as n grows, that distribution 49 00:02:25,820 --> 00:02:31,250 of Zn kind of settles in a certain asymptotic shape. 50 00:02:31,250 --> 00:02:33,620 And that's the shape of a standard 51 00:02:33,620 --> 00:02:35,290 normal random variable. 52 00:02:35,290 --> 00:02:37,580 So standard normal means that it has 0 53 00:02:37,580 --> 00:02:39,930 mean and unit variance. 54 00:02:39,930 --> 00:02:43,850 More precisely, what the central limit theorem tells us 55 00:02:43,850 --> 00:02:46,560 is a relation between the cumulative distribution 56 00:02:46,560 --> 00:02:49,430 function of Zn and its relation to the cumulative 57 00:02:49,430 --> 00:02:51,990 distribution function of the standard normal. 58 00:02:51,990 --> 00:02:56,620 So for any given number, c, the probability that Zn is 59 00:02:56,620 --> 00:03:01,140 less than or equal to c, in the limit, becomes the same as 60 00:03:01,140 --> 00:03:04,090 the probability that the standard normal becomes less 61 00:03:04,090 --> 00:03:05,760 than or equal to c. 62 00:03:05,760 --> 00:03:08,800 And of course, this is useful because these probabilities 63 00:03:08,800 --> 00:03:11,960 are available from the normal tables, whereas the 64 00:03:11,960 --> 00:03:15,850 distribution of Zn might be a very complicated expression if 65 00:03:15,850 --> 00:03:19,520 you were to calculate it exactly. 66 00:03:19,520 --> 00:03:22,960 So some comments about the central limit theorem. 67 00:03:22,960 --> 00:03:27,860 First thing is that it's quite amazing that it's universal. 68 00:03:27,860 --> 00:03:31,970 It doesn't matter what the distribution of the X's is. 69 00:03:31,970 --> 00:03:35,970 It can be any distribution whatsoever, as long as it has 70 00:03:35,970 --> 00:03:39,070 finite mean and finite variance. 71 00:03:39,070 --> 00:03:42,170 And when you go and do your approximations using the 72 00:03:42,170 --> 00:03:44,520 central limit theorem, the only thing that you need to 73 00:03:44,520 --> 00:03:47,580 know about the distribution of the X's are the 74 00:03:47,580 --> 00:03:49,130 mean and the variance. 75 00:03:49,130 --> 00:03:52,410 You need those in order to standardize Sn. 76 00:03:52,410 --> 00:03:55,910 I mean -- to subtract the mean and divide by the standard 77 00:03:55,910 --> 00:03:56,810 deviation -- 78 00:03:56,810 --> 00:03:59,120 you need to know the mean and the variance. 79 00:03:59,120 --> 00:04:02,350 But these are the only things that you need to know in order 80 00:04:02,350 --> 00:04:03,600 to apply it. 81 00:04:06,060 --> 00:04:08,730 In addition, it's a very accurate 82 00:04:08,730 --> 00:04:10,650 computational shortcut. 83 00:04:10,650 --> 00:04:14,660 So the distribution of this Zn's, in principle, you can 84 00:04:14,660 --> 00:04:18,130 calculate it by convolution of the distribution of the X's 85 00:04:18,130 --> 00:04:20,050 with itself many, many times. 86 00:04:20,050 --> 00:04:23,720 But this is tedious, and if you try to do it analytically, 87 00:04:23,720 --> 00:04:26,570 it might be a very complicated expression. 88 00:04:26,570 --> 00:04:29,910 Whereas by just appealing to the standard normal table for 89 00:04:29,910 --> 00:04:33,870 the standard normal random variable, things are done in a 90 00:04:33,870 --> 00:04:35,360 very quick way. 91 00:04:35,360 --> 00:04:39,070 So it's a nice computational shortcut if you don't want to 92 00:04:39,070 --> 00:04:42,210 get an exact answer to a probability problem. 93 00:04:42,210 --> 00:04:47,480 Now, at a more philosophical level, it justifies why we are 94 00:04:47,480 --> 00:04:50,930 really interested in normal random variables. 95 00:04:50,930 --> 00:04:55,230 Whenever you have a phenomenon which is noisy, and the noise 96 00:04:55,230 --> 00:05:00,420 that you observe is created by adding the lots of little 97 00:05:00,420 --> 00:05:03,820 pieces of randomness that are independent of each other, the 98 00:05:03,820 --> 00:05:06,840 overall effect that you're going to observe can be 99 00:05:06,840 --> 00:05:10,240 described by a normal random variable. 100 00:05:10,240 --> 00:05:16,810 So in a classic example that goes 100 years back or so, 101 00:05:16,810 --> 00:05:19,840 suppose that you have a fluid, and inside that fluid, there's 102 00:05:19,840 --> 00:05:23,340 a little particle of dust or whatever that's 103 00:05:23,340 --> 00:05:24,950 suspended in there. 104 00:05:24,950 --> 00:05:28,380 That little particle gets hit by molecules 105 00:05:28,380 --> 00:05:30,000 completely at random -- 106 00:05:30,000 --> 00:05:32,730 and so what you're going to see is that particle kind of 107 00:05:32,730 --> 00:05:36,020 moving randomly inside that liquid. 108 00:05:36,020 --> 00:05:40,260 Now that random motion, if you ask, after one second, how 109 00:05:40,260 --> 00:05:45,520 much is my particle displaced, let's say, in the x-axis along 110 00:05:45,520 --> 00:05:47,170 the x direction. 111 00:05:47,170 --> 00:05:50,960 That displacement is very, very well modeled by a normal 112 00:05:50,960 --> 00:05:51,960 random variable. 113 00:05:51,960 --> 00:05:55,710 And the reason is that the position of that particle is 114 00:05:55,710 --> 00:06:00,160 decided by the cumulative effect of lots of random hits 115 00:06:00,160 --> 00:06:04,480 by molecules that hit that particle. 116 00:06:04,480 --> 00:06:11,630 So that's a sort of celebrated physical model that goes under 117 00:06:11,630 --> 00:06:15,000 the name of Brownian motion. 118 00:06:15,000 --> 00:06:18,100 And it's the same model that some people use to describe 119 00:06:18,100 --> 00:06:20,300 the movement in the financial markets. 120 00:06:20,300 --> 00:06:24,660 The argument might go that the movement of prices has to do 121 00:06:24,660 --> 00:06:28,300 with lots of little decisions and lots of little events by 122 00:06:28,300 --> 00:06:31,310 many, many different actors that are 123 00:06:31,310 --> 00:06:32,890 involved in the market. 124 00:06:32,890 --> 00:06:37,440 So the distribution of stock prices might be well described 125 00:06:37,440 --> 00:06:39,740 by normal random variables. 126 00:06:39,740 --> 00:06:43,780 At least that's what people wanted to believe until 127 00:06:43,780 --> 00:06:45,160 somewhat recently. 128 00:06:45,160 --> 00:06:48,300 Now, the evidence is that, actually, these distributions 129 00:06:48,300 --> 00:06:52,210 are a little more heavy-tailed in the sense that extreme 130 00:06:52,210 --> 00:06:55,630 events are a little more likely to occur that what 131 00:06:55,630 --> 00:06:58,040 normal random variables would seem to indicate. 132 00:06:58,040 --> 00:07:03,110 But as a first model, again, it could be a plausible 133 00:07:03,110 --> 00:07:07,300 argument to have, at least as a starting model, one that 134 00:07:07,300 --> 00:07:10,200 involves normal random variables. 135 00:07:10,200 --> 00:07:13,020 So this is the philosophical side of things. 136 00:07:13,020 --> 00:07:15,820 On the more accurate, mathematical side, it's 137 00:07:15,820 --> 00:07:18,290 important to appreciate exactly quite kind of 138 00:07:18,290 --> 00:07:21,250 statement the central limit theorem is. 139 00:07:21,250 --> 00:07:25,460 It's a statement about the convergence of the CDF of 140 00:07:25,460 --> 00:07:27,940 these standardized random variables to 141 00:07:27,940 --> 00:07:29,840 the CDF of a normal. 142 00:07:29,840 --> 00:07:32,470 So it's a statement about convergence of CDFs. 143 00:07:32,470 --> 00:07:36,580 It's not a statement about convergence of PMFs, or 144 00:07:36,580 --> 00:07:39,100 convergence of PDFs. 145 00:07:39,100 --> 00:07:42,160 Now, if one makes additional mathematical assumptions, 146 00:07:42,160 --> 00:07:44,880 there are variations of the central limit theorem that 147 00:07:44,880 --> 00:07:47,220 talk about PDFs and PMFs. 148 00:07:47,220 --> 00:07:51,930 But in general, that's not necessarily the case. 149 00:07:51,930 --> 00:07:54,610 And I'm going to illustrate this with-- 150 00:07:54,610 --> 00:07:58,890 I have a plot here which is not in your slides. 151 00:07:58,890 --> 00:08:04,700 But just to make the point, consider two different 152 00:08:04,700 --> 00:08:06,710 discrete distributions. 153 00:08:06,710 --> 00:08:09,820 This discrete distribution takes values 1, 4, 7. 154 00:08:13,470 --> 00:08:16,110 This discrete distribution can take values 1, 2, 4, 6, and 7. 155 00:08:18,720 --> 00:08:24,270 So this one has sort of a periodicity of 3, this one, 156 00:08:24,270 --> 00:08:27,960 the range of values is a little more interesting. 157 00:08:27,960 --> 00:08:30,910 The numbers in these two distributions are cooked up so 158 00:08:30,910 --> 00:08:34,380 that they have the same mean and the same variance. 159 00:08:34,380 --> 00:08:38,970 Now, what I'm going to do is to take eight independent 160 00:08:38,970 --> 00:08:44,090 copies of the random variable and plot the PMF of the sum of 161 00:08:44,090 --> 00:08:45,980 eight random variables. 162 00:08:45,980 --> 00:08:51,520 Now, if I plot the PMF of the sum of 8 of these, I get the 163 00:08:51,520 --> 00:08:59,690 plot, which corresponds to these bullets in this diagram. 164 00:08:59,690 --> 00:09:03,040 If I take 8 random variables, according to this 165 00:09:03,040 --> 00:09:07,270 distribution, and add them up and compute their PMF, the PMF 166 00:09:07,270 --> 00:09:10,310 I get is the one denoted here by the X's. 167 00:09:10,310 --> 00:09:15,630 The two PMFs look really different, at least, when you 168 00:09:15,630 --> 00:09:16,890 eyeball them. 169 00:09:16,890 --> 00:09:23,500 On the other hand, if you were to plot the CDFs of them, then 170 00:09:23,500 --> 00:09:34,000 the CDFs, if you compare them with the normal CDF, which is 171 00:09:34,000 --> 00:09:38,390 this continuous curve, the CDF, of course, it goes up in 172 00:09:38,390 --> 00:09:41,870 steps because we're looking at discrete random variables. 173 00:09:41,870 --> 00:09:47,600 But it's very close to the normal CDF. 174 00:09:47,600 --> 00:09:52,000 And if we, instead of n equal to 8, we were to take 16, then 175 00:09:52,000 --> 00:09:54,480 the coincidence would be even better. 176 00:09:54,480 --> 00:09:59,850 So in terms of CDFs, when we add 8 or 16 of these, we get 177 00:09:59,850 --> 00:10:01,930 very close to the normal CDF. 178 00:10:01,930 --> 00:10:05,080 We would get essentially the same picture if I were to take 179 00:10:05,080 --> 00:10:06,850 8 or 16 of these. 180 00:10:06,850 --> 00:10:11,730 So the CDFs sit, essentially, on top of each other, although 181 00:10:11,730 --> 00:10:14,400 the two PMFs look quite different. 182 00:10:14,400 --> 00:10:17,230 So this is to appreciate that, formally speaking, we only 183 00:10:17,230 --> 00:10:22,470 have a statement about CDFs, not about PMFs. 184 00:10:22,470 --> 00:10:26,980 Now in practice, how do you use the central limit theorem? 185 00:10:26,980 --> 00:10:30,550 Well, it tells us that we can calculate probabilities by 186 00:10:30,550 --> 00:10:32,810 treating Zn as if it were a standard 187 00:10:32,810 --> 00:10:34,550 normal random variable. 188 00:10:34,550 --> 00:10:38,280 Now Zn is a linear function of Sn. 189 00:10:38,280 --> 00:10:43,120 Conversely, Sn is a linear function of Zn. 190 00:10:43,120 --> 00:10:45,680 Linear functions of normals are normal. 191 00:10:45,680 --> 00:10:49,450 So if I pretend that Zn is normal, it's essentially the 192 00:10:49,450 --> 00:10:53,230 same as if we pretend that Sn is normal. 193 00:10:53,230 --> 00:10:55,560 And so we can calculate probabilities that have to do 194 00:10:55,560 --> 00:10:59,830 with Sn as if Sn were normal. 195 00:10:59,830 --> 00:11:03,850 Now, the central limit theorem does not tell us that Sn is 196 00:11:03,850 --> 00:11:05,120 approximately normal. 197 00:11:05,120 --> 00:11:08,860 The formal statement is about Zn, but, practically speaking, 198 00:11:08,860 --> 00:11:11,150 when you use the result, you can just 199 00:11:11,150 --> 00:11:14,650 pretend that Sn is normal. 200 00:11:14,650 --> 00:11:18,620 Finally, it's a limit theorem, so it tells us about what 201 00:11:18,620 --> 00:11:21,240 happens when n goes to infinity. 202 00:11:21,240 --> 00:11:23,880 If we are to use it in practice, of course, n is not 203 00:11:23,880 --> 00:11:25,120 going to be infinity. 204 00:11:25,120 --> 00:11:28,320 Maybe n is equal to 15. 205 00:11:28,320 --> 00:11:32,130 Can we use a limit theorem when n is a small number, as 206 00:11:32,130 --> 00:11:34,020 small as 15? 207 00:11:34,020 --> 00:11:36,980 Well, it turns out that it's a very good approximation. 208 00:11:36,980 --> 00:11:41,420 Even for quite small values of n, it gives us 209 00:11:41,420 --> 00:11:43,770 very accurate answers. 210 00:11:43,770 --> 00:11:49,710 So n over the order of 15, or 20, or so give us very good 211 00:11:49,710 --> 00:11:51,790 results in practice. 212 00:11:51,790 --> 00:11:54,820 There are no good theorems that will give us hard 213 00:11:54,820 --> 00:11:58,550 guarantees because the quality of the approximation does 214 00:11:58,550 --> 00:12:03,490 depend on the details of the distribution of the X's. 215 00:12:03,490 --> 00:12:07,510 If the X's have a distribution that, from the outset, looks a 216 00:12:07,510 --> 00:12:13,200 little bit like the normal, then for small values of n, 217 00:12:13,200 --> 00:12:15,700 you are going to see, essentially, a normal 218 00:12:15,700 --> 00:12:16,980 distribution for the sum. 219 00:12:16,980 --> 00:12:20,030 If the distribution of the X's is very different from the 220 00:12:20,030 --> 00:12:23,350 normal, it's going to take a larger value of n for the 221 00:12:23,350 --> 00:12:25,770 central limit theorem to take effect. 222 00:12:25,770 --> 00:12:29,960 So let's illustrates this with a few representative plots. 223 00:12:32,600 --> 00:12:36,460 So here, we're starting with a discrete uniform distribution 224 00:12:36,460 --> 00:12:39,580 that goes from 1 to 8. 225 00:12:39,580 --> 00:12:44,200 Let's add 2 of these random variables, 2 random variables 226 00:12:44,200 --> 00:12:47,870 with this PMF, and find the PMF of the sum. 227 00:12:47,870 --> 00:12:52,570 This is a convolution of 2 discrete uniforms, and I 228 00:12:52,570 --> 00:12:54,960 believe you have seen this exercise before. 229 00:12:54,960 --> 00:12:59,030 When you convolve this with itself, you get a triangle. 230 00:12:59,030 --> 00:13:04,400 So this is the PMF for the sum of two discrete uniforms. 231 00:13:04,400 --> 00:13:05,370 Now let's continue. 232 00:13:05,370 --> 00:13:07,980 Let's convolve this with itself. 233 00:13:07,980 --> 00:13:10,750 These was going to give us the PMF of a sum 234 00:13:10,750 --> 00:13:13,740 of 4 discrete uniforms. 235 00:13:13,740 --> 00:13:17,930 And we get this, which starts looking like a normal. 236 00:13:17,930 --> 00:13:23,450 If we go to n equal to 32, then it looks, essentially, 237 00:13:23,450 --> 00:13:25,270 exactly like a normal. 238 00:13:25,270 --> 00:13:27,850 And it's an excellent approximation. 239 00:13:27,850 --> 00:13:32,290 So this is the PMF of the sum of 32 discrete random 240 00:13:32,290 --> 00:13:36,560 variables with this uniform distribution. 241 00:13:36,560 --> 00:13:42,190 If we start with a PMF which is not symmetric-- 242 00:13:42,190 --> 00:13:44,640 this one is symmetric around the mean. 243 00:13:44,640 --> 00:13:47,630 But if we start with a PMF which is non-symmetric, so 244 00:13:47,630 --> 00:13:53,780 this is, here, is a truncated geometric PMF, then things do 245 00:13:53,780 --> 00:13:58,960 not work out as nicely when I add 8 of these. 246 00:13:58,960 --> 00:14:03,640 That is, if I convolve this with itself 8 times, I get 247 00:14:03,640 --> 00:14:08,600 this PMF, which maybe resembles a little bit to the 248 00:14:08,600 --> 00:14:09,800 normal one. 249 00:14:09,800 --> 00:14:13,050 But you can really tell that it's different from the normal 250 00:14:13,050 --> 00:14:16,640 if you focus at the details here and there. 251 00:14:16,640 --> 00:14:19,930 Here it sort of rises sharply. 252 00:14:19,930 --> 00:14:23,420 Here it tails off a bit slower. 253 00:14:23,420 --> 00:14:27,660 So there's an asymmetry here that's present, and which is a 254 00:14:27,660 --> 00:14:29,340 consequence of the asymmetry of the 255 00:14:29,340 --> 00:14:31,710 distribution we started with. 256 00:14:31,710 --> 00:14:35,320 If we go to 16, it looks a little better, but still you 257 00:14:35,320 --> 00:14:39,600 can see the asymmetry between this tail and that tail. 258 00:14:39,600 --> 00:14:43,030 If you get to 32 there's still a little bit of asymmetry, but 259 00:14:43,030 --> 00:14:48,520 at least now it starts looking like a normal distribution. 260 00:14:48,520 --> 00:14:54,270 So the moral from these plots is that it might vary, a 261 00:14:54,270 --> 00:14:57,360 little bit, what kind of values of n you need before 262 00:14:57,360 --> 00:15:00,070 you get the really good approximation. 263 00:15:00,070 --> 00:15:04,520 But for values of n in the range 20 to 30 or so, usually 264 00:15:04,520 --> 00:15:07,340 you expect to get a pretty good approximation. 265 00:15:07,340 --> 00:15:10,180 At least that's what the visual inspection of these 266 00:15:10,180 --> 00:15:13,330 graphs tells us. 267 00:15:13,330 --> 00:15:16,560 So now that we know that we have a good approximation in 268 00:15:16,560 --> 00:15:18,460 our hands, let's use it. 269 00:15:18,460 --> 00:15:21,890 Let's use it by revisiting an example from last time. 270 00:15:21,890 --> 00:15:24,480 This is the polling problem. 271 00:15:24,480 --> 00:15:28,360 We're interested in the fraction of population that 272 00:15:28,360 --> 00:15:30,220 has a certain habit been. 273 00:15:30,220 --> 00:15:33,680 And we try to find what f is. 274 00:15:33,680 --> 00:15:38,120 And the way we do it is by polling people at random and 275 00:15:38,120 --> 00:15:40,600 recording the answers that they give, whether they have 276 00:15:40,600 --> 00:15:42,340 the habit or not. 277 00:15:42,340 --> 00:15:45,250 So for each person, we get the Bernoulli random variable. 278 00:15:45,250 --> 00:15:52,050 With probability f, a person is going to respond 1, or yes, 279 00:15:52,050 --> 00:15:55,080 so this is with probability f. 280 00:15:55,080 --> 00:15:58,490 And with the remaining probability 1-f, the person 281 00:15:58,490 --> 00:16:00,390 responds no. 282 00:16:00,390 --> 00:16:04,520 We record this number, which is how many people answered 283 00:16:04,520 --> 00:16:06,800 yes, divided by the total number of people. 284 00:16:06,800 --> 00:16:10,740 That's the fraction of the population that we asked. 285 00:16:10,740 --> 00:16:16,980 This is the fraction inside our sample that answered yes. 286 00:16:16,980 --> 00:16:21,410 And as we discussed last time, you might start with some 287 00:16:21,410 --> 00:16:23,210 specs for the poll. 288 00:16:23,210 --> 00:16:25,660 And the specs have two parameters-- 289 00:16:25,660 --> 00:16:29,400 the accuracy that you want and the confidence that you want 290 00:16:29,400 --> 00:16:33,620 to have that you did really obtain the desired accuracy. 291 00:16:33,620 --> 00:16:40,550 So the specs here is that we want, probability 95% that our 292 00:16:40,550 --> 00:16:46,400 estimate is within 1 % point from the true answer. 293 00:16:46,400 --> 00:16:48,940 So the event of interest is this. 294 00:16:48,940 --> 00:16:53,640 That's the result of the poll minus distance from the true 295 00:16:53,640 --> 00:16:59,150 answer is less or bigger than 1 % point. 296 00:16:59,150 --> 00:17:02,000 And we're interested in calculating or approximating 297 00:17:02,000 --> 00:17:04,140 this particular probability. 298 00:17:04,140 --> 00:17:08,000 So we want to do it using the central limit theorem. 299 00:17:08,000 --> 00:17:13,050 And one way of arranging the mechanics of this calculation 300 00:17:13,050 --> 00:17:17,880 is to take the event of interest and massage it by 301 00:17:17,880 --> 00:17:21,400 subtracting and dividing things from both sides of this 302 00:17:21,400 --> 00:17:27,510 inequality so that you bring him to the picture the 303 00:17:27,510 --> 00:17:31,600 standardized random variable, the Zn, and then apply the 304 00:17:31,600 --> 00:17:33,900 central limit theorem. 305 00:17:33,900 --> 00:17:38,550 So the event of interest, let me write it in full, Mn is 306 00:17:38,550 --> 00:17:42,280 this quantity, so I'm putting it here, minus f, which is the 307 00:17:42,280 --> 00:17:44,410 same as nf divided by n. 308 00:17:44,410 --> 00:17:46,980 So this is the same as that event. 309 00:17:46,980 --> 00:17:49,840 We're going to calculate the probability of this. 310 00:17:49,840 --> 00:17:52,460 This is not exactly in the form in which we apply the 311 00:17:52,460 --> 00:17:53,430 central limit theorem. 312 00:17:53,430 --> 00:17:56,570 To apply the central limit theorem, we need, down here, 313 00:17:56,570 --> 00:17:59,660 to have sigma square root n. 314 00:17:59,660 --> 00:18:03,100 So how can I put sigma square root n here? 315 00:18:03,100 --> 00:18:07,350 I can divide both sides of this inequality by sigma. 316 00:18:07,350 --> 00:18:10,970 And then I can take a factor of square root n from here and 317 00:18:10,970 --> 00:18:13,240 send it to the other side. 318 00:18:13,240 --> 00:18:15,660 So this event is the same as that event. 319 00:18:15,660 --> 00:18:19,190 This will happen if and only if that will happen. 320 00:18:19,190 --> 00:18:23,670 So calculating the probability of this event here is the same 321 00:18:23,670 --> 00:18:27,110 as calculating the probability that this events happens. 322 00:18:27,110 --> 00:18:30,870 And now we are in business because the random variable 323 00:18:30,870 --> 00:18:36,510 that we got in here is Zn, or the absolute value of Zn, and 324 00:18:36,510 --> 00:18:41,480 we're talking about the probability that Zn, absolute 325 00:18:41,480 --> 00:18:45,660 value of Zn, is bigger than a certain number. 326 00:18:45,660 --> 00:18:50,310 Since Zn is to be approximated by a standard normal random 327 00:18:50,310 --> 00:18:54,560 variable, our approximation is going to be, instead of asking 328 00:18:54,560 --> 00:18:59,040 for Zn being bigger than this number, we will ask for Z, 329 00:18:59,040 --> 00:19:02,500 absolute value of Z, being bigger than this number. 330 00:19:02,500 --> 00:19:05,640 So this is the probability that we want to calculate. 331 00:19:05,640 --> 00:19:09,730 And now Z is a standard normal random variable. 332 00:19:09,730 --> 00:19:12,760 There's a small difficulty, the one that we also 333 00:19:12,760 --> 00:19:14,310 encountered last time. 334 00:19:14,310 --> 00:19:18,110 And the difficulty is that the standard deviation, sigma, of 335 00:19:18,110 --> 00:19:20,720 the Xi's is not known. 336 00:19:20,720 --> 00:19:24,560 Sigma is equal to f times-- 337 00:19:24,560 --> 00:19:30,090 sigma, in this example, is f times (1-f), and the only 338 00:19:30,090 --> 00:19:32,690 thing that we know about sigma is that it's going to be a 339 00:19:32,690 --> 00:19:35,010 number less than 1/2. 340 00:19:39,810 --> 00:19:45,180 OK, so we're going to have to use an inequality here. 341 00:19:45,180 --> 00:19:48,890 We're going to use a conservative value of sigma, 342 00:19:48,890 --> 00:19:54,120 the value of sigma equal to 1/2 and use that instead of 343 00:19:54,120 --> 00:19:55,760 the exact value of sigma. 344 00:19:55,760 --> 00:19:59,100 And this gives us an inequality going this way. 345 00:19:59,100 --> 00:20:03,710 Let's just make sure why the inequality goes this way. 346 00:20:03,710 --> 00:20:06,683 We got, on our axis, two numbers. 347 00:20:12,390 --> 00:20:21,650 One number is 0.01 square root n divided by sigma. 348 00:20:21,650 --> 00:20:27,870 And the other number is 0.02 square root of n. 349 00:20:27,870 --> 00:20:30,840 And my claim is that the numbers are related to each 350 00:20:30,840 --> 00:20:32,930 other in this particular way. 351 00:20:32,930 --> 00:20:33,500 Why is this? 352 00:20:33,500 --> 00:20:35,410 Sigma is less than 2. 353 00:20:35,410 --> 00:20:39,580 So 1/sigma is bigger than 2. 354 00:20:39,580 --> 00:20:44,020 So since 1/sigma is bigger than 2 this means that this 355 00:20:44,020 --> 00:20:47,740 numbers sits to the right of that number. 356 00:20:47,740 --> 00:20:51,950 So here we have the probability that Z is bigger 357 00:20:51,950 --> 00:20:54,820 than this number. 358 00:20:54,820 --> 00:20:59,060 The probability of falling out there is less than the 359 00:20:59,060 --> 00:21:03,060 probability of falling in this interval. 360 00:21:03,060 --> 00:21:06,170 So that's what that last inequality is saying-- 361 00:21:06,170 --> 00:21:09,330 this probability is smaller than that probability. 362 00:21:09,330 --> 00:21:12,010 This is the probability that we're interested in, but since 363 00:21:12,010 --> 00:21:16,490 we don't know sigma, we take the conservative value, and we 364 00:21:16,490 --> 00:21:21,610 use an upper bound in terms of the probability of this 365 00:21:21,610 --> 00:21:23,730 interval here. 366 00:21:23,730 --> 00:21:26,920 And now we are in business. 367 00:21:26,920 --> 00:21:30,980 We can start using our normal tables to calculate 368 00:21:30,980 --> 00:21:33,140 probabilities of interest. 369 00:21:33,140 --> 00:21:40,300 So for example, let's say that's we take n to be 10,000. 370 00:21:40,300 --> 00:21:42,370 How is the calculation going to go? 371 00:21:42,370 --> 00:21:45,860 We want to calculate the probability that the absolute 372 00:21:45,860 --> 00:21:52,920 value of Z is bigger than 0.2 times 1000, which is the 373 00:21:52,920 --> 00:21:56,530 probability that the absolute value of Z is larger than or 374 00:21:56,530 --> 00:21:58,490 equal to 2. 375 00:21:58,490 --> 00:22:00,500 And here let's do some mechanics, 376 00:22:00,500 --> 00:22:03,300 just to stay in shape. 377 00:22:03,300 --> 00:22:05,860 The probability that you're larger than or equal to 2 in 378 00:22:05,860 --> 00:22:09,290 absolute value, since the normal is symmetric around the 379 00:22:09,290 --> 00:22:13,590 mean, this is going to be twice the probability that Z 380 00:22:13,590 --> 00:22:16,560 is larger than or equal to 2. 381 00:22:16,560 --> 00:22:22,330 Can we use the cumulative distribution function of Z to 382 00:22:22,330 --> 00:22:23,300 calculate this? 383 00:22:23,300 --> 00:22:26,100 Well, almost the cumulative gives us probabilities of 384 00:22:26,100 --> 00:22:28,910 being less than something, not bigger than something. 385 00:22:28,910 --> 00:22:33,480 So we need one more step and write this as 1 minus the 386 00:22:33,480 --> 00:22:38,420 probability that Z is less than or equal to 2. 387 00:22:38,420 --> 00:22:41,620 And this probability, now, you can read off 388 00:22:41,620 --> 00:22:43,770 from the normal tables. 389 00:22:43,770 --> 00:22:46,460 And the normal tables will tell you that this 390 00:22:46,460 --> 00:22:52,840 probability is 0.9772. 391 00:22:52,840 --> 00:22:54,520 And you do get an answer. 392 00:22:54,520 --> 00:23:02,530 And the answer is 0.0456. 393 00:23:02,530 --> 00:23:05,220 OK, so we tried 10,000. 394 00:23:05,220 --> 00:23:10,990 And we find that our probably of error is 4.5%, so we're 395 00:23:10,990 --> 00:23:15,710 doing better than the spec that we had. 396 00:23:15,710 --> 00:23:19,490 So this tells us that maybe we have some leeway. 397 00:23:19,490 --> 00:23:24,070 Maybe we can use a smaller sample size and still stay 398 00:23:24,070 --> 00:23:26,030 without our specs. 399 00:23:26,030 --> 00:23:29,630 Let's try to find how much we can push the envelope. 400 00:23:29,630 --> 00:23:34,716 How much smaller can we take n? 401 00:23:34,716 --> 00:23:37,890 To answer that question, we need to do this kind of 402 00:23:37,890 --> 00:23:40,790 calculation, essentially, going backwards. 403 00:23:40,790 --> 00:23:46,420 We're going to fix this number to be 0.05 and work backwards 404 00:23:46,420 --> 00:23:49,130 here to find-- 405 00:23:49,130 --> 00:23:50,770 did I do a mistake here? 406 00:23:50,770 --> 00:23:51,770 10,000. 407 00:23:51,770 --> 00:23:53,700 So I'm missing a 0 here. 408 00:23:57,440 --> 00:24:07,540 Ah, but I'm taking the square root, so it's 100. 409 00:24:07,540 --> 00:24:11,080 Where did the 0.02 come in from? 410 00:24:11,080 --> 00:24:12,020 Ah, from here. 411 00:24:12,020 --> 00:24:15,870 OK, all right. 412 00:24:15,870 --> 00:24:19,330 0.02 times 100, that gives us 2. 413 00:24:19,330 --> 00:24:22,130 OK, all right. 414 00:24:22,130 --> 00:24:24,240 Very good, OK. 415 00:24:24,240 --> 00:24:27,570 So we'll have to do this calculation now backwards, 416 00:24:27,570 --> 00:24:33,510 figure out if this is 0.05, what kind of number we're 417 00:24:33,510 --> 00:24:41,380 going to need here and then here, and from this we will be 418 00:24:41,380 --> 00:24:45,240 able to tell what value of n do we need. 419 00:24:45,240 --> 00:24:53,670 OK, so we want to find n such that the probability that Z is 420 00:24:53,670 --> 00:25:04,870 bigger than 0.02 square root n is 0.05. 421 00:25:04,870 --> 00:25:09,320 OK, so Z is a standard normal random variable. 422 00:25:09,320 --> 00:25:16,810 And we want the probability that we are 423 00:25:16,810 --> 00:25:18,640 outside this range. 424 00:25:18,640 --> 00:25:21,940 We want the probability of those two tails together. 425 00:25:24,960 --> 00:25:26,920 Those two tails together should have 426 00:25:26,920 --> 00:25:29,990 probability of 0.05. 427 00:25:29,990 --> 00:25:33,280 This means that this tail, by itself, should have 428 00:25:33,280 --> 00:25:36,900 probability 0.025. 429 00:25:36,900 --> 00:25:45,960 And this means that this probability should be 0.975. 430 00:25:45,960 --> 00:25:52,350 Now, if this probability is to be 0.975, what 431 00:25:52,350 --> 00:25:54,970 should that number be? 432 00:25:54,970 --> 00:25:59,980 You go to the normal tables, and you find which is the 433 00:25:59,980 --> 00:26:03,190 entry that corresponds to that number. 434 00:26:03,190 --> 00:26:07,020 I actually brought a normal table with me. 435 00:26:07,020 --> 00:26:12,740 And 0.975 is down here. 436 00:26:12,740 --> 00:26:15,420 And it tells you that to the number that 437 00:26:15,420 --> 00:26:19,820 corresponds to it is 1.96. 438 00:26:19,820 --> 00:26:24,890 So this tells us that this number 439 00:26:24,890 --> 00:26:31,790 should be equal to 1.96. 440 00:26:31,790 --> 00:26:36,380 And now, from here, you do the calculations. 441 00:26:36,380 --> 00:26:47,510 And you find that n is 9604. 442 00:26:47,510 --> 00:26:53,200 So with a sample of 10,000, we got probability of error 4.5%. 443 00:26:53,200 --> 00:26:57,910 With a slightly smaller sample size of 9,600, we can get the 444 00:26:57,910 --> 00:27:01,880 probability of a mistake to be 0.05, which 445 00:27:01,880 --> 00:27:04,070 was exactly our spec. 446 00:27:04,070 --> 00:27:07,450 So these are essentially the two ways that you're going to 447 00:27:07,450 --> 00:27:09,830 be using the central limit theorem. 448 00:27:09,830 --> 00:27:12,690 Either you're given n and you try to calculate 449 00:27:12,690 --> 00:27:13,610 probabilities. 450 00:27:13,610 --> 00:27:15,590 Or you're given the probabilities, and you want to 451 00:27:15,590 --> 00:27:18,210 work backwards to find n itself. 452 00:27:20,990 --> 00:27:27,710 So in this example, the random variable that we dealt with 453 00:27:27,710 --> 00:27:30,450 was, of course, a binomial random variable. 454 00:27:30,450 --> 00:27:38,590 The Xi's were Bernoulli, so the sum of 455 00:27:38,590 --> 00:27:40,950 the Xi's were binomial. 456 00:27:40,950 --> 00:27:44,100 So the central limit theorem certainly applies to the 457 00:27:44,100 --> 00:27:45,950 binomial distribution. 458 00:27:45,950 --> 00:27:49,440 To be more precise, of course, it applies to the standardized 459 00:27:49,440 --> 00:27:52,730 version of the binomial random variable. 460 00:27:52,730 --> 00:27:55,140 So here's what we did, essentially, in 461 00:27:55,140 --> 00:27:57,300 the previous example. 462 00:27:57,300 --> 00:28:00,690 We fixed the number p, which is the probability of success 463 00:28:00,690 --> 00:28:02,010 in our experiments. 464 00:28:02,010 --> 00:28:06,550 p corresponds to f in the previous example. 465 00:28:06,550 --> 00:28:10,570 Let every Xi a Bernoulli random variable and are 466 00:28:10,570 --> 00:28:13,790 standing assumption is that these random variables are 467 00:28:13,790 --> 00:28:15,040 independent. 468 00:28:17,580 --> 00:28:20,730 When we add them, we get a random variable that has a 469 00:28:20,730 --> 00:28:22,030 binomial distribution. 470 00:28:22,030 --> 00:28:25,220 We know the mean and the variance of the binomial, so 471 00:28:25,220 --> 00:28:29,130 we take Sn, we subtract the mean, which is this, divide by 472 00:28:29,130 --> 00:28:30,470 the standard deviation. 473 00:28:30,470 --> 00:28:32,790 The central limit theorem tells us that the cumulative 474 00:28:32,790 --> 00:28:36,130 distribution function of this random variable is a standard 475 00:28:36,130 --> 00:28:39,860 normal random variable in the limit. 476 00:28:39,860 --> 00:28:43,730 So let's do one more example of a calculation. 477 00:28:43,730 --> 00:28:47,160 Let's take n to be-- 478 00:28:47,160 --> 00:28:50,110 let's choose some specific numbers to work with. 479 00:28:52,950 --> 00:28:58,300 So in this example, first thing to do is to find the 480 00:28:58,300 --> 00:29:02,390 expected value of Sn, which is n times p. 481 00:29:02,390 --> 00:29:04,150 It's 18. 482 00:29:04,150 --> 00:29:08,100 Then we need to write down the standard deviation. 483 00:29:12,430 --> 00:29:16,530 The variance of Sn is the sum of the variances. 484 00:29:16,530 --> 00:29:19,940 It's np times (1-p). 485 00:29:19,940 --> 00:29:25,920 And in this particular example, p times (1-p) is 1/4, 486 00:29:25,920 --> 00:29:28,320 n is 36, so this is 9. 487 00:29:28,320 --> 00:29:33,120 And that tells us that the standard deviation of this n 488 00:29:33,120 --> 00:29:34,370 is equal to 3. 489 00:29:37,170 --> 00:29:40,650 So what we're going to do is to take the event of interest, 490 00:29:40,650 --> 00:29:46,400 which is Sn less than 21, and rewrite it in a way that 491 00:29:46,400 --> 00:29:48,910 involves the standardized random variable. 492 00:29:48,910 --> 00:29:51,700 So to do that, we need to subtract the mean. 493 00:29:51,700 --> 00:29:55,680 So we write this as Sn-3 should be less 494 00:29:55,680 --> 00:29:58,460 than or equal to 21-3. 495 00:29:58,460 --> 00:30:00,360 This is the same event. 496 00:30:00,360 --> 00:30:02,890 And then divide by the standard deviation, which is 497 00:30:02,890 --> 00:30:06,450 3, and we end up with this. 498 00:30:06,450 --> 00:30:08,300 So the event itself of-- 499 00:30:08,300 --> 00:30:09,550 AUDIENCE: [INAUDIBLE]. 500 00:30:13,700 --> 00:30:24,150 Should subtract, 18, yes, which gives me a much nicer 501 00:30:24,150 --> 00:30:26,640 number out here, which is 1. 502 00:30:26,640 --> 00:30:31,650 So the event of interest, that Sn is less than 21, is the 503 00:30:31,650 --> 00:30:37,330 same as the event that a standard normal random 504 00:30:37,330 --> 00:30:41,580 variable is less than or equal to 1. 505 00:30:41,580 --> 00:30:44,690 And once more, you can look this up at the normal tables. 506 00:30:44,690 --> 00:30:50,690 And you find that the answer that you get is 0.43. 507 00:30:50,690 --> 00:30:53,390 Now it's interesting to compare this answer that we 508 00:30:53,390 --> 00:30:57,230 got through the central limit theorem with the exact answer. 509 00:30:57,230 --> 00:31:01,920 The exact answer involves the exact binomial distribution. 510 00:31:01,920 --> 00:31:08,780 What we have here is the binomial probability that, Sn 511 00:31:08,780 --> 00:31:10,970 is equal to k. 512 00:31:10,970 --> 00:31:15,230 Sn being equal to k is given by this formula. 513 00:31:15,230 --> 00:31:22,610 And we add, over all values for k going from 0 up to 21, 514 00:31:22,610 --> 00:31:28,670 we write a two lines code to calculate this sum, and we get 515 00:31:28,670 --> 00:31:32,530 the exact answer, which is 0.8785. 516 00:31:32,530 --> 00:31:35,760 So there's a pretty good agreements between the two, 517 00:31:35,760 --> 00:31:38,600 although you wouldn't call that's 518 00:31:38,600 --> 00:31:40,395 necessarily excellent agreement. 519 00:31:45,080 --> 00:31:47,060 Can we do a little better than that? 520 00:31:51,570 --> 00:31:53,750 OK. 521 00:31:53,750 --> 00:31:56,510 It turns out that we can. 522 00:31:56,510 --> 00:31:58,625 And here's the idea. 523 00:32:02,300 --> 00:32:07,750 So our random variable Sn has a mean of 18. 524 00:32:07,750 --> 00:32:09,540 It has a binomial distribution. 525 00:32:09,540 --> 00:32:14,050 It's described by a PMF that has a shape roughly like this 526 00:32:14,050 --> 00:32:16,690 and which keeps going on. 527 00:32:16,690 --> 00:32:20,960 Using the central limit theorem is basically 528 00:32:20,960 --> 00:32:26,650 pretending that Sn is normal with the 529 00:32:26,650 --> 00:32:28,650 right mean and variance. 530 00:32:28,650 --> 00:32:35,200 So pretending that Zn has 0 mean unit variance, we 531 00:32:35,200 --> 00:32:38,850 approximate it with Z, that has 0 mean unit variance. 532 00:32:38,850 --> 00:32:42,190 If you were to pretend that Sn is normal, you would 533 00:32:42,190 --> 00:32:45,407 approximate it with a normal that has the correct mean and 534 00:32:45,407 --> 00:32:46,250 correct variance. 535 00:32:46,250 --> 00:32:49,390 So it would still be centered at 18. 536 00:32:49,390 --> 00:32:53,800 And it would have the same variance as the binomial PMF. 537 00:32:53,800 --> 00:32:57,350 So using the central limit theorem essentially means that 538 00:32:57,350 --> 00:33:00,420 we keep the mean and the variance what they are but we 539 00:33:00,420 --> 00:33:03,960 pretend that our distribution is normal. 540 00:33:03,960 --> 00:33:06,780 We want to calculate the probability that Sn is less 541 00:33:06,780 --> 00:33:09,590 than or equal to 21. 542 00:33:09,590 --> 00:33:14,310 I pretend that my random variable is normal, so I draw 543 00:33:14,310 --> 00:33:18,680 a line here and I calculate the area under the normal 544 00:33:18,680 --> 00:33:22,000 curve going up to 21. 545 00:33:22,000 --> 00:33:23,500 That's essentially what we did. 546 00:33:26,260 --> 00:33:29,730 Now, a smart person comes around and says, Sn is a 547 00:33:29,730 --> 00:33:31,360 discrete random variable. 548 00:33:31,360 --> 00:33:34,750 So the event that Sn is less than or equal to 21 is the 549 00:33:34,750 --> 00:33:38,480 same as Sn being strictly less than 22 because nothing in 550 00:33:38,480 --> 00:33:41,240 between can happen. 551 00:33:41,240 --> 00:33:43,700 So I'm going to use the central limit theorem 552 00:33:43,700 --> 00:33:48,290 approximation by pretending again that Sn is normal and 553 00:33:48,290 --> 00:33:51,650 finding the probability of this event while pretending 554 00:33:51,650 --> 00:33:53,720 that Sn is normal. 555 00:33:53,720 --> 00:33:57,870 So what this person would do would be to draw a line here, 556 00:33:57,870 --> 00:34:02,780 at 22, and calculate the area under the normal curve 557 00:34:02,780 --> 00:34:05,490 all the way to 22. 558 00:34:05,490 --> 00:34:06,700 Who is right? 559 00:34:06,700 --> 00:34:08,820 Which one is better? 560 00:34:08,820 --> 00:34:15,639 Well neither, but we can do better than both if we sort of 561 00:34:15,639 --> 00:34:17,949 split the difference. 562 00:34:17,949 --> 00:34:21,969 So another way of writing the same event for Sn is to write 563 00:34:21,969 --> 00:34:25,940 it as Sn being less than 21.5. 564 00:34:25,940 --> 00:34:29,570 In terms of the discrete random variable Sn, all three 565 00:34:29,570 --> 00:34:32,239 of these are exactly the same event. 566 00:34:32,239 --> 00:34:35,090 But when you do the continuous approximation, they give you 567 00:34:35,090 --> 00:34:36,250 different probabilities. 568 00:34:36,250 --> 00:34:39,760 It's a matter of whether you integrate the area under the 569 00:34:39,760 --> 00:34:46,159 normal curve up to here, up to the midway point, or up to 22. 570 00:34:46,159 --> 00:34:50,659 It turns out that integrating up to the midpoint is what 571 00:34:50,659 --> 00:34:54,469 gives us the better numerical results. 572 00:34:54,469 --> 00:34:59,170 So we take here 21 and 1/2, and we integrate the area 573 00:34:59,170 --> 00:35:01,170 under the normal curve up to here. 574 00:35:14,100 --> 00:35:18,560 So let's do this calculation and see what we get. 575 00:35:18,560 --> 00:35:21,330 What would we change here? 576 00:35:21,330 --> 00:35:27,730 Instead of 21, we would now write 21 and 1/2. 577 00:35:27,730 --> 00:35:32,810 This 18 becomes, no, that 18 stays what it is. 578 00:35:32,810 --> 00:35:36,890 But this 21 becomes 21 and 1/2. 579 00:35:36,890 --> 00:35:44,790 And so this one becomes 1 + 0.5 by 3. 580 00:35:44,790 --> 00:35:48,210 This is 117. 581 00:35:48,210 --> 00:35:51,980 So we now look up into the normal tables and ask for the 582 00:35:51,980 --> 00:36:00,000 probability that Z is less than 1.17. 583 00:36:00,000 --> 00:36:06,070 So this here gets approximated by the probability that the 584 00:36:06,070 --> 00:36:09,240 standard normal is less than 1.17. 585 00:36:09,240 --> 00:36:15,960 And the normal tables will tell us this is 0.879. 586 00:36:15,960 --> 00:36:23,550 Going back to the previous slide, what we got this time 587 00:36:23,550 --> 00:36:30,310 with this improved approximation is 0.879. 588 00:36:30,310 --> 00:36:33,730 This is a really good approximation 589 00:36:33,730 --> 00:36:35,730 of the correct number. 590 00:36:35,730 --> 00:36:39,160 This is what we got using the 21. 591 00:36:39,160 --> 00:36:42,360 This is what we get using the 21 and 1/2. 592 00:36:42,360 --> 00:36:45,940 And it's an approximation that's sort of right on-- a 593 00:36:45,940 --> 00:36:48,350 very good one. 594 00:36:48,350 --> 00:36:54,120 The moral from this numerical example is that doing this 1 595 00:36:54,120 --> 00:37:00,933 and 1/2 correction does give us better approximations. 596 00:37:06,070 --> 00:37:12,010 In fact, we can use this 1/2 idea to even calculate 597 00:37:12,010 --> 00:37:14,340 individual probabilities. 598 00:37:14,340 --> 00:37:17,130 So suppose you want to approximate the probability 599 00:37:17,130 --> 00:37:21,010 that Sn equal to 19. 600 00:37:21,010 --> 00:37:25,620 If you were to pretend that Sn is normal and calculate this 601 00:37:25,620 --> 00:37:28,470 probability, the probability that the normal random 602 00:37:28,470 --> 00:37:31,670 variable is equal to 19 is 0. 603 00:37:31,670 --> 00:37:34,150 So you don't get an interesting answer. 604 00:37:34,150 --> 00:37:37,610 You get a more interesting answer by writing this event, 605 00:37:37,610 --> 00:37:41,460 19 as being the same as the event of falling between 18 606 00:37:41,460 --> 00:37:45,910 and 1/2 and 19 and 1/2 and using the normal approximation 607 00:37:45,910 --> 00:37:48,230 to calculate this probability. 608 00:37:48,230 --> 00:37:51,890 In terms of our previous picture, this corresponds to 609 00:37:51,890 --> 00:37:53,140 the following. 610 00:37:59,400 --> 00:38:04,650 We are interested in the probability that 611 00:38:04,650 --> 00:38:07,130 Sn is equal to 19. 612 00:38:07,130 --> 00:38:11,230 So we're interested in the height of this bar. 613 00:38:11,230 --> 00:38:15,720 We're going to consider the area under the normal curve 614 00:38:15,720 --> 00:38:21,500 going from here to here, and use this area as an 615 00:38:21,500 --> 00:38:25,110 approximation for the height of that particular bar. 616 00:38:25,110 --> 00:38:30,670 So what we're basically doing is, we take the probability 617 00:38:30,670 --> 00:38:33,830 under the normal curve that's assigned over a continuum of 618 00:38:33,830 --> 00:38:38,280 values and attributed it to different discrete values. 619 00:38:38,280 --> 00:38:43,510 Whatever is above the midpoint gets attributed to 19. 620 00:38:43,510 --> 00:38:45,640 Whatever is below that midpoint gets 621 00:38:45,640 --> 00:38:47,250 attributed to 18. 622 00:38:47,250 --> 00:38:54,280 So this is green area is our approximation of the value of 623 00:38:54,280 --> 00:38:56,500 the PMF at 19. 624 00:38:56,500 --> 00:39:00,740 So similarly, if you wanted to approximate the value of the 625 00:39:00,740 --> 00:39:04,440 PMF at this point, you would take this interval and 626 00:39:04,440 --> 00:39:06,580 integrate the area under the normal 627 00:39:06,580 --> 00:39:09,350 curve over that interval. 628 00:39:09,350 --> 00:39:13,410 It turns out that this gives a very good approximation of the 629 00:39:13,410 --> 00:39:15,660 PMF of the binomial. 630 00:39:15,660 --> 00:39:22,580 And actually, this was the context in which the central 631 00:39:22,580 --> 00:39:26,310 limit theorem was proved in the first place, when this 632 00:39:26,310 --> 00:39:27,990 business started. 633 00:39:27,990 --> 00:39:33,060 So this business goes back a few hundred years. 634 00:39:33,060 --> 00:39:35,700 And the central limit theorem was first approved by 635 00:39:35,700 --> 00:39:39,420 considering the PMF of a binomial random variable when 636 00:39:39,420 --> 00:39:41,840 p is equal to 1/2. 637 00:39:41,840 --> 00:39:45,590 People did the algebra, and they found out that the exact 638 00:39:45,590 --> 00:39:49,700 expression for the PMF is quite well approximated by 639 00:39:49,700 --> 00:39:51,980 that expression hat you would get from a normal 640 00:39:51,980 --> 00:39:53,380 distribution. 641 00:39:53,380 --> 00:39:57,510 Then the proof was extended to binomials for more general 642 00:39:57,510 --> 00:39:59,690 values of p. 643 00:39:59,690 --> 00:40:04,220 So here we talk about this as a refinement of the general 644 00:40:04,220 --> 00:40:07,480 central limit theorem, but, historically, that refinement 645 00:40:07,480 --> 00:40:09,830 was where the whole business got started 646 00:40:09,830 --> 00:40:11,820 in the first place. 647 00:40:11,820 --> 00:40:18,700 All right, so let's go through the mechanics of approximating 648 00:40:18,700 --> 00:40:21,970 the probability that Sn is equal to 19-- 649 00:40:21,970 --> 00:40:23,810 exactly 19. 650 00:40:23,810 --> 00:40:27,340 As we said, we're going to write this event as an event 651 00:40:27,340 --> 00:40:31,040 that covers an interval of unit length from 18 and 1/2 to 652 00:40:31,040 --> 00:40:31,970 19 and 1/2. 653 00:40:31,970 --> 00:40:33,730 This is the event of interest. 654 00:40:33,730 --> 00:40:37,070 First step is to massage the event of interest so that it 655 00:40:37,070 --> 00:40:40,010 involves our Zn random variable. 656 00:40:40,010 --> 00:40:43,290 So subtract 18 from all sides. 657 00:40:43,290 --> 00:40:46,860 Divide by the standard deviation of 3 from all sides. 658 00:40:46,860 --> 00:40:50,850 That's the equivalent representation of the event. 659 00:40:50,850 --> 00:40:54,200 This is our standardized random variable Zn. 660 00:40:54,200 --> 00:40:56,950 These are just these numbers. 661 00:40:56,950 --> 00:41:00,530 And to do an approximation, we want to find the probability 662 00:41:00,530 --> 00:41:04,380 of this event, but Zn is approximately normal, so we 663 00:41:04,380 --> 00:41:08,030 plug in here the Z, which is the standard normal. 664 00:41:08,030 --> 00:41:10,150 So we want to find the probability that the standard 665 00:41:10,150 --> 00:41:12,890 normal falls inside this interval. 666 00:41:12,890 --> 00:41:15,630 You find these using CDFs because this is the 667 00:41:15,630 --> 00:41:18,760 probability that you're less than this but 668 00:41:18,760 --> 00:41:22,370 not less than that. 669 00:41:22,370 --> 00:41:25,370 So it's a difference between two cumulative probabilities. 670 00:41:25,370 --> 00:41:27,400 Then, you look up your normal tables. 671 00:41:27,400 --> 00:41:30,560 You find two numbers for these quantities, and, finally, you 672 00:41:30,560 --> 00:41:35,140 get a numerical answer for an individual entry of the PMF of 673 00:41:35,140 --> 00:41:36,480 the binomial. 674 00:41:36,480 --> 00:41:39,350 This is a pretty good approximation, it turns out. 675 00:41:39,350 --> 00:41:42,910 If you were to do the calculations using the exact 676 00:41:42,910 --> 00:41:47,130 formula, you would get something 677 00:41:47,130 --> 00:41:49,360 which is pretty close-- 678 00:41:49,360 --> 00:41:52,800 an error in the third digit-- 679 00:41:52,800 --> 00:41:56,980 this is pretty good. 680 00:41:56,980 --> 00:41:59,650 So I guess what we did here with our discussion of the 681 00:41:59,650 --> 00:42:04,560 binomial slightly contradicts what I said before-- 682 00:42:04,560 --> 00:42:07,330 that the central limit theorem is a statement about 683 00:42:07,330 --> 00:42:09,240 cumulative distribution functions. 684 00:42:09,240 --> 00:42:13,240 In general, it doesn't tell you what to do to approximate 685 00:42:13,240 --> 00:42:15,270 PMFs themselves. 686 00:42:15,270 --> 00:42:17,440 And that's indeed the case in general. 687 00:42:17,440 --> 00:42:20,220 One the other hand, for the special case of a binomial 688 00:42:20,220 --> 00:42:23,610 distribution, the central limit theorem approximation, 689 00:42:23,610 --> 00:42:28,200 with this 1/2 correction, is a very good approximation even 690 00:42:28,200 --> 00:42:29,560 for the individual PMF. 691 00:42:33,290 --> 00:42:40,210 All right, so we spent quite a bit of time on mechanics. 692 00:42:40,210 --> 00:42:46,050 So let's spend the last few minutes today thinking a bit 693 00:42:46,050 --> 00:42:47,930 and look at a small puzzle. 694 00:42:51,390 --> 00:42:54,240 So the puzzle is the following. 695 00:42:54,240 --> 00:43:02,460 Consider Poisson process that runs over a unit interval. 696 00:43:02,460 --> 00:43:07,770 And where the arrival rate is equal to 1. 697 00:43:07,770 --> 00:43:09,790 So this is the unit interval. 698 00:43:09,790 --> 00:43:12,720 And let X be the number of arrivals. 699 00:43:15,430 --> 00:43:19,930 And this is Poisson, with mean 1. 700 00:43:25,000 --> 00:43:28,160 Now, let me take this interval and divide it 701 00:43:28,160 --> 00:43:30,650 into n little pieces. 702 00:43:30,650 --> 00:43:34,270 So each piece has length 1/n. 703 00:43:34,270 --> 00:43:41,225 And let Xi be the number of arrivals during 704 00:43:41,225 --> 00:43:43,490 the Ith little interval. 705 00:43:48,000 --> 00:43:51,630 OK, what do we know about the random variables Xi? 706 00:43:51,630 --> 00:43:55,260 Is they are themselves Poisson. 707 00:43:55,260 --> 00:43:58,490 It's a number of arrivals during a small interval. 708 00:43:58,490 --> 00:44:02,340 We also know that when n is big, so the length of the 709 00:44:02,340 --> 00:44:08,190 interval is small, these Xi's are approximately Bernoulli, 710 00:44:08,190 --> 00:44:11,730 with mean 1/n. 711 00:44:11,730 --> 00:44:13,970 Guess it doesn't matter whether we model them as 712 00:44:13,970 --> 00:44:15,720 Bernoulli or not. 713 00:44:15,720 --> 00:44:19,660 What matters is that the Xi's are independent. 714 00:44:19,660 --> 00:44:20,970 Why are they independent? 715 00:44:20,970 --> 00:44:24,410 Because, in a Poisson process, these joint intervals are 716 00:44:24,410 --> 00:44:26,770 independent of each other. 717 00:44:26,770 --> 00:44:28,955 So the Xi's are independent. 718 00:44:31,840 --> 00:44:35,570 And they also have the same distribution. 719 00:44:35,570 --> 00:44:40,360 And we have that X, the total number of arrivals, is the sum 720 00:44:40,360 --> 00:44:41,610 over the Xn's. 721 00:44:44,470 --> 00:44:49,510 So the central limit theorem tells us that, approximately, 722 00:44:49,510 --> 00:44:53,670 the sum of independent, identically distributed random 723 00:44:53,670 --> 00:44:57,720 variables, when we have lots of these random variables, 724 00:44:57,720 --> 00:45:01,530 behaves like a normal random variable. 725 00:45:01,530 --> 00:45:07,475 So by using this decomposition of X into a sum of i.i.d 726 00:45:07,475 --> 00:45:11,540 random variables, and by using values of n that are bigger 727 00:45:11,540 --> 00:45:16,540 and bigger, by taking the limit, it should follow that X 728 00:45:16,540 --> 00:45:19,510 has a normal distribution. 729 00:45:19,510 --> 00:45:22,120 On the other hand, we know that X has a Poisson 730 00:45:22,120 --> 00:45:23,370 distribution. 731 00:45:25,270 --> 00:45:32,640 So something must be wrong in this argument here. 732 00:45:32,640 --> 00:45:34,900 Can we really use the central limit 733 00:45:34,900 --> 00:45:38,330 theorem in this situation? 734 00:45:38,330 --> 00:45:41,300 So what do we need for the central limit theorem? 735 00:45:41,300 --> 00:45:44,160 We need to have independent, identically 736 00:45:44,160 --> 00:45:46,700 distributed random variables. 737 00:45:46,700 --> 00:45:49,060 We have it here. 738 00:45:49,060 --> 00:45:53,410 We want them to have a finite mean and finite variance. 739 00:45:53,410 --> 00:45:57,610 We also have it here, means variances are finite. 740 00:45:57,610 --> 00:46:02,050 What is another assumption that was never made explicit, 741 00:46:02,050 --> 00:46:04,080 but essentially was there? 742 00:46:07,680 --> 00:46:13,260 Or in other words, what is the flaw in this argument that 743 00:46:13,260 --> 00:46:15,520 uses the central limit theorem here? 744 00:46:15,520 --> 00:46:16,770 Any thoughts? 745 00:46:24,110 --> 00:46:29,640 So in the central limit theorem, we said, consider-- 746 00:46:29,640 --> 00:46:34,820 fix a probability distribution, and let the Xi's 747 00:46:34,820 --> 00:46:38,280 be distributed according to that probability distribution, 748 00:46:38,280 --> 00:46:42,935 and add a larger and larger number or Xi's. 749 00:46:42,935 --> 00:46:47,410 But the underlying, unstated assumption is that we fix the 750 00:46:47,410 --> 00:46:49,490 distribution of the Xi's. 751 00:46:49,490 --> 00:46:52,810 As we let n increase, the statistics of 752 00:46:52,810 --> 00:46:55,930 each Xi do not change. 753 00:46:55,930 --> 00:46:59,010 Whereas here, I'm playing a trick on you. 754 00:46:59,010 --> 00:47:03,700 As I'm taking more and more random variables, I'm actually 755 00:47:03,700 --> 00:47:07,850 changing what those random variables are. 756 00:47:07,850 --> 00:47:12,960 When I take a larger n, the Xi's are random variables with 757 00:47:12,960 --> 00:47:15,720 a different mean and different variance. 758 00:47:15,720 --> 00:47:19,800 So I'm adding more of these, but at the same time, in this 759 00:47:19,800 --> 00:47:23,420 example, I'm changing their distributions. 760 00:47:23,420 --> 00:47:26,380 That's something that doesn't fit the setting of the central 761 00:47:26,380 --> 00:47:27,000 limit theorem. 762 00:47:27,000 --> 00:47:29,910 In the central limit theorem, you first fix the distribution 763 00:47:29,910 --> 00:47:31,200 of the X's. 764 00:47:31,200 --> 00:47:35,290 You keep it fixed, and then you consider adding more and 765 00:47:35,290 --> 00:47:38,950 more according to that particular fixed distribution. 766 00:47:38,950 --> 00:47:40,020 So that's the catch. 767 00:47:40,020 --> 00:47:42,240 That's why the central limit theorem does not 768 00:47:42,240 --> 00:47:43,970 apply to this situation. 769 00:47:43,970 --> 00:47:46,230 And we're lucky that it doesn't apply because, 770 00:47:46,230 --> 00:47:50,220 otherwise, we would have a huge contradiction destroying 771 00:47:50,220 --> 00:47:52,770 probability theory. 772 00:47:52,770 --> 00:48:02,240 OK, but now that's still leaves us with a 773 00:48:02,240 --> 00:48:05,040 little bit of a dilemma. 774 00:48:05,040 --> 00:48:08,510 Suppose that, here, essentially we're adding 775 00:48:08,510 --> 00:48:12,815 independent Bernoulli random variables. 776 00:48:22,650 --> 00:48:25,300 So the issue is that the central limit theorem has to 777 00:48:25,300 --> 00:48:28,920 do with asymptotics as n goes to infinity. 778 00:48:28,920 --> 00:48:34,260 And if we consider a binomial, and somebody gives us specific 779 00:48:34,260 --> 00:48:38,870 numbers about the parameters of that binomial, it might not 780 00:48:38,870 --> 00:48:40,830 necessarily be obvious what kind of 781 00:48:40,830 --> 00:48:42,790 approximation do we use. 782 00:48:42,790 --> 00:48:45,660 In particular, we do have two different approximations for 783 00:48:45,660 --> 00:48:47,100 the binomial. 784 00:48:47,100 --> 00:48:51,610 If we fix p, then the binomial is the sum of Bernoulli's that 785 00:48:51,610 --> 00:48:54,930 come from a fixed distribution, we consider more 786 00:48:54,930 --> 00:48:56,450 and more of these. 787 00:48:56,450 --> 00:48:58,990 When we add them, the central limit theorem tells us that we 788 00:48:58,990 --> 00:49:01,190 get the normal distribution. 789 00:49:01,190 --> 00:49:04,430 There's another sort of limit, which has the flavor of this 790 00:49:04,430 --> 00:49:10,770 example, in which we still deal with a binomial, sum of n 791 00:49:10,770 --> 00:49:11,170 Bernoulli's. 792 00:49:11,170 --> 00:49:14,310 We let that sum, the number of the 793 00:49:14,310 --> 00:49:16,090 Bernoulli's go to infinity. 794 00:49:16,090 --> 00:49:18,890 But each Bernoulli has a probability of success that 795 00:49:18,890 --> 00:49:23,830 goes to 0, and we do this in a way so that np, the expected 796 00:49:23,830 --> 00:49:27,090 number of successes, stays finite. 797 00:49:27,090 --> 00:49:30,660 This is the situation that we dealt with when we first 798 00:49:30,660 --> 00:49:32,960 defined our Poisson process. 799 00:49:32,960 --> 00:49:37,540 We have a very, very large number so lots, of time slots, 800 00:49:37,540 --> 00:49:40,920 but during each time slot, there's a tiny probability of 801 00:49:40,920 --> 00:49:42,950 obtaining an arrival. 802 00:49:42,950 --> 00:49:48,460 Under that setting, in discrete time, we have a 803 00:49:48,460 --> 00:49:51,670 binomial distribution, or Bernoulli process, but when we 804 00:49:51,670 --> 00:49:54,530 take the limit, we obtain the Poisson process and the 805 00:49:54,530 --> 00:49:56,470 Poisson approximation. 806 00:49:56,470 --> 00:49:58,510 So these are two equally valid 807 00:49:58,510 --> 00:50:00,550 approximations of the binomial. 808 00:50:00,550 --> 00:50:03,300 But they're valid in different asymptotic regimes. 809 00:50:03,300 --> 00:50:06,180 In one regime, we fixed p, let n go to infinity. 810 00:50:06,180 --> 00:50:09,360 In the other regime, we let both n and p change 811 00:50:09,360 --> 00:50:11,540 simultaneously. 812 00:50:11,540 --> 00:50:14,240 Now, in real life, you're never dealing with the 813 00:50:14,240 --> 00:50:15,290 limiting situations. 814 00:50:15,290 --> 00:50:17,870 You're dealing with actual numbers. 815 00:50:17,870 --> 00:50:21,820 So if somebody tells you that the numbers are like this, 816 00:50:21,820 --> 00:50:25,160 then you should probably say that this is the situation 817 00:50:25,160 --> 00:50:27,380 that fits the Poisson description-- 818 00:50:27,380 --> 00:50:30,180 large number of slots with each slot having a tiny 819 00:50:30,180 --> 00:50:32,460 probability of success. 820 00:50:32,460 --> 00:50:36,890 On the other hand, if p is something like this, and n is 821 00:50:36,890 --> 00:50:40,460 500, then you expect to get the distribution for the 822 00:50:40,460 --> 00:50:41,680 number of successes. 823 00:50:41,680 --> 00:50:45,740 It's going to have a mean of 50 and to have a fair amount 824 00:50:45,740 --> 00:50:47,280 of spread around there. 825 00:50:47,280 --> 00:50:50,150 It turns out that the normal approximation would be better 826 00:50:50,150 --> 00:50:51,500 in this context. 827 00:50:51,500 --> 00:50:57,120 As a rule of thumb, if n times p is bigger than 10 or 20, you 828 00:50:57,120 --> 00:50:59,320 can start using the normal approximation. 829 00:50:59,320 --> 00:51:04,310 If n times p is a small number, then you prefer to use 830 00:51:04,310 --> 00:51:06,090 the Poisson approximation. 831 00:51:06,090 --> 00:51:08,840 But there's no hard theorems or rules about 832 00:51:08,840 --> 00:51:11,650 how to go about this. 833 00:51:11,650 --> 00:51:15,440 OK, so from next time we're going to switch base again. 834 00:51:15,440 --> 00:51:17,830 And we're going to put together everything we learned 835 00:51:17,830 --> 00:51:20,620 in this class to start solving inference problems.