1 00:00:00,520 --> 00:00:03,450 An important application of the central limit theorem is 2 00:00:03,450 --> 00:00:07,190 in the approximate calculation of the binomial probabilities. 3 00:00:07,190 --> 00:00:09,270 Here is what is involved. 4 00:00:09,270 --> 00:00:11,060 We start with random variables-- 5 00:00:11,060 --> 00:00:11,840 Xi-- 6 00:00:11,840 --> 00:00:13,280 that are independent. 7 00:00:13,280 --> 00:00:14,800 And they have the same distribution. 8 00:00:14,800 --> 00:00:17,050 They're all Bernoulli with parameter p. 9 00:00:17,050 --> 00:00:20,550 We add n of those random variables, and the resulting 10 00:00:20,550 --> 00:00:24,450 random variable, Sn, we know that it has a binomial PNF 11 00:00:24,450 --> 00:00:26,110 with parameters n and p. 12 00:00:26,110 --> 00:00:30,010 We also know its mean, and we do know its variance. 13 00:00:30,010 --> 00:00:33,430 What the central limit theorem tells us, in this case, since 14 00:00:33,430 --> 00:00:35,790 we're dealing with the sum of independent identically 15 00:00:35,790 --> 00:00:38,640 distributed random variables, is the following. 16 00:00:38,640 --> 00:00:41,340 If we take this random variable here that we have 17 00:00:41,340 --> 00:00:45,610 been denoting by Zn, which is a standardized version of Sn-- 18 00:00:45,610 --> 00:00:48,550 we subtract the mean of Sn and divide by the standard 19 00:00:48,550 --> 00:00:49,630 deviation-- 20 00:00:49,630 --> 00:00:53,710 this random variable has a CDF that approaches as n goes to 21 00:00:53,710 --> 00:00:57,070 infinity, the CDF of a standard normal. 22 00:00:57,070 --> 00:01:00,500 So let us use what we now know to calculate some 23 00:01:00,500 --> 00:01:01,910 probabilities. 24 00:01:01,910 --> 00:01:03,300 Let us fix some parameters. 25 00:01:03,300 --> 00:01:04,349 n is 36. 26 00:01:04,349 --> 00:01:05,620 p is 0.5. 27 00:01:05,620 --> 00:01:08,260 And we wish to calculate the probability that Sn is less 28 00:01:08,260 --> 00:01:10,200 than or equal to 21. 29 00:01:10,200 --> 00:01:13,930 Now, in this case, we can calculate it exactly using the 30 00:01:13,930 --> 00:01:15,420 binomial formula. 31 00:01:15,420 --> 00:01:18,830 The probability of being less than or equal to 21 is the sum 32 00:01:18,830 --> 00:01:22,710 of the probabilities of all the numbers from 0 to 21. 33 00:01:22,710 --> 00:01:26,670 And this is the probability of obtaining a number k. 34 00:01:26,670 --> 00:01:29,950 And by calculating this expression, we obtain this 35 00:01:29,950 --> 00:01:32,880 number, which is the exact answer. 36 00:01:32,880 --> 00:01:35,950 Now, let us proceed using the central limit theorem. 37 00:01:35,950 --> 00:01:40,039 We are interested in this probability, but we will use 38 00:01:40,039 --> 00:01:43,590 the fact about the CDF of this related random variable. 39 00:01:43,590 --> 00:01:45,720 So the first step is to calculate n 40 00:01:45,720 --> 00:01:47,850 times p, which is 18. 41 00:01:47,850 --> 00:01:50,470 The second step is to calculate this denominator 42 00:01:50,470 --> 00:01:54,600 here, which in our case evaluates to 3. 43 00:01:54,600 --> 00:01:58,580 Now, since we know something about the CDF of this random 44 00:01:58,580 --> 00:02:02,450 variable, what we need to do is to take this event and 45 00:02:02,450 --> 00:02:05,830 rewrite it in terms of this random variable. 46 00:02:05,830 --> 00:02:11,460 So we have the event of interest, which is that Sn is 47 00:02:11,460 --> 00:02:14,150 less than or equal to 21. 48 00:02:14,150 --> 00:02:17,460 This is the same as the event that Sn minus 18 is less than 49 00:02:17,460 --> 00:02:19,770 or equal to 21 minus 18. 50 00:02:19,770 --> 00:02:22,800 And it's the same as this event here, where we divide 51 00:02:22,800 --> 00:02:25,829 both sides by 3. 52 00:02:25,829 --> 00:02:29,520 Now, what we have here is the probability that this random 53 00:02:29,520 --> 00:02:33,060 variable Zn is less than or equal to 1. 54 00:02:33,060 --> 00:02:37,079 But now, Zn is approximately a standard normal, so we can use 55 00:02:37,079 --> 00:02:40,920 here the CDF of the standard normal distribution, 56 00:02:40,920 --> 00:02:42,350 which is Phi of 1. 57 00:02:42,350 --> 00:02:45,550 And at this point, we look at the tables for the normal 58 00:02:45,550 --> 00:02:46,590 distribution. 59 00:02:46,590 --> 00:02:48,440 We'll find this entry here. 60 00:02:48,440 --> 00:02:54,079 And this gives us an answer of 0.8413. 61 00:02:54,079 --> 00:02:56,340 This is a pretty good approximation of the exact 62 00:02:56,340 --> 00:02:59,340 answer, which is 0.8785. 63 00:02:59,340 --> 00:03:01,460 But it is not a great approximation. 64 00:03:01,460 --> 00:03:04,150 It is off by about four percentage points. 65 00:03:04,150 --> 00:03:07,960 Can we do better than that? 66 00:03:07,960 --> 00:03:11,550 It turns out that we can get a better approximation. 67 00:03:11,550 --> 00:03:14,160 And let us see how this can be done. 68 00:03:14,160 --> 00:03:17,320 Recall that we approximated this probability using the 69 00:03:17,320 --> 00:03:20,380 central limit theorem and found this numerical value. 70 00:03:20,380 --> 00:03:23,690 But we make an observation that this probability is equal 71 00:03:23,690 --> 00:03:25,300 to this probability here. 72 00:03:25,300 --> 00:03:26,450 Why is that? 73 00:03:26,450 --> 00:03:28,550 Sn is an integer random variable. 74 00:03:28,550 --> 00:03:31,790 Therefore, if I tell you that it is strictly less than 22, 75 00:03:31,790 --> 00:03:35,640 I'm also telling you that it is 21 or less. 76 00:03:35,640 --> 00:03:39,960 Therefore, this event here is the same as that event here. 77 00:03:39,960 --> 00:03:42,480 And therefore, their probabilities are the same. 78 00:03:42,480 --> 00:03:46,630 So instead of using the central limit approximation to 79 00:03:46,630 --> 00:03:49,630 calculate this probability, let us follow the same 80 00:03:49,630 --> 00:03:53,160 procedure but try to calculate this probability here. 81 00:03:53,160 --> 00:03:56,820 And this probability here is equal to the 82 00:03:56,820 --> 00:04:00,880 probability that Sn minus-- 83 00:04:00,880 --> 00:04:03,690 we subtract the mean, divide by the standard 84 00:04:03,690 --> 00:04:05,520 deviation of Sn-- 85 00:04:05,520 --> 00:04:11,740 is strictly less than 22 minus 18 divided by 3, which is the 86 00:04:11,740 --> 00:04:16,060 probability that the random variable that we denote by Zn, 87 00:04:16,060 --> 00:04:19,630 which is this expression here, is strictly less than 22 88 00:04:19,630 --> 00:04:21,120 minus 18 over 3. 89 00:04:21,120 --> 00:04:23,910 And this is 1.33. 90 00:04:23,910 --> 00:04:27,380 Now, at this point, we pretend that Zn is a standard normal 91 00:04:27,380 --> 00:04:28,210 random variable-- 92 00:04:28,210 --> 00:04:31,510 the probability that the standard normal is less than a 93 00:04:31,510 --> 00:04:32,760 certain number. 94 00:04:32,760 --> 00:04:37,800 This is the standard normal CDF evaluated at that number. 95 00:04:37,800 --> 00:04:43,020 And then we look up at the normal tables at 1.33 and we 96 00:04:43,020 --> 00:04:49,490 find this value of 0.9082. 97 00:04:49,490 --> 00:04:52,580 Now, we compare this value with the exact 98 00:04:52,580 --> 00:04:54,720 answer for this problem. 99 00:04:54,720 --> 00:04:58,159 And we see that we again missed it. 100 00:04:58,159 --> 00:05:02,370 Using this approximation to this quantity gave us an 101 00:05:02,370 --> 00:05:04,510 underestimate of this number. 102 00:05:04,510 --> 00:05:07,430 Now, we obtained an overestimate. 103 00:05:07,430 --> 00:05:10,420 The true value is somewhere in the middle. 104 00:05:10,420 --> 00:05:13,250 So this suggests that we may want to do something that 105 00:05:13,250 --> 00:05:17,560 combines these two alternative choices here. 106 00:05:17,560 --> 00:05:20,750 But before doing that, it's good to understand what 107 00:05:20,750 --> 00:05:24,350 exactly have we be doing all along. 108 00:05:24,350 --> 00:05:27,000 What we're doing is the following. 109 00:05:27,000 --> 00:05:31,370 We have the PMF of the binomial centered at 18, which 110 00:05:31,370 --> 00:05:32,250 is the mean. 111 00:05:32,250 --> 00:05:34,250 It's a discrete random variable. 112 00:05:34,250 --> 00:05:37,530 But when we use the central limit theorem, we pretend that 113 00:05:37,530 --> 00:05:43,190 the binomial is normal, but while we keep the same mean 114 00:05:43,190 --> 00:05:44,440 and variance. 115 00:05:46,720 --> 00:05:50,130 Now, when we calculate probabilities, if we want to 116 00:05:50,130 --> 00:05:54,550 find the discrete probability that Sn is less than or equal 117 00:05:54,550 --> 00:05:59,020 to 21, which is the sum of these probabilities, what we 118 00:05:59,020 --> 00:06:05,380 do is we look at the area under the normal 119 00:06:05,380 --> 00:06:09,610 PDF from 21 and below. 120 00:06:09,610 --> 00:06:14,700 In the alternative approach, when we use the central limit 121 00:06:14,700 --> 00:06:18,180 theorem to approximate the probability of this event, we 122 00:06:18,180 --> 00:06:24,100 go to 22, and we look at the event of falling below 22. 123 00:06:24,100 --> 00:06:30,060 This means that we're looking at the area from 22 and lower. 124 00:06:30,060 --> 00:06:36,650 So in one approach, this particular region is not used 125 00:06:36,650 --> 00:06:37,690 in the calculation. 126 00:06:37,690 --> 00:06:39,250 That's what we did here. 127 00:06:39,250 --> 00:06:42,560 But in the second approach, it was used in the calculation. 128 00:06:42,560 --> 00:06:45,690 Should it be used or not? 129 00:06:45,690 --> 00:06:52,180 It makes more sense to use only part of this solid region 130 00:06:52,180 --> 00:06:55,150 and assign it to the calculation of the probability 131 00:06:55,150 --> 00:06:57,470 of being at 21 or less. 132 00:06:57,470 --> 00:07:01,890 Namely, we can take the mid point here, where the mid 133 00:07:01,890 --> 00:07:07,690 point is at 21.5, and calculate the area under the 134 00:07:07,690 --> 00:07:13,340 normal PDF only going up to 21.5. 135 00:07:13,340 --> 00:07:17,170 What this amounts to is looking at this particular 136 00:07:17,170 --> 00:07:18,420 event here. 137 00:07:18,420 --> 00:07:21,520 Now, this event is, of course, identical to this event that 138 00:07:21,520 --> 00:07:25,130 we have been considering, because again, Sn is a 139 00:07:25,130 --> 00:07:29,470 discrete random variable that takes integer values. 140 00:07:29,470 --> 00:07:32,510 But when we approximate it by a normal, it does make a 141 00:07:32,510 --> 00:07:34,530 difference whether we write the event 142 00:07:34,530 --> 00:07:36,840 this way or that way. 143 00:07:36,840 --> 00:07:40,570 So here, we're going to obtain the probability that the 144 00:07:40,570 --> 00:07:43,760 standardized version of Zn is less than. 145 00:07:43,760 --> 00:07:46,180 We follow the same calculation, but now we have 146 00:07:46,180 --> 00:07:52,100 21.5 minus 18 divided by 3. 147 00:07:52,100 --> 00:07:56,730 And this number here is 1.17. 148 00:07:56,730 --> 00:08:01,550 And using the central limit theorem calculation, this is 149 00:08:01,550 --> 00:08:08,960 the CDF of the standard normal evaluated at 1.17, which we 150 00:08:08,960 --> 00:08:12,090 can go and look up in the normal table to find 151 00:08:12,090 --> 00:08:16,960 the value of 0.8790. 152 00:08:16,960 --> 00:08:21,840 And now, we notice that this value is remarkably close to 153 00:08:21,840 --> 00:08:23,320 the true value. 154 00:08:23,320 --> 00:08:26,730 It is much better as an approximation that what we 155 00:08:26,730 --> 00:08:31,980 obtained using either this choice or that choice. 156 00:08:31,980 --> 00:08:37,270 And since this approximation is so good, we may consider 157 00:08:37,270 --> 00:08:41,370 even using it to approximate individual probabilities of 158 00:08:41,370 --> 00:08:43,350 the binomial PMF. 159 00:08:43,350 --> 00:08:46,130 Let's see what that takes. 160 00:08:46,130 --> 00:08:49,580 Let us try to approximate, as an example, the probability 161 00:08:49,580 --> 00:08:53,680 that Sn takes a value of exactly 19. 162 00:08:53,680 --> 00:08:58,610 So what we will do will be to write the event that Sn is 163 00:08:58,610 --> 00:09:08,770 equal to 19 as the event that Sn lies between 18.5 and 19.5. 164 00:09:08,770 --> 00:09:12,010 In terms of the picture that we were discussing before, 165 00:09:12,010 --> 00:09:15,290 what we are doing, essentially, is to take the 166 00:09:15,290 --> 00:09:23,560 area under the normal PDF that extends from 18.5 to 19.5 and 167 00:09:23,560 --> 00:09:27,640 declare that this area corresponds to the discrete 168 00:09:27,640 --> 00:09:32,220 event that our binomial random variable takes a value of 19. 169 00:09:32,220 --> 00:09:35,660 Similarly, if we wanted to calculate approximately the 170 00:09:35,660 --> 00:09:40,060 value of the probability that Sn takes a value of 21, we 171 00:09:40,060 --> 00:09:43,200 would consider the area under the normal PDF 172 00:09:43,200 --> 00:09:46,890 from 20.5 to 21.5. 173 00:09:46,890 --> 00:09:49,630 So let us now continue with this approach. 174 00:09:49,630 --> 00:09:54,940 We do the usual calculations, which is to express this event 175 00:09:54,940 --> 00:09:57,420 in terms of standardized values. 176 00:09:57,420 --> 00:10:02,080 That is, we subtract throughout the mean of Sn and 177 00:10:02,080 --> 00:10:04,420 divide by standard deviation. 178 00:10:04,420 --> 00:10:08,560 So what we obtain here is the standardized version of Sn. 179 00:10:08,560 --> 00:10:15,340 And that has to be, now, less than or equal to 19.5 minus 18 180 00:10:15,340 --> 00:10:19,980 divided by 3, which is the probability that our 181 00:10:19,980 --> 00:10:30,430 standardized random variable lies between 0.17 and 0.5. 182 00:10:30,430 --> 00:10:35,230 And now, if we pretend that Zn is a standard normal random 183 00:10:35,230 --> 00:10:38,000 variable, which is what the central limit theorem 184 00:10:38,000 --> 00:10:42,170 suggests, this is going to be equal to the probability that 185 00:10:42,170 --> 00:10:48,060 the standard normal is less than or equal to 0.5 minus the 186 00:10:48,060 --> 00:10:53,530 probability that it is less than 0.17. 187 00:10:53,530 --> 00:10:57,750 And if we look up those entries in the normal tables, 188 00:10:57,750 --> 00:11:05,750 what we find is an answer of 0.6915 minus this number, 189 00:11:05,750 --> 00:11:10,090 which evaluates to 0.124. 190 00:11:10,090 --> 00:11:14,070 And what is the exact answer if we were to use the binomial 191 00:11:14,070 --> 00:11:15,740 probability formulas? 192 00:11:15,740 --> 00:11:20,720 The exact answer is remarkably close to what we obtained in 193 00:11:20,720 --> 00:11:23,070 our approximation. 194 00:11:23,070 --> 00:11:26,720 This example illustrates a more general fact that this 195 00:11:26,720 --> 00:11:30,090 approach of calculating individual entries of the 196 00:11:30,090 --> 00:11:34,130 binomial PMF gives very accurate answers. 197 00:11:34,130 --> 00:11:36,370 And in fact, there are theorems, there are 198 00:11:36,370 --> 00:11:40,260 theoretical results to this effect, that tell us that this 199 00:11:40,260 --> 00:11:42,460 way of approximating-- 200 00:11:42,460 --> 00:11:45,590 asymptotically, as n goes to infinity and 201 00:11:45,590 --> 00:11:47,380 in a certain regime-- 202 00:11:47,380 --> 00:11:50,140 does give us very accurate approximations.