1 00:00:02,050 --> 00:00:04,500 We now revisit the polling problem that we 2 00:00:04,500 --> 00:00:06,390 have started earlier. 3 00:00:06,390 --> 00:00:09,210 When we first looked at that problem, we used the Chebyshev 4 00:00:09,210 --> 00:00:13,820 inequality to obtain certain bounds and numerical results. 5 00:00:13,820 --> 00:00:18,000 What we want to do now is instead to use a central limit 6 00:00:18,000 --> 00:00:22,110 theorem-type approximation, which we hope that it will be 7 00:00:22,110 --> 00:00:24,920 more accurate and more informative. 8 00:00:24,920 --> 00:00:27,350 Let us remind ourselves of the setting. 9 00:00:27,350 --> 00:00:30,940 We want to estimate a certain number, p, which is the 10 00:00:30,940 --> 00:00:33,510 fraction of the population that will vote yes in a 11 00:00:33,510 --> 00:00:34,960 certain referendum. 12 00:00:34,960 --> 00:00:39,010 And we estimate p by picking a sample out of the population. 13 00:00:39,010 --> 00:00:40,650 We pick n people. 14 00:00:40,650 --> 00:00:44,360 We pick them randomly, uniformly over the population 15 00:00:44,360 --> 00:00:46,090 and independently. 16 00:00:46,090 --> 00:00:49,360 For each one of the people in the sample, we ask them if 17 00:00:49,360 --> 00:00:52,520 they will vote to yes or no, and then we record their 18 00:00:52,520 --> 00:00:56,360 answers in Bernoulli random variables, Xi. 19 00:00:56,360 --> 00:00:59,590 So by the assumptions that we have made, these Xi's are 20 00:00:59,590 --> 00:01:03,950 independent Bernoulli random variables, and their mean is 21 00:01:03,950 --> 00:01:06,580 equal to p. 22 00:01:06,580 --> 00:01:09,760 We count how many X's were equal to 1. 23 00:01:09,760 --> 00:01:11,000 That's the number of yeses. 24 00:01:11,000 --> 00:01:13,700 We divide by n, and that gives us the fraction in the 25 00:01:13,700 --> 00:01:16,070 population that have responded yes. 26 00:01:16,070 --> 00:01:18,610 This is the sample mean of the X's. 27 00:01:18,610 --> 00:01:21,210 And we use this sample mean to estimate the 28 00:01:21,210 --> 00:01:24,200 unknown fraction p. 29 00:01:24,200 --> 00:01:28,960 We would like the error in our estimation to be small, that 30 00:01:28,960 --> 00:01:31,990 is the difference between the sample mean and the true value 31 00:01:31,990 --> 00:01:35,280 p to be small, less, let's say, than 32 00:01:35,280 --> 00:01:37,030 one percentage point. 33 00:01:37,030 --> 00:01:40,300 Now there's no way of guaranteeing that this spec 34 00:01:40,300 --> 00:01:44,070 will be met with certainty, unless we sample almost 35 00:01:44,070 --> 00:01:45,780 everyone in the population. 36 00:01:45,780 --> 00:01:49,740 But what we can do instead is to ask that these 37 00:01:49,740 --> 00:01:54,520 specifications are violated with only a small probability. 38 00:01:54,520 --> 00:01:58,150 So we look at the probability that our estimation error is 39 00:01:58,150 --> 00:02:01,290 larger than what we want. 40 00:02:01,290 --> 00:02:04,310 This is the case that we do not meet the specs, and we 41 00:02:04,310 --> 00:02:07,410 would like this probability to be small. 42 00:02:07,410 --> 00:02:11,880 One possible question is what the value of n should be in 43 00:02:11,880 --> 00:02:13,590 order to meet the specs. 44 00:02:13,590 --> 00:02:16,310 But in order to do any calculations, we first need a 45 00:02:16,310 --> 00:02:19,520 way of approximating this probability. 46 00:02:19,520 --> 00:02:22,700 We will do that using the central limit theorem. 47 00:02:22,700 --> 00:02:26,070 The central limit theorem involves this standardized 48 00:02:26,070 --> 00:02:30,730 version of the random variable Sn, where Sn stands for the 49 00:02:30,730 --> 00:02:33,390 sum of the X's. 50 00:02:33,390 --> 00:02:35,090 We know that this random variable is 51 00:02:35,090 --> 00:02:36,620 approximately normal. 52 00:02:36,620 --> 00:02:41,050 And what we want to do now is to take this event and rewrite 53 00:02:41,050 --> 00:02:44,490 it in an equivalent way but which involves this random 54 00:02:44,490 --> 00:02:46,900 variable Zn. 55 00:02:46,900 --> 00:02:48,030 Let us start. 56 00:02:48,030 --> 00:02:52,590 First, we note that here we a mu and a sigma, so we should 57 00:02:52,590 --> 00:02:54,280 know what these are. 58 00:02:54,280 --> 00:02:58,480 For a Bernoulli random variable, the mean is what we 59 00:02:58,480 --> 00:03:03,340 already wrote down, and sigma is the square root of p 60 00:03:03,340 --> 00:03:04,930 times 1 minus p. 61 00:03:07,850 --> 00:03:09,930 Now let's look at this event. 62 00:03:09,930 --> 00:03:17,050 Mn is the same as Sn/n, by definition. 63 00:03:17,050 --> 00:03:22,980 And we can write p in this form, minus n times 64 00:03:22,980 --> 00:03:26,790 p divided by n. 65 00:03:26,790 --> 00:03:30,220 And we want this quantity to be larger 66 00:03:30,220 --> 00:03:33,150 than or equal to 0.01. 67 00:03:33,150 --> 00:03:35,550 So this event here is identical to 68 00:03:35,550 --> 00:03:37,490 that event up there. 69 00:03:37,490 --> 00:03:40,940 This starts to look like this expression. 70 00:03:40,940 --> 00:03:42,900 p is the same as Mu. 71 00:03:42,900 --> 00:03:45,340 But there is a little bit of a difference in 72 00:03:45,340 --> 00:03:47,190 the denominator terms. 73 00:03:47,190 --> 00:03:49,820 So let's see what we can do. 74 00:03:49,820 --> 00:03:57,860 Let's take this same event but multiply both sides of the 75 00:03:57,860 --> 00:04:00,380 inequality by a square root of n. 76 00:04:00,380 --> 00:04:04,490 This causes this denominator term to become just square 77 00:04:04,490 --> 00:04:09,470 root of n, and we get a square root of n term in the 78 00:04:09,470 --> 00:04:12,260 numerator on the other side. 79 00:04:12,260 --> 00:04:15,030 This is an equivalent description of the event. 80 00:04:15,030 --> 00:04:20,720 Now we can multiply both sides of this inequality by sigma-- 81 00:04:20,720 --> 00:04:25,580 actually the denominators on both sides by sigma-- 82 00:04:25,580 --> 00:04:28,930 and we obtain this equivalent representation. 83 00:04:28,930 --> 00:04:36,110 But now we notice that here we do have the random variable Zn 84 00:04:36,110 --> 00:04:37,860 that we wanted. 85 00:04:37,860 --> 00:04:42,050 And so we managed to express this event in terms of the 86 00:04:42,050 --> 00:04:44,000 random variable Zn. 87 00:04:44,000 --> 00:04:47,960 In particular what we have is that this probability is the 88 00:04:47,960 --> 00:04:53,380 same as the probability that the absolute value of Zn is 89 00:04:53,380 --> 00:04:58,030 larger than or equal to 0.01 square root of 90 00:04:58,030 --> 00:05:02,140 n divided by sigma. 91 00:05:02,140 --> 00:05:06,080 Then we can use the central limit theorem approximation to 92 00:05:06,080 --> 00:05:09,360 approximate this probability by the corresponding 93 00:05:09,360 --> 00:05:13,630 probability where we now use a standard normal random 94 00:05:13,630 --> 00:05:20,100 variable instead of the Zn random variable. 95 00:05:20,100 --> 00:05:23,860 So here, Z stands for a standard normal random 96 00:05:23,860 --> 00:05:28,780 variable with mean 0 and variance equal to 1. 97 00:05:28,780 --> 00:05:31,720 Let us now continue on a new slide so that we have some 98 00:05:31,720 --> 00:05:33,010 working space. 99 00:05:33,010 --> 00:05:37,520 And here is the result that we have derived so far. 100 00:05:37,520 --> 00:05:40,630 If somebody gives us the value of n, we would like to be able 101 00:05:40,630 --> 00:05:44,100 to calculate this probability using this approximation. 102 00:05:44,100 --> 00:05:49,040 However, there's a slight difficulty because sigma is a 103 00:05:49,040 --> 00:05:55,210 function that depends on p, and it is not known. 104 00:05:55,210 --> 00:05:58,670 However, as we discussed when we first started the polling 105 00:05:58,670 --> 00:06:02,460 problem, we do know that sigma is always less 106 00:06:02,460 --> 00:06:04,520 than or equal to 1/2. 107 00:06:04,520 --> 00:06:09,340 And this suggests that we could use here the worst-case 108 00:06:09,340 --> 00:06:13,000 value of the standard deviation, replace sigma by 109 00:06:13,000 --> 00:06:17,290 1/2 and instead look at this probability here. 110 00:06:17,290 --> 00:06:19,990 How are these two probabilities related? 111 00:06:19,990 --> 00:06:22,800 Which direction does the inequality go? 112 00:06:22,800 --> 00:06:25,500 A sketch will be useful here. 113 00:06:25,500 --> 00:06:33,200 Z is a standard normal, and it's centered at 0. 114 00:06:33,200 --> 00:06:39,650 Somewhere here, we have a value of 0.02 square root n. 115 00:06:39,650 --> 00:06:45,210 And somewhere further out, we have the value of 0.01 square 116 00:06:45,210 --> 00:06:49,040 root n divided by sigma. 117 00:06:49,040 --> 00:06:51,860 Why are these two values ordered this way? 118 00:06:51,860 --> 00:06:58,450 Since sigma is less than 1/2, 1 over sigma is bigger than 2. 119 00:06:58,450 --> 00:07:01,170 So this expression here is bigger than 120 00:07:01,170 --> 00:07:04,180 this expression there. 121 00:07:04,180 --> 00:07:09,250 Since the inequality goes this way, now we can compare these 122 00:07:09,250 --> 00:07:12,230 two events. 123 00:07:12,230 --> 00:07:16,570 This event, that Z is larger in absolute value than this 124 00:07:16,570 --> 00:07:22,170 number, is the probability of this tail of the distribution. 125 00:07:22,170 --> 00:07:26,540 And we will have a similar probability from the other end 126 00:07:26,540 --> 00:07:29,280 of the tail of the distribution. 127 00:07:29,280 --> 00:07:31,880 Here we're talking about the probability of being larger 128 00:07:31,880 --> 00:07:36,159 than or equal to this number, which would correspond only to 129 00:07:36,159 --> 00:07:39,830 this part of the tail and, similarly, a small part of the 130 00:07:39,830 --> 00:07:42,200 tail from the other side. 131 00:07:42,200 --> 00:07:45,590 The blue event is smaller than the red event. 132 00:07:45,590 --> 00:07:49,159 This is the probability of the blue event, so it's going to 133 00:07:49,159 --> 00:07:54,970 be no larger than the probability of the red event. 134 00:07:54,970 --> 00:07:58,690 Now if somebody gives us a value of n, we should be able 135 00:07:58,690 --> 00:08:00,780 to calculate this probability. 136 00:08:00,780 --> 00:08:03,740 How do we calculate it? 137 00:08:03,740 --> 00:08:07,480 The probability that the absolute value is above a 138 00:08:07,480 --> 00:08:13,600 certain number is equal to the probability of this tail plus 139 00:08:13,600 --> 00:08:15,570 the probability of that tail. 140 00:08:15,570 --> 00:08:18,960 But because of the symmetry of the normal distribution, this 141 00:08:18,960 --> 00:08:24,450 is twice the probability of each one of the tails. 142 00:08:24,450 --> 00:08:26,390 What is the probability of this tail? 143 00:08:26,390 --> 00:08:30,740 It's 1 minus the probability of whatever is below that. 144 00:08:30,740 --> 00:08:32,669 So it's 1 minus. 145 00:08:32,669 --> 00:08:36,270 And the probability of being below that, this is the 146 00:08:36,270 --> 00:08:44,850 standard normal CDF evaluated at 0.02 square root n. 147 00:08:44,850 --> 00:08:48,870 So we do have now an expression for the desired 148 00:08:48,870 --> 00:08:52,890 probability, or at least a bound for it, which is 149 00:08:52,890 --> 00:08:57,640 expressed in terms of the standard normal CDF. 150 00:08:57,640 --> 00:09:01,970 If somebody gives you a value of n, you can plug in here. 151 00:09:01,970 --> 00:09:06,410 If n is 10,000, then square root of n is 100. 152 00:09:06,410 --> 00:09:10,640 And this number becomes equal to 2. 153 00:09:10,640 --> 00:09:14,730 And so in this case, what we obtain is that the probability 154 00:09:14,730 --> 00:09:19,680 of interest is less than or equal to 2 times 1 155 00:09:19,680 --> 00:09:23,230 minus Phi of 2. 156 00:09:23,230 --> 00:09:27,960 Now we invoke the standard normal table. 157 00:09:27,960 --> 00:09:32,730 From the normal table, we obtain that this quantity is 158 00:09:32,730 --> 00:09:45,060 equal to twice 1 minus 0.9772, which evaluates to 0.046. 159 00:09:45,060 --> 00:09:53,300 So if we use 10,000 people in our sample, then we will get 160 00:09:53,300 --> 00:09:57,280 an accuracy of one percentage point with very high 161 00:09:57,280 --> 00:09:58,460 probability. 162 00:09:58,460 --> 00:10:02,620 The probability that we do not meet the specification so that 163 00:10:02,620 --> 00:10:05,620 the accuracy that we get is worse than one percentage 164 00:10:05,620 --> 00:10:08,660 point, that probability is quite small. 165 00:10:08,660 --> 00:10:12,450 It's 0.046. 166 00:10:12,450 --> 00:10:18,080 That is 4 and something percent. 167 00:10:18,080 --> 00:10:19,880 This is pretty good. 168 00:10:19,880 --> 00:10:25,070 And suppose that your boss now tells you, I only want the 169 00:10:25,070 --> 00:10:30,610 probability of not meeting the specs to be 5%. 170 00:10:30,610 --> 00:10:34,670 You look at this result, and you say, with 10,000, I 171 00:10:34,670 --> 00:10:38,810 achieved a probability of a large error 172 00:10:38,810 --> 00:10:42,210 that's less than 5%. 173 00:10:42,210 --> 00:10:45,990 This means that I probably have some leeway and that I 174 00:10:45,990 --> 00:10:49,820 can reduce the size of my sample. 175 00:10:49,820 --> 00:10:52,690 What could the size of the sample be and 176 00:10:52,690 --> 00:10:56,060 still meet those specs? 177 00:10:56,060 --> 00:10:59,630 What we're trying to do here is that we have this 178 00:10:59,630 --> 00:11:04,880 approximation for the probability of interest, and 179 00:11:04,880 --> 00:11:07,500 we want to set this probability 180 00:11:07,500 --> 00:11:15,250 to a value of 0.05. 181 00:11:15,250 --> 00:11:18,860 Then we want to ask, what is the value of n that will 182 00:11:18,860 --> 00:11:22,900 result in this particular probability of 183 00:11:22,900 --> 00:11:26,060 not meeting the specs? 184 00:11:26,060 --> 00:11:28,090 Now we can do the algebra. 185 00:11:28,090 --> 00:11:33,080 And we find that this corresponds to requiring that 186 00:11:33,080 --> 00:11:42,320 phi of 0.02 square root n to be equal to 0.975. 187 00:11:42,320 --> 00:11:45,380 What's the interpretation of this? 188 00:11:45,380 --> 00:11:49,500 We want to choose n so that the probability of the two 189 00:11:49,500 --> 00:11:53,780 tails is 5%. 190 00:11:53,780 --> 00:11:58,620 This means that we want this probability here to be 2 and 191 00:11:58,620 --> 00:12:00,510 1/2 percent. 192 00:12:00,510 --> 00:12:03,240 This means that the probability of whatever is to 193 00:12:03,240 --> 00:12:13,550 the left of this number should be 0.975, including the tail. 194 00:12:13,550 --> 00:12:18,540 This means, again, that we have to look at the standard 195 00:12:18,540 --> 00:12:25,150 normal table and ask, what's the value for which the CDF is 196 00:12:25,150 --> 00:12:28,190 equal to 0.975? 197 00:12:28,190 --> 00:12:34,230 So we look around, and we find 0.975 to be here, and it 198 00:12:34,230 --> 00:12:38,130 corresponds to 1.96. 199 00:12:38,130 --> 00:12:42,750 This tells us that 0.02 square root n 200 00:12:42,750 --> 00:12:48,840 should be equal to 1.96. 201 00:12:48,840 --> 00:12:54,240 Then we solve for n, and we find that the value of n is 202 00:12:54,240 --> 00:13:01,110 9,604, which is indeed some reduction from the 10,000 that 203 00:13:01,110 --> 00:13:02,360 we had originally. 204 00:13:04,450 --> 00:13:08,020 How does this relate to the real world? 205 00:13:08,020 --> 00:13:12,330 When you read newspapers about polls, you will never see 206 00:13:12,330 --> 00:13:16,100 sample sizes that are about 10,000. 207 00:13:16,100 --> 00:13:20,290 You will usually see sample sizes of the order of 1,000, 208 00:13:20,290 --> 00:13:22,530 sometimes even smaller. 209 00:13:22,530 --> 00:13:24,400 How can they do that? 210 00:13:24,400 --> 00:13:28,100 Well, they can do that because the specs that they impose are 211 00:13:28,100 --> 00:13:31,140 not as tight as the specs that we have here. 212 00:13:31,140 --> 00:13:35,100 Usually, they tell you that the results are accurate 213 00:13:35,100 --> 00:13:38,850 within three percentage points, let's say, instead of 214 00:13:38,850 --> 00:13:40,600 one percentage point. 215 00:13:40,600 --> 00:13:46,420 And by moving from 0.01 to 0.03, and if you repeat those 216 00:13:46,420 --> 00:13:50,090 calculations, you will find that the sample size of about 217 00:13:50,090 --> 00:13:53,690 1,000 will actually do.