1 00:00:00,920 --> 00:00:04,240 We will now consider a very practical application of the 2 00:00:04,240 --> 00:00:07,060 weak law of large numbers, and the calculations 3 00:00:07,060 --> 00:00:08,670 associated with it. 4 00:00:08,670 --> 00:00:10,970 The application has to do with polling. 5 00:00:10,970 --> 00:00:14,200 There's a certain referendum that's going to take place. 6 00:00:14,200 --> 00:00:16,940 We're close enough to the day of the referendum so that 7 00:00:16,940 --> 00:00:20,270 voters have made up their minds, and there is a fraction 8 00:00:20,270 --> 00:00:24,300 p of the population that represents the voters that are 9 00:00:24,300 --> 00:00:25,780 going to vote yes. 10 00:00:25,780 --> 00:00:28,920 But the referendum has not yet taken place, and you want to 11 00:00:28,920 --> 00:00:34,990 find out, to predict or estimate what p actually is. 12 00:00:34,990 --> 00:00:39,280 What you do is that you go ahead, and you select at 13 00:00:39,280 --> 00:00:42,750 random a number of people out of the population. 14 00:00:42,750 --> 00:00:46,690 And for each person, you record their answer, whether 15 00:00:46,690 --> 00:00:49,810 they intend to vote yes, or whether they 16 00:00:49,810 --> 00:00:52,120 intend to vote no. 17 00:00:52,120 --> 00:00:55,630 When we say that the people are randomly selected, what we 18 00:00:55,630 --> 00:01:00,890 mean is that we choose them uniformly from the population. 19 00:01:00,890 --> 00:01:05,110 And since there's a fraction p that will vote yes, this means 20 00:01:05,110 --> 00:01:09,210 that this random variable will be 1 with probability p, and 21 00:01:09,210 --> 00:01:14,860 therefore, the expected value of Xi is equal to p. 22 00:01:14,860 --> 00:01:18,680 In addition, we assume that we select those people 23 00:01:18,680 --> 00:01:19,930 independently. 24 00:01:21,810 --> 00:01:25,789 Now note that if we select people independently, there's 25 00:01:25,789 --> 00:01:30,289 always a chance that the first person polled will be the same 26 00:01:30,289 --> 00:01:33,530 as the second person polled, something that we do not 27 00:01:33,530 --> 00:01:35,140 really want to happen. 28 00:01:35,140 --> 00:01:39,729 However, if we assume that the population is very large, or 29 00:01:39,729 --> 00:01:43,360 even idealize the situation by assuming that the population 30 00:01:43,360 --> 00:01:46,360 is infinite, then this is never going to happen, and 31 00:01:46,360 --> 00:01:49,620 this will not be a concern. 32 00:01:49,620 --> 00:01:51,820 So how do we proceed? 33 00:01:51,820 --> 00:01:54,350 We look at the results that we got from the 34 00:01:54,350 --> 00:01:55,720 people that we polled. 35 00:01:55,720 --> 00:02:00,420 We count how many said yes, divide by n, and this gives us 36 00:02:00,420 --> 00:02:04,310 the fraction of yeses in the sample that we have obtained. 37 00:02:04,310 --> 00:02:08,380 And this is a pretty reasonable estimate for the 38 00:02:08,380 --> 00:02:11,920 unknown fraction p, the fraction of yeses in the 39 00:02:11,920 --> 00:02:14,770 overall population. 40 00:02:14,770 --> 00:02:18,260 Now perhaps your boss has asked you to find out the 41 00:02:18,260 --> 00:02:20,120 exact value of p. 42 00:02:20,120 --> 00:02:22,310 What should your response be? 43 00:02:22,310 --> 00:02:27,490 Well, there is no way to calculate p exactly on the 44 00:02:27,490 --> 00:02:31,170 basis of a finite and random poll. 45 00:02:31,170 --> 00:02:35,090 Therefore, there is going to be some error in our 46 00:02:35,090 --> 00:02:37,320 estimation of p. 47 00:02:37,320 --> 00:02:42,030 Then, perhaps your boss comes back and says, OK, then try to 48 00:02:42,030 --> 00:02:45,970 give me an estimate of p which is very accurate. 49 00:02:45,970 --> 00:02:49,030 I would like you to come up with an estimate which is 50 00:02:49,030 --> 00:02:52,250 correct within one percentage point. 51 00:02:52,250 --> 00:02:54,710 Can you do this for me? 52 00:02:54,710 --> 00:03:00,370 Your answer might be, OK, let me try polling 10,000 people, 53 00:03:00,370 --> 00:03:05,080 and see if I can guarantee for you such a small error. 54 00:03:05,080 --> 00:03:08,050 But after you think about the situation a little more, you 55 00:03:08,050 --> 00:03:12,350 realize that there is no way of guaranteeing such a small 56 00:03:12,350 --> 00:03:13,820 error with certainty. 57 00:03:13,820 --> 00:03:17,329 What if your unlucky, and the people that you poll happen to 58 00:03:17,329 --> 00:03:20,870 be not representative of the true population? 59 00:03:20,870 --> 00:03:24,070 So you come back to your boss and you say, I cannot 60 00:03:24,070 --> 00:03:27,460 guarantee with certainty that the error is going to be 61 00:03:27,460 --> 00:03:31,460 small, but perhaps I can guarantee for you that the 62 00:03:31,460 --> 00:03:36,420 error that I get is small with high probability. 63 00:03:36,420 --> 00:03:40,610 Or alternatively, I'm going to guarantee for you that the 64 00:03:40,610 --> 00:03:43,890 probability that we get an error that's bigger than that 65 00:03:43,890 --> 00:03:46,070 is very small. 66 00:03:46,070 --> 00:03:48,750 So how small is it going to be? 67 00:03:48,750 --> 00:03:53,000 Let's try to derive a bound on this probability of an error 68 00:03:53,000 --> 00:03:55,560 larger than one percentage point. 69 00:03:55,560 --> 00:03:59,250 Using the calculations that we carried out when we derived 70 00:03:59,250 --> 00:04:03,450 the weak law of large numbers, we know that this probability 71 00:04:03,450 --> 00:04:08,200 of a large error is bounded above by this quantity. 72 00:04:08,200 --> 00:04:10,040 What is this quantity? 73 00:04:10,040 --> 00:04:14,650 Sigma squared is the variance of the random variable that 74 00:04:14,650 --> 00:04:15,580 we're sampling. 75 00:04:15,580 --> 00:04:19,450 And since this is Bernoulli, this variance is p times 1 76 00:04:19,450 --> 00:04:25,090 minus p, and then we divide by n, which in our case is 10 to 77 00:04:25,090 --> 00:04:27,960 the fourth times epsilon squared. 78 00:04:27,960 --> 00:04:31,340 Epsilon is 10 to the minus 2, so here we have 10 79 00:04:31,340 --> 00:04:33,570 to the minus 4. 80 00:04:33,570 --> 00:04:35,620 OK, but now, what is this quantity? 81 00:04:35,620 --> 00:04:40,620 This quantity depends on p, and we do not know what p is. 82 00:04:40,620 --> 00:04:44,420 However, if you take this expression, and plot it as a 83 00:04:44,420 --> 00:04:51,580 function of p, what you obtain is a plot of this type. 84 00:04:51,580 --> 00:04:57,030 And the maximum happens when p is equal to 1/2, in which case 85 00:04:57,030 --> 00:04:59,970 we get a value of 1/4. 86 00:04:59,970 --> 00:05:04,060 That is, the variance of the Bernoulli is, at most, 1/4. 87 00:05:04,060 --> 00:05:07,970 And therefore, we obtain this bound here where the 88 00:05:07,970 --> 00:05:10,710 denominator terms have disappeared because they're 89 00:05:10,710 --> 00:05:12,250 equal to 1. 90 00:05:12,250 --> 00:05:16,710 So you tell your boss, if I sample 10,000 people, then the 91 00:05:16,710 --> 00:05:20,590 probability of an error more than the one percentage point 92 00:05:20,590 --> 00:05:23,860 is going to be less than 25%. 93 00:05:23,860 --> 00:05:28,600 At which point, your boss might reply and say, well, a 94 00:05:28,600 --> 00:05:32,659 probability of a large error of 25% is too big. 95 00:05:32,659 --> 00:05:34,580 This is unacceptable. 96 00:05:34,580 --> 00:05:38,850 I would like you to have a probability of error that's 97 00:05:38,850 --> 00:05:43,170 less than 5%. 98 00:05:43,170 --> 00:05:48,260 So suppose now that we want to change this, and make it only 99 00:05:48,260 --> 00:05:50,700 a 5% error-- 100 00:05:50,700 --> 00:05:52,909 5% or less. 101 00:05:52,909 --> 00:05:55,240 How are you going to proceed? 102 00:05:55,240 --> 00:05:58,870 Well, you have this quantity here, this upper bound, which 103 00:05:58,870 --> 00:06:05,720 we know to be less than or equal to 1/4 divided by n 104 00:06:05,720 --> 00:06:08,960 times epsilon squared, which is, in our case, 10 105 00:06:08,960 --> 00:06:10,850 to the minus 4. 106 00:06:10,850 --> 00:06:17,020 We would like this quantity to be less than or equal to 5%, 107 00:06:17,020 --> 00:06:21,420 which is 5/10 to the second power. 108 00:06:21,420 --> 00:06:25,380 And after you solve this inequality, you find that this 109 00:06:25,380 --> 00:06:30,180 is equivalent to taking n larger than or equal 110 00:06:30,180 --> 00:06:32,790 to 10 to the sixth. 111 00:06:32,790 --> 00:06:36,010 And then the five together with that four gives us a 112 00:06:36,010 --> 00:06:38,960 denominator of 20. 113 00:06:38,960 --> 00:06:43,860 And this number is equal to 50,000. 114 00:06:43,860 --> 00:06:47,560 So at this point, you can go back to your boss and tell 115 00:06:47,560 --> 00:06:51,200 them that one way of guaranteeing that the 116 00:06:51,200 --> 00:06:55,909 probability of a large error is less than or equal to 5% is 117 00:06:55,909 --> 00:06:58,850 to take n equal to 50,000. 118 00:06:58,850 --> 00:07:05,260 So 50,000 will suffice to achieve the desired specs. 119 00:07:05,260 --> 00:07:08,880 Notice that the desired specs have two parameters. 120 00:07:08,880 --> 00:07:14,140 One is the accuracy that you want, and the other is the 121 00:07:14,140 --> 00:07:17,870 confidence with which the accuracy 122 00:07:17,870 --> 00:07:21,040 is going to be achieved. 123 00:07:21,040 --> 00:07:24,420 Now 50,000 is a pretty large number. 124 00:07:24,420 --> 00:07:27,890 If you notice the results of polls, the way that they are 125 00:07:27,890 --> 00:07:31,690 presented in newspapers, they usually tell you that there's 126 00:07:31,690 --> 00:07:36,570 an accuracy of plus or minus three percentage points, not 127 00:07:36,570 --> 00:07:38,380 one percentage point. 128 00:07:38,380 --> 00:07:41,250 That helps things a little. 129 00:07:41,250 --> 00:07:43,670 It means that you can do with a somewhat 130 00:07:43,670 --> 00:07:46,060 smaller sample size. 131 00:07:46,060 --> 00:07:48,440 And then, there's another effect. 132 00:07:48,440 --> 00:07:52,790 Our calculation here was based on this inequality, which is 133 00:07:52,790 --> 00:07:54,390 the Chebyshev inequality. 134 00:07:54,390 --> 00:07:58,650 But the Chebyshev inequality is not that accurate. 135 00:07:58,650 --> 00:08:03,120 It turns out that if we use more accurate estimates of 136 00:08:03,120 --> 00:08:07,620 this probability, we will find that actually much smaller 137 00:08:07,620 --> 00:08:10,910 values of n will be enough for our purposes.