1 00:00:00,000 --> 00:00:00,040 2 00:00:00,040 --> 00:00:02,460 The following content is provided under a Creative 3 00:00:02,460 --> 00:00:03,870 Commons license. 4 00:00:03,870 --> 00:00:06,910 Your support will help MIT OpenCourseWare continue to 5 00:00:06,910 --> 00:00:10,560 offer high quality educational resources for free. 6 00:00:10,560 --> 00:00:13,460 To make a donation or view additional materials from 7 00:00:13,460 --> 00:00:17,390 hundreds of MIT courses, visit MIT OpenCourseWare at 8 00:00:17,390 --> 00:00:18,640 ocw.mit.edu. 9 00:00:18,640 --> 00:00:21,860 10 00:00:21,860 --> 00:00:25,130 JOHN TSITSIKLIS: We're going to start today a new unit. 11 00:00:25,130 --> 00:00:29,320 so we will be talking about limit theorems. 12 00:00:29,320 --> 00:00:33,580 So just to introduce the topic, let's think of the 13 00:00:33,580 --> 00:00:35,560 following situation. 14 00:00:35,560 --> 00:00:37,580 There's a population of penguins down 15 00:00:37,580 --> 00:00:38,970 at the South Pole. 16 00:00:38,970 --> 00:00:42,740 And if you were to pick a penguin at random and measure 17 00:00:42,740 --> 00:00:46,930 their height, the expected value of their height would be 18 00:00:46,930 --> 00:00:50,020 the average of the heights of the different penguins in the 19 00:00:50,020 --> 00:00:50,970 population. 20 00:00:50,970 --> 00:00:53,430 So suppose when you pick one, every 21 00:00:53,430 --> 00:00:55,210 penguin is equally likely. 22 00:00:55,210 --> 00:00:58,020 Then the expected value is just the average of all the 23 00:00:58,020 --> 00:00:59,340 penguins out there. 24 00:00:59,340 --> 00:01:01,650 So your boss asks you to find out what that the 25 00:01:01,650 --> 00:01:03,020 expected value is. 26 00:01:03,020 --> 00:01:04,980 One way would be to go and measure 27 00:01:04,980 --> 00:01:06,540 each and every penguin. 28 00:01:06,540 --> 00:01:08,600 That might be a little time consuming. 29 00:01:08,600 --> 00:01:13,120 So alternatively, what you can do is to go and pick penguins 30 00:01:13,120 --> 00:01:17,450 at random, pick a few of them, let's say a number n of them. 31 00:01:17,450 --> 00:01:20,420 So you measure the height of each one. 32 00:01:20,420 --> 00:01:25,920 And then you calculate the average of the heights of 33 00:01:25,920 --> 00:01:29,050 those penguins that you have collected. 34 00:01:29,050 --> 00:01:33,100 So this is your estimate of the expected value. 35 00:01:33,100 --> 00:01:41,010 Now, we called this the sample mean, which is the mean value, 36 00:01:41,010 --> 00:01:44,430 but within the sample that you have collected. 37 00:01:44,430 --> 00:01:48,090 This is something that's sort of feels the same as the 38 00:01:48,090 --> 00:01:52,140 expected value, which is again, the mean. 39 00:01:52,140 --> 00:01:54,400 But the expected value's a different kind of mean. 40 00:01:54,400 --> 00:01:57,870 The expected value is the mean over the entire population, 41 00:01:57,870 --> 00:02:01,680 whereas the sample mean is the average over the smaller 42 00:02:01,680 --> 00:02:03,940 sample that you have measured. 43 00:02:03,940 --> 00:02:06,330 The expected value is a number. 44 00:02:06,330 --> 00:02:09,220 The sample mean is a random variable. 45 00:02:09,220 --> 00:02:11,720 It's a random variable because the sample you have 46 00:02:11,720 --> 00:02:15,010 collected is random. 47 00:02:15,010 --> 00:02:18,520 Now, we think that this is a reasonable way of estimating 48 00:02:18,520 --> 00:02:19,900 the expectation. 49 00:02:19,900 --> 00:02:25,710 So in the limit as n goes to infinity, it's plausible that 50 00:02:25,710 --> 00:02:29,170 the sample mean, the estimate that we are constructing, 51 00:02:29,170 --> 00:02:33,790 should somehow get close to the expected value. 52 00:02:33,790 --> 00:02:34,560 What does this mean? 53 00:02:34,560 --> 00:02:36,160 What does it mean to get close? 54 00:02:36,160 --> 00:02:37,620 In what sense? 55 00:02:37,620 --> 00:02:39,440 And is this statement true? 56 00:02:39,440 --> 00:02:44,160 This is the kind of statement that we deal with when dealing 57 00:02:44,160 --> 00:02:45,710 with limit theorems. 58 00:02:45,710 --> 00:02:49,500 That's the subject of limit theorems, when what happens if 59 00:02:49,500 --> 00:02:52,020 you're dealing with lots and lots of random variables, and 60 00:02:52,020 --> 00:02:54,620 perhaps take averages and so on. 61 00:02:54,620 --> 00:02:57,280 So why do we bother about this? 62 00:02:57,280 --> 00:03:01,200 Well, if you're in the sampling business, it would be 63 00:03:01,200 --> 00:03:04,870 reassuring to know that this particular way of estimating 64 00:03:04,870 --> 00:03:06,880 the expected value actually gets you 65 00:03:06,880 --> 00:03:08,850 close to the true answer. 66 00:03:08,850 --> 00:03:11,890 There's also a higher level reason, which is a little more 67 00:03:11,890 --> 00:03:13,660 abstract and mathematical. 68 00:03:13,660 --> 00:03:17,110 So probability problems are easy to deal with if you're 69 00:03:17,110 --> 00:03:20,040 having in your hands one or two random variables. 70 00:03:20,040 --> 00:03:23,520 You can write down their mass functions, joints density 71 00:03:23,520 --> 00:03:24,930 functions, and so on. 72 00:03:24,930 --> 00:03:27,500 You can calculate on paper or on a computer, 73 00:03:27,500 --> 00:03:29,430 you can get the answers. 74 00:03:29,430 --> 00:03:33,510 Probability problems become computationally intractable if 75 00:03:33,510 --> 00:03:36,760 you're dealing, let's say, with 100 random variables and 76 00:03:36,760 --> 00:03:40,280 you're trying to get the exact answers for anything. 77 00:03:40,280 --> 00:03:43,050 So in principle, the same formulas that we have, they 78 00:03:43,050 --> 00:03:44,230 still apply. 79 00:03:44,230 --> 00:03:47,360 But they involve summations over large ranges of 80 00:03:47,360 --> 00:03:48,830 combinations of indices. 81 00:03:48,830 --> 00:03:51,310 And that makes life extremely difficult. 82 00:03:51,310 --> 00:03:55,100 But when you push the envelope and you go to a situation 83 00:03:55,100 --> 00:03:58,480 where you're dealing with a very, very large number of 84 00:03:58,480 --> 00:04:02,130 variables, then you can start taking limits. 85 00:04:02,130 --> 00:04:05,200 And when you take limits, wonderful things happen. 86 00:04:05,200 --> 00:04:08,030 Many formulas start simplifying, and you can 87 00:04:08,030 --> 00:04:11,770 actually get useful answers by considering those limits. 88 00:04:11,770 --> 00:04:15,450 And that's sort of the big reason why looking at limit 89 00:04:15,450 --> 00:04:17,820 theorems is a useful thing to do. 90 00:04:17,820 --> 00:04:20,990 So what we're going to do today, first we're going to 91 00:04:20,990 --> 00:04:27,110 start with a useful, simple tool that allows us to relates 92 00:04:27,110 --> 00:04:30,290 probabilities with expected values. 93 00:04:30,290 --> 00:04:33,230 The Markov inequality is the first inequality we're going 94 00:04:33,230 --> 00:04:33,840 to write down. 95 00:04:33,840 --> 00:04:37,650 And then using that, we're going to get the Chebyshev's 96 00:04:37,650 --> 00:04:39,760 inequality, a related inequality. 97 00:04:39,760 --> 00:04:43,760 Then we need to define what do we mean by convergence when we 98 00:04:43,760 --> 00:04:45,270 talk about random variables. 99 00:04:45,270 --> 00:04:48,310 It's a notion that's a generalization of the notion 100 00:04:48,310 --> 00:04:51,000 of the usual convergence of limits of 101 00:04:51,000 --> 00:04:52,690 a sequence of numbers. 102 00:04:52,690 --> 00:04:55,710 And once we have our notion of convergence, we're going to 103 00:04:55,710 --> 00:05:00,860 see that, indeed, the sample mean converges to the true 104 00:05:00,860 --> 00:05:04,380 mean, converges to the expected value of the X's. 105 00:05:04,380 --> 00:05:08,840 And this statement is called the weak law of large numbers. 106 00:05:08,840 --> 00:05:11,650 The reason it's called the weak law is because there's 107 00:05:11,650 --> 00:05:14,640 also a strong law, which is a statement with the same 108 00:05:14,640 --> 00:05:16,570 flavor, but with a somewhat different 109 00:05:16,570 --> 00:05:18,410 mathematical content. 110 00:05:18,410 --> 00:05:20,790 But it's a little more abstract, and we will not be 111 00:05:20,790 --> 00:05:21,680 getting into this. 112 00:05:21,680 --> 00:05:26,070 So the weak law is all that you're going to get. 113 00:05:26,070 --> 00:05:28,570 All right. 114 00:05:28,570 --> 00:05:31,050 So now we start our digression. 115 00:05:31,050 --> 00:05:38,220 And our first tool will be the so-called Markov inequality. 116 00:05:38,220 --> 00:05:45,050 117 00:05:45,050 --> 00:05:48,040 So let's take a random variable that's always 118 00:05:48,040 --> 00:05:48,870 non-negative. 119 00:05:48,870 --> 00:05:51,790 No matter what, it gets no negative values. 120 00:05:51,790 --> 00:05:53,710 To keep things simple, let's assume it's a 121 00:05:53,710 --> 00:05:55,500 discrete random variable. 122 00:05:55,500 --> 00:05:59,770 So the expected value is the sum over all possible values 123 00:05:59,770 --> 00:06:01,460 that a random variable can take. 124 00:06:01,460 --> 00:06:04,440 125 00:06:04,440 --> 00:06:06,600 The values of the random variables that can take 126 00:06:06,600 --> 00:06:10,850 weighted according to their corresponding probabilities. 127 00:06:10,850 --> 00:06:13,700 Now, this is a sum over all x's. 128 00:06:13,700 --> 00:06:16,640 But x takes non-negative values. 129 00:06:16,640 --> 00:06:19,780 And the PMF is also non-negative. 130 00:06:19,780 --> 00:06:24,310 So if I take a sum over fewer things, I'm going to get a 131 00:06:24,310 --> 00:06:25,550 smaller value. 132 00:06:25,550 --> 00:06:29,180 So the sum when I add over everything is less than or 133 00:06:29,180 --> 00:06:33,255 equal to the sum that I will get if I only add those terms 134 00:06:33,255 --> 00:06:35,620 that are bigger than a certain constant. 135 00:06:35,620 --> 00:06:38,600 136 00:06:38,600 --> 00:06:45,140 Now, if I'm adding over x's that are bigger than a, the x 137 00:06:45,140 --> 00:06:48,630 that shows up up there will always be larger 138 00:06:48,630 --> 00:06:50,490 than or equal to a. 139 00:06:50,490 --> 00:06:52,370 So we get this inequality. 140 00:06:52,370 --> 00:06:58,170 141 00:06:58,170 --> 00:06:59,980 And now, a is a constant. 142 00:06:59,980 --> 00:07:02,870 I can pull it outside the summation. 143 00:07:02,870 --> 00:07:05,320 And then I'm left with the probabilities of all the x's 144 00:07:05,320 --> 00:07:06,990 that are bigger than a. 145 00:07:06,990 --> 00:07:08,850 And that's just the probability of 146 00:07:08,850 --> 00:07:10,250 being bigger than a. 147 00:07:10,250 --> 00:07:15,540 148 00:07:15,540 --> 00:07:18,140 OK, so that's the Markov inequality. 149 00:07:18,140 --> 00:07:23,800 Basically tells us that the expected value is larger than 150 00:07:23,800 --> 00:07:26,240 or equal to this number. 151 00:07:26,240 --> 00:07:30,260 It relates expected values to probabilities. 152 00:07:30,260 --> 00:07:34,660 It tells us that if the expected value is small, then 153 00:07:34,660 --> 00:07:39,250 the probability that x is big is also going to be small. 154 00:07:39,250 --> 00:07:42,240 So it's translates a statement about smallness of expected 155 00:07:42,240 --> 00:07:46,205 values to a statement about smallness of probabilities. 156 00:07:46,205 --> 00:07:49,020 157 00:07:49,020 --> 00:07:49,930 OK. 158 00:07:49,930 --> 00:07:54,210 What we actually need is a somewhat different version of 159 00:07:54,210 --> 00:07:57,240 this same statement. 160 00:07:57,240 --> 00:08:03,010 And what we're going to do is to apply this inequality to a 161 00:08:03,010 --> 00:08:08,150 non-negative random variable of a special type. 162 00:08:08,150 --> 00:08:13,330 And you can think of applying this same calculation to a 163 00:08:13,330 --> 00:08:18,800 random variable of this form, (X minus mu)-squared, where mu 164 00:08:18,800 --> 00:08:21,870 is the expected value of X. 165 00:08:21,870 --> 00:08:24,075 Now, this is a non-negative random variable. 166 00:08:24,075 --> 00:08:35,419 167 00:08:35,419 --> 00:08:37,919 So, the expected value of this random variable, which is the 168 00:08:37,919 --> 00:08:42,220 variance, by following the same thinking as we had in 169 00:08:42,220 --> 00:08:52,880 that derivation up to there, is bigger than the probability 170 00:08:52,880 --> 00:08:58,210 that this random variable is bigger than some-- 171 00:08:58,210 --> 00:09:04,760 let me use a-squared instead of an a 172 00:09:04,760 --> 00:09:06,585 times the value a-squared. 173 00:09:06,585 --> 00:09:12,420 174 00:09:12,420 --> 00:09:16,310 So now of course, this probability is the same as the 175 00:09:16,310 --> 00:09:23,440 probability that the absolute value of X minus mu is bigger 176 00:09:23,440 --> 00:09:27,190 than a times a-squared. 177 00:09:27,190 --> 00:09:34,860 And this side is equal to the variance of X. So this relates 178 00:09:34,860 --> 00:09:40,890 the variance of X to the probability that our random 179 00:09:40,890 --> 00:09:45,200 variable is far away from its mean. 180 00:09:45,200 --> 00:09:50,590 If the variance is small, then it means that the probability 181 00:09:50,590 --> 00:09:54,635 of being far away from the mean is also small. 182 00:09:54,635 --> 00:09:57,240 183 00:09:57,240 --> 00:10:02,220 So I derived this by applying the Markov inequality to this 184 00:10:02,220 --> 00:10:04,950 particular non-negative random variable. 185 00:10:04,950 --> 00:10:09,500 Or just to reinforce, perhaps, the message, and increase your 186 00:10:09,500 --> 00:10:13,450 confidence in this inequality, let's just look at the 187 00:10:13,450 --> 00:10:16,980 derivation once more, where I'm going, here, to start from 188 00:10:16,980 --> 00:10:20,890 first principles, but use the same idea as the one that was 189 00:10:20,890 --> 00:10:23,480 used in the proof out here. 190 00:10:23,480 --> 00:10:23,685 Ok. 191 00:10:23,685 --> 00:10:26,920 So just for variety, now let's think of X as being a 192 00:10:26,920 --> 00:10:28,760 continuous random variable. 193 00:10:28,760 --> 00:10:31,520 The derivation is the same whether it's discrete or 194 00:10:31,520 --> 00:10:32,510 continuous. 195 00:10:32,510 --> 00:10:35,990 So by definition, the variance is the integral, is this 196 00:10:35,990 --> 00:10:38,130 particular integral. 197 00:10:38,130 --> 00:10:43,920 Now, the integral is going to become smaller if I integrate, 198 00:10:43,920 --> 00:10:47,130 instead of integrating over the full range, I only 199 00:10:47,130 --> 00:10:51,070 integrate over x's that are far away from the mean. 200 00:10:51,070 --> 00:10:52,700 So mu is the mean. 201 00:10:52,700 --> 00:10:54,345 Think of c as some big number. 202 00:10:54,345 --> 00:10:59,670 203 00:10:59,670 --> 00:11:02,210 These are x's that are far away from the mean to the 204 00:11:02,210 --> 00:11:05,410 left, from minus infinity to mu minus c. 205 00:11:05,410 --> 00:11:09,030 And these are the x's that are far away from the mean on the 206 00:11:09,030 --> 00:11:11,210 positive side. 207 00:11:11,210 --> 00:11:13,420 So by integrating over fewer stuff, I'm 208 00:11:13,420 --> 00:11:15,580 getting a smaller integral. 209 00:11:15,580 --> 00:11:21,970 Now, for any x in this range, this distance, x minus mu, is 210 00:11:21,970 --> 00:11:23,220 at least c. 211 00:11:23,220 --> 00:11:26,320 So that squared is at least c squared. 212 00:11:26,320 --> 00:11:28,910 So this term over this range of integration 213 00:11:28,910 --> 00:11:30,520 is at least c squared. 214 00:11:30,520 --> 00:11:33,250 So I can take it outside the integral. 215 00:11:33,250 --> 00:11:36,400 And I'm left just with the integral of the density. 216 00:11:36,400 --> 00:11:38,480 Same thing on the other side. 217 00:11:38,480 --> 00:11:41,770 And so what factors out is this term c squared. 218 00:11:41,770 --> 00:11:45,360 And inside, we're left with the probability of being to 219 00:11:45,360 --> 00:11:49,060 the left of mu minus c, and then the probability of being 220 00:11:49,060 --> 00:11:52,310 to the right of mu plus c, which is the same as the 221 00:11:52,310 --> 00:11:55,370 probability that the absolute value of the distance from the 222 00:11:55,370 --> 00:11:58,770 mean is larger than or equal to c. 223 00:11:58,770 --> 00:12:04,820 So that's the same inequality that we proved there, except 224 00:12:04,820 --> 00:12:06,060 that here I'm using c. 225 00:12:06,060 --> 00:12:10,530 There I used a, but it's exactly the same one. 226 00:12:10,530 --> 00:12:12,960 This inequality was maybe better to understand if you 227 00:12:12,960 --> 00:12:16,790 take that term and send it to the other side and 228 00:12:16,790 --> 00:12:18,780 write it this form. 229 00:12:18,780 --> 00:12:20,010 What does it tell us? 230 00:12:20,010 --> 00:12:25,750 It tells us that if c is a big number, it tells us that the 231 00:12:25,750 --> 00:12:30,750 probability of being more than c away from the mean is going 232 00:12:30,750 --> 00:12:32,330 to be a small number. 233 00:12:32,330 --> 00:12:34,780 When c is big, this is small. 234 00:12:34,780 --> 00:12:35,970 Now, this is intuitive. 235 00:12:35,970 --> 00:12:38,290 The variance is a measure of the spread of the 236 00:12:38,290 --> 00:12:40,960 distribution, how wide it is. 237 00:12:40,960 --> 00:12:43,960 It tells us that if the variance is small, the 238 00:12:43,960 --> 00:12:46,320 distribution is not very wide. 239 00:12:46,320 --> 00:12:49,020 And mathematically, this translates to this statement 240 00:12:49,020 --> 00:12:52,360 that when the variance is small, the probability of 241 00:12:52,360 --> 00:12:54,880 being far away is going to be small. 242 00:12:54,880 --> 00:12:58,370 And the further away you're looking, that is, if c is a 243 00:12:58,370 --> 00:13:00,330 bigger number, that probability 244 00:13:00,330 --> 00:13:01,765 also becomes small. 245 00:13:01,765 --> 00:13:04,930 246 00:13:04,930 --> 00:13:07,880 Maybe an even more intuitive way to think about the content 247 00:13:07,880 --> 00:13:13,230 of this inequality is to, instead of c, use the number 248 00:13:13,230 --> 00:13:16,910 k, where k is positive and sigma is 249 00:13:16,910 --> 00:13:18,530 the standard deviation. 250 00:13:18,530 --> 00:13:22,670 So let's just plug k sigma in the place of c. 251 00:13:22,670 --> 00:13:25,300 So this becomes k sigma squared. 252 00:13:25,300 --> 00:13:27,130 These sigma squared's cancel. 253 00:13:27,130 --> 00:13:29,770 We're left with 1 over k-square. 254 00:13:29,770 --> 00:13:31,690 Now, what is this? 255 00:13:31,690 --> 00:13:36,260 This is the event that you are k standard deviations away 256 00:13:36,260 --> 00:13:37,770 from the mean. 257 00:13:37,770 --> 00:13:40,600 So for example, this statement here tells you that if you 258 00:13:40,600 --> 00:13:44,900 look at the test scores from a quiz, what fraction of the 259 00:13:44,900 --> 00:13:49,900 class are 3 standard deviations away from the mean? 260 00:13:49,900 --> 00:13:53,000 It's possible, but it's not going to be a lot of people. 261 00:13:53,000 --> 00:13:57,930 It's going to be at most, 1/9 of the class that can be 3 262 00:13:57,930 --> 00:14:02,190 standard deviations or more away from the mean. 263 00:14:02,190 --> 00:14:05,250 So the Chebyshev inequality is a really useful one. 264 00:14:05,250 --> 00:14:07,860 265 00:14:07,860 --> 00:14:11,300 It comes in handy whenever you want to relate probabilities 266 00:14:11,300 --> 00:14:12,800 and expected values. 267 00:14:12,800 --> 00:14:16,390 So if you know that your expected values or, in 268 00:14:16,390 --> 00:14:19,260 particular, that your variance is small, this tells you 269 00:14:19,260 --> 00:14:23,080 something about tailed probabilities. 270 00:14:23,080 --> 00:14:25,530 So this is the end of our first digression. 271 00:14:25,530 --> 00:14:28,320 We have this inequality in our hands. 272 00:14:28,320 --> 00:14:31,170 Our second digression is talk about limits. 273 00:14:31,170 --> 00:14:34,680 274 00:14:34,680 --> 00:14:37,190 We want to eventually talk about limits of random 275 00:14:37,190 --> 00:14:39,750 variables, but as a warm up, we're going to start with 276 00:14:39,750 --> 00:14:42,440 limits of sequences. 277 00:14:42,440 --> 00:14:47,670 So you're given a sequence of numbers, a1, 278 00:14:47,670 --> 00:14:50,500 a2, a3, and so on. 279 00:14:50,500 --> 00:14:54,160 And we want to define the notion that a sequence 280 00:14:54,160 --> 00:14:56,470 converges to a number. 281 00:14:56,470 --> 00:15:04,710 You sort of know what this means, but let's just go 282 00:15:04,710 --> 00:15:06,510 through it some more. 283 00:15:06,510 --> 00:15:09,890 So here's a. 284 00:15:09,890 --> 00:15:16,200 We have our sequence of values as n increases. 285 00:15:16,200 --> 00:15:20,290 What do we mean by the sequence converging to a is 286 00:15:20,290 --> 00:15:23,550 that when you look at those values, they get closer and 287 00:15:23,550 --> 00:15:25,140 closer to a. 288 00:15:25,140 --> 00:15:29,570 So this value here is your typical a sub n. 289 00:15:29,570 --> 00:15:33,880 They get closer and closer to a, and they stay closer. 290 00:15:33,880 --> 00:15:36,860 So let's try to make that more precise. 291 00:15:36,860 --> 00:15:40,750 What it means is let's fix a sense of what 292 00:15:40,750 --> 00:15:42,250 it means to be close. 293 00:15:42,250 --> 00:15:47,540 Let me look at an interval that goes from a - epsilon to 294 00:15:47,540 --> 00:15:50,340 a + epsilon. 295 00:15:50,340 --> 00:15:57,280 Then if my sequence converges to a, this means that as n 296 00:15:57,280 --> 00:16:02,810 increases, eventually the values of the sequence that I 297 00:16:02,810 --> 00:16:06,420 get stay inside this band. 298 00:16:06,420 --> 00:16:10,430 Since they converge to a, this means that eventually they 299 00:16:10,430 --> 00:16:14,130 will be smaller than a + epsilon and 300 00:16:14,130 --> 00:16:16,310 bigger than a - epsilon. 301 00:16:16,310 --> 00:16:21,320 So convergence means that given a band of positive 302 00:16:21,320 --> 00:16:25,690 length around the number a, the values of the sequence 303 00:16:25,690 --> 00:16:28,720 that you get eventually get inside and 304 00:16:28,720 --> 00:16:31,300 stay inside that band. 305 00:16:31,300 --> 00:16:34,060 So that's sort of the picture definition of 306 00:16:34,060 --> 00:16:35,840 what convergence means. 307 00:16:35,840 --> 00:16:40,460 So now let's translate this into a mathematical statement. 308 00:16:40,460 --> 00:16:45,610 Given a band of positive length, no matter how wide 309 00:16:45,610 --> 00:16:50,690 that band is or how narrow it is, so for every epsilon 310 00:16:50,690 --> 00:16:56,500 positive, eventually the sequence gets inside the band. 311 00:16:56,500 --> 00:16:58,460 What does eventually mean? 312 00:16:58,460 --> 00:17:01,410 There exists a time, so that after that 313 00:17:01,410 --> 00:17:03,510 time something happens. 314 00:17:03,510 --> 00:17:07,230 And the something that happens is that after that time, we 315 00:17:07,230 --> 00:17:09,520 are inside that band. 316 00:17:09,520 --> 00:17:12,060 So this is a formal mathematical definition, which 317 00:17:12,060 --> 00:17:17,250 actually translates what I was telling in the wordy way 318 00:17:17,250 --> 00:17:20,140 before, and showing in terms of the picture. 319 00:17:20,140 --> 00:17:25,140 Given a certain band, even if it's narrow, eventually, after 320 00:17:25,140 --> 00:17:28,520 a certain time n0, the values of the sequence are going to 321 00:17:28,520 --> 00:17:30,240 stay inside this band. 322 00:17:30,240 --> 00:17:35,770 Now, if I were to take epsilon to be very small, this thing 323 00:17:35,770 --> 00:17:38,130 would still be true that eventually I'm going to get 324 00:17:38,130 --> 00:17:42,400 inside of the band, except that I may have to wait longer 325 00:17:42,400 --> 00:17:45,770 for the values to get inside here. 326 00:17:45,770 --> 00:17:48,400 All right, that's what it means for a deterministic 327 00:17:48,400 --> 00:17:51,350 sequence to converge to something. 328 00:17:51,350 --> 00:17:54,150 Now, how about random variables. 329 00:17:54,150 --> 00:17:57,340 What does it mean for a sequence of random variables 330 00:17:57,340 --> 00:18:00,280 to converge to a number? 331 00:18:00,280 --> 00:18:02,600 We're just going to twist a little bit of the word 332 00:18:02,600 --> 00:18:03,310 definition. 333 00:18:03,310 --> 00:18:08,390 For numbers, we said that eventually the numbers get 334 00:18:08,390 --> 00:18:10,180 inside that band. 335 00:18:10,180 --> 00:18:13,270 But if instead of numbers we have random variables with a 336 00:18:13,270 --> 00:18:18,080 certain distribution, so here instead of a_n we're dealing 337 00:18:18,080 --> 00:18:20,750 with a random variable that has a distribution, let's say, 338 00:18:20,750 --> 00:18:26,650 of this kind, what we want is that this distribution gets 339 00:18:26,650 --> 00:18:31,460 inside this band, so it gets concentrated inside here. 340 00:18:31,460 --> 00:18:33,150 What does it means that the distribution 341 00:18:33,150 --> 00:18:34,850 gets inside this band? 342 00:18:34,850 --> 00:18:36,910 I mean a random variable has a distribution. 343 00:18:36,910 --> 00:18:40,130 It may have some tails, so maybe not the entire 344 00:18:40,130 --> 00:18:43,920 distribution gets concentrated inside of the band. 345 00:18:43,920 --> 00:18:48,660 But we want that more and more of this distribution is 346 00:18:48,660 --> 00:18:50,820 concentrated in this band. 347 00:18:50,820 --> 00:18:51,730 So that -- 348 00:18:51,730 --> 00:18:53,130 in a sense that -- 349 00:18:53,130 --> 00:18:57,070 the probability of falling outside the band converges to 350 00:18:57,070 --> 00:19:00,410 0 -- becomes smaller and smaller. 351 00:19:00,410 --> 00:19:05,660 So in words, we're going to say that the sequence random 352 00:19:05,660 --> 00:19:09,070 variables or a sequence of probability distributions, 353 00:19:09,070 --> 00:19:12,060 that would be the same, converges to a particular 354 00:19:12,060 --> 00:19:15,070 number a if the following is true. 355 00:19:15,070 --> 00:19:22,320 If I consider a small band around a, then the probability 356 00:19:22,320 --> 00:19:26,300 that my random variable falls outside this band, which is 357 00:19:26,300 --> 00:19:29,530 the area under this curve, this probability becomes 358 00:19:29,530 --> 00:19:32,620 smaller and smaller as n goes to infinity. 359 00:19:32,620 --> 00:19:35,370 The probability of being outside this band 360 00:19:35,370 --> 00:19:38,570 converges to 0. 361 00:19:38,570 --> 00:19:40,620 So that's the intuitive idea. 362 00:19:40,620 --> 00:19:45,080 So in the beginning, maybe our distribution is sitting 363 00:19:45,080 --> 00:19:46,590 everywhere. 364 00:19:46,590 --> 00:19:49,490 As n increases, the distribution starts to get 365 00:19:49,490 --> 00:19:51,560 concentrating inside the band. 366 00:19:51,560 --> 00:19:57,300 When a is even bigger, our distribution is even more 367 00:19:57,300 --> 00:20:00,310 inside that band, so that these outside probabilities 368 00:20:00,310 --> 00:20:02,460 become smaller and smaller. 369 00:20:02,460 --> 00:20:03,860 So the corresponding mathematical 370 00:20:03,860 --> 00:20:06,760 statement is the following. 371 00:20:06,760 --> 00:20:13,730 I fix a band around a, a +/- epsilon. 372 00:20:13,730 --> 00:20:18,170 Given that band, the probability of falling outside 373 00:20:18,170 --> 00:20:21,350 this band, this probability converges to 0. 374 00:20:21,350 --> 00:20:23,600 Or another way to say it is that the limit of this 375 00:20:23,600 --> 00:20:26,560 probability is equal to 0. 376 00:20:26,560 --> 00:20:29,720 If you were to translate this into a complete mathematical 377 00:20:29,720 --> 00:20:31,800 statement, you would have to write down the 378 00:20:31,800 --> 00:20:34,150 following messy thing. 379 00:20:34,150 --> 00:20:37,220 For every epsilon positive -- 380 00:20:37,220 --> 00:20:39,480 that's this statement -- 381 00:20:39,480 --> 00:20:41,240 the limit is 0. 382 00:20:41,240 --> 00:20:44,610 What does it mean that the limit of something is 0? 383 00:20:44,610 --> 00:20:47,670 We flip back to the previous slide. 384 00:20:47,670 --> 00:20:48,110 Why? 385 00:20:48,110 --> 00:20:51,430 Because a probability is a number. 386 00:20:51,430 --> 00:20:54,720 So here we're talking about a sequence of numbers 387 00:20:54,720 --> 00:20:56,340 convergent to 0. 388 00:20:56,340 --> 00:20:58,190 What does it mean for a sequence of numbers to 389 00:20:58,190 --> 00:20:59,180 converge to 0? 390 00:20:59,180 --> 00:21:05,320 It means that for any epsilon prime positive, there exists 391 00:21:05,320 --> 00:21:11,230 some n0 such that for every n bigger than n0 the 392 00:21:11,230 --> 00:21:12,770 following is true -- 393 00:21:12,770 --> 00:21:16,450 that this probability is less than or 394 00:21:16,450 --> 00:21:17,860 equal to epsilon prime. 395 00:21:17,860 --> 00:21:20,610 396 00:21:20,610 --> 00:21:27,660 So the mathematical statement is a little hard to parse. 397 00:21:27,660 --> 00:21:32,270 For every size of that band, and then you take the 398 00:21:32,270 --> 00:21:34,990 definition of what it means for the limit of a sequence of 399 00:21:34,990 --> 00:21:37,720 numbers to converge to 0. 400 00:21:37,720 --> 00:21:42,340 But it's a lot easier to describe this in words and, 401 00:21:42,340 --> 00:21:45,010 basically, think in terms of this picture. 402 00:21:45,010 --> 00:21:48,690 That as n increases, the probability of falling outside 403 00:21:48,690 --> 00:21:51,305 those bands just become smaller and smaller. 404 00:21:51,305 --> 00:21:56,590 So the statement is that our distribution gets concentrated 405 00:21:56,590 --> 00:22:01,340 in arbitrarily narrow little bands around that 406 00:22:01,340 --> 00:22:05,050 particular number a. 407 00:22:05,050 --> 00:22:05,350 OK. 408 00:22:05,350 --> 00:22:07,790 So let's look at an example. 409 00:22:07,790 --> 00:22:11,660 Suppose a random variable Yn has a discrete distribution of 410 00:22:11,660 --> 00:22:13,720 this particular type. 411 00:22:13,720 --> 00:22:17,150 Does it converge to something? 412 00:22:17,150 --> 00:22:19,570 Well, the probability distribution of this random 413 00:22:19,570 --> 00:22:22,370 variable gets concentrated at 0 -- 414 00:22:22,370 --> 00:22:26,520 there's more and more probability of being at 0. 415 00:22:26,520 --> 00:22:29,710 If I fix a band around 0 -- 416 00:22:29,710 --> 00:22:34,850 so if I take the band from minus epsilon to epsilon and 417 00:22:34,850 --> 00:22:36,520 look at that band-- 418 00:22:36,520 --> 00:22:42,350 the probability of falling outside this band is 1/n. 419 00:22:42,350 --> 00:22:45,780 As n goes to infinity, that probability goes to 0. 420 00:22:45,780 --> 00:22:50,550 So in this case, we do have convergence. 421 00:22:50,550 --> 00:22:56,780 And Yn converges in probability to the number 0. 422 00:22:56,780 --> 00:23:00,310 So this just captures the facts obvious from this 423 00:23:00,310 --> 00:23:03,680 picture, that more and more of our probability distribution 424 00:23:03,680 --> 00:23:07,630 gets concentrated around 0, as n goes to infinity. 425 00:23:07,630 --> 00:23:10,330 Now, an interesting thing to notice is the following, that 426 00:23:10,330 --> 00:23:15,390 even though Yn converges to 0, if you were to write down the 427 00:23:15,390 --> 00:23:20,440 expected value for Yn, what would it be? 428 00:23:20,440 --> 00:23:24,410 It's going to be n times the probability of this value, 429 00:23:24,410 --> 00:23:26,240 which is 1/n. 430 00:23:26,240 --> 00:23:29,230 So the expected value turns out to be 1. 431 00:23:29,230 --> 00:23:34,300 And if you were to look at the expected value of Yn-squared, 432 00:23:34,300 --> 00:23:38,190 this would be 0. 433 00:23:38,190 --> 00:23:41,770 times this probability, and then n-squared times this 434 00:23:41,770 --> 00:23:45,720 probability, which is equal to n. 435 00:23:45,720 --> 00:23:49,850 And this actually goes to infinity. 436 00:23:49,850 --> 00:23:53,580 So we have this, perhaps, strange situation where a 437 00:23:53,580 --> 00:23:58,030 random variable goes to 0, but the expected value of this 438 00:23:58,030 --> 00:24:01,140 random variable does not go to 0. 439 00:24:01,140 --> 00:24:04,570 And the second moment of that random variable actually goes 440 00:24:04,570 --> 00:24:05,790 to infinity. 441 00:24:05,790 --> 00:24:08,740 So this tells us that convergence in probability 442 00:24:08,740 --> 00:24:11,380 tells you something, but it doesn't tell 443 00:24:11,380 --> 00:24:13,310 you the whole story. 444 00:24:13,310 --> 00:24:17,260 Convergence to 0 of a random variable doesn't imply 445 00:24:17,260 --> 00:24:20,630 anything about convergence of expected values or of 446 00:24:20,630 --> 00:24:23,420 variances and so on. 447 00:24:23,420 --> 00:24:26,060 So the reason is that convergence in probability 448 00:24:26,060 --> 00:24:28,470 tells you that this tail probability 449 00:24:28,470 --> 00:24:30,400 here is very small. 450 00:24:30,400 --> 00:24:34,440 But it doesn't tell you how far does this tail go. 451 00:24:34,440 --> 00:24:39,390 As in this example, the tail probability is small, but that 452 00:24:39,390 --> 00:24:43,410 tail acts far away, so it gives a disproportionate 453 00:24:43,410 --> 00:24:45,950 contribution to the expected value or the 454 00:24:45,950 --> 00:24:47,200 expected value squared. 455 00:24:47,200 --> 00:24:53,340 456 00:24:53,340 --> 00:24:53,650 OK. 457 00:24:53,650 --> 00:24:59,000 So now we've got everything that we need to go back to the 458 00:24:59,000 --> 00:25:02,900 sample mean and study its properties. 459 00:25:02,900 --> 00:25:05,460 So the sad thing is that we have a 460 00:25:05,460 --> 00:25:07,320 sequence of random variables. 461 00:25:07,320 --> 00:25:08,350 They're independent. 462 00:25:08,350 --> 00:25:10,450 They have the same distribution. 463 00:25:10,450 --> 00:25:12,790 And we assume that they have a finite mean 464 00:25:12,790 --> 00:25:14,480 and a finite variance. 465 00:25:14,480 --> 00:25:18,430 We're looking at the sample mean. 466 00:25:18,430 --> 00:25:21,670 Now in principle, you can calculate the probability 467 00:25:21,670 --> 00:25:25,090 distribution of the sample mean, because we know how to 468 00:25:25,090 --> 00:25:26,950 find the distributions of sums of 469 00:25:26,950 --> 00:25:28,320 independent random variables. 470 00:25:28,320 --> 00:25:31,030 You use the convolution formula over and over. 471 00:25:31,030 --> 00:25:32,870 But this is pretty complicated, so 472 00:25:32,870 --> 00:25:34,730 let's not look at that. 473 00:25:34,730 --> 00:25:38,920 Let's just look at expected values, variances, and the 474 00:25:38,920 --> 00:25:42,610 probabilities that the sample mean is far away 475 00:25:42,610 --> 00:25:44,310 from the true mean. 476 00:25:44,310 --> 00:25:47,470 So what is the expected value of this random variable? 477 00:25:47,470 --> 00:25:51,260 The expected value of a sum of random variables is the sum of 478 00:25:51,260 --> 00:25:52,510 the expected values. 479 00:25:52,510 --> 00:25:56,320 480 00:25:56,320 --> 00:26:00,320 And then we have this factor of n in the denominator. 481 00:26:00,320 --> 00:26:07,040 Each one of these expected values is mu, so we get mu. 482 00:26:07,040 --> 00:26:13,960 So the sample mean, the average value of this Mn in 483 00:26:13,960 --> 00:26:18,570 expectation is the same as the true mean inside our 484 00:26:18,570 --> 00:26:20,620 population. 485 00:26:20,620 --> 00:26:26,560 Now here, this is a fine conceptual point, there's two 486 00:26:26,560 --> 00:26:29,920 kinds of averages involved when you write down this 487 00:26:29,920 --> 00:26:31,280 expression. 488 00:26:31,280 --> 00:26:33,310 We understand that expectations are 489 00:26:33,310 --> 00:26:36,470 some kind of average. 490 00:26:36,470 --> 00:26:40,250 The sample mean is also an average over the values that 491 00:26:40,250 --> 00:26:42,240 we have observed. 492 00:26:42,240 --> 00:26:45,220 But it's two different kinds of averages. 493 00:26:45,220 --> 00:26:50,460 The sample mean is the average of the heights of the penguins 494 00:26:50,460 --> 00:26:54,330 that we collected over a single expedition. 495 00:26:54,330 --> 00:26:59,600 The expected value is to be thought of as follows, my 496 00:26:59,600 --> 00:27:02,060 probabilistic experiment is one expedition 497 00:27:02,060 --> 00:27:04,160 to the South Pole. 498 00:27:04,160 --> 00:27:09,760 Expected value here means thinking on the average over a 499 00:27:09,760 --> 00:27:12,620 huge number of expeditions. 500 00:27:12,620 --> 00:27:16,270 So my expedition is a random experiment, I collect random 501 00:27:16,270 --> 00:27:18,520 samples, and they record Mn. 502 00:27:18,520 --> 00:27:21,230 503 00:27:21,230 --> 00:27:27,170 The average result of an expedition is what we would 504 00:27:27,170 --> 00:27:31,060 get if we were to carry out a zillion expeditions and 505 00:27:31,060 --> 00:27:35,050 average the averages that we get at each particular 506 00:27:35,050 --> 00:27:36,090 expedition. 507 00:27:36,090 --> 00:27:39,860 So this Mn is the average during a single expedition. 508 00:27:39,860 --> 00:27:44,090 This expectation is the average over an imagined 509 00:27:44,090 --> 00:27:46,125 infinite sequence of expeditions. 510 00:27:46,125 --> 00:27:49,760 511 00:27:49,760 --> 00:27:52,830 And of course, the other thing to always keep in mind is that 512 00:27:52,830 --> 00:27:56,910 expectations give you numbers, whereas the sample mean is 513 00:27:56,910 --> 00:28:00,210 actually a random variable. 514 00:28:00,210 --> 00:28:00,486 All right. 515 00:28:00,486 --> 00:28:03,310 So this random variable, how random is it? 516 00:28:03,310 --> 00:28:05,610 How big is its variance? 517 00:28:05,610 --> 00:28:10,040 So the variance of a sum of random variables is the sum of 518 00:28:10,040 --> 00:28:12,370 the variances. 519 00:28:12,370 --> 00:28:16,610 But since we're dividing by n, when you calculate variances 520 00:28:16,610 --> 00:28:19,580 this brings in a factor of n-squared. 521 00:28:19,580 --> 00:28:21,215 So the variance is sigma-squared over n. 522 00:28:21,215 --> 00:28:24,340 523 00:28:24,340 --> 00:28:26,870 And in particular, the variance of the sample mean 524 00:28:26,870 --> 00:28:28,620 becomes smaller and smaller. 525 00:28:28,620 --> 00:28:31,170 It means that when you estimate that average height 526 00:28:31,170 --> 00:28:34,570 of penguins, if you take a large sample, then your 527 00:28:34,570 --> 00:28:37,530 estimate is not going to be too random. 528 00:28:37,530 --> 00:28:41,120 The randomness in your estimates become small if you 529 00:28:41,120 --> 00:28:43,250 have a large sample size. 530 00:28:43,250 --> 00:28:46,090 Having a large sample size kind of removes the randomness 531 00:28:46,090 --> 00:28:47,930 from your experiment. 532 00:28:47,930 --> 00:28:52,690 Now let's apply the Chebyshev inequality to say something 533 00:28:52,690 --> 00:28:56,020 about tail probabilities for the sample mean. 534 00:28:56,020 --> 00:28:59,610 The probability that you are more than epsilon away from 535 00:28:59,610 --> 00:29:03,650 the true mean is less than or equal to the variance of this 536 00:29:03,650 --> 00:29:07,030 quantity divided by this number squared. 537 00:29:07,030 --> 00:29:09,860 So that's just the translation of the Chebyshev inequality to 538 00:29:09,860 --> 00:29:12,320 the particular context we've got here. 539 00:29:12,320 --> 00:29:13,590 We found the variance. 540 00:29:13,590 --> 00:29:15,100 It's sigma-squared over n. 541 00:29:15,100 --> 00:29:18,340 So we end up with this expression. 542 00:29:18,340 --> 00:29:20,490 So what does this expression do? 543 00:29:20,490 --> 00:29:25,570 544 00:29:25,570 --> 00:29:32,370 For any given epsilon, if I fix epsilon, then this 545 00:29:32,370 --> 00:29:36,630 probability, which is less than sigma-squared over n 546 00:29:36,630 --> 00:29:40,550 epsilon-squared, converges to 0 as n goes to infinity. 547 00:29:40,550 --> 00:29:44,730 548 00:29:44,730 --> 00:29:48,050 And this is just the definition of convergence in 549 00:29:48,050 --> 00:29:49,690 probability. 550 00:29:49,690 --> 00:29:54,310 If this happens, that the probability of being more than 551 00:29:54,310 --> 00:29:57,590 epsilon away from the mean, that probability goes to 0, 552 00:29:57,590 --> 00:30:01,510 and this is true no matter how I choose my epsilon, then by 553 00:30:01,510 --> 00:30:04,490 definition we have convergence in probability. 554 00:30:04,490 --> 00:30:08,050 So we have proved that the sample mean converges in 555 00:30:08,050 --> 00:30:11,430 probability to the true mean. 556 00:30:11,430 --> 00:30:16,210 And this is what the weak law of large numbers tells us. 557 00:30:16,210 --> 00:30:21,060 So in some vague sense, it tells us that the sample 558 00:30:21,060 --> 00:30:24,350 means, when you take the average of many, many 559 00:30:24,350 --> 00:30:28,150 measurements in your sample, then the sample mean is a good 560 00:30:28,150 --> 00:30:31,870 estimate of the true mean in the sense that it approaches 561 00:30:31,870 --> 00:30:36,380 the true mean as your sample size increases. 562 00:30:36,380 --> 00:30:39,220 It approaches the true mean, but of course in a very 563 00:30:39,220 --> 00:30:42,540 specific sense, in probability, according to this 564 00:30:42,540 --> 00:30:46,550 notion of convergence that we have used. 565 00:30:46,550 --> 00:30:51,060 So since we're talking about sampling, let's go over an 566 00:30:51,060 --> 00:30:56,150 example, which is the typical situation faced by someone 567 00:30:56,150 --> 00:30:58,110 who's constructing a poll. 568 00:30:58,110 --> 00:31:02,680 So you're interested in some property of the population. 569 00:31:02,680 --> 00:31:05,590 So what fraction of the population 570 00:31:05,590 --> 00:31:08,380 prefers Coke to Pepsi? 571 00:31:08,380 --> 00:31:11,080 So there's a number f, which is that fraction of the 572 00:31:11,080 --> 00:31:12,460 population. 573 00:31:12,460 --> 00:31:16,260 And so this is an exact number. 574 00:31:16,260 --> 00:31:20,250 So out of a population of 100 million, 20 million prefer 575 00:31:20,250 --> 00:31:25,590 Coke, then f would be 0.2. 576 00:31:25,590 --> 00:31:27,970 We want to find out what that fraction is. 577 00:31:27,970 --> 00:31:30,590 We cannot ask everyone. 578 00:31:30,590 --> 00:31:34,250 What we're going to do is to take a random sample of people 579 00:31:34,250 --> 00:31:37,300 and ask them for their preferences. 580 00:31:37,300 --> 00:31:42,690 So the ith person either says yes for Coke or no. 581 00:31:42,690 --> 00:31:46,430 And we record that by putting a 1 each time that we get a 582 00:31:46,430 --> 00:31:49,160 yes answer. 583 00:31:49,160 --> 00:31:51,850 And then we form the average of these x's. 584 00:31:51,850 --> 00:31:53,070 What is this average? 585 00:31:53,070 --> 00:31:57,000 It's the number of 1's that we got divided by n. 586 00:31:57,000 --> 00:32:02,590 So this is a fraction, but calculated only on the basis 587 00:32:02,590 --> 00:32:04,880 of the sample that we have. 588 00:32:04,880 --> 00:32:10,260 So you can think of this as being an estimate, f_hat, 589 00:32:10,260 --> 00:32:13,120 based on the sample that we have. 590 00:32:13,120 --> 00:32:17,155 Now, even though we used the lower case letter here, this 591 00:32:17,155 --> 00:32:20,590 f_hat is, of course, a random variable. 592 00:32:20,590 --> 00:32:23,300 f is a number. 593 00:32:23,300 --> 00:32:27,570 This is the true fraction in the overall population. 594 00:32:27,570 --> 00:32:30,380 f_hat is the estimate that we get by using 595 00:32:30,380 --> 00:32:32,300 our particular sample. 596 00:32:32,300 --> 00:32:32,410 Ok. 597 00:32:32,410 --> 00:32:38,760 So your boss told you, I need to know what f is, but go and 598 00:32:38,760 --> 00:32:40,150 do some sampling. 599 00:32:40,150 --> 00:32:42,720 What are you going to respond? 600 00:32:42,720 --> 00:32:46,360 Unless I ask everyone in the whole population, there's no 601 00:32:46,360 --> 00:32:51,180 way for me to know f exactly. 602 00:32:51,180 --> 00:32:51,890 Right? 603 00:32:51,890 --> 00:32:54,560 There's no way. 604 00:32:54,560 --> 00:32:59,040 OK, so the boss tells you, well OK, then that'll me f 605 00:32:59,040 --> 00:33:00,860 within an accuracy. 606 00:33:00,860 --> 00:33:10,910 I want an answer from you, that's your answer, which is 607 00:33:10,910 --> 00:33:14,930 close to the correct answer within 1 % point. 608 00:33:14,930 --> 00:33:20,260 So if the true f is 0.4, your answer should be somewhere 609 00:33:20,260 --> 00:33:22,500 between 0.39 and 0.41. 610 00:33:22,500 --> 00:33:25,520 I want a really accurate answer. 611 00:33:25,520 --> 00:33:27,580 What are you going to say? 612 00:33:27,580 --> 00:33:31,360 Well, there's no guarantee that my answer 613 00:33:31,360 --> 00:33:33,230 will be within 1 %. 614 00:33:33,230 --> 00:33:37,320 Maybe I'm unlucky and I just happen to sample the wrong set 615 00:33:37,320 --> 00:33:40,450 of people and my answer comes out to be wrong. 616 00:33:40,450 --> 00:33:45,800 So I cannot give you a hard guarantee that this inequality 617 00:33:45,800 --> 00:33:47,240 will be satisfied. 618 00:33:47,240 --> 00:33:51,990 But perhaps, I can give you a guarantee that this inequality 619 00:33:51,990 --> 00:33:55,680 will be satisfied, this accuracy requirement will be 620 00:33:55,680 --> 00:33:59,340 satisfied, with high confidence. 621 00:33:59,340 --> 00:34:02,520 That is, there's going to be a smaller probability that 622 00:34:02,520 --> 00:34:04,420 things go wrong, that I'm unlikely 623 00:34:04,420 --> 00:34:07,030 and I use a bad sample. 624 00:34:07,030 --> 00:34:10,750 But leaving aside that smaller probability of being unlucky, 625 00:34:10,750 --> 00:34:13,989 my answer will be accurate within the accuracy 626 00:34:13,989 --> 00:34:16,100 requirement that you have. 627 00:34:16,100 --> 00:34:20,500 So these two numbers are the usual specs that one has when 628 00:34:20,500 --> 00:34:22,010 designing polls. 629 00:34:22,010 --> 00:34:27,370 So this number is the accuracy that we want. 630 00:34:27,370 --> 00:34:29,300 It's the desired accuracy. 631 00:34:29,300 --> 00:34:35,239 And this number has to do with the confidence that we want. 632 00:34:35,239 --> 00:34:40,210 So 1 minus that number, we could call it the confidence 633 00:34:40,210 --> 00:34:43,500 that we want out of our sample. 634 00:34:43,500 --> 00:34:47,820 So this is really 1 minus confidence. 635 00:34:47,820 --> 00:34:51,830 So now your job is to figure out how large an n, how large 636 00:34:51,830 --> 00:34:56,219 a sample should you be using, in order to satisfy the specs 637 00:34:56,219 --> 00:34:59,060 that your boss gave you. 638 00:34:59,060 --> 00:35:02,560 All you know at this stage is the Chebyshev inequality. 639 00:35:02,560 --> 00:35:05,210 So you just try to use it. 640 00:35:05,210 --> 00:35:09,780 The probability of getting an answer that's more than 0.01 641 00:35:09,780 --> 00:35:14,780 away from the true answer is, by Chebyshev's inequality, the 642 00:35:14,780 --> 00:35:20,170 variance of this random variable divided by this 643 00:35:20,170 --> 00:35:21,540 number squared. 644 00:35:21,540 --> 00:35:25,870 The variance, as we argued a little earlier, is the 645 00:35:25,870 --> 00:35:29,190 variance of the x's divided by n. 646 00:35:29,190 --> 00:35:31,830 So we get this expression. 647 00:35:31,830 --> 00:35:35,230 So we would like this number to be less 648 00:35:35,230 --> 00:35:38,330 than or equal to 0.05. 649 00:35:38,330 --> 00:35:41,620 OK, here we hit a little bit off a difficulty. 650 00:35:41,620 --> 00:35:49,040 The variance, (sigma_x)-squared, what is it? 651 00:35:49,040 --> 00:35:54,010 (Sigma_x)-squared is, if you remember the variance of a 652 00:35:54,010 --> 00:35:58,010 Bernoulli random variable, is this quantity. 653 00:35:58,010 --> 00:35:59,730 But we don't know it. 654 00:35:59,730 --> 00:36:02,880 f is what we're trying to estimate in the first place. 655 00:36:02,880 --> 00:36:06,790 So the variance is not known, so I cannot plug in a number 656 00:36:06,790 --> 00:36:08,080 inside here. 657 00:36:08,080 --> 00:36:12,340 What I can do is to be conservative and use an upper 658 00:36:12,340 --> 00:36:14,050 bound of the variance. 659 00:36:14,050 --> 00:36:17,280 How large can this number get? 660 00:36:17,280 --> 00:36:20,090 Well, you can plot f times (1-f). 661 00:36:20,090 --> 00:36:25,950 662 00:36:25,950 --> 00:36:26,750 It's a parabola. 663 00:36:26,750 --> 00:36:29,420 It has a root at 0 and at 1. 664 00:36:29,420 --> 00:36:34,450 So the maximum value is going to be, by symmetry, at 1/2 and 665 00:36:34,450 --> 00:36:39,350 when f is 1/2, then this variance becomes 1/4. 666 00:36:39,350 --> 00:36:42,340 So I don't know (sigma_x)-squared, but I'm 667 00:36:42,340 --> 00:36:45,480 going to use the worst case value for (sigma_x)-squared, 668 00:36:45,480 --> 00:36:48,480 which is 4. 669 00:36:48,480 --> 00:36:53,320 And this is now an inequality that I know to be always true. 670 00:36:53,320 --> 00:36:56,910 I've got my specs, and my specs tell me that I want this 671 00:36:56,910 --> 00:36:59,800 number to be less than 0.05. 672 00:36:59,800 --> 00:37:04,980 And given what I know, the best thing I can do is to say, 673 00:37:04,980 --> 00:37:07,860 OK, I'm going to take this number and make 674 00:37:07,860 --> 00:37:14,070 it less than 0.05. 675 00:37:14,070 --> 00:37:20,860 If I choose my n so that this is less than 0.05, then I'm 676 00:37:20,860 --> 00:37:24,890 certain that this probability is also less than 0.05. 677 00:37:24,890 --> 00:37:28,720 What does it take for this inequality to be true? 678 00:37:28,720 --> 00:37:36,370 You can solve for n here, and you find that to satisfy this 679 00:37:36,370 --> 00:37:40,780 inequality, n should be larger than or equal to 50,000. 680 00:37:40,780 --> 00:37:44,250 So you can just let n be equal to 50,000. 681 00:37:44,250 --> 00:37:47,920 So the Chebyshev inequality tells us that if you take n 682 00:37:47,920 --> 00:37:51,940 equal to 50,000, then by the Chebyshev inequality, we're 683 00:37:51,940 --> 00:37:57,850 guaranteed to satisfy the specs that we were given. 684 00:37:57,850 --> 00:37:57,960 Ok. 685 00:37:57,960 --> 00:38:03,950 Now, 50,000 is a bit of a large sample size. 686 00:38:03,950 --> 00:38:05,980 Right? 687 00:38:05,980 --> 00:38:09,490 If you read anything in the newspapers where they say so 688 00:38:09,490 --> 00:38:13,230 much of the voters think this and that, this was determined 689 00:38:13,230 --> 00:38:19,830 on the basis of a sample of 1,200 likely voters or so. 690 00:38:19,830 --> 00:38:23,430 So the numbers that you will typically see in these news 691 00:38:23,430 --> 00:38:27,590 items about polling, they usually involve sample sizes 692 00:38:27,590 --> 00:38:30,080 about the 1,000 or so. 693 00:38:30,080 --> 00:38:35,250 You will never see a sample size of 50,000. 694 00:38:35,250 --> 00:38:37,230 That's too much. 695 00:38:37,230 --> 00:38:41,670 So where can we cut some corners? 696 00:38:41,670 --> 00:38:46,390 Well, we can cut corners basically in three places. 697 00:38:46,390 --> 00:38:49,950 This requirement is a little too tight. 698 00:38:49,950 --> 00:38:53,530 Newspaper stories will usually tell you, we have an accuracy 699 00:38:53,530 --> 00:38:58,800 of +/- 3 % points, instead of 1 % point. 700 00:38:58,800 --> 00:39:03,770 And because this number comes up as a square, by making it 3 701 00:39:03,770 --> 00:39:09,000 % points instead of 1, saves you a factor of 10. 702 00:39:09,000 --> 00:39:12,790 Then, the five percent confidence, I guess that's 703 00:39:12,790 --> 00:39:15,180 usually OK. 704 00:39:15,180 --> 00:39:19,400 If we use that factor of 10, then we make our sample that 705 00:39:19,400 --> 00:39:23,730 we gain from here, then we get a sample size of 10,000. 706 00:39:23,730 --> 00:39:25,980 And that's, again, a little too big. 707 00:39:25,980 --> 00:39:28,140 So where can we fix things? 708 00:39:28,140 --> 00:39:31,140 Well, it turns out that this inequality that we're using 709 00:39:31,140 --> 00:39:34,660 here, Chebyshev's inequality, is just an inequality. 710 00:39:34,660 --> 00:39:36,890 It's not that tight. 711 00:39:36,890 --> 00:39:38,850 It's not very accurate. 712 00:39:38,850 --> 00:39:42,800 Maybe there's a better way of calculating or estimating this 713 00:39:42,800 --> 00:39:46,760 quantity, which is smaller than this. 714 00:39:46,760 --> 00:39:49,770 And using a more accurate inequality or a more accurate 715 00:39:49,770 --> 00:39:55,320 bound, then we can convince ourselves that we can settle 716 00:39:55,320 --> 00:39:57,800 with a smaller sample size. 717 00:39:57,800 --> 00:40:01,770 This more accurate kind of inequality comes out of a 718 00:40:01,770 --> 00:40:04,140 difference limit theorem, which is the next limit 719 00:40:04,140 --> 00:40:06,030 theorem we're going to consider. 720 00:40:06,030 --> 00:40:08,310 We're going to start the discussion today, but we're 721 00:40:08,310 --> 00:40:12,150 going to continue with it next week. 722 00:40:12,150 --> 00:40:18,750 Before I tell you exactly what that other limit theorem says, 723 00:40:18,750 --> 00:40:20,800 let me give you the big picture of 724 00:40:20,800 --> 00:40:24,760 what's involved here. 725 00:40:24,760 --> 00:40:29,170 We're dealing with sums of i.i.d random variables. 726 00:40:29,170 --> 00:40:32,300 Each X has a distribution of its own. 727 00:40:32,300 --> 00:40:34,840 728 00:40:34,840 --> 00:40:41,190 So suppose that X has a distribution which is 729 00:40:41,190 --> 00:40:43,090 something like this. 730 00:40:43,090 --> 00:40:48,560 This is the density of X. If I add lots of X's together, what 731 00:40:48,560 --> 00:40:51,460 kind of distribution do I expect? 732 00:40:51,460 --> 00:40:55,170 The mean is going to be n times the mean of an 733 00:40:55,170 --> 00:41:00,560 individual X. So if this is mu, I'm going to get a mean of 734 00:41:00,560 --> 00:41:02,730 n times mu. 735 00:41:02,730 --> 00:41:06,620 But my variance will also increase. 736 00:41:06,620 --> 00:41:08,050 When I add the random variables, 737 00:41:08,050 --> 00:41:10,190 I'm adding the variances. 738 00:41:10,190 --> 00:41:13,370 So since the variance increases, we're going to get 739 00:41:13,370 --> 00:41:17,610 a distribution that's pretty wide. 740 00:41:17,610 --> 00:41:23,240 So this is the density of X1 plus all the way to Xn. 741 00:41:23,240 --> 00:41:27,640 So as n increases, my distribution shifts, because 742 00:41:27,640 --> 00:41:28,770 the mean is positive. 743 00:41:28,770 --> 00:41:30,610 So I keep adding things. 744 00:41:30,610 --> 00:41:33,870 And also, my distribution becomes wider and wider. 745 00:41:33,870 --> 00:41:36,080 The variance increases. 746 00:41:36,080 --> 00:41:39,260 Well, we started a different scaling. 747 00:41:39,260 --> 00:41:42,980 We started a scaled version of this quantity when we looked 748 00:41:42,980 --> 00:41:46,180 at the weak law of large numbers. 749 00:41:46,180 --> 00:41:49,580 In the weak law of large numbers, we take this random 750 00:41:49,580 --> 00:41:52,140 variable and divide it by n. 751 00:41:52,140 --> 00:41:56,300 And what the weak law tells us is that we're going to get a 752 00:41:56,300 --> 00:42:01,050 distribution that's very highly concentrated around the 753 00:42:01,050 --> 00:42:03,650 true mean, which is mu. 754 00:42:03,650 --> 00:42:07,520 So this here would be the density of X1 plus 755 00:42:07,520 --> 00:42:12,630 Xn divided by n. 756 00:42:12,630 --> 00:42:16,660 Because I've divided by n, the mean has become the original 757 00:42:16,660 --> 00:42:19,410 mean, which is mu. 758 00:42:19,410 --> 00:42:22,620 But the weak law of large numbers tells us that the 759 00:42:22,620 --> 00:42:26,650 distribution of this random variable is very concentrated 760 00:42:26,650 --> 00:42:27,810 around the mean. 761 00:42:27,810 --> 00:42:29,850 So we get a distribution that's very 762 00:42:29,850 --> 00:42:31,520 narrow in this kind. 763 00:42:31,520 --> 00:42:34,230 In the limit, this distribution becomes one 764 00:42:34,230 --> 00:42:37,570 that's just concentrated on top of mu. 765 00:42:37,570 --> 00:42:40,930 So it's sort of a degenerate distribution. 766 00:42:40,930 --> 00:42:46,070 So these are two extremes, no scaling for the sum, a scaling 767 00:42:46,070 --> 00:42:47,740 where we divide by n. 768 00:42:47,740 --> 00:42:50,680 In this extreme, we get the trivial case of a distribution 769 00:42:50,680 --> 00:42:52,860 that flattens out completely. 770 00:42:52,860 --> 00:42:56,070 In this scaling, we get a distribution that gets 771 00:42:56,070 --> 00:42:59,150 concentrated around a single point. 772 00:42:59,150 --> 00:43:02,030 Again, we look at some intermediate scaling that 773 00:43:02,030 --> 00:43:04,050 makes things more interesting. 774 00:43:04,050 --> 00:43:09,700 Things do become interesting if we scale by dividing the 775 00:43:09,700 --> 00:43:14,520 sum by square root of n instead of dividing by n. 776 00:43:14,520 --> 00:43:17,210 What effect does this have? 777 00:43:17,210 --> 00:43:22,510 When we scale by dividing by square root of n, the variance 778 00:43:22,510 --> 00:43:28,050 of Sn over square root of n is going to be the variance of Sn 779 00:43:28,050 --> 00:43:30,760 over sum divided by n. 780 00:43:30,760 --> 00:43:32,780 That's how variances behave. 781 00:43:32,780 --> 00:43:37,370 The variance of Sn is n sigma-squared, divide by n, 782 00:43:37,370 --> 00:43:41,330 which is sigma squared, which means that when we scale in 783 00:43:41,330 --> 00:43:45,940 this particular way, as n changes, the 784 00:43:45,940 --> 00:43:48,230 variance doesn't change. 785 00:43:48,230 --> 00:43:50,300 So the width of our distribution 786 00:43:50,300 --> 00:43:52,190 will be sort of constant. 787 00:43:52,190 --> 00:43:56,360 The distribution changes shape, but it doesn't become 788 00:43:56,360 --> 00:43:59,910 narrower as was the case here. 789 00:43:59,910 --> 00:44:04,550 It doesn't become wider, kind of keeps the same width. 790 00:44:04,550 --> 00:44:09,260 So perhaps in the limit, this distribution is going to take 791 00:44:09,260 --> 00:44:11,080 an interesting shape. 792 00:44:11,080 --> 00:44:14,170 And that's indeed the case. 793 00:44:14,170 --> 00:44:19,800 So let's do what we did before. 794 00:44:19,800 --> 00:44:25,110 So we're looking at the sum, and we want to divide the sum 795 00:44:25,110 --> 00:44:28,860 by something that goes like square root of n. 796 00:44:28,860 --> 00:44:33,140 So the variance of Sn is n sigma squared. 797 00:44:33,140 --> 00:44:38,240 The variance of the sigma Sn is the square root of that. 798 00:44:38,240 --> 00:44:39,570 It's this number. 799 00:44:39,570 --> 00:44:43,930 So effectively, we're scaling by order of square root n. 800 00:44:43,930 --> 00:44:47,570 Now, I'm doing another thing here. 801 00:44:47,570 --> 00:44:52,350 If my random variable has a positive mean, then this 802 00:44:52,350 --> 00:44:55,470 quantity is going to have a mean that's 803 00:44:55,470 --> 00:44:56,950 positive and growing. 804 00:44:56,950 --> 00:44:59,450 It's going to be shifting to the right. 805 00:44:59,450 --> 00:45:01,350 Why is that? 806 00:45:01,350 --> 00:45:04,370 Sn has a mean that's proportional to n. 807 00:45:04,370 --> 00:45:09,510 When I divide by square root n, then it means that the mean 808 00:45:09,510 --> 00:45:11,990 scales like square root of n. 809 00:45:11,990 --> 00:45:14,740 So my distribution would still keep shifting 810 00:45:14,740 --> 00:45:16,720 after I do this division. 811 00:45:16,720 --> 00:45:20,860 I want to keep my distribution in place, so I subtract out 812 00:45:20,860 --> 00:45:23,920 the mean of Sn. 813 00:45:23,920 --> 00:45:29,580 So what we're doing here is a standard technique or 814 00:45:29,580 --> 00:45:32,670 transformation where you take a random variable and you 815 00:45:32,670 --> 00:45:34,890 so-called standardize it. 816 00:45:34,890 --> 00:45:38,500 I remove the mean of that random variable and I divide 817 00:45:38,500 --> 00:45:40,100 by the standard deviation. 818 00:45:40,100 --> 00:45:43,030 This results in a random variable that has 0 mean and 819 00:45:43,030 --> 00:45:44,960 unit variance. 820 00:45:44,960 --> 00:45:49,880 What Zn measures is the following, Zn tells me how 821 00:45:49,880 --> 00:45:55,520 many standard deviations am I away from the mean. 822 00:45:55,520 --> 00:45:59,380 Sn minus (n times expected value of X) tells me how much 823 00:45:59,380 --> 00:46:02,980 is Sn away from the mean value of Sn. 824 00:46:02,980 --> 00:46:06,250 And by dividing by the standard deviation of Sn -- 825 00:46:06,250 --> 00:46:09,830 this tells me how many standard deviations away from 826 00:46:09,830 --> 00:46:12,550 the mean am I. 827 00:46:12,550 --> 00:46:15,360 So we're going to look at this random variable, which is just 828 00:46:15,360 --> 00:46:17,260 a transformation Zn. 829 00:46:17,260 --> 00:46:20,840 It's a linear transformation of Sn. 830 00:46:20,840 --> 00:46:24,740 S And we're going to compare this random variable to a 831 00:46:24,740 --> 00:46:27,230 standard normal random variable. 832 00:46:27,230 --> 00:46:30,610 So a standard normal is the random variable that you are 833 00:46:30,610 --> 00:46:35,200 familiar with, given by the usual formula, and for which 834 00:46:35,200 --> 00:46:37,400 we have tables for it. 835 00:46:37,400 --> 00:46:40,400 This Zn has 0 mean and unit variance. 836 00:46:40,400 --> 00:46:44,220 So in that respect, it has the same statistics as the 837 00:46:44,220 --> 00:46:45,655 standard normal. 838 00:46:45,655 --> 00:46:48,960 The distribution of Zn could be anything -- 839 00:46:48,960 --> 00:46:50,770 can be pretty messy. 840 00:46:50,770 --> 00:46:53,320 But there is this amazing theorem called the central 841 00:46:53,320 --> 00:46:58,250 limit theorem that tells us that the distribution of Zn 842 00:46:58,250 --> 00:47:01,930 approaches the distribution of the standard normal in the 843 00:47:01,930 --> 00:47:06,270 following sense, that probability is that you can 844 00:47:06,270 --> 00:47:07,080 calculate -- 845 00:47:07,080 --> 00:47:07,930 of this type -- 846 00:47:07,930 --> 00:47:10,350 that you can calculate for Zn -- 847 00:47:10,350 --> 00:47:13,330 is the limit becomes the same as the probabilities that you 848 00:47:13,330 --> 00:47:17,590 would get from the standard normal tables for Z. 849 00:47:17,590 --> 00:47:19,750 It's a statement about the cumulative 850 00:47:19,750 --> 00:47:21,960 distribution functions. 851 00:47:21,960 --> 00:47:25,060 This quantity, as a function of c, is the cumulative 852 00:47:25,060 --> 00:47:27,920 distribution function of the random variable Zn. 853 00:47:27,920 --> 00:47:30,860 This is the cumulative distribution function of the 854 00:47:30,860 --> 00:47:32,190 standard normal. 855 00:47:32,190 --> 00:47:34,530 The central limit theorem tells us that the cumulative 856 00:47:34,530 --> 00:47:39,340 distribution function of the sum of a number of random 857 00:47:39,340 --> 00:47:43,040 variables, after they're appropriately standardized, 858 00:47:43,040 --> 00:47:46,480 approaches the cumulative distribution function over the 859 00:47:46,480 --> 00:47:50,580 standard normal distribution. 860 00:47:50,580 --> 00:47:53,620 In particular, this tells us that we can calculate 861 00:47:53,620 --> 00:47:59,480 probabilities for Zn when n is large by calculating instead 862 00:47:59,480 --> 00:48:02,800 probabilities for Z. And that's going to be a good 863 00:48:02,800 --> 00:48:04,020 approximation. 864 00:48:04,020 --> 00:48:07,670 Probabilities for Z are easy to calculate because they're 865 00:48:07,670 --> 00:48:09,250 well tabulated. 866 00:48:09,250 --> 00:48:12,820 So we get a very nice shortcut for calculating 867 00:48:12,820 --> 00:48:14,990 probabilities for Zn. 868 00:48:14,990 --> 00:48:17,990 Now, it's not Zn that you're interested in. 869 00:48:17,990 --> 00:48:20,890 What you're interested in is Sn. 870 00:48:20,890 --> 00:48:23,820 And Sn -- 871 00:48:23,820 --> 00:48:29,080 inverting this relation here -- 872 00:48:29,080 --> 00:48:38,330 Sn is square root n sigma Zn plus n expected 873 00:48:38,330 --> 00:48:42,602 value of X. All right. 874 00:48:42,602 --> 00:48:46,620 Now, if you can calculate probabilities for Zn, even 875 00:48:46,620 --> 00:48:49,380 approximately, then you can certainly calculate 876 00:48:49,380 --> 00:48:53,290 probabilities for Sn, because one is a linear 877 00:48:53,290 --> 00:48:55,206 function of the other. 878 00:48:55,206 --> 00:48:58,710 And we're going to do a little bit of that next time. 879 00:48:58,710 --> 00:49:02,220 You're going to get, also, some practice in recitation. 880 00:49:02,220 --> 00:49:04,975 At a more vague level, you could describe the central 881 00:49:04,975 --> 00:49:08,270 limit theorem as saying the following, when n is large, 882 00:49:08,270 --> 00:49:12,160 you can pretend that Zn is a standard normal random 883 00:49:12,160 --> 00:49:15,440 variable and do the calculations as if Zn was 884 00:49:15,440 --> 00:49:16,680 standard normal. 885 00:49:16,680 --> 00:49:21,530 Now, pretending that Zn is normal is the same as 886 00:49:21,530 --> 00:49:25,900 pretending that Sn is normal, because Sn is a linear 887 00:49:25,900 --> 00:49:27,700 function of Zn. 888 00:49:27,700 --> 00:49:30,400 And we know that linear functions of normal random 889 00:49:30,400 --> 00:49:32,140 variables are normal. 890 00:49:32,140 --> 00:49:36,290 So the central limit theorem essentially tells us that we 891 00:49:36,290 --> 00:49:40,070 can pretend that Sn is a normal random variable and do 892 00:49:40,070 --> 00:49:44,760 the calculations just as if it were a normal random variable. 893 00:49:44,760 --> 00:49:47,020 Mathematically speaking though, the central limit 894 00:49:47,020 --> 00:49:50,480 theorem does not talk about the distribution of Sn, 895 00:49:50,480 --> 00:49:54,940 because the distribution of Sn becomes degenerate in the 896 00:49:54,940 --> 00:49:57,650 limit, just a very flat and long thing. 897 00:49:57,650 --> 00:49:59,810 So strictly speaking mathematically, it's a 898 00:49:59,810 --> 00:50:03,060 statement about cumulative distributions of Zn's. 899 00:50:03,060 --> 00:50:06,420 Practically, the way you use it is by just pretending that 900 00:50:06,420 --> 00:50:08,415 Sn is normal. 901 00:50:08,415 --> 00:50:09,400 Very good. 902 00:50:09,400 --> 00:50:11,080 Enjoy the Thanksgiving Holiday. 903 00:50:11,080 --> 00:50:12,330