1 00:00:00,530 --> 00:00:02,960 The following content is provided under a Creative 2 00:00:02,960 --> 00:00:04,370 Commons license. 3 00:00:04,370 --> 00:00:07,410 Your support will help MIT OpenCourseWare continue to 4 00:00:07,410 --> 00:00:11,060 offer high-quality educational resources for free. 5 00:00:11,060 --> 00:00:13,960 To make a donation or view additional materials from 6 00:00:13,960 --> 00:00:19,790 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:19,790 --> 00:00:21,040 ocw.mit.edu. 8 00:00:23,830 --> 00:00:25,630 PROFESSOR: OK, let's get started. 9 00:00:25,630 --> 00:00:28,430 We'll start a minute early and then maybe we can finish a 10 00:00:28,430 --> 00:00:31,230 minute early if we're lucky. 11 00:00:31,230 --> 00:00:35,320 Today we want to talk about the laws of large numbers. 12 00:00:35,320 --> 00:00:38,310 We want to talk about convergence a little bit. 13 00:00:38,310 --> 00:00:41,170 We will not really get into the strong law of large 14 00:00:41,170 --> 00:00:45,820 numbers, which we're going to do later, because that's a 15 00:00:45,820 --> 00:00:48,590 kind of a mysterious, difficult topic. 16 00:00:48,590 --> 00:00:53,120 And I wanted to put off really talking about that until we 17 00:00:53,120 --> 00:00:55,480 got to the point where we could make some use of it, 18 00:00:55,480 --> 00:00:57,810 which is not quite yet. 19 00:00:57,810 --> 00:00:59,970 So first, I want to review what we've done 20 00:00:59,970 --> 00:01:01,220 just a little bit. 21 00:01:04,370 --> 00:01:08,890 We've said that probability models are very natural things 22 00:01:08,890 --> 00:01:13,380 for real-world situations, particularly those that are 23 00:01:13,380 --> 00:01:15,600 repeatable. 24 00:01:15,600 --> 00:01:22,940 And by repeatable, I mean they use trials, which have 25 00:01:22,940 --> 00:01:25,980 essentially the same initial conditions. 26 00:01:25,980 --> 00:01:28,130 They're essentially isolated from each other. 27 00:01:28,130 --> 00:01:31,270 When I say they're isolated from each other, I mean there 28 00:01:31,270 --> 00:01:33,470 isn't any apparent contact that they 29 00:01:33,470 --> 00:01:35,340 have with each other. 30 00:01:35,340 --> 00:01:39,970 So for example, when you're flipping coins, there's not 31 00:01:39,970 --> 00:01:42,260 one very unusual coin, and that's the coin 32 00:01:42,260 --> 00:01:43,810 you use all the time. 33 00:01:43,810 --> 00:01:45,910 And then you try to use those results to be 34 00:01:45,910 --> 00:01:48,700 typical of all coins. 35 00:01:48,700 --> 00:01:50,920 Have a fixed set of possible outcomes for 36 00:01:50,920 --> 00:01:52,440 these multiple trials. 37 00:01:52,440 --> 00:01:56,430 And they have an essentially random individual outcomes. 38 00:01:56,430 --> 00:01:59,860 Now, you'll see there's a real problem here when I use the 39 00:01:59,860 --> 00:02:04,270 word "random" there and "probability models" there 40 00:02:04,270 --> 00:02:06,130 because there is something inherently 41 00:02:06,130 --> 00:02:08,440 circular in this argument. 42 00:02:08,440 --> 00:02:11,190 It's something that always happens when you get into 43 00:02:11,190 --> 00:02:14,760 modeling where you're trying to take the messy real-world 44 00:02:14,760 --> 00:02:19,090 and turn it into a nice, clean, mathematical model. 45 00:02:19,090 --> 00:02:22,350 So that really, what we all do, and we do this 46 00:02:22,350 --> 00:02:26,590 instinctively, is after we start getting used to a 47 00:02:26,590 --> 00:02:30,570 particular model, we assume that the real-world is like 48 00:02:30,570 --> 00:02:31,840 that model. 49 00:02:31,840 --> 00:02:35,320 If you don't think you do that, think again. 50 00:02:35,320 --> 00:02:38,390 Because I think everyone does. 51 00:02:38,390 --> 00:02:41,370 So you really have that problem of trying to figure 52 00:02:41,370 --> 00:02:45,010 out what's wrong with models, how to go to better models, 53 00:02:45,010 --> 00:02:46,910 and we do this all the time. 54 00:02:46,910 --> 00:02:50,030 OK, for any model, an extended model-- in other words, an 55 00:02:50,030 --> 00:02:54,590 extended mathematical model, for a sequence or an n-tuple 56 00:02:54,590 --> 00:03:00,030 of independent identically distributed repetitions is 57 00:03:00,030 --> 00:03:02,660 always well-defined mathematically. 58 00:03:02,660 --> 00:03:05,230 We haven't proven that, it's not trivial. 59 00:03:05,230 --> 00:03:08,020 But in fact, it's true. 60 00:03:08,020 --> 00:03:12,050 Relative frequencies and sample averages. 61 00:03:12,050 --> 00:03:17,240 Relative frequencies apply to events, and you can represent 62 00:03:17,240 --> 00:03:20,180 events in terms of indicator functions and then use 63 00:03:20,180 --> 00:03:22,240 everything you know about random variables 64 00:03:22,240 --> 00:03:24,330 to deal with them. 65 00:03:24,330 --> 00:03:26,650 Therefore, you can use sample averages. 66 00:03:26,650 --> 00:03:29,910 In this extended model, essentially become 67 00:03:29,910 --> 00:03:31,330 deterministic. 68 00:03:31,330 --> 00:03:34,210 And that's what the laws of large numbers say in various 69 00:03:34,210 --> 00:03:36,210 different ways. 70 00:03:36,210 --> 00:03:39,880 And beyond knowing that they become deterministic, our 71 00:03:39,880 --> 00:03:43,700 problem today is to decide exactly what that means. 72 00:03:48,010 --> 00:03:51,280 The laws of large numbers specify what "become 73 00:03:51,280 --> 00:03:53,250 deterministic" means. 74 00:03:53,250 --> 00:03:56,170 They only operates within the extended model. 75 00:03:56,170 --> 00:03:59,390 In other words, laws of large numbers don't apply to the 76 00:03:59,390 --> 00:04:01,440 real world. 77 00:04:01,440 --> 00:04:04,260 Well, we hope they apply to the real world, but they only 78 00:04:04,260 --> 00:04:07,140 apply to the real world when the model is good because you 79 00:04:07,140 --> 00:04:10,100 can only prove the laws of large numbers within this 80 00:04:10,100 --> 00:04:11,370 model domain. 81 00:04:11,370 --> 00:04:15,220 Probability theory provides an awful lot of consistency 82 00:04:15,220 --> 00:04:18,730 checks and ways to avoid experimentation. 83 00:04:18,730 --> 00:04:21,250 In other words, I'm not claiming here that you have to 84 00:04:21,250 --> 00:04:25,600 do experimentation with a large number of so-called 85 00:04:25,600 --> 00:04:30,150 independent trials very often because you have so many ways 86 00:04:30,150 --> 00:04:31,520 of checking on things. 87 00:04:31,520 --> 00:04:34,890 But every once in a while, you have to do experimentation. 88 00:04:34,890 --> 00:04:39,520 And when you do, somehow or other the idea of this large 89 00:04:39,520 --> 00:04:45,300 number of trials, and either IID trials or trials which are 90 00:04:45,300 --> 00:04:47,330 somehow isolated from each other. 91 00:04:50,410 --> 00:04:53,720 And we will soon get to talk about Markov models. 92 00:04:53,720 --> 00:04:57,730 We will see that with Markov models, you don't have the IID 93 00:04:57,730 --> 00:05:02,320 property, but you have enough independence over time that 94 00:05:02,320 --> 00:05:04,920 you can still get these sorts of results. 95 00:05:04,920 --> 00:05:10,800 So anyway, the determinism in this large number of trials 96 00:05:10,800 --> 00:05:14,650 really underlies much of the value of probability. 97 00:05:14,650 --> 00:05:16,970 OK, in other words, you don't need to use this 98 00:05:16,970 --> 00:05:19,120 experimentation very often. 99 00:05:19,120 --> 00:05:22,920 But when you do, you really need it because that's what 100 00:05:22,920 --> 00:05:27,050 you use to resolve conflicts, and to settle on things, and 101 00:05:27,050 --> 00:05:30,960 to have different people who are all trying to understand 102 00:05:30,960 --> 00:05:34,510 what's going on have some idea of something 103 00:05:34,510 --> 00:05:37,210 they can agree on. 104 00:05:37,210 --> 00:05:39,620 OK, so that's enough for probability models. 105 00:05:39,620 --> 00:05:41,200 That's enough for philosophy. 106 00:05:41,200 --> 00:05:44,740 We will come back to this with little bits and 107 00:05:44,740 --> 00:05:46,930 pieces now and then. 108 00:05:46,930 --> 00:05:50,670 But at this point, we're really going into talking 109 00:05:50,670 --> 00:05:54,530 about the mathematical models themselves. 110 00:05:54,530 --> 00:05:58,310 OK, so let's talk about the Markov bound, the Chebyshev 111 00:05:58,310 --> 00:06:00,470 bound, and the Chernoff bound. 112 00:06:00,470 --> 00:06:02,670 You should be reading the notes, so I hope you know what 113 00:06:02,670 --> 00:06:05,360 all these things are so I can go through 114 00:06:05,360 --> 00:06:07,005 them relatively quickly. 115 00:06:10,730 --> 00:06:13,850 If you think that using these lectures slides that I'm 116 00:06:13,850 --> 00:06:18,000 passing out plus doing the problems is sufficient for 117 00:06:18,000 --> 00:06:20,040 understanding this course, you're 118 00:06:20,040 --> 00:06:22,180 really kidding yourself. 119 00:06:22,180 --> 00:06:27,070 I mean, the course is based on this text, which explains 120 00:06:27,070 --> 00:06:28,340 things much more fully. 121 00:06:28,340 --> 00:06:30,270 It still has errors in it. 122 00:06:30,270 --> 00:06:32,640 It still has typos in It. 123 00:06:32,640 --> 00:06:36,770 But a whole lot fewer than these lecture slides, so you 124 00:06:36,770 --> 00:06:40,800 should be reading them and then using them try to get a 125 00:06:40,800 --> 00:06:45,810 better idea of what these lectures mean, and using that 126 00:06:45,810 --> 00:06:49,450 to get the better idea of what the exercises 127 00:06:49,450 --> 00:06:51,090 you're doing mean. 128 00:06:51,090 --> 00:06:56,560 Doing the exercises does not do you any good whatsoever. 129 00:06:56,560 --> 00:07:00,250 The only thing that does you some good is to do an exercise 130 00:07:00,250 --> 00:07:03,740 and then think about what it has to do with anything. 131 00:07:03,740 --> 00:07:06,440 And if you don't do that second part, then all you're 132 00:07:06,440 --> 00:07:09,290 doing is you're building a very, 133 00:07:09,290 --> 00:07:11,810 very second rate computer. 134 00:07:11,810 --> 00:07:16,190 Your abilities as a computer are about the same for the 135 00:07:16,190 --> 00:07:20,250 most part as the computer in a coffee maker. 136 00:07:22,750 --> 00:07:26,730 You are really not up to what a TV set does anymore. 137 00:07:26,730 --> 00:07:31,200 I mean, TV sets can do so much computation that they're way 138 00:07:31,200 --> 00:07:33,610 beyond your abilities at this point. 139 00:07:33,610 --> 00:07:38,700 So the only edge you have, the only thing you can do to try 140 00:07:38,700 --> 00:07:42,850 to make yourself worthwhile is to understand these things 141 00:07:42,850 --> 00:07:45,570 because computers cannot do any of that 142 00:07:45,570 --> 00:07:47,580 understanding at all. 143 00:07:47,580 --> 00:07:49,800 So you're way ahead of them there. 144 00:07:49,800 --> 00:07:51,735 OK, so what is the Markov model? 145 00:07:55,080 --> 00:07:59,220 What it says is, if y is a non-negative random variable-- 146 00:07:59,220 --> 00:08:02,370 in other words, it's a random variable, which only takes one 147 00:08:02,370 --> 00:08:04,570 non-negative sample values. 148 00:08:04,570 --> 00:08:08,190 If it has an expectation, expectation of y. 149 00:08:08,190 --> 00:08:13,010 And for any real y greater than 0, the probability that Y 150 00:08:13,010 --> 00:08:16,910 is greater than or equal to little y is the expected value 151 00:08:16,910 --> 00:08:19,820 of Y divided by little y. 152 00:08:19,820 --> 00:08:22,260 The proof of it is by picture. 153 00:08:22,260 --> 00:08:25,980 If you don't like proofs by pictures, you should get used 154 00:08:25,980 --> 00:08:28,560 to it because we will prove a great number of things by 155 00:08:28,560 --> 00:08:29,880 pictures here. 156 00:08:29,880 --> 00:08:33,380 And I claim that a proof by picture is better than a proof 157 00:08:33,380 --> 00:08:36,909 by algebra because if there's anything wrong with it, you 158 00:08:36,909 --> 00:08:39,309 can see from looking at it what it is. 159 00:08:39,309 --> 00:08:43,400 So we know that the expected value of Y is the integral 160 00:08:43,400 --> 00:08:47,680 under the complimentary distribution function. 161 00:08:47,680 --> 00:08:52,400 This square in here, this area of Y times probability of Y 162 00:08:52,400 --> 00:08:55,770 greater or equal to Y. The probability of Y greater than 163 00:08:55,770 --> 00:08:59,500 or equal, capital Y, the random variable y greater than 164 00:08:59,500 --> 00:09:04,030 or equal to the number, little y, is just that 165 00:09:04,030 --> 00:09:05,790 point right up there. 166 00:09:05,790 --> 00:09:07,330 This is the point y. 167 00:09:07,330 --> 00:09:10,570 This is the point probability of capital Y 168 00:09:10,570 --> 00:09:12,750 greater than little y. 169 00:09:12,750 --> 00:09:15,510 It doesn't make any difference when you're integrating 170 00:09:15,510 --> 00:09:20,840 whether you use a greater than or equal to sign or a 171 00:09:20,840 --> 00:09:22,870 greater than sign. 172 00:09:22,870 --> 00:09:26,260 If you have a discontinuity, the integral is the same no 173 00:09:26,260 --> 00:09:28,590 matter which way you look at it. 174 00:09:28,590 --> 00:09:35,740 So this area here is y times the probability that random 175 00:09:35,740 --> 00:09:39,790 variable y is greater than or equal to number y. 176 00:09:39,790 --> 00:09:43,380 And all that the Markov bound says is that this little 177 00:09:43,380 --> 00:09:47,100 rectangle here is less than or equal to the integral under 178 00:09:47,100 --> 00:09:48,520 that curve. 179 00:09:48,520 --> 00:09:50,220 That's a perfectly rigorous proof. 180 00:09:53,970 --> 00:09:58,020 We don't really care about rigorous proofs here, anyway 181 00:09:58,020 --> 00:10:02,530 since we're trying to get at the issue of how you use 182 00:10:02,530 --> 00:10:09,050 probability, but we don't want to have proofs which mislead 183 00:10:09,050 --> 00:10:10,430 you about things. 184 00:10:10,430 --> 00:10:13,720 In other words, proofs which aren't right. 185 00:10:13,720 --> 00:10:18,490 So we try to be right, and I want you to learn to be right. 186 00:10:18,490 --> 00:10:23,320 But I don't want you to start to worry too much about 187 00:10:23,320 --> 00:10:27,020 looking like a mathematician when you prove things. 188 00:10:27,020 --> 00:10:29,770 OK, the Chebyshev inequality. 189 00:10:29,770 --> 00:10:32,570 If Z has a mean-- 190 00:10:32,570 --> 00:10:38,690 and when you say Z has a mean, what you really mean is the 191 00:10:38,690 --> 00:10:44,450 expected value of the absolute value of Z is finite. 192 00:10:44,450 --> 00:10:48,900 And it has a variance sigma squared of Z. That's saying a 193 00:10:48,900 --> 00:10:51,520 little more than just having a mean. 194 00:10:51,520 --> 00:10:55,180 Then, for any epsilon greater than 0. 195 00:10:55,180 --> 00:10:57,310 In other words, this bound works for any epsilon. 196 00:10:57,310 --> 00:11:00,730 The probability that the absolute value of Z 197 00:11:00,730 --> 00:11:01,920 less than the mean. 198 00:11:01,920 --> 00:11:05,530 In other words, that it's further away from the mean by 199 00:11:05,530 --> 00:11:09,535 more than epsilon, is less than or equal to the variance 200 00:11:09,535 --> 00:11:13,140 of Z divided by epsilon squared. 201 00:11:13,140 --> 00:11:17,910 Again, this is a very weak bound, but it's very general. 202 00:11:17,910 --> 00:11:20,220 And therefore, it's very useful. 203 00:11:20,220 --> 00:11:24,020 And the proof is simplicity itself. 204 00:11:24,020 --> 00:11:29,670 You define a new random variable y, which is Z minus 205 00:11:29,670 --> 00:11:33,650 the expected value of Z quantity squared. 206 00:11:33,650 --> 00:11:37,990 The expected value of Y then is the expected value of this, 207 00:11:37,990 --> 00:11:41,580 which is just the variance of Z. 208 00:11:41,580 --> 00:11:45,300 So for any y greater than 0, we'll use the Markov bound, 209 00:11:45,300 --> 00:11:48,520 which says the probability that random variable Y is 210 00:11:48,520 --> 00:11:52,090 greater than or equal to number little y is less than 211 00:11:52,090 --> 00:11:54,260 or equal to sigma of Z squared. 212 00:11:54,260 --> 00:11:58,740 Namely, the expected value of random variable Y divided by 213 00:11:58,740 --> 00:12:02,100 number Y. That's just the Markov bound. 214 00:12:02,100 --> 00:12:06,560 And then, random variable Y is greater than or equal to 215 00:12:06,560 --> 00:12:10,780 number y if and only if the positive square root of 216 00:12:10,780 --> 00:12:14,140 capital Y-- we're dealing only with positive non-negative 217 00:12:14,140 --> 00:12:15,120 things here-- 218 00:12:15,120 --> 00:12:19,090 is greater than or equal to the square root of number y. 219 00:12:19,090 --> 00:12:23,700 And that's less than or equal to sigma Z squared over y. 220 00:12:23,700 --> 00:12:27,860 And square root of Y is just the magnitude 221 00:12:27,860 --> 00:12:29,820 of Z minus Z bar. 222 00:12:29,820 --> 00:12:34,500 We're setting epsilon equal to square root of y yields the 223 00:12:34,500 --> 00:12:36,350 Chebyshev bound. 224 00:12:36,350 --> 00:12:44,140 Now, that's something which I don't believe 225 00:12:44,140 --> 00:12:46,380 in memorizing proofs. 226 00:12:46,380 --> 00:12:48,440 I think that's a terrible idea. 227 00:12:48,440 --> 00:12:53,080 But that's something so simple and so often used that you 228 00:12:53,080 --> 00:12:54,830 just ought to think in those terms. 229 00:12:54,830 --> 00:12:58,070 You ought to be able to see that diagram of the Markov 230 00:12:58,070 --> 00:12:58,800 inequality. 231 00:12:58,800 --> 00:13:01,580 You ought to be able to see why it is true. 232 00:13:01,580 --> 00:13:03,980 And you ought to understand it well enough 233 00:13:03,980 --> 00:13:05,710 that you can use it. 234 00:13:05,710 --> 00:13:09,480 In other words, there's a big difference in mathematics 235 00:13:09,480 --> 00:13:13,780 between knowing what a theorem says and knowing that the 236 00:13:13,780 --> 00:13:17,020 theorem is true, and really having a gut 237 00:13:17,020 --> 00:13:19,080 feeling for that theorem. 238 00:13:19,080 --> 00:13:21,570 I mean, you know this when you deal with numbers. 239 00:13:21,570 --> 00:13:23,030 You know it when you deal with integration or 240 00:13:23,030 --> 00:13:25,880 differentiation, or any of those things you've known 241 00:13:25,880 --> 00:13:28,040 about for a long time. 242 00:13:28,040 --> 00:13:30,500 There's a big difference between the things that you 243 00:13:30,500 --> 00:13:32,910 can really work with because you understand them and you 244 00:13:32,910 --> 00:13:39,490 see them and those things that you just know as something you 245 00:13:39,490 --> 00:13:40,740 don't really understand. 246 00:13:44,140 --> 00:13:48,120 This is something that you really ought to understand and 247 00:13:48,120 --> 00:13:50,310 be able to see it. 248 00:13:50,310 --> 00:13:52,650 OK, the Chernoff bound is the last of these. 249 00:13:52,650 --> 00:13:54,690 We will use this a great deal. 250 00:13:54,690 --> 00:13:56,880 It's a generating function bound. 251 00:13:56,880 --> 00:14:02,990 And it says, for any number, positive number z, and any 252 00:14:02,990 --> 00:14:07,040 positive number r greater than 0, such that the moment 253 00:14:07,040 --> 00:14:09,620 generating function-- the moment generating function of 254 00:14:09,620 --> 00:14:14,320 a random variable z is a function given the random 255 00:14:14,320 --> 00:14:15,600 variable z. 256 00:14:15,600 --> 00:14:18,610 It's a function of a real number r. 257 00:14:18,610 --> 00:14:24,710 And that function is the expected value of e to the rZ. 258 00:14:24,710 --> 00:14:27,410 It's called the generating function because if you start 259 00:14:27,410 --> 00:14:30,880 taking derivatives of this and evaluate them, that r equals 260 00:14:30,880 --> 00:14:34,140 0, what you get is the various moments of z. 261 00:14:34,140 --> 00:14:36,410 You've probably seen that at some point. 262 00:14:36,410 --> 00:14:39,650 If you haven't seen it, it's not important here because we 263 00:14:39,650 --> 00:14:41,580 don't use that at all. 264 00:14:41,580 --> 00:14:45,380 What we really use is the fact that this is a function. 265 00:14:45,380 --> 00:14:50,130 It's a function, which is increasing as r increases 266 00:14:50,130 --> 00:14:53,140 because you put-- 267 00:14:53,140 --> 00:14:56,100 well, it just does. 268 00:14:56,100 --> 00:14:59,330 And what it says is the probability that this random 269 00:14:59,330 --> 00:15:01,430 variable is greater than or equal to the number z-- 270 00:15:01,430 --> 00:15:03,410 I should really use different letters for these things. 271 00:15:03,410 --> 00:15:05,480 It's hard to talk about them-- 272 00:15:05,480 --> 00:15:08,560 is less than or equal to the moment generating function 273 00:15:08,560 --> 00:15:11,030 times e to the minus rZ. 274 00:15:11,030 --> 00:15:14,340 And the proof is exactly the same as the proof before. 275 00:15:14,340 --> 00:15:16,730 You might get the picture that you can prove many, many 276 00:15:16,730 --> 00:15:19,420 different things from the Markov inequality. 277 00:15:19,420 --> 00:15:21,400 And in fact, you can. 278 00:15:21,400 --> 00:15:24,580 You just put in whatever you want to and you get a new 279 00:15:24,580 --> 00:15:26,260 inequality. 280 00:15:26,260 --> 00:15:30,070 And you can call it after yourself if you want. 281 00:15:30,070 --> 00:15:31,220 I mean, Chernoff. 282 00:15:31,220 --> 00:15:33,290 Chernoff is still alive. 283 00:15:33,290 --> 00:15:37,460 Chernoff is a faculty member at Harvard. 284 00:15:37,460 --> 00:15:40,550 And this is kind of curious because he sort of slipped 285 00:15:40,550 --> 00:15:43,090 this in in a paper that he wrote where he was trying to 286 00:15:43,090 --> 00:15:45,860 prove something difficult. 287 00:15:45,860 --> 00:15:49,740 And this is a relationship that many mathematicians have 288 00:15:49,740 --> 00:15:51,900 used over many, many years. 289 00:15:51,900 --> 00:15:55,400 And it's so simple that they didn't make any fuss about it. 290 00:15:55,400 --> 00:15:57,560 And he didn't make any fuss about it. 291 00:15:57,560 --> 00:16:01,800 And he was sort of embarrassed that many engineers, starting 292 00:16:01,800 --> 00:16:04,690 with Claude Shannon, found this to be extraordinarily 293 00:16:04,690 --> 00:16:07,610 useful, and started calling it the Chernoff bound. 294 00:16:07,610 --> 00:16:10,800 He was slightly embarrassed of having this totally trivial 295 00:16:10,800 --> 00:16:14,300 thing suddenly be named after him. 296 00:16:14,300 --> 00:16:17,030 But anyway, that's the way it happened. 297 00:16:17,030 --> 00:16:22,020 And now it's a widely used tool that we use all the time. 298 00:16:22,020 --> 00:16:25,940 So it's the same proof that we had before for any y greater 299 00:16:25,940 --> 00:16:28,740 0, Markov says this. 300 00:16:28,740 --> 00:16:31,450 And therefore, you get that. 301 00:16:31,450 --> 00:16:34,510 This decreases exponentially with Z, and 302 00:16:34,510 --> 00:16:36,910 that's why it's useful. 303 00:16:36,910 --> 00:16:40,730 I mean, the Markov inequality only decays as 304 00:16:40,730 --> 00:16:44,500 1 over little y. 305 00:16:44,500 --> 00:16:48,880 The Chebyshev inequality decays as 1 over y squared. 306 00:16:48,880 --> 00:16:51,610 This decays exponentially with y. 307 00:16:51,610 --> 00:16:55,210 And therefore, when you start dealing with large deviations, 308 00:16:55,210 --> 00:16:59,360 trying to talk about things that are very, very unlikely 309 00:16:59,360 --> 00:17:03,420 when you get very, very far from the mean, this is a very 310 00:17:03,420 --> 00:17:04,640 useful way to do it. 311 00:17:04,640 --> 00:17:08,310 And it's sort of the standard way of doing it at this point. 312 00:17:08,310 --> 00:17:12,530 We won't use it right now, but this is the right time to talk 313 00:17:12,530 --> 00:17:14,140 about it a little bit. 314 00:17:14,140 --> 00:17:17,579 OK, next topic we want to take up is really these laws of 315 00:17:17,579 --> 00:17:21,450 large numbers, and something about convergence. 316 00:17:21,450 --> 00:17:25,579 We want to understand a little bit about that. 317 00:17:25,579 --> 00:17:33,390 And this picture that we've seen before, we take X1, up to 318 00:17:33,390 --> 00:17:37,710 Xn as n independent identically 319 00:17:37,710 --> 00:17:40,210 distributed random variables. 320 00:17:40,210 --> 00:17:43,570 They each have mean expected value of X. They each have 321 00:17:43,570 --> 00:17:45,880 variance sigma squared. 322 00:17:45,880 --> 00:17:48,870 You let Sn be the sum of all of them. 323 00:17:48,870 --> 00:17:53,400 What we want to understand is, how does S sub n behave? 324 00:17:53,400 --> 00:17:57,040 And more particularly, how does Sn over n-- 325 00:17:57,040 --> 00:18:00,003 namely, the relative-- 326 00:18:00,003 --> 00:18:05,230 not the relative frequency, but the sample average of x 327 00:18:05,230 --> 00:18:07,710 behave when you take n samples? 328 00:18:07,710 --> 00:18:13,520 So this curve shows the distribution function of S4, 329 00:18:13,520 --> 00:18:20,700 of S20, and of S50 when you have a binary random variable 330 00:18:20,700 --> 00:18:25,500 with probability of 1 equal to a quarter, probability of 0 331 00:18:25,500 --> 00:18:26,970 equal to 3/4. 332 00:18:26,970 --> 00:18:30,140 And what you see graphically is what you can see 333 00:18:30,140 --> 00:18:33,080 mathematically very easily, too. 334 00:18:33,080 --> 00:18:38,570 The mean value of S sub n is n times X bar. 335 00:18:38,570 --> 00:18:42,680 So the center point in these curves is moving out within. 336 00:18:42,680 --> 00:18:44,650 You see the center point here. 337 00:18:44,650 --> 00:18:48,060 Center point somewhere around there. 338 00:18:48,060 --> 00:18:52,950 Actually, the center point is at-- yeah, just about what it 339 00:18:52,950 --> 00:18:55,080 looks like. 340 00:18:55,080 --> 00:19:00,180 And the center point here is out somewhere around there. 341 00:19:00,180 --> 00:19:04,540 And you see the variance, you see the variance going up 342 00:19:04,540 --> 00:19:06,700 linearly with n. 343 00:19:06,700 --> 00:19:10,710 Not with n squared, but with n, which means the standard 344 00:19:10,710 --> 00:19:14,750 deviation is going up with the square root of n. 345 00:19:14,750 --> 00:19:18,510 That's sort of why the law of large numbers works. 346 00:19:18,510 --> 00:19:22,220 It's because the standard deviation of these random-- 347 00:19:22,220 --> 00:19:26,070 of these sums only goes up with the square root of n. 348 00:19:26,070 --> 00:19:31,520 So these curves, along with moving out, become relatively 349 00:19:31,520 --> 00:19:35,335 more compressed relative to how far out they are. 350 00:19:40,940 --> 00:19:46,520 This curve here is relatively more compressed for its mean 351 00:19:46,520 --> 00:19:48,230 than this one is here. 352 00:19:48,230 --> 00:19:51,240 And that's more compressed relative to this one. 353 00:19:51,240 --> 00:19:55,210 We get this a lot more easily if we look 354 00:19:55,210 --> 00:19:56,430 at the sample average. 355 00:19:56,430 --> 00:19:58,810 Namely, S sub n over n. 356 00:19:58,810 --> 00:20:03,250 This is a random variable of mean X bar. 357 00:20:03,250 --> 00:20:07,320 That's a random variable of variance sigma squared over n. 358 00:20:07,320 --> 00:20:10,770 That's something that you ought to just recognize and 359 00:20:10,770 --> 00:20:14,230 have very close to the top of your consciousness because 360 00:20:14,230 --> 00:20:16,510 that again, is sort of why the sample 361 00:20:16,510 --> 00:20:18,590 average starts to converge. 362 00:20:18,590 --> 00:20:24,880 So what happens then is for n equals 4, you get this very 363 00:20:24,880 --> 00:20:26,780 "blech" curve. 364 00:20:26,780 --> 00:20:29,370 For n equals 20, it starts looking a little more 365 00:20:29,370 --> 00:20:30,490 reasonable. 366 00:20:30,490 --> 00:20:35,150 For n equals 50, it's starting to scrunch in and start to 367 00:20:35,150 --> 00:20:38,270 look like a unit step. 368 00:20:38,270 --> 00:20:43,250 And what we'll find is that the intuitive way of looking 369 00:20:43,250 --> 00:20:48,090 at the law of large numbers, or one of the more intuitive 370 00:20:48,090 --> 00:20:50,880 ways of looking at it, is that the sample 371 00:20:50,880 --> 00:20:52,860 average starts to look-- 372 00:20:52,860 --> 00:20:57,000 starts to have a distribution function, which 373 00:20:57,000 --> 00:20:58,730 looks like a unit step. 374 00:20:58,730 --> 00:21:01,140 And that step occurs at the mean. 375 00:21:01,140 --> 00:21:05,310 So this curve here keeps scrunching in. 376 00:21:05,310 --> 00:21:08,350 This part down here is moving over that way. 377 00:21:08,350 --> 00:21:11,790 This part over here is moving over that way. 378 00:21:11,790 --> 00:21:16,620 And it all gets close to a unit step. 379 00:21:16,620 --> 00:21:20,350 OK, the variance of Sn over n, as we've said, is equal to 380 00:21:20,350 --> 00:21:22,680 sigma squared over n. 381 00:21:22,680 --> 00:21:27,960 The limit of the variance as n goes to infinity takes the 382 00:21:27,960 --> 00:21:29,440 limit of that. 383 00:21:29,440 --> 00:21:32,770 Don't even have to know the definition of a limit, can see 384 00:21:32,770 --> 00:21:36,070 that when n gets large, this get small. 385 00:21:36,070 --> 00:21:37,340 And this goes to 0. 386 00:21:37,340 --> 00:21:41,400 So the limit of this goes to 0. 387 00:21:41,400 --> 00:21:45,240 Now, here's the important thing. 388 00:21:45,240 --> 00:21:52,240 This equation says a whole lot more than this equation says. 389 00:21:52,240 --> 00:21:58,090 Because this equation says how quickly that approaches 0. 390 00:21:58,090 --> 00:22:01,370 All this says is it approaches 0. 391 00:22:01,370 --> 00:22:06,150 So we've thrown away a lot that we know, and now all we 392 00:22:06,150 --> 00:22:09,350 know is this. 393 00:22:09,350 --> 00:22:15,720 This 3 says that the convergence is as 1/n. 394 00:22:15,720 --> 00:22:16,890 This doesn't say that. 395 00:22:16,890 --> 00:22:19,920 This just says that it converges. 396 00:22:19,920 --> 00:22:23,660 Why would anyone in their right mind want to replace an 397 00:22:23,660 --> 00:22:28,030 informative statement like this with a not informative 398 00:22:28,030 --> 00:22:30,240 statement like this? 399 00:22:30,240 --> 00:22:33,450 Any ideas of why you might want to do that? 400 00:22:33,450 --> 00:22:34,700 Any suggestions? 401 00:22:38,790 --> 00:22:39,145 AUDIENCE: Convenience. 402 00:22:39,145 --> 00:22:39,500 PROFESSOR: What? 403 00:22:39,500 --> 00:22:41,720 AUDIENCE: Convenience. 404 00:22:41,720 --> 00:22:42,120 Convenience. 405 00:22:42,120 --> 00:22:44,130 Sometimes you don't need to-- 406 00:22:44,130 --> 00:22:45,320 PROFESSOR: Well, yes, convenience. 407 00:22:45,320 --> 00:22:47,260 But there's a much stronger reason. 408 00:22:51,750 --> 00:22:55,740 This is a statement for IID random variables. 409 00:22:55,740 --> 00:22:59,500 This law of large numbers, we want it to apply to as many 410 00:22:59,500 --> 00:23:01,840 different situations as possible. 411 00:23:01,840 --> 00:23:04,030 To things that aren't quite IID. 412 00:23:04,030 --> 00:23:08,150 To things that don't have a variance. 413 00:23:08,150 --> 00:23:15,030 And this statement here is going to apply more 414 00:23:15,030 --> 00:23:16,450 generally than this. 415 00:23:16,450 --> 00:23:20,610 You can have situations where the variance goes to 0 more 416 00:23:20,610 --> 00:23:24,780 slowly than 1/n if these random variables are not 417 00:23:24,780 --> 00:23:25,990 independent. 418 00:23:25,990 --> 00:23:28,380 But you still have this statement. 419 00:23:28,380 --> 00:23:35,170 And this statement is what we really need, so this really 420 00:23:35,170 --> 00:23:37,770 says something, which is called 421 00:23:37,770 --> 00:23:40,230 convergence and mean square. 422 00:23:40,230 --> 00:23:41,400 Why mean square? 423 00:23:41,400 --> 00:23:44,560 Because this is the mean squared. 424 00:23:44,560 --> 00:23:47,380 So obvious terminology. 425 00:23:47,380 --> 00:23:50,400 Mathematicians aren't always very good at choosing 426 00:23:50,400 --> 00:23:54,240 terminology that makes sense when you look at it, 427 00:23:54,240 --> 00:23:55,990 but this one does. 428 00:23:55,990 --> 00:24:00,800 Definition is a sequence of random variables Y1, Y2, Y3, 429 00:24:00,800 --> 00:24:01,830 and so forth. 430 00:24:01,830 --> 00:24:06,560 Converges in mean square to a random variable Y if this 431 00:24:06,560 --> 00:24:10,100 limit here is equal to 0. 432 00:24:10,100 --> 00:24:17,130 So in this case, Y, this random variable Y, is really a 433 00:24:17,130 --> 00:24:20,360 deterministic random variable, which is just the 434 00:24:20,360 --> 00:24:25,290 deterministic value, expected value of X. This random 435 00:24:25,290 --> 00:24:30,290 variable here is this relative frequency here. 436 00:24:30,290 --> 00:24:33,390 And this is saying that the expected value of the relative 437 00:24:33,390 --> 00:24:38,210 frequency relative to the expected value of X-- 438 00:24:38,210 --> 00:24:39,660 this is going to 0. 439 00:24:39,660 --> 00:24:41,910 This isn't saying anything extra. 440 00:24:41,910 --> 00:24:44,670 This is just saying, if you're not interested in the law of 441 00:24:44,670 --> 00:24:47,910 large numbers, you might be interested in how a bunch of 442 00:24:47,910 --> 00:24:53,030 random variables approach some other random variable. 443 00:24:53,030 --> 00:24:57,760 Now, if you look at a set of real numbers and you say, does 444 00:24:57,760 --> 00:25:02,500 that set of real numbers approach something? 445 00:25:02,500 --> 00:25:06,660 I mean, you have sort of a complicated looking definition 446 00:25:06,660 --> 00:25:10,590 for that, which really says that the numbers 447 00:25:10,590 --> 00:25:13,320 approach this constant. 448 00:25:13,320 --> 00:25:18,860 But a set of numbers is so much more simple-minded than a 449 00:25:18,860 --> 00:25:20,960 set of random variables. 450 00:25:20,960 --> 00:25:23,250 I mean, a set of random variables is-- 451 00:25:23,250 --> 00:25:28,100 I mean, not even their distribution functions really 452 00:25:28,100 --> 00:25:29,680 explain what they are. 453 00:25:29,680 --> 00:25:31,620 There's also the relationship between the 454 00:25:31,620 --> 00:25:33,390 distribution functions. 455 00:25:33,390 --> 00:25:38,550 So you're not going to find anything very easy that says 456 00:25:38,550 --> 00:25:40,680 random variables converge. 457 00:25:40,680 --> 00:25:43,530 And you can expect to find the number of different kinds of 458 00:25:43,530 --> 00:25:47,020 statements about convergence. 459 00:25:47,020 --> 00:25:49,270 And this is just going to be one of them. 460 00:25:49,270 --> 00:25:53,010 This is something called convergence and mean square-- 461 00:25:53,010 --> 00:25:53,736 yes? 462 00:25:53,736 --> 00:26:02,296 AUDIENCE: Going from 3 to 4, we don't need IID anymore? 463 00:26:02,296 --> 00:26:06,000 So they can be just-- 464 00:26:06,000 --> 00:26:08,790 PROFESSOR: You can certainly find examples where it's not 465 00:26:08,790 --> 00:26:12,620 IID, and this doesn't hold and this does hold. 466 00:26:12,620 --> 00:26:16,200 The most interesting case where this doesn't hold and 467 00:26:16,200 --> 00:26:20,780 this does hold is where you-- 468 00:26:20,780 --> 00:26:23,800 no, you still need a variance for this to hold. 469 00:26:28,600 --> 00:26:30,930 Yeah, so I guess I can't really construct any nice 470 00:26:30,930 --> 00:26:37,300 examples of n where this holds and this doesn't hold. 471 00:26:37,300 --> 00:26:40,160 But there are some if you talk about random variables that 472 00:26:40,160 --> 00:26:41,410 are not IID. 473 00:26:44,580 --> 00:26:46,406 I ought to have a problem that does that. 474 00:26:46,406 --> 00:26:48,950 But so far, I don't. 475 00:26:52,690 --> 00:26:57,520 Now, the fact that this sample average converges in mean 476 00:26:57,520 --> 00:27:00,640 square doesn't tell us directly what might be more 477 00:27:00,640 --> 00:27:02,070 interesting. 478 00:27:02,070 --> 00:27:05,320 I mean, you look at that statement and it doesn't 479 00:27:05,320 --> 00:27:09,550 really tell you what this complementary distribution 480 00:27:09,550 --> 00:27:11,530 function looks like. 481 00:27:11,530 --> 00:27:16,360 I mean, to me the thing that is closest to what I would 482 00:27:16,360 --> 00:27:22,430 think of as convergence is that this sequence of random 483 00:27:22,430 --> 00:27:27,680 variables minus the random variable, the convergence of 484 00:27:27,680 --> 00:27:33,250 that difference, approaches a distribution function, which 485 00:27:33,250 --> 00:27:34,890 is the unit step. 486 00:27:34,890 --> 00:27:37,720 Which means that the probability that you're 487 00:27:37,720 --> 00:27:44,380 anywhere off of that center point is going to 0. 488 00:27:44,380 --> 00:27:49,070 I mean, that's a very easy to interpret statement. 489 00:27:49,070 --> 00:27:53,550 The fact that the variance is going to 0, I don't quite know 490 00:27:53,550 --> 00:27:56,180 how do interpret it, except through Chebyshev's law, which 491 00:27:56,180 --> 00:27:59,520 gets me to the other statement. 492 00:27:59,520 --> 00:28:05,680 So what I'm saying here is if we apply Chebyshev to that 493 00:28:05,680 --> 00:28:08,690 statement before number 3-- 494 00:28:08,690 --> 00:28:10,560 this one-- 495 00:28:10,560 --> 00:28:12,540 which says what the variance is. 496 00:28:12,540 --> 00:28:17,840 If we apply Chebyshev, then what we get is the probability 497 00:28:17,840 --> 00:28:24,050 that the relative frequency minus the mean, the absolute 498 00:28:24,050 --> 00:28:27,320 value of that is greater than or equal to epsilon. 499 00:28:27,320 --> 00:28:31,530 That probability is less than or equal to sigma squared over 500 00:28:31,530 --> 00:28:33,930 n times epsilon squared. 501 00:28:33,930 --> 00:28:37,030 You'll notice this is a very peculiar 502 00:28:37,030 --> 00:28:39,480 statement in terms of epsilon. 503 00:28:39,480 --> 00:28:44,700 Because if you want to make epsilon very small, so you get 504 00:28:44,700 --> 00:28:51,410 something strong here, this term blows up. 505 00:28:51,410 --> 00:28:55,730 So the way you have to look at this is pick some epsilon 506 00:28:55,730 --> 00:28:56,980 you're happy with. 507 00:29:00,830 --> 00:29:04,760 I mean, you might want these two things to be within 1% of 508 00:29:04,760 --> 00:29:06,490 each other. 509 00:29:06,490 --> 00:29:14,020 Then, epsilon squared here is 10,000. 510 00:29:14,020 --> 00:29:17,830 But by making n big enough, that gets submerged. 511 00:29:17,830 --> 00:29:21,020 So excuse me. 512 00:29:21,020 --> 00:29:26,920 Epsilon squared is 1/10,000 so 1 over 513 00:29:26,920 --> 00:29:29,180 epsilon squared is 10,000. 514 00:29:29,180 --> 00:29:30,900 So you need to make n very large. 515 00:29:30,900 --> 00:29:31,900 Yes? 516 00:29:31,900 --> 00:29:34,406 AUDIENCE: So that's why at times when n is too small and 517 00:29:34,406 --> 00:29:36,130 epsilon is too small as well, you can get obvious things, 518 00:29:36,130 --> 00:29:38,390 like it's less than or equal to a number greater than 1? 519 00:29:38,390 --> 00:29:39,526 PROFESSOR: Yes. 520 00:29:39,526 --> 00:29:42,660 And this inequality is not much good because there's a 521 00:29:42,660 --> 00:29:45,120 very obvious inequality that works. 522 00:29:45,120 --> 00:29:46,810 Yes. 523 00:29:46,810 --> 00:29:50,580 But the other thing is this is a very weak inequality. 524 00:29:50,580 --> 00:29:53,590 So all this is doing is giving you a bound. 525 00:29:53,590 --> 00:29:59,230 All it's doing is saying that when n gets big enough, this 526 00:29:59,230 --> 00:30:03,110 number gets as small as you want it to be. 527 00:30:03,110 --> 00:30:06,040 So you can get an arbitrary accuracy of epsilon between 528 00:30:06,040 --> 00:30:08,520 sample average and mean. 529 00:30:08,520 --> 00:30:10,780 You can get that with a probability 530 00:30:10,780 --> 00:30:13,450 1 minus this quantity. 531 00:30:13,450 --> 00:30:17,110 You can make that as close to 1 as you wish if 532 00:30:17,110 --> 00:30:18,680 you increase n. 533 00:30:18,680 --> 00:30:22,690 So that gives us the law of large numbers, and I haven't 534 00:30:22,690 --> 00:30:27,450 stated it formally, all the formal jazz as in the notes. 535 00:30:27,450 --> 00:30:31,700 But it says, if you have IID random variables with a finite 536 00:30:31,700 --> 00:30:35,800 variance, the limit of the probability that Sn over n 537 00:30:35,800 --> 00:30:38,970 minus x bar, the absolute value of that is greater than 538 00:30:38,970 --> 00:30:43,360 or equal to epsilon is equal to 0 in the limit, no matter 539 00:30:43,360 --> 00:30:45,630 how you choose epsilon. 540 00:30:45,630 --> 00:30:47,610 Namely, this is one of those peculiar things in 541 00:30:47,610 --> 00:30:49,760 mathematics. 542 00:30:49,760 --> 00:30:53,930 It depends on who gets the first choice. 543 00:30:53,930 --> 00:30:56,420 If I get to choose epsilon and you get to 544 00:30:56,420 --> 00:30:58,730 choose n, then you win. 545 00:30:58,730 --> 00:31:00,650 You can make this go to 0. 546 00:31:00,650 --> 00:31:04,430 If you choose n and then I choose epsilon, you lose. 547 00:31:04,430 --> 00:31:06,840 So it's only when you choose first that you win. 548 00:31:06,840 --> 00:31:08,630 But still, this statement works. 549 00:31:08,630 --> 00:31:11,940 For every epsilon greater than 0, this limit 550 00:31:11,940 --> 00:31:14,220 here is equal to 0. 551 00:31:14,220 --> 00:31:22,266 Now, let's go immediately a couple of pages beyond and 552 00:31:22,266 --> 00:31:25,200 look at this figure a little bit because I think this 553 00:31:25,200 --> 00:31:29,290 figure tells what's going on, I think better 554 00:31:29,290 --> 00:31:30,710 than anything else. 555 00:31:30,710 --> 00:31:34,070 You have the mean of x, which is right in the center of this 556 00:31:34,070 --> 00:31:36,740 distribution function. 557 00:31:36,740 --> 00:31:40,800 As n gets larger and larger, this distribution function 558 00:31:40,800 --> 00:31:44,060 here is going to be scrunching in, which we sort of know 559 00:31:44,060 --> 00:31:46,720 because the variance is going to 0. 560 00:31:46,720 --> 00:31:50,210 And we also sort of know it because of what this weak law 561 00:31:50,210 --> 00:31:53,110 of large numbers tells us. 562 00:31:53,110 --> 00:31:55,390 And we have these. 563 00:31:58,140 --> 00:32:02,220 If we pick some given epsilon, then we have-- 564 00:32:02,220 --> 00:32:07,070 if we look at a range of two epsilon, epsilon on one side 565 00:32:07,070 --> 00:32:11,030 of the mean, epsilon on the other side of the mean, then 566 00:32:11,030 --> 00:32:16,140 we can ask the question, how well does this distribution 567 00:32:16,140 --> 00:32:18,950 function conform to a unit step? 568 00:32:18,950 --> 00:32:23,830 Well, one easy way of looking at that is saying, if we draw 569 00:32:23,830 --> 00:32:29,150 a rectangle here of width 2 epsilon around X bar, when 570 00:32:29,150 --> 00:32:32,380 does this distribution function get inside that 571 00:32:32,380 --> 00:32:37,280 rectangle and when does it leave the rectangle? 572 00:32:37,280 --> 00:32:42,930 And what the weak law of large numbers says is that if you 573 00:32:42,930 --> 00:32:47,900 pick epsilon and hold it fixed, then delta 1 is going 574 00:32:47,900 --> 00:32:51,340 to 0 and delta 2 is going to 0. 575 00:32:51,340 --> 00:32:52,590 And eventually-- 576 00:32:56,863 --> 00:33:00,030 I think this is dying out. 577 00:33:00,030 --> 00:33:01,570 Well, no problem. 578 00:33:04,510 --> 00:33:09,130 What this says is that as n gets larger and larger, this 579 00:33:09,130 --> 00:33:11,660 quantity shrinks down to 0. 580 00:33:11,660 --> 00:33:14,970 That quantity up there shrinks down to 0. 581 00:33:14,970 --> 00:33:17,950 And suddenly, you have something which is, for all 582 00:33:17,950 --> 00:33:23,120 practical purposes, a unit step. 583 00:33:23,120 --> 00:33:26,015 Namely, if you think about it a little bit, how can you take 584 00:33:26,015 --> 00:33:30,450 an increasing curve, which increases from 0 to 1, and say 585 00:33:30,450 --> 00:33:33,560 that's close to a unit step? 586 00:33:33,560 --> 00:33:35,200 Isn't this a nice way of doing it? 587 00:33:38,590 --> 00:33:41,350 I mean, the function is increasing so it can't do 588 00:33:41,350 --> 00:33:45,450 anything after it crosses this point here. 589 00:33:45,450 --> 00:33:50,990 All it can do is increase, and eventually it leaves again. 590 00:33:50,990 --> 00:33:53,780 Now, another thing. 591 00:33:53,780 --> 00:33:59,020 When you think about the weak law of large numbers and you 592 00:33:59,020 --> 00:34:04,890 don't state it formally, one of the important things is you 593 00:34:04,890 --> 00:34:10,909 can't make epsilon 0 here and you can't make delta 0. 594 00:34:10,909 --> 00:34:14,389 You need both an epsilon and a delta in this argument. 595 00:34:14,389 --> 00:34:16,790 And you can see that just by looking at a reasonable 596 00:34:16,790 --> 00:34:19,030 distribution function. 597 00:34:19,030 --> 00:34:23,550 If you make epsilon equal to 0, then you're asking, what's 598 00:34:23,550 --> 00:34:28,940 the probability that this sample average is exactly 599 00:34:28,940 --> 00:34:31,190 equal to the mean? 600 00:34:31,190 --> 00:34:35,920 And in most cases, that's equal to 0. 601 00:34:35,920 --> 00:34:39,020 Namely, you can't win on that argument. 602 00:34:39,020 --> 00:34:45,800 And if you try to make delta equal to 0. 603 00:34:45,800 --> 00:34:47,050 In other words, you ask-- 604 00:34:50,850 --> 00:34:54,219 then suddenly you're stuck over here, and you're stuck 605 00:34:54,219 --> 00:34:58,820 way over there, and you can't make epsilon small. 606 00:34:58,820 --> 00:35:03,170 So trying to say that a curve looks like a step function, 607 00:35:03,170 --> 00:35:07,290 you really need two fudge factors to do that. 608 00:35:07,290 --> 00:35:09,410 So the weak law of large numbers. 609 00:35:09,410 --> 00:35:18,120 In terms of dealing with how close you are to a step 610 00:35:18,120 --> 00:35:21,960 function, the weak law of large numbers says about the 611 00:35:21,960 --> 00:35:24,610 only thing you can reasonably say. 612 00:35:24,610 --> 00:35:27,620 OK, now let's go back to the slide before. 613 00:35:27,620 --> 00:35:31,290 The weak law of large numbers says that the limit as n goes 614 00:35:31,290 --> 00:35:35,790 to infinity of the probability that the sample average is 615 00:35:35,790 --> 00:35:39,180 greater than or equal to epsilon equals 0. 616 00:35:39,180 --> 00:35:41,820 And it says that for every epsilon greater than 0. 617 00:35:45,160 --> 00:35:49,290 An equivalent statement is this statement here. 618 00:35:52,350 --> 00:35:56,400 The probability that Sn over n minus x bar is greater than or 619 00:35:56,400 --> 00:36:00,130 equal to epsilon is a complicated looking animal, 620 00:36:00,130 --> 00:36:01,230 but it's just a number. 621 00:36:01,230 --> 00:36:03,500 It's just a number between 0 and 1. 622 00:36:03,500 --> 00:36:05,060 It's a probability. 623 00:36:05,060 --> 00:36:08,460 So for every n, you get a number up there. 624 00:36:08,460 --> 00:36:13,680 And what this is saying is that that sequence of numbers 625 00:36:13,680 --> 00:36:16,640 is approaching 0. 626 00:36:16,640 --> 00:36:19,560 Another way to say that a sequence of numbers approaches 627 00:36:19,560 --> 00:36:21,900 0 is this way down here. 628 00:36:21,900 --> 00:36:27,270 It says that for every epsilon greater than 0 and every delta 629 00:36:27,270 --> 00:36:33,000 greater than 0, the probability that this quantity 630 00:36:33,000 --> 00:36:38,450 is less than or equal to delta is-- 631 00:36:38,450 --> 00:36:42,950 this probability is less than or equal to delta for all 632 00:36:42,950 --> 00:36:43,910 large enough n. 633 00:36:43,910 --> 00:36:48,040 In other words, these funny little things on the edge here 634 00:36:48,040 --> 00:36:56,930 for this next slide, the delta 1 and delta 2 are going to 0. 635 00:36:56,930 --> 00:36:59,980 So it's important to understand this both ways. 636 00:37:04,930 --> 00:37:10,580 And now again, this quantity here looks very much like-- 637 00:37:21,740 --> 00:37:24,920 these two equations look very much alike. 638 00:37:24,920 --> 00:37:27,470 Except this one tells you something more about 639 00:37:27,470 --> 00:37:29,060 convergence than this one does. 640 00:37:29,060 --> 00:37:31,900 This says how this goes to 0. 641 00:37:31,900 --> 00:37:34,260 This only says that it goes to 0. 642 00:37:34,260 --> 00:37:37,245 So again, we have the same thing. 643 00:37:37,245 --> 00:37:41,800 The weak law of large numbers says this weaker thing, and it 644 00:37:41,800 --> 00:37:44,690 says this weaker thing because sometimes you 645 00:37:44,690 --> 00:37:46,660 need the weaker thing. 646 00:37:46,660 --> 00:37:49,200 And in this case, there is a good example. 647 00:37:49,200 --> 00:37:52,960 The weak law of large numbers is true, even if you don't 648 00:37:52,960 --> 00:37:54,490 have a variance. 649 00:37:54,490 --> 00:37:56,790 It's true under the single condition 650 00:37:56,790 --> 00:37:58,890 that you have a mean. 651 00:37:58,890 --> 00:38:03,540 There's a nice proof in the text about that. 652 00:38:03,540 --> 00:38:08,770 It's a proof that does something, which we're going 653 00:38:08,770 --> 00:38:11,030 to do many, many times. 654 00:38:11,030 --> 00:38:12,880 You look at a random variable. 655 00:38:12,880 --> 00:38:15,990 You can't say what you want to say about it, so 656 00:38:15,990 --> 00:38:18,930 you truncate it. 657 00:38:18,930 --> 00:38:21,910 You say, let me-- 658 00:38:21,910 --> 00:38:27,520 I mean, if you think a problem is too hard, you look at a 659 00:38:27,520 --> 00:38:29,120 simpler problem. 660 00:38:29,120 --> 00:38:32,050 If you're drunk and you drop a coin, you look for it 661 00:38:32,050 --> 00:38:33,010 underneath a light. 662 00:38:33,010 --> 00:38:35,130 You don't look for it where it's dark, even though you 663 00:38:35,130 --> 00:38:37,380 dropped it where it's dark. 664 00:38:37,380 --> 00:38:38,620 So all of us do that. 665 00:38:38,620 --> 00:38:41,830 If we can't solve a problem, we try to pose a simpler, 666 00:38:41,830 --> 00:38:44,360 similar problem that we can solve. 667 00:38:44,360 --> 00:38:47,220 So you truncate this random variable. 668 00:38:47,220 --> 00:38:50,190 When you truncate a random variable, I mean you just take 669 00:38:50,190 --> 00:38:53,050 its distribution function and you chop it off 670 00:38:53,050 --> 00:38:55,470 to a certain point. 671 00:38:55,470 --> 00:38:58,040 And what happens then? 672 00:38:58,040 --> 00:38:59,370 Well. at that point you have a variance. 673 00:38:59,370 --> 00:39:01,010 You have a moment generating function. 674 00:39:01,010 --> 00:39:02,680 You have all the things you want. 675 00:39:02,680 --> 00:39:07,160 Nothing peculiar can happen because the thing is bounded. 676 00:39:07,160 --> 00:39:11,090 So then the trick in proving the weak law of large numbers 677 00:39:11,090 --> 00:39:14,620 under these more general circumstances is to first 678 00:39:14,620 --> 00:39:16,960 truncate the random variable. 679 00:39:16,960 --> 00:39:19,480 You then have the weak law of large numbers. 680 00:39:19,480 --> 00:39:22,890 And then the thing that you do is in a very ticklish way, you 681 00:39:22,890 --> 00:39:26,050 start increasing n, and you increase 682 00:39:26,050 --> 00:39:27,850 the truncation parameter. 683 00:39:27,850 --> 00:39:31,550 And if you do this in just the right way, you wind up proving 684 00:39:31,550 --> 00:39:33,620 the theorem you want to prove. 685 00:39:33,620 --> 00:39:36,000 Now, I'm not saying you ought to read that proof now. 686 00:39:38,710 --> 00:39:42,020 If you're sailing along with no problems at all, you ought 687 00:39:42,020 --> 00:39:43,990 to read that proof now. 688 00:39:43,990 --> 00:39:46,800 If you don't quite have the kind of mathematical 689 00:39:46,800 --> 00:39:51,670 background that I seem to be often assuming in this course, 690 00:39:51,670 --> 00:39:53,420 you ought to skip that. 691 00:39:53,420 --> 00:39:56,740 You will have many opportunities to understand 692 00:39:56,740 --> 00:39:57,830 the technique later. 693 00:39:57,830 --> 00:40:01,900 It's the technique which is important, it's not the-- 694 00:40:01,900 --> 00:40:04,600 I mean, it's not so much the actual proof. 695 00:40:09,530 --> 00:40:13,660 Now, the thing we didn't talk about, about this picture 696 00:40:13,660 --> 00:40:18,770 here, is we say that a sequence of random variables-- 697 00:40:18,770 --> 00:40:20,520 Y1, Y2, et cetera-- 698 00:40:20,520 --> 00:40:25,350 converges in probability to a random variable y if for every 699 00:40:25,350 --> 00:40:30,740 epsilon greater than 0 and every delta greater than 0, 700 00:40:30,740 --> 00:40:36,440 the probability that the n-th random variable minus this 701 00:40:36,440 --> 00:40:39,370 funny random variable is greater than or equal to 702 00:40:39,370 --> 00:40:41,330 epsilon, is less than or equal to delta. 703 00:40:41,330 --> 00:40:45,460 That's saying the same thing as this picture says. 704 00:40:45,460 --> 00:40:49,500 In this picture here, you can draw each one of the Y sub n's 705 00:40:49,500 --> 00:40:52,840 minus Y. You think of Y sub n minus Y as a 706 00:40:52,840 --> 00:40:55,710 single random variable. 707 00:40:55,710 --> 00:40:58,780 And then you get this kind of curve here and the same 708 00:40:58,780 --> 00:41:00,580 interpretation works. 709 00:41:00,580 --> 00:41:04,160 So again, what you're saying with convergence and 710 00:41:04,160 --> 00:41:09,380 probability is that the distribution function of Yn 711 00:41:09,380 --> 00:41:15,710 minus Y is approaching a unit step as n gets big. 712 00:41:15,710 --> 00:41:18,570 So that's really the meaning of convergence and 713 00:41:18,570 --> 00:41:20,190 probability. 714 00:41:20,190 --> 00:41:23,720 I mean, you get this unit step as n gets bigger and bigger. 715 00:41:27,164 --> 00:41:30,860 OK, so let's review all of what we've done in 716 00:41:30,860 --> 00:41:34,070 the last half hour. 717 00:41:34,070 --> 00:41:38,470 If a random variable, generic random variable x, has a 718 00:41:38,470 --> 00:41:39,640 standard deviation-- 719 00:41:39,640 --> 00:41:42,570 in other words, if it has a finite variance. 720 00:41:42,570 --> 00:41:49,580 And if X1, X2 are IID with that standard deviation, then 721 00:41:49,580 --> 00:41:53,860 the standard deviation of the relative frequency is equal to 722 00:41:53,860 --> 00:41:56,840 the standard deviation of x divided by the 723 00:41:56,840 --> 00:41:58,290 square root of n. 724 00:41:58,290 --> 00:42:02,610 So the standard deviation is the relative 725 00:42:02,610 --> 00:42:03,760 of the sample average. 726 00:42:03,760 --> 00:42:06,430 Excuse me, sample average, not relative frequency. 727 00:42:06,430 --> 00:42:12,080 Of the sample average is going to 0 as n gets big. 728 00:42:12,080 --> 00:42:17,530 In the same way, if you have a sequence of arbitrary random 729 00:42:17,530 --> 00:42:21,720 variables, which are converging to Y in mean 730 00:42:21,720 --> 00:42:25,820 square, then Chebyshev shows that it converges in 731 00:42:25,820 --> 00:42:27,180 probability. 732 00:42:27,180 --> 00:42:31,530 OK so mean square convergence implies convergence in 733 00:42:31,530 --> 00:42:32,810 probability. 734 00:42:32,810 --> 00:42:35,950 Mean square convergence is a funny statement which says 735 00:42:35,950 --> 00:42:44,470 that this sequence of random variables has a standard 736 00:42:44,470 --> 00:42:47,260 deviation, which is going to 0. 737 00:42:47,260 --> 00:42:51,500 And it's hard to see exactly what that means because that 738 00:42:51,500 --> 00:42:55,140 standard deviation is a complicated integral. 739 00:42:55,140 --> 00:42:58,040 And I don't know what it means. 740 00:42:58,040 --> 00:43:02,070 But if you use the Chebyshev inequality, then it means this 741 00:43:02,070 --> 00:43:10,610 very simple statement, which says that this sequence has to 742 00:43:10,610 --> 00:43:13,840 converge in probability to Y. 743 00:43:13,840 --> 00:43:16,600 Mean square convergence then implies convergence in 744 00:43:16,600 --> 00:43:19,480 probability. 745 00:43:19,480 --> 00:43:22,420 Reverse isn't true because-- 746 00:43:22,420 --> 00:43:26,010 and I can't give you an example of it now, but I've 747 00:43:26,010 --> 00:43:28,450 already told you something about it. 748 00:43:28,450 --> 00:43:31,650 Because I've said that's the weak law of large numbers 749 00:43:31,650 --> 00:43:39,280 continues to hold if the generic random variable has a 750 00:43:39,280 --> 00:43:41,890 mean, but doesn't have a variance because of this 751 00:43:41,890 --> 00:43:43,140 truncation argument. 752 00:43:48,440 --> 00:43:53,170 Well, I mean, what it says then is a variance is not 753 00:43:53,170 --> 00:43:57,030 required for the weak law of large numbers to hold. 754 00:43:57,030 --> 00:43:59,650 And if the variance doesn't hold, then you certainly don't 755 00:43:59,650 --> 00:44:01,990 have convergence in mean square. 756 00:44:01,990 --> 00:44:07,150 So we have an example even though you haven't proven that 757 00:44:07,150 --> 00:44:08,700 that example works. 758 00:44:08,700 --> 00:44:11,780 You have an example where the weak law of large numbers 759 00:44:11,780 --> 00:44:19,290 holds, but convergence in mean square does not hold. 760 00:44:19,290 --> 00:44:24,510 OK, and the final thing is convergence in probability 761 00:44:24,510 --> 00:44:29,730 really means that the distribution of Yn minus Y 762 00:44:29,730 --> 00:44:31,330 approaches the unit step. 763 00:44:31,330 --> 00:44:31,750 Yes? 764 00:44:31,750 --> 00:44:35,117 AUDIENCE: So in general, convergence in probability 765 00:44:35,117 --> 00:44:37,041 doesn't imply convergence in distribution. 766 00:44:37,041 --> 00:44:38,970 But it holds in this special case because-- 767 00:44:38,970 --> 00:44:40,746 PROFESSOR: It does imply convergence in distribution. 768 00:44:40,746 --> 00:44:42,560 We haven't talked about convergence in 769 00:44:42,560 --> 00:44:43,810 distribution yet. 770 00:44:48,040 --> 00:44:52,060 Except it does not imply convergence in mean square, 771 00:44:52,060 --> 00:44:55,190 which is a thing that requires a variance. 772 00:44:55,190 --> 00:44:59,230 So you can have convergence in probability without 773 00:44:59,230 --> 00:45:02,570 convergence in mean square, but not the other way. 774 00:45:02,570 --> 00:45:04,860 I mean, convergence in mean square, you just apply 775 00:45:04,860 --> 00:45:07,210 Chebyshev to it, and suddenly-- 776 00:45:07,210 --> 00:45:09,995 presto, you have convergence in probability. 777 00:45:16,960 --> 00:45:18,530 And incidentally, I wish all of you 778 00:45:18,530 --> 00:45:21,450 would ask more questions. 779 00:45:21,450 --> 00:45:24,720 Because we're taking this video, which is going to be 780 00:45:24,720 --> 00:45:28,860 shown to many people in many different countries. 781 00:45:28,860 --> 00:45:32,830 And they ask themselves, would it be better if I came to MIT 782 00:45:32,830 --> 00:45:35,450 and then I could sit-in class and ask questions? 783 00:45:35,450 --> 00:45:37,760 And then they see these videos and they say, ah, it doesn't 784 00:45:37,760 --> 00:45:39,590 make any difference, nobody ask questions anyway. 785 00:45:42,820 --> 00:45:45,750 And because of that, MIT will simply wither 786 00:45:45,750 --> 00:45:47,810 away at some point. 787 00:45:47,810 --> 00:45:50,720 So it's very important for you to ask questions now and then. 788 00:45:55,470 --> 00:45:58,400 Now, let's go on to the central limit theorem. 789 00:46:03,900 --> 00:46:08,110 This sum of n IID random variables 790 00:46:08,110 --> 00:46:12,110 minus n times the mean-- 791 00:46:12,110 --> 00:46:17,010 in other words, we just normalized it to 0 mean. 792 00:46:17,010 --> 00:46:20,945 S sub n minus n x bar is a 0 mean random variable. 793 00:46:20,945 --> 00:46:23,660 And it has variance n times sigma squared. 794 00:46:23,660 --> 00:46:28,140 It also has second moment n times sigma squared. 795 00:46:28,140 --> 00:46:35,220 And what that means is that you take Sn minus n times the 796 00:46:35,220 --> 00:46:39,590 mean of x and divide it by the square root of n times sigma. 797 00:46:39,590 --> 00:46:41,490 What you get is something which is 0 798 00:46:41,490 --> 00:46:44,640 mean and unit variance. 799 00:46:44,640 --> 00:46:49,070 So as you keep increasing n, this random variable here, Sn 800 00:46:49,070 --> 00:46:53,550 minus n x bar over the square root of n sigma, just sits 801 00:46:53,550 --> 00:46:57,720 there rock solid with the same mean and the same variance, 802 00:46:57,720 --> 00:47:00,190 nothing ever happens to it. 803 00:47:00,190 --> 00:47:03,670 Except it has a distribution function, and the distribution 804 00:47:03,670 --> 00:47:05,020 function changes. 805 00:47:05,020 --> 00:47:08,220 I mean, you see the distribution function changing 806 00:47:08,220 --> 00:47:12,380 here as you let n get larger and larger. 807 00:47:12,380 --> 00:47:16,310 In some sense, these steps are getting smaller and smaller. 808 00:47:16,310 --> 00:47:19,680 So it looks like you're approaching some particular 809 00:47:19,680 --> 00:47:26,470 curve and when we looked at the Bernoulli case-- 810 00:47:26,470 --> 00:47:28,290 I guess it was just last time. 811 00:47:28,290 --> 00:47:32,080 When we looked at the Bernoulli case, what we saw is 812 00:47:32,080 --> 00:47:37,610 that these steps here were going as e to the minus 813 00:47:37,610 --> 00:47:42,050 difference from the mean squared divided by 2 times 814 00:47:42,050 --> 00:47:43,570 sigma squared. 815 00:47:43,570 --> 00:47:47,270 Bunch of stuff, but what we saw was that these steps were 816 00:47:47,270 --> 00:47:52,790 proportional to the density of a Gaussian. 817 00:47:52,790 --> 00:47:57,030 In other words, this curve that we're converging to is 818 00:47:57,030 --> 00:48:00,170 proportional to the distribution function of the 819 00:48:00,170 --> 00:48:02,050 Gaussian random variable. 820 00:48:02,050 --> 00:48:06,120 We didn't completely prove that because all we did was to 821 00:48:06,120 --> 00:48:08,590 show what happened to the PMF. 822 00:48:08,590 --> 00:48:11,500 We didn't really integrate these things. 823 00:48:11,500 --> 00:48:16,830 We didn't really deal with all of the small quantities. 824 00:48:16,830 --> 00:48:19,100 We said they weren't important. 825 00:48:19,100 --> 00:48:23,620 But you sort of got the picture of exactly why this 826 00:48:23,620 --> 00:48:26,860 convergence to a normal distribution 827 00:48:26,860 --> 00:48:29,900 function takes place. 828 00:48:29,900 --> 00:48:33,310 And the theorem says this more general thing that this 829 00:48:33,310 --> 00:48:37,340 convergence does, in fact, take place. 830 00:48:37,340 --> 00:48:41,310 And that is what the central limit theorem says. 831 00:48:41,310 --> 00:48:45,460 It says that that happens not only for the Bernoulli case, 832 00:48:45,460 --> 00:48:48,360 but it happens for all random variables, 833 00:48:48,360 --> 00:48:51,210 which have a variance. 834 00:48:51,210 --> 00:48:54,890 And the convergence is relatively good if the random 835 00:48:54,890 --> 00:48:57,110 variables have a certain moment that 836 00:48:57,110 --> 00:48:58,575 can be awful otherwise. 837 00:49:08,450 --> 00:49:12,980 So this expression on top then is really the expression of 838 00:49:12,980 --> 00:49:15,320 the central limit theorem. 839 00:49:15,320 --> 00:49:24,320 It says not only does the normalized sample average-- 840 00:49:24,320 --> 00:49:28,290 I'll call this whole thing the normalized sample average 841 00:49:28,290 --> 00:49:33,700 because Sn minus n X bar has variance square root of n 842 00:49:33,700 --> 00:49:35,350 times sigma sub x. 843 00:49:35,350 --> 00:49:40,900 So this normalized sample average has mean 0 and 844 00:49:40,900 --> 00:49:42,760 standard deviation 1. 845 00:49:42,760 --> 00:49:46,370 Not only does it have mean 0 and variance 1, but it also 846 00:49:46,370 --> 00:49:50,620 becomes closer and closer to this Gaussian distribution. 847 00:49:50,620 --> 00:49:51,870 Why is that important? 848 00:49:54,500 --> 00:49:57,020 Well, if you start studying noise and things like that, 849 00:49:57,020 --> 00:49:58,870 it's very important. 850 00:49:58,870 --> 00:50:03,290 Because it says that if you have the sum of lots and lots 851 00:50:03,290 --> 00:50:09,450 of very, very small, unimportant things, then what 852 00:50:09,450 --> 00:50:12,830 those things add up to if they're relatively independent 853 00:50:12,830 --> 00:50:15,310 is something which is almost Gaussian. 854 00:50:15,310 --> 00:50:22,080 You pick up a book on noise theory or you pick up a book 855 00:50:22,080 --> 00:50:26,360 which is on communication, or which is on control, or 856 00:50:26,360 --> 00:50:30,430 something like that, and after you read a few chapters, you 857 00:50:30,430 --> 00:50:35,550 get the idea that all random variables are Gaussian. 858 00:50:35,550 --> 00:50:38,040 This is particularly true if you look at books on 859 00:50:38,040 --> 00:50:39,630 statistics. 860 00:50:39,630 --> 00:50:42,160 Many, many books on statistics, particularly for 861 00:50:42,160 --> 00:50:45,360 undergraduates, the only random variable they ever talk 862 00:50:45,360 --> 00:50:48,610 about is the normal random variable. 863 00:50:48,610 --> 00:50:51,160 For some reason or other, you're led to believe that all 864 00:50:51,160 --> 00:50:53,860 random variables are Gaussian. 865 00:50:53,860 --> 00:50:55,300 Well, of course, they aren't. 866 00:50:55,300 --> 00:50:59,080 But this says that a lot of random variables, which are 867 00:50:59,080 --> 00:51:03,210 sums of large numbers of little things, in fact are 868 00:51:03,210 --> 00:51:04,540 close to Gaussian. 869 00:51:04,540 --> 00:51:06,925 But we're interested in it here for another reason. 870 00:51:11,540 --> 00:51:15,530 And we'll come to that in a little bit. 871 00:51:15,530 --> 00:51:19,160 But let me make the comment that the proofs that I gave 872 00:51:19,160 --> 00:51:22,150 you about the central limit theorem for 873 00:51:22,150 --> 00:51:24,650 the Bernoulli case-- 874 00:51:24,650 --> 00:51:27,300 and if you fill-in those epsilons and deltas there, 875 00:51:27,300 --> 00:51:30,520 that really was a valid proof. 876 00:51:30,520 --> 00:51:34,270 That technique does not work at all when you have a 877 00:51:34,270 --> 00:51:36,180 non-Bernoulli situation. 878 00:51:36,180 --> 00:51:38,760 Because the situation is very, very complicated. 879 00:51:38,760 --> 00:51:41,400 You wind up-- 880 00:51:41,400 --> 00:51:43,740 I mean, if you have a Bernoulli case, you wind up 881 00:51:43,740 --> 00:51:46,800 with this nice, nice distribution, which says that 882 00:51:46,800 --> 00:51:51,390 every step in the Bernoulli distribution, you have terms 883 00:51:51,390 --> 00:51:54,420 that are increasing and then terms that are decreasing. 884 00:51:54,420 --> 00:51:57,030 If you look at what happens for a discrete random 885 00:51:57,030 --> 00:52:01,870 variable, which is not binary, you have the most god awful 886 00:52:01,870 --> 00:52:05,570 distribution if you try to look at the 887 00:52:05,570 --> 00:52:07,130 probability mass function. 888 00:52:07,130 --> 00:52:09,500 It is just awful. 889 00:52:09,500 --> 00:52:12,140 And the only thing which looks nice is the 890 00:52:12,140 --> 00:52:13,105 distribution function. 891 00:52:13,105 --> 00:52:16,110 The distribution function looks relatively nice. 892 00:52:16,110 --> 00:52:18,713 And why is hard to tell. 893 00:52:18,713 --> 00:52:24,110 And if you look at proofs of it, it goes through Fourier 894 00:52:24,110 --> 00:52:25,090 transforms. 895 00:52:25,090 --> 00:52:28,330 In probability theory, Fourier transforms are called 896 00:52:28,330 --> 00:52:32,300 characteristics functions, but it's really the same thing. 897 00:52:32,300 --> 00:52:35,610 And you go through this very complicated argument. 898 00:52:35,610 --> 00:52:38,720 I've been through it a number of times. 899 00:52:38,720 --> 00:52:41,200 And to me, it's all algebra. 900 00:52:41,200 --> 00:52:44,340 And I'm not a person that just accepts the fact that 901 00:52:44,340 --> 00:52:46,690 something is all algebra easily. 902 00:52:46,690 --> 00:52:50,170 I keep trying to find ways of making sense out of it. 903 00:52:50,170 --> 00:52:52,700 And I've never been able to make sense out of it, but I'm 904 00:52:52,700 --> 00:52:54,400 convinced that it's true. 905 00:52:54,400 --> 00:52:58,180 So you just have to sort of live with that. 906 00:52:58,180 --> 00:53:06,550 So anyway, the central limit theorem does apply to the 907 00:53:06,550 --> 00:53:08,420 distribution function. 908 00:53:08,420 --> 00:53:11,090 Namely, exactly what that says. 909 00:53:11,090 --> 00:53:16,620 The distribution function of this normalized sample average 910 00:53:16,620 --> 00:53:21,330 does go into the distribution function of the Gaussian. 911 00:53:21,330 --> 00:53:26,240 The PMFs do not converge at all. 912 00:53:26,240 --> 00:53:30,410 And nothing else converges, but just that one thing. 913 00:53:30,410 --> 00:53:34,780 OK, a sequence of random variables converges in 914 00:53:34,780 --> 00:53:36,060 distribution-- 915 00:53:36,060 --> 00:53:38,450 this is what someone was asking about just a second 916 00:53:38,450 --> 00:53:43,080 ago, but I don't think you were really asking about that. 917 00:53:43,080 --> 00:53:46,980 But this is what convergence in distribution means. 918 00:53:46,980 --> 00:53:51,610 It means it's the limit of the distribution functions of a 919 00:53:51,610 --> 00:53:54,240 sequence of random variables. 920 00:53:54,240 --> 00:53:59,140 Turns into the distribution function of some 921 00:53:59,140 --> 00:54:00,390 other random variable. 922 00:54:06,530 --> 00:54:09,990 Then you say that these random variables converge in 923 00:54:09,990 --> 00:54:14,930 distribution to Z. And that's a nice, useful thing. 924 00:54:14,930 --> 00:54:18,920 And the CLT, the Central Limit Theorem, then says that this 925 00:54:18,920 --> 00:54:24,990 normalized sample average converges in distribution to 926 00:54:24,990 --> 00:54:28,440 the distribution of a normal random variable. 927 00:54:28,440 --> 00:54:33,280 Many people call that density city phi and call the normal 928 00:54:33,280 --> 00:54:34,650 distribution phi. 929 00:54:34,650 --> 00:54:35,280 I don't know why. 930 00:54:35,280 --> 00:54:39,600 I mean, you've got to call it something, so many people call 931 00:54:39,600 --> 00:54:40,850 it the same thing. 932 00:54:43,110 --> 00:54:46,660 This convergence and distribution is really almost 933 00:54:46,660 --> 00:54:48,450 a misnomer. 934 00:54:48,450 --> 00:54:52,240 Because when random variables converge in distribution to 935 00:54:52,240 --> 00:54:56,740 another random variable, I mean, if you say something 936 00:54:56,740 --> 00:55:03,290 converges, usually you have the idea that the thing which 937 00:55:03,290 --> 00:55:06,710 is converging to something else is getting close to it in 938 00:55:06,710 --> 00:55:08,190 some sense. 939 00:55:08,190 --> 00:55:11,240 And the random variables aren't getting close at all, 940 00:55:11,240 --> 00:55:12,885 it's only the distribution functions 941 00:55:12,885 --> 00:55:15,180 that are getting close. 942 00:55:15,180 --> 00:55:21,680 If I take a sequence of IID random variables, all of them 943 00:55:21,680 --> 00:55:24,910 have the same distribution function. 944 00:55:24,910 --> 00:55:27,430 And therefore, a sequence of IID 945 00:55:27,430 --> 00:55:30,250 random variables converges. 946 00:55:30,250 --> 00:55:35,040 And in fact, it's converged right from the beginning to 947 00:55:35,040 --> 00:55:37,100 the same random variable, to the same 948 00:55:37,100 --> 00:55:39,140 generic random variable. 949 00:55:39,140 --> 00:55:41,330 But they're not at all close to each other. 950 00:55:41,330 --> 00:55:46,130 But you still call this convergence in distribution. 951 00:55:46,130 --> 00:55:48,820 Why do we make such a big fuss about convergence in 952 00:55:48,820 --> 00:55:50,020 distribution? 953 00:55:50,020 --> 00:55:53,850 Well, primarily because of the central limit theorem because 954 00:55:53,850 --> 00:55:57,740 you would like to see that a sequence of random variables, 955 00:55:57,740 --> 00:56:01,600 in fact, starts to look like something that is interesting, 956 00:56:01,600 --> 00:56:04,250 which is the Gaussian random variable after a while. 957 00:56:04,250 --> 00:56:08,340 It says we can do these crazy things that 958 00:56:08,340 --> 00:56:11,910 statisticians do, and that-- 959 00:56:11,910 --> 00:56:14,390 well, fortunately, most communication theorists are a 960 00:56:14,390 --> 00:56:18,160 little more careful than statisticians. 961 00:56:18,160 --> 00:56:20,980 Somebody's going to hit me for saying that, but 962 00:56:20,980 --> 00:56:22,860 I think it's true. 963 00:56:22,860 --> 00:56:29,420 But the central limit theorem, in fact, does say that many of 964 00:56:29,420 --> 00:56:33,180 these sums of random variables you look at can be reasonably 965 00:56:33,180 --> 00:56:35,135 approximated as being Gaussian. 966 00:56:38,750 --> 00:56:44,730 So what we have now is convergence in probability 967 00:56:44,730 --> 00:56:46,550 implies convergence in distribution. 968 00:56:49,110 --> 00:56:51,620 And the proof, I will-- 969 00:56:51,620 --> 00:56:56,680 on the slides, I always abbreviate proof by Pf. 970 00:56:56,680 --> 00:57:00,190 And sometimes Pf is just what it sounds like it, it's 971 00:57:00,190 --> 00:57:05,430 "poof." It is not quite a proof, and you have to look at 972 00:57:05,430 --> 00:57:07,790 those to get the actual proof. 973 00:57:07,790 --> 00:57:11,530 But this says the convergence is a sequence of Yn's in 974 00:57:11,530 --> 00:57:14,970 probability means that it converges to a unit step. 975 00:57:14,970 --> 00:57:17,360 That's exactly what convergence 976 00:57:17,360 --> 00:57:20,400 in probability mean. 977 00:57:20,400 --> 00:57:24,860 It converges to a unit step, and ti converges everywhere 978 00:57:24,860 --> 00:57:27,580 but at the step itself. 979 00:57:27,580 --> 00:57:30,040 If you look at the definition of convergence in 980 00:57:30,040 --> 00:57:33,970 distribution, and I might not have said it carefully enough 981 00:57:33,970 --> 00:57:38,630 when I defined it back here. 982 00:57:38,630 --> 00:57:40,310 Oh, yes, I did. 983 00:57:40,310 --> 00:57:42,590 Remarkable. 984 00:57:42,590 --> 00:57:46,130 Often I make up these slides when I'm half asleep, and they 985 00:57:46,130 --> 00:57:49,050 don't always say what I intended them to say. 986 00:57:49,050 --> 00:57:53,290 And my evil twin brother comes in and changes them later. 987 00:57:53,290 --> 00:57:54,430 But here I said it right. 988 00:57:54,430 --> 00:57:58,660 A sequence of random variables converges in distribution to 989 00:57:58,660 --> 00:58:05,910 another random variable Z if the limit of the distribution 990 00:58:05,910 --> 00:58:10,736 function is equal to the limit-- 991 00:58:10,736 --> 00:58:14,530 if the limit of the distribution function of the Z 992 00:58:14,530 --> 00:58:18,780 sub n is equal to the distribution function of Z. 993 00:58:18,780 --> 00:58:23,470 But it only says for all Z where this distribution 994 00:58:23,470 --> 00:58:26,550 function is continuous. 995 00:58:26,550 --> 00:58:33,840 You can't really expect much more than that because if 996 00:58:33,840 --> 00:58:36,180 you're looking at a distribution-- 997 00:58:36,180 --> 00:58:38,380 if you're looking at a limiting distribution 998 00:58:38,380 --> 00:58:43,990 function, which looks like this, especially for the law 999 00:58:43,990 --> 00:58:50,080 of large numbers, all we've been able to show is that 1000 00:58:50,080 --> 00:58:56,930 these distributions come in down here very close, go up 1001 00:58:56,930 --> 00:58:59,660 and get out very close up there. 1002 00:58:59,660 --> 00:59:01,650 We haven't said anything about where they 1003 00:59:01,650 --> 00:59:03,680 cross this actual line. 1004 00:59:03,680 --> 00:59:07,160 And there's nothing in the argument about the weak law of 1005 00:59:07,160 --> 00:59:11,970 large numbers, which says anything about what happens 1006 00:59:11,970 --> 00:59:15,640 right exactly at the mean. 1007 00:59:15,640 --> 00:59:18,700 But that's something that's the central 1008 00:59:18,700 --> 00:59:20,460 limit theorem says-- 1009 00:59:20,460 --> 00:59:21,632 Yes? 1010 00:59:21,632 --> 00:59:22,574 AUDIENCE: What's the Zn? 1011 00:59:22,574 --> 00:59:25,400 That's not the sample mean? 1012 00:59:31,370 --> 00:59:37,530 PROFESSOR: When we use the Zn's for the central limit 1013 00:59:37,530 --> 00:59:42,510 theorem, then what I mean by the Z sub n's here is those 1014 00:59:42,510 --> 00:59:47,380 normalized random variables Sn minus n x bar over square root 1015 00:59:47,380 --> 00:59:48,630 of n times sigma. 1016 00:59:50,760 --> 00:59:54,080 And in all of these definitions of convergence, 1017 00:59:54,080 --> 00:59:57,950 the random variables, which are converging to something, 1018 00:59:57,950 --> 00:59:59,890 are always rather peculiar. 1019 00:59:59,890 --> 01:00:02,660 Sometimes they're the sample averages. 1020 01:00:02,660 --> 01:00:06,840 Sometimes they're the normalized sample averages. 1021 01:00:06,840 --> 01:00:08,090 God knows what they are. 1022 01:00:10,980 --> 01:00:13,430 But what mathematicians like to do-- 1023 01:00:13,430 --> 01:00:15,950 and there's a good reason for what they like to do-- 1024 01:00:15,950 --> 01:00:20,340 is they like to define different kinds of convergence 1025 01:00:20,340 --> 01:00:25,620 in general terms, and then apply them to the specific 1026 01:00:25,620 --> 01:00:28,770 thing that you're interested in. 1027 01:00:28,770 --> 01:00:31,620 OK, so the central limit theorem says that this 1028 01:00:31,620 --> 01:00:38,610 normalized sum converges in distribution to phi, but it 1029 01:00:38,610 --> 01:00:42,180 only has to converge where the distribution function is 1030 01:00:42,180 --> 01:00:43,287 continuous. 1031 01:00:43,287 --> 01:00:45,275 Yes? 1032 01:00:45,275 --> 01:00:47,760 AUDIENCE: So the theorem applies to the distribution. 1033 01:00:47,760 --> 01:00:52,730 Why doesn't it apply to PMF? 1034 01:00:52,730 --> 01:01:02,560 PROFESSOR: Well, if you look at the example we have here, 1035 01:01:02,560 --> 01:01:10,400 if you look at the PDF for this normalized random 1036 01:01:10,400 --> 01:01:13,030 variable, you find something which is 1037 01:01:13,030 --> 01:01:15,550 jumping up, jumping up. 1038 01:01:15,550 --> 01:01:18,780 If we look at it for n equals 50, it's still jumping up. 1039 01:01:18,780 --> 01:01:20,760 The jumps are smaller. 1040 01:01:20,760 --> 01:01:23,150 But if you look at the PDF for this-- 1041 01:01:23,150 --> 01:01:28,810 well, if you look at the distribution function for the 1042 01:01:28,810 --> 01:01:30,530 normal, it has a density. 1043 01:01:34,260 --> 01:01:41,130 This PDF function for the things which you want to 1044 01:01:41,130 --> 01:01:45,690 approach a limit never have a density. 1045 01:01:45,690 --> 01:01:47,760 All the time they have a PDF. 1046 01:01:47,760 --> 01:01:51,020 The steps are getting smaller and smaller. 1047 01:01:51,020 --> 01:01:54,390 And you can see that here as you're up to n equals 50. 1048 01:01:54,390 --> 01:01:57,740 You can see these little tiny steps here. 1049 01:01:57,740 --> 01:02:03,350 But you still have a PMF. 1050 01:02:03,350 --> 01:02:06,850 You'll want to look at it in terms of density. 1051 01:02:06,850 --> 01:02:09,980 You have to look at in terms of impulses. 1052 01:02:09,980 --> 01:02:12,630 And there's no way you can say an impulse is starting to 1053 01:02:12,630 --> 01:02:14,765 approach a smooth curve. 1054 01:02:36,640 --> 01:02:39,610 OK, so we have this proof that converges in probability, 1055 01:02:39,610 --> 01:02:43,770 implies convergence in distribution. 1056 01:02:43,770 --> 01:02:47,680 And since convergence in mean square implies convergence in 1057 01:02:47,680 --> 01:02:51,430 probability, and convergence in probability implies 1058 01:02:51,430 --> 01:02:54,760 convergence in distribution, we suddenly have the 1059 01:02:54,760 --> 01:02:58,030 convergence in mean square implies convergence in 1060 01:02:58,030 --> 01:02:59,425 distribution also. 1061 01:02:59,425 --> 01:03:03,740 And you have this nice picture in the book of all the things 1062 01:03:03,740 --> 01:03:05,990 that converge in distribution. 1063 01:03:05,990 --> 01:03:07,420 Inside of that-- 1064 01:03:07,420 --> 01:03:10,250 this is distribution. 1065 01:03:10,250 --> 01:03:12,850 Inside of that is all the things that converge in 1066 01:03:12,850 --> 01:03:14,100 probability. 1067 01:03:17,440 --> 01:03:20,490 And inside of that is all the things that 1068 01:03:20,490 --> 01:03:21,740 converge in mean square. 1069 01:03:31,510 --> 01:03:34,200 Now, there's a paradox here. 1070 01:03:34,200 --> 01:03:38,590 And what the paradox is, is that the central limit theorem 1071 01:03:38,590 --> 01:03:43,470 says something very, very strong about how Sn over n-- 1072 01:03:43,470 --> 01:03:45,520 namely, the sample average-- 1073 01:03:45,520 --> 01:03:48,250 converges to the mean. 1074 01:03:48,250 --> 01:03:50,640 The convergence in distribution is a very weak 1075 01:03:50,640 --> 01:03:52,570 form of convergence. 1076 01:03:52,570 --> 01:03:55,610 So how is this weak form of convergence telling you 1077 01:03:55,610 --> 01:04:02,460 something that says specific about how a sample average 1078 01:04:02,460 --> 01:04:03,680 converges to the mean? 1079 01:04:03,680 --> 01:04:05,750 It tells you much more than the weak law of 1080 01:04:05,750 --> 01:04:06,910 large numbers does. 1081 01:04:06,910 --> 01:04:09,780 Because it tells you if you at this thing, it's starting to 1082 01:04:09,780 --> 01:04:13,340 approach a normal distribution function. 1083 01:04:13,340 --> 01:04:15,970 And the resolution of that paradox-- 1084 01:04:15,970 --> 01:04:17,840 and this is important I think-- 1085 01:04:17,840 --> 01:04:21,470 is that the random variables converge in distribution to 1086 01:04:21,470 --> 01:04:23,950 the central limit theorem are these 1087 01:04:23,950 --> 01:04:28,860 normalized random variables. 1088 01:04:28,860 --> 01:04:36,830 The ones that converge in probability are the things 1089 01:04:36,830 --> 01:04:40,030 which are normalizing in terms of the mean, but you're not 1090 01:04:40,030 --> 01:04:43,330 normalizing them in terms of variance. 1091 01:04:43,330 --> 01:04:47,700 So when you look at one curve relative to the other curve, 1092 01:04:47,700 --> 01:04:51,500 one curve is a squashed down version of the other curve. 1093 01:04:51,500 --> 01:04:53,740 I mean, look at those pictures we have for that example. 1094 01:04:59,230 --> 01:05:03,980 If you look at a sequence of distribution functions for S 1095 01:05:03,980 --> 01:05:09,230 sub n over n, what you find is things which are squashing 1096 01:05:09,230 --> 01:05:13,520 down into a unit step. 1097 01:05:13,520 --> 01:05:17,100 If you look at what you have for the normalized random 1098 01:05:17,100 --> 01:05:24,990 variables, normalized to unit variance, what you have is 1099 01:05:24,990 --> 01:05:27,910 something which is not squashing down at all. 1100 01:05:27,910 --> 01:05:30,310 It gives the whole shape of the thing. 1101 01:05:30,310 --> 01:05:34,580 You can get from one curve to the other just by squashing or 1102 01:05:34,580 --> 01:05:37,290 expanding on the x-axis. 1103 01:05:37,290 --> 01:05:39,790 That's the only difference between them. 1104 01:05:39,790 --> 01:05:44,750 So the central limit theorem says when you don't squash, 1105 01:05:44,750 --> 01:05:49,340 you get this nice Gaussian distribution function. 1106 01:05:49,340 --> 01:05:54,040 The weak law of large numbers says when you do squash, you 1107 01:05:54,040 --> 01:05:56,010 get a unit step. 1108 01:05:56,010 --> 01:05:57,960 Now, which tells you more? 1109 01:05:57,960 --> 01:06:01,610 Well, if you have the central limit theorem, it tells you a 1110 01:06:01,610 --> 01:06:08,900 lot more because it says, if you look at this unit step 1111 01:06:08,900 --> 01:06:14,160 here, and you expand it out by a factor of square root of n, 1112 01:06:14,160 --> 01:06:19,340 what you're going to get is something that goes like this. 1113 01:06:19,340 --> 01:06:22,490 The central limit theorem tells you exactly what the 1114 01:06:22,490 --> 01:06:25,780 distribution function is at x bar. 1115 01:06:25,780 --> 01:06:30,960 It tells you that that's converging to what? 1116 01:06:30,960 --> 01:06:36,750 What's the probability that the sum of a large number of 1117 01:06:36,750 --> 01:06:39,500 random variables is greater than n times the mean? 1118 01:06:42,670 --> 01:06:44,790 What is it approximately? 1119 01:06:44,790 --> 01:06:45,090 AUDIENCE: 1/2. 1120 01:06:45,090 --> 01:06:46,340 PROFESSOR: 1/2. 1121 01:06:47,880 --> 01:06:50,370 That's what this says. 1122 01:06:50,370 --> 01:06:52,460 This is a distribution function. 1123 01:06:52,460 --> 01:06:54,400 It's converging to the distribution 1124 01:06:54,400 --> 01:06:57,240 function of the normal. 1125 01:06:57,240 --> 01:07:00,500 It hits that point, the normal is centered 1126 01:07:00,500 --> 01:07:02,980 on this x bar here. 1127 01:07:02,980 --> 01:07:05,280 And it hits that point exactly at 1/2. 1128 01:07:05,280 --> 01:07:08,800 This says the probability of being on that side is 1/2. 1129 01:07:08,800 --> 01:07:11,976 The probability of being on this side is 1/2. 1130 01:07:11,976 --> 01:07:15,500 So you see, the central limit theorem is telling you a whole 1131 01:07:15,500 --> 01:07:19,580 lot more about how this is converging than the weak law 1132 01:07:19,580 --> 01:07:21,980 of large numbers is. 1133 01:07:21,980 --> 01:07:23,440 Now, I come back to the question I asked 1134 01:07:23,440 --> 01:07:24,690 you a long time ago. 1135 01:07:27,760 --> 01:07:31,080 Why is the weak law of large numbers-- 1136 01:07:31,080 --> 01:07:35,170 why do you see it used more often than the central limit 1137 01:07:35,170 --> 01:07:37,005 theorem since it's so much less powerful? 1138 01:07:39,710 --> 01:07:42,010 Well, it's the same answer as before. 1139 01:07:42,010 --> 01:07:45,480 It's less powerful, but it applies to a much larger 1140 01:07:45,480 --> 01:07:47,390 number of cases. 1141 01:07:47,390 --> 01:07:51,620 And in many situations, all you want is that weaker 1142 01:07:51,620 --> 01:07:54,210 statement that tells you everything you want to know, 1143 01:07:54,210 --> 01:07:57,670 but it tells you that weaker statement for this enormous 1144 01:07:57,670 --> 01:08:01,820 variety of different situations. 1145 01:08:01,820 --> 01:08:05,980 Mean square convergence applies to fewer things. 1146 01:08:05,980 --> 01:08:08,570 Well, of course, convergence in distribution applies for 1147 01:08:08,570 --> 01:08:09,990 even more things. 1148 01:08:09,990 --> 01:08:13,080 But we saw that when you're dealing with the central limit 1149 01:08:13,080 --> 01:08:18,120 theorem, all bets are off on that because it's talking 1150 01:08:18,120 --> 01:08:21,109 about a different sequence of random variables, which might 1151 01:08:21,109 --> 01:08:23,430 or might not converge. 1152 01:08:23,430 --> 01:08:29,520 OK, so finally, convergence with probability 1. 1153 01:08:32,109 --> 01:08:38,060 Many people call convergence with probability 1 convergence 1154 01:08:38,060 --> 01:08:42,560 almost surely, or convergence almost everywhere. 1155 01:08:42,560 --> 01:08:45,640 You will see this almost everywhere. 1156 01:08:45,640 --> 01:08:50,330 Now, why do I want to use convergence with probability, 1157 01:08:50,330 --> 01:08:54,109 and why is that a dangerous thing to do? 1158 01:08:54,109 --> 01:08:58,270 When you say things are converging with probability 1, 1159 01:08:58,270 --> 01:09:01,770 it sounds very much like you're saying they converge in 1160 01:09:01,770 --> 01:09:03,970 probability because you're using the word 1161 01:09:03,970 --> 01:09:05,439 "probability" in each. 1162 01:09:05,439 --> 01:09:08,499 The two are very, very different concepts. 1163 01:09:11,790 --> 01:09:15,729 And therefore, it would seem like you should avoid the word 1164 01:09:15,729 --> 01:09:19,330 "probability" in this second one and say convergence almost 1165 01:09:19,330 --> 01:09:23,689 surely or convergence almost everywhere. 1166 01:09:23,689 --> 01:09:27,120 And the reason I don't like those is they don't make any 1167 01:09:27,120 --> 01:09:30,800 sense, unless you understand measure theory. 1168 01:09:30,800 --> 01:09:33,069 And we're not assuming that you understand 1169 01:09:33,069 --> 01:09:35,220 measure theory here. 1170 01:09:35,220 --> 01:09:37,720 If you wanted to do that first problem in the last problem 1171 01:09:37,720 --> 01:09:40,899 set, you had to understand measure theory. 1172 01:09:40,899 --> 01:09:45,555 And I apologize for that, I didn't mean to do that to you. 1173 01:09:45,555 --> 01:09:52,390 But this notion of convergence with probability 1, I think 1174 01:09:52,390 --> 01:09:53,649 you can understand that. 1175 01:09:53,649 --> 01:09:56,590 I think you can get a good sense of what it means without 1176 01:09:56,590 --> 01:09:57,950 knowing any measure theory. 1177 01:09:57,950 --> 01:10:01,050 And at least that's what we're trying to do. 1178 01:10:01,050 --> 01:10:04,180 OK, so let's go on. 1179 01:10:06,950 --> 01:10:09,410 We've already said that a random variable is a lot more 1180 01:10:09,410 --> 01:10:14,200 complicated thing than a number is. 1181 01:10:14,200 --> 01:10:16,060 I think those of you who thought you understood 1182 01:10:16,060 --> 01:10:19,410 probability theory pretty well, I probably managed to 1183 01:10:19,410 --> 01:10:23,790 confuse you enough to get you to the point where you think 1184 01:10:23,790 --> 01:10:26,690 you're not on totally safe ground talking 1185 01:10:26,690 --> 01:10:28,330 about random variables. 1186 01:10:28,330 --> 01:10:31,310 And you're certainly not on very safe ground talking about 1187 01:10:31,310 --> 01:10:34,850 how random variables converge to each other. 1188 01:10:34,850 --> 01:10:38,960 And that's good because to reach a greater understanding 1189 01:10:38,960 --> 01:10:41,840 of something, you have to get to the point where you're a 1190 01:10:41,840 --> 01:10:44,090 little bit confused first. 1191 01:10:44,090 --> 01:10:46,910 So I've intentionally tried to-- 1192 01:10:46,910 --> 01:10:48,190 well, I haven't tried to make this more 1193 01:10:48,190 --> 01:10:51,140 confusing than necessary. 1194 01:10:51,140 --> 01:10:56,490 But in fact, it's not as simple as what elementary 1195 01:10:56,490 --> 01:10:59,320 courses would make you believe. 1196 01:10:59,320 --> 01:11:04,140 OK, this notion of convergence with probability 1, which we 1197 01:11:04,140 --> 01:11:13,140 abbreviate WP1, is something that we're not going to talk 1198 01:11:13,140 --> 01:11:17,030 about a great deal until we come to renewal processes. 1199 01:11:17,030 --> 01:11:19,880 And the reason is we won't need it a great deal until we 1200 01:11:19,880 --> 01:11:21,960 come to renewal processes. 1201 01:11:21,960 --> 01:11:25,150 But you ought to know that there's something like that 1202 01:11:25,150 --> 01:11:26,770 hanging out there. 1203 01:11:26,770 --> 01:11:29,580 And you ought to have some idea of what it is. 1204 01:11:29,580 --> 01:11:31,890 So here's the definition of it. 1205 01:11:31,890 --> 01:11:37,120 A sequence of random variables convergence with probability 1 1206 01:11:37,120 --> 01:11:40,600 to some other random variable Z all in 1207 01:11:40,600 --> 01:11:43,190 the same sample space. 1208 01:11:43,190 --> 01:11:46,920 If the probability of sample points in omega-- 1209 01:11:46,920 --> 01:11:53,750 now remember, that a sample point implies a value for each 1210 01:11:53,750 --> 01:11:56,560 one of these random variables. 1211 01:11:56,560 --> 01:12:00,270 So in a sense, you can think of a sample point as, more or 1212 01:12:00,270 --> 01:12:04,630 less, equivalent to a sample path of this sequence of 1213 01:12:04,630 --> 01:12:06,340 random variables here. 1214 01:12:06,340 --> 01:12:11,810 OK, so for omega and capital Omega, it says the limit as n 1215 01:12:11,810 --> 01:12:16,670 goes to infinity of these random variables at the point 1216 01:12:16,670 --> 01:12:25,690 omega is equal to what this extra random variable is at 1217 01:12:25,690 --> 01:12:26,210 the point-- 1218 01:12:26,210 --> 01:12:28,000 and it says that the probability of that whole 1219 01:12:28,000 --> 01:12:31,050 thing is equal to 1. 1220 01:12:31,050 --> 01:12:33,230 Now, how many of you can look at that statement 1221 01:12:33,230 --> 01:12:35,926 and see what it means? 1222 01:12:35,926 --> 01:12:39,550 Well, I'm sure some of you can because you've seen it before. 1223 01:12:39,550 --> 01:12:43,580 But understanding what that statement means, even though 1224 01:12:43,580 --> 01:12:47,810 it's a very simple statement, is not very easy. 1225 01:12:47,810 --> 01:12:49,790 So there's the statement up there. 1226 01:12:49,790 --> 01:12:52,000 Let's try to parse it. 1227 01:12:52,000 --> 01:12:56,470 In other words, break it down into what it's talking about. 1228 01:12:56,470 --> 01:13:01,210 For each sample point omega, that sample point is going to 1229 01:13:01,210 --> 01:13:04,960 map into a sequence of-- 1230 01:13:21,680 --> 01:13:29,390 so each sample point maps into this sample path of values for 1231 01:13:29,390 --> 01:13:31,210 this sequence of random variables. 1232 01:13:35,220 --> 01:13:37,590 Some of those sequences-- 1233 01:13:37,590 --> 01:13:40,985 OK, this now is a sequence of numbers. 1234 01:13:53,880 --> 01:13:58,790 So each omega goes into some sequence of numbers. 1235 01:13:58,790 --> 01:14:06,630 And that also is unquestionably close to this 1236 01:14:06,630 --> 01:14:13,460 final generic random variable, capital Z evaluated at omega. 1237 01:14:13,460 --> 01:14:18,370 Now, some of these sequences here, sequences of real 1238 01:14:18,370 --> 01:14:24,470 numbers, we all know what a limit of real numbers is. 1239 01:14:24,470 --> 01:14:26,280 I hope you do. 1240 01:14:26,280 --> 01:14:28,690 I know that many of you don't. 1241 01:14:28,690 --> 01:14:31,230 And we'll talk about it later when we start talking about 1242 01:14:31,230 --> 01:14:33,000 the strong law of large numbers. 1243 01:14:33,000 --> 01:14:38,600 But this does, perhaps, have a limit. 1244 01:14:38,600 --> 01:14:40,780 It perhaps doesn't have a limit. 1245 01:14:40,780 --> 01:14:46,280 If you look at a sequence 1, 2, 1, 2, 1, 2, 1, 2, forever, 1246 01:14:46,280 --> 01:14:48,660 that doesn't have a limit because it doesn't start to 1247 01:14:48,660 --> 01:14:50,000 get close to anything. 1248 01:14:50,000 --> 01:14:53,060 It keeps wandering around forever. 1249 01:14:53,060 --> 01:15:00,670 If you look at a sequence which is 1 for 10 terms, then 1250 01:15:00,670 --> 01:15:08,200 it's 2 the 11th term, and then it's 1 for 100 terms, 2 for 1 1251 01:15:08,200 --> 01:15:13,900 more term, 1 for 1,000 terms, then 2 for the next term, and 1252 01:15:13,900 --> 01:15:17,120 so forth, that's a much more tricky case. 1253 01:15:17,120 --> 01:15:21,130 Because in that sequence, pretty soon 1254 01:15:21,130 --> 01:15:22,210 all you see is 1's. 1255 01:15:22,210 --> 01:15:24,950 You look for an awful long way and you don't see any 2's. 1256 01:15:24,950 --> 01:15:28,340 That does not converge just because the definition of 1257 01:15:28,340 --> 01:15:30,160 convergence. 1258 01:15:30,160 --> 01:15:33,640 And when you work with convergence for a long time, 1259 01:15:33,640 --> 01:15:35,850 after a while you're very happy that that doesn't 1260 01:15:35,850 --> 01:15:39,230 converge because it would play all sorts of 1261 01:15:39,230 --> 01:15:42,300 havoc with all of analysis. 1262 01:15:42,300 --> 01:15:45,730 So anyway, there is this idea that these numbers either 1263 01:15:45,730 --> 01:15:47,710 converge or they don't converge. 1264 01:15:47,710 --> 01:15:51,600 When these numbers converge, they might or might not 1265 01:15:51,600 --> 01:15:53,500 converge to this. 1266 01:15:53,500 --> 01:15:58,350 So for every omega in this sample space, you have this 1267 01:15:58,350 --> 01:16:00,130 sequence here. 1268 01:16:00,130 --> 01:16:02,830 That sequence might converge. 1269 01:16:02,830 --> 01:16:05,460 If it does converge, it might converge to this or it might 1270 01:16:05,460 --> 01:16:08,070 converge to something else. 1271 01:16:08,070 --> 01:16:13,090 And what this is saying here is you take this entire set of 1272 01:16:13,090 --> 01:16:14,700 sequences here. 1273 01:16:14,700 --> 01:16:17,960 Namely, you take the entire set of omega. 1274 01:16:17,960 --> 01:16:21,440 And for each one of those omegas, this might or might 1275 01:16:21,440 --> 01:16:22,800 not converge. 1276 01:16:22,800 --> 01:16:26,700 You look at the set of omega for which this 1277 01:16:26,700 --> 01:16:28,590 sequence does converge. 1278 01:16:28,590 --> 01:16:30,810 And for which it does converge to Z of omega. 1279 01:16:34,210 --> 01:16:41,950 And now you look at that set and what convergence with 1280 01:16:41,950 --> 01:16:47,010 probability 1 means is that that set turns out to be an 1281 01:16:47,010 --> 01:16:53,720 event, and that event turns out to have probability 1. 1282 01:16:53,720 --> 01:16:58,370 Which says that for almost everything that happens, you 1283 01:16:58,370 --> 01:17:01,500 look at this sample sequence and it has a 1284 01:17:01,500 --> 01:17:05,260 limit, which is this. 1285 01:17:05,260 --> 01:17:08,050 And that's true in probability. 1286 01:17:08,050 --> 01:17:10,570 It's not true for most sequences. 1287 01:17:10,570 --> 01:17:15,180 Let me give you a very quick and simple example. 1288 01:17:15,180 --> 01:17:20,620 Look at this Bernoulli case, and suppose the probability of 1289 01:17:20,620 --> 01:17:24,830 a 1 is one quarter and the probability of a 0 is 3/4. 1290 01:17:24,830 --> 01:17:28,450 Look at what happens when you take an extraordinarily large 1291 01:17:28,450 --> 01:17:36,620 number of trials and you ask for those sample sequences 1292 01:17:36,620 --> 01:17:38,880 that you take. 1293 01:17:38,880 --> 01:17:41,800 What's going to happen to them? 1294 01:17:41,800 --> 01:17:46,760 Well, this says that if you look at the relative 1295 01:17:46,760 --> 01:17:50,330 frequency of 1's-- 1296 01:17:50,330 --> 01:17:57,220 well, if they converge in this sense, if the set of relative 1297 01:17:57,220 --> 01:17:59,330 frequencies-- 1298 01:17:59,330 --> 01:18:00,440 excuse me. 1299 01:18:00,440 --> 01:18:06,370 If the set of sample averages converges in this sense, then 1300 01:18:06,370 --> 01:18:12,960 it says with probability 1, that sample average is going 1301 01:18:12,960 --> 01:18:16,750 to converge to one quarter. 1302 01:18:16,750 --> 01:18:19,340 Now, that doesn't mean that most sequences are going to 1303 01:18:19,340 --> 01:18:21,790 converge that way. 1304 01:18:21,790 --> 01:18:26,810 Because most sequences are going to converge to 1/2. 1305 01:18:26,810 --> 01:18:31,010 There are many more sequences with half 1's and half 0's 1306 01:18:31,010 --> 01:18:33,370 than there are with three quarter 1's 1307 01:18:33,370 --> 01:18:37,690 and one quarter 0's-- 1308 01:18:37,690 --> 01:18:40,270 with three quarter 0's and one quarter 1's. 1309 01:18:40,270 --> 01:18:42,980 So there are many more of one than the other. 1310 01:18:42,980 --> 01:18:48,090 But those particular sequences, which have 1311 01:18:48,090 --> 01:18:49,640 probability-- 1312 01:18:49,640 --> 01:18:54,650 those particular sequences which have relative frequency 1313 01:18:54,650 --> 01:19:00,110 one quarter are much more likely than those which have 1314 01:19:00,110 --> 01:19:02,290 relative frequency one half. 1315 01:19:02,290 --> 01:19:04,450 Because the ones with relative frequency 1316 01:19:04,450 --> 01:19:06,135 one half are so unlikely. 1317 01:19:09,180 --> 01:19:11,670 That's a complicated set of ideas. 1318 01:19:11,670 --> 01:19:14,860 You sort of know that it has to be true because of these 1319 01:19:14,860 --> 01:19:16,960 other laws of large numbers. 1320 01:19:16,960 --> 01:19:20,630 And this is simply extending those laws of large numbers 1321 01:19:20,630 --> 01:19:25,870 one more step to say not only does the distribution function 1322 01:19:25,870 --> 01:19:28,420 of that sample average converge to what it should 1323 01:19:28,420 --> 01:19:33,750 converge to, but also for these sequences with 1324 01:19:33,750 --> 01:19:36,380 probability 1, they converge. 1325 01:19:36,380 --> 01:19:40,450 If you look at the sequence for long enough, the sample 1326 01:19:40,450 --> 01:19:43,720 average is going to converge to what it should for that one 1327 01:19:43,720 --> 01:19:45,525 particular sample sequence. 1328 01:19:50,030 --> 01:19:55,510 OK, the strong law of large numbers then says that if X1, 1329 01:19:55,510 --> 01:20:02,060 X2, and so forth are IID random variables, and they 1330 01:20:02,060 --> 01:20:06,210 have an expected value, which is less than infinity, then 1331 01:20:06,210 --> 01:20:11,950 the sample average converges to the actual average with 1332 01:20:11,950 --> 01:20:13,550 probability 1. 1333 01:20:13,550 --> 01:20:16,910 In other words, it says with probability 1, you look at 1334 01:20:16,910 --> 01:20:18,640 this sequence forever. 1335 01:20:18,640 --> 01:20:20,740 I don't know how you look at a sequence forever. 1336 01:20:20,740 --> 01:20:22,270 I've never figured that out. 1337 01:20:22,270 --> 01:20:25,930 But if you could look at it forever, then with probability 1338 01:20:25,930 --> 01:20:30,580 1, it would come out with the right relative frequency. 1339 01:20:30,580 --> 01:20:34,870 It'll take a lot of investment of time when we get to chapter 1340 01:20:34,870 --> 01:20:38,160 4 to try to sort that out. 1341 01:20:38,160 --> 01:20:40,765 Wanted to tell you a little bit about it, read about in 1342 01:20:40,765 --> 01:20:43,920 notes a little bit in chapter 1, and we will 1343 01:20:43,920 --> 01:20:45,590 come back to it. 1344 01:20:45,590 --> 01:20:48,520 And I think you will then understand it. 1345 01:20:48,520 --> 01:20:53,770 OK, with that, we are done with chapter 1. 1346 01:20:53,770 --> 01:20:57,600 Next time we will go into Poisson processes. 1347 01:20:57,600 --> 01:21:03,230 If you're upset by all of the abstraction in chapter 1, you 1348 01:21:03,230 --> 01:21:06,780 will be very happy when we get into Poisson processes because 1349 01:21:06,780 --> 01:21:08,870 there's nothing abstract there at all. 1350 01:21:08,870 --> 01:21:11,340 Everything you could reasonably say about Poisson 1351 01:21:11,340 --> 01:21:16,790 processes is either obviously true or obviously false. 1352 01:21:16,790 --> 01:21:19,810 I mean, there's nothing that's strange there at all. 1353 01:21:19,810 --> 01:21:23,100 After you understand it, everything works. 1354 01:21:23,100 --> 01:21:24,740 So we'll do that next time.