1 00:00:00,500 --> 00:00:02,800 The following content is provided under a Creative 2 00:00:02,800 --> 00:00:04,340 Commons license. 3 00:00:04,340 --> 00:00:06,660 Your support will help MIT OpenCourseWare 4 00:00:06,660 --> 00:00:11,020 continue to offer high quality educational resources for free. 5 00:00:11,020 --> 00:00:13,640 To make a donation or view additional materials 6 00:00:13,640 --> 00:00:17,365 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,365 --> 00:00:17,990 at ocw.mit.edu. 8 00:00:23,575 --> 00:00:27,160 PROFESSOR: So today, we're going to talk about the probability 9 00:00:27,160 --> 00:00:30,770 that a random variable deviates by a certain amount 10 00:00:30,770 --> 00:00:32,850 from its expectation. 11 00:00:32,850 --> 00:00:36,050 Now, we've seen examples where a random variable 12 00:00:36,050 --> 00:00:39,790 is very unlikely to deviate much from its expectation. 13 00:00:39,790 --> 00:00:41,910 For example, if you flip 100 mutually 14 00:00:41,910 --> 00:00:44,700 independent fair coins, you're very 15 00:00:44,700 --> 00:00:46,740 likely to wind up with close to 50 16 00:00:46,740 --> 00:00:50,990 heads, very unlikely to wind up with 25 or fewer 17 00:00:50,990 --> 00:00:52,820 heads, for example. 18 00:00:52,820 --> 00:00:54,950 We've also seen examples of distributions 19 00:00:54,950 --> 00:00:59,780 where you are very likely to be far from your expectation, 20 00:00:59,780 --> 00:01:02,540 for example, that problem when we had the communications 21 00:01:02,540 --> 00:01:04,890 channel, and we were measuring the latency of a packet 22 00:01:04,890 --> 00:01:06,500 crossing the channel. 23 00:01:06,500 --> 00:01:09,340 There, most of the time, your latency 24 00:01:09,340 --> 00:01:11,360 would be 10 milliseconds. 25 00:01:11,360 --> 00:01:13,730 But the expected latency was infinite. 26 00:01:13,730 --> 00:01:17,850 So you're very likely to deviate a lot from your expectation 27 00:01:17,850 --> 00:01:19,650 in that case. 28 00:01:19,650 --> 00:01:22,380 Last time, we looked at the variance. 29 00:01:22,380 --> 00:01:26,130 And we saw how that gave us some feel for the likelihood 30 00:01:26,130 --> 00:01:28,650 of being far from the expectation-- 31 00:01:28,650 --> 00:01:32,190 high variance meaning you're more likely to deviate 32 00:01:32,190 --> 00:01:33,930 from the expectation. 33 00:01:33,930 --> 00:01:36,640 Today, we're going to develop specific tools 34 00:01:36,640 --> 00:01:39,750 for bounding or limiting the probability you 35 00:01:39,750 --> 00:01:43,130 deviate by a specified amount from the expectation. 36 00:01:43,130 --> 00:01:46,090 And the first tool is known as Markov's theorem. 37 00:01:46,090 --> 00:01:49,680 Markov's theorem says that if the random variable is always 38 00:01:49,680 --> 00:01:55,200 non-negative, then it is unlikely to greatly exceed 39 00:01:55,200 --> 00:01:56,510 its expectation. 40 00:01:56,510 --> 00:02:20,820 In particular, if R is a non-negative random variable, 41 00:02:20,820 --> 00:02:28,770 then for all x bigger than 0, the probability 42 00:02:28,770 --> 00:02:35,705 that R is at least x is at most the expected value of R, 43 00:02:35,705 --> 00:02:39,950 the mean, divided by x. 44 00:02:39,950 --> 00:02:45,200 So in other words, if R is never negative-- for example, 45 00:02:45,200 --> 00:02:48,290 say the expected value is smaller. 46 00:02:48,290 --> 00:02:52,240 Then the probability R is large will be a small number. 47 00:02:52,240 --> 00:02:55,700 Because I'll have a small number over a big number. 48 00:02:55,700 --> 00:02:59,640 So it says that you are unlikely to greatly exceed 49 00:02:59,640 --> 00:03:00,900 the expected value. 50 00:03:00,900 --> 00:03:02,002 So let's prove that. 51 00:03:05,050 --> 00:03:08,610 Now, from the theorem of total expectation 52 00:03:08,610 --> 00:03:11,230 that you did in recitation last week, 53 00:03:11,230 --> 00:03:13,690 we can compute the expected value of R 54 00:03:13,690 --> 00:03:28,690 by looking at two cases-- the case when R is at least x, 55 00:03:28,690 --> 00:03:31,100 and the case when R is less than x. 56 00:03:41,860 --> 00:03:43,810 That's from the theorem of total expectation. 57 00:03:43,810 --> 00:03:45,500 I look at two cases. 58 00:03:45,500 --> 00:03:46,560 R is bigger than x. 59 00:03:46,560 --> 00:03:49,250 Take the expected value there times the probability 60 00:03:49,250 --> 00:03:54,900 of this case happening plus the case when R is less than x. 61 00:03:54,900 --> 00:04:01,950 OK, now since R is non-negative, this is at least 0. 62 00:04:01,950 --> 00:04:03,320 R can't ever be negative. 63 00:04:03,320 --> 00:04:05,380 So the expectation can't be negative. 64 00:04:05,380 --> 00:04:06,840 A probability can't be negative. 65 00:04:06,840 --> 00:04:10,210 So this is at least 0. 66 00:04:10,210 --> 00:04:15,500 And this is trivially at least x. 67 00:04:15,500 --> 00:04:18,170 Because I'm taking the expected value of R 68 00:04:18,170 --> 00:04:20,850 in the case when R is at least x. 69 00:04:20,850 --> 00:04:23,980 So R is always at least x in this case. 70 00:04:23,980 --> 00:04:27,680 So its expected value is at least x. 71 00:04:27,680 --> 00:04:30,220 So that means that the expected value of R 72 00:04:30,220 --> 00:04:33,780 is at least x times the probability 73 00:04:33,780 --> 00:04:39,700 R is greater than x, R is greater or equal to x. 74 00:04:39,700 --> 00:04:44,216 And now I can get the theorem by just dividing by x. 75 00:04:50,230 --> 00:04:53,250 I'm less than or equal to the expected value 76 00:04:53,250 --> 00:04:57,980 of R divided by x. 77 00:04:57,980 --> 00:05:00,060 So it's a very easy theorem to prove. 78 00:05:00,060 --> 00:05:02,080 But it's going to have amazing consequences 79 00:05:02,080 --> 00:05:05,480 that we're going to build up through a series of results 80 00:05:05,480 --> 00:05:06,850 today. 81 00:05:06,850 --> 00:05:10,378 Any questions about Markov's theorem and the proof? 82 00:05:14,090 --> 00:05:17,716 All right, there's a simple corollary, which is useful. 83 00:05:20,710 --> 00:05:29,930 Again, if R is a non-negative random variable, 84 00:05:29,930 --> 00:05:36,110 then for all c bigger than 0, the probability 85 00:05:36,110 --> 00:05:43,410 that R is at least c times its expected value is at most 1 86 00:05:43,410 --> 00:05:45,440 and c. 87 00:05:45,440 --> 00:05:48,100 So the probability you're twice your expected value 88 00:05:48,100 --> 00:05:50,190 is at most 1/2. 89 00:05:50,190 --> 00:05:53,670 And the proof is very easy. 90 00:05:53,670 --> 00:05:57,540 We just set x to be equal to c times the expected 91 00:05:57,540 --> 00:06:00,640 value of R in the theorem. 92 00:06:04,020 --> 00:06:12,570 So I just plug in x is c times the expected value of R. 93 00:06:12,570 --> 00:06:15,720 And I get expected value of R over c times the expected value 94 00:06:15,720 --> 00:06:18,330 of R, which is 1/c. 95 00:06:18,330 --> 00:06:21,260 So you just plug in that value in Markov's theorem, 96 00:06:21,260 --> 00:06:24,794 and it comes out. 97 00:06:24,794 --> 00:06:26,210 All right, let's do some examples. 98 00:06:31,110 --> 00:06:38,870 Let's let R be the weight of a random person 99 00:06:38,870 --> 00:06:39,963 uniformly selected. 100 00:06:44,030 --> 00:06:47,940 And I don't know what the distribution of weights 101 00:06:47,940 --> 00:06:49,930 is in the country. 102 00:06:49,930 --> 00:06:52,690 But suppose that the expected value of R, 103 00:06:52,690 --> 00:06:57,790 which is the average weight, is 100 pounds. 104 00:06:57,790 --> 00:07:03,810 So if I average over all people, their weight is 100 pounds. 105 00:07:03,810 --> 00:07:07,250 And suppose I want to know the probability 106 00:07:07,250 --> 00:07:10,090 that the random person weighs at least 200 pounds. 107 00:07:13,270 --> 00:07:14,855 What can I say about that probability? 108 00:07:20,820 --> 00:07:21,756 Do I know it exactly? 109 00:07:25,892 --> 00:07:26,600 I don't think so. 110 00:07:26,600 --> 00:07:29,280 Because I don't know what the distribution of weights is. 111 00:07:29,280 --> 00:07:33,530 But I can still get an upper bound on this probability. 112 00:07:33,530 --> 00:07:35,180 What bound can I get on the probability 113 00:07:35,180 --> 00:07:36,580 that a random person has a weight 114 00:07:36,580 --> 00:07:38,090 of 200 given the facts here? 115 00:07:38,090 --> 00:07:39,302 Yeah. 116 00:07:39,302 --> 00:07:42,669 AUDIENCE: [INAUDIBLE] 117 00:07:46,050 --> 00:07:48,760 PROFESSOR: Yes, well, it's 100 over 200, right. 118 00:07:48,760 --> 00:07:51,880 It's at most the expected value, which is 100, 119 00:07:51,880 --> 00:07:54,420 over the x, which is 200. 120 00:07:54,420 --> 00:07:57,390 And that's equal to 1/2. 121 00:07:57,390 --> 00:08:00,040 So the probability that a random person weighs 200 pounds 122 00:08:00,040 --> 00:08:02,230 or more is at most 1/2. 123 00:08:02,230 --> 00:08:04,940 Or I could plug it in here. 124 00:08:04,940 --> 00:08:07,796 The expected value is 100. 125 00:08:07,796 --> 00:08:09,020 200 is twice that. 126 00:08:09,020 --> 00:08:11,699 So c would be 2 here. 127 00:08:11,699 --> 00:08:13,740 So the probability of being twice the expectation 128 00:08:13,740 --> 00:08:15,190 is at most 1/2. 129 00:08:15,190 --> 00:08:17,220 Now of course, I'm using the fact 130 00:08:17,220 --> 00:08:19,280 that weight is never negative. 131 00:08:19,280 --> 00:08:20,990 That's obviously true. 132 00:08:20,990 --> 00:08:25,400 But it is implicitly being used here. 133 00:08:25,400 --> 00:08:29,000 So what fraction of the population 134 00:08:29,000 --> 00:08:32,909 now can weigh at least 200 pounds? 135 00:08:32,909 --> 00:08:34,280 Slightly different question. 136 00:08:34,280 --> 00:08:37,730 Before I asked you, if I take a random person, 137 00:08:37,730 --> 00:08:40,429 what's the probability they weigh at least 200 pounds? 138 00:08:40,429 --> 00:08:44,720 Now I'm asking, what fraction of the population 139 00:08:44,720 --> 00:08:51,207 can weigh at least 200 pounds if the average is 100? 140 00:08:55,280 --> 00:08:56,594 What is it? 141 00:08:56,594 --> 00:08:57,950 Yeah? 142 00:08:57,950 --> 00:08:58,900 AUDIENCE: At most 1/2. 143 00:08:58,900 --> 00:08:59,910 PROFESSOR: At most 1/2. 144 00:08:59,910 --> 00:09:03,300 In fact, it's the same answer. 145 00:09:03,300 --> 00:09:08,450 And why? 146 00:09:08,450 --> 00:09:11,460 Why can't everybody weigh 200 pounds, 147 00:09:11,460 --> 00:09:13,070 so it would be all the population 148 00:09:13,070 --> 00:09:15,460 weighs 200 pounds at least? 149 00:09:15,460 --> 00:09:16,990 AUDIENCE: [INAUDIBLE] 150 00:09:16,990 --> 00:09:20,690 PROFESSOR: Probability would be 1, and that can't happen. 151 00:09:20,690 --> 00:09:23,350 And in fact, intuitively, if everybody 152 00:09:23,350 --> 00:09:26,102 weighs at least 200 pounds, the average 153 00:09:26,102 --> 00:09:27,560 is going to be at least 200 pounds. 154 00:09:27,560 --> 00:09:30,530 And we said the average was 100. 155 00:09:30,530 --> 00:09:33,370 And this is illustrating this interesting thing 156 00:09:33,370 --> 00:09:40,000 that probability implies things about averages and fractions. 157 00:09:40,000 --> 00:09:41,960 Because it's really the same thing in disguise. 158 00:09:41,960 --> 00:09:44,690 The connection is, if I've got a bunch of people, say, 159 00:09:44,690 --> 00:09:47,760 in the country, I can convert a fraction 160 00:09:47,760 --> 00:09:50,190 that have some property into a probability 161 00:09:50,190 --> 00:09:52,150 by just selecting a random person. 162 00:09:52,150 --> 00:09:52,775 Yeah. 163 00:09:52,775 --> 00:09:53,650 AUDIENCE: [INAUDIBLE] 164 00:10:01,584 --> 00:10:05,990 PROFESSOR: No, the variance could be very big. 165 00:10:05,990 --> 00:10:09,890 Because I might have a person that weighs a million pounds, 166 00:10:09,890 --> 00:10:11,150 say. 167 00:10:11,150 --> 00:10:13,110 So you have to get into that. 168 00:10:13,110 --> 00:10:15,120 But it gets a little bit more complicated. 169 00:10:15,120 --> 00:10:15,995 Yeah. 170 00:10:15,995 --> 00:10:16,870 AUDIENCE: [INAUDIBLE] 171 00:10:19,750 --> 00:10:21,310 PROFESSOR: No, there's nothing being 172 00:10:21,310 --> 00:10:26,360 assumed about the distribution, nothing at all, OK? 173 00:10:26,360 --> 00:10:28,670 So that's the beauty of Markov's theorem. 174 00:10:28,670 --> 00:10:30,650 Well, I've assumed one thing. 175 00:10:30,650 --> 00:10:33,254 I assume that there is no negative values. 176 00:10:33,254 --> 00:10:35,065 That's it. 177 00:10:35,065 --> 00:10:35,940 AUDIENCE: [INAUDIBLE] 178 00:10:39,037 --> 00:10:40,120 PROFESSOR: That's correct. 179 00:10:40,120 --> 00:10:42,500 They can distribute it any way with positive values. 180 00:10:42,500 --> 00:10:47,490 But we have a fact here we've used, that the average was 100. 181 00:10:47,490 --> 00:10:50,010 So that does limit your distribution. 182 00:10:50,010 --> 00:10:52,260 In other words, you couldn't have a distribution where 183 00:10:52,260 --> 00:10:54,520 everybody weighs 200 pounds. 184 00:10:54,520 --> 00:10:58,380 Because then the average would be 200, not 100. 185 00:10:58,380 --> 00:11:01,060 But anything else where they're all positive 186 00:11:01,060 --> 00:11:06,080 and they average 100, you know that at most half can be 200. 187 00:11:06,080 --> 00:11:07,620 Because if you pick a random one, 188 00:11:07,620 --> 00:11:12,200 the probability of getting one that's 200 is at most 1/2, 189 00:11:12,200 --> 00:11:13,860 which follows from Markov's theorem. 190 00:11:13,860 --> 00:11:17,007 And that's partly why it's so powerful. 191 00:11:17,007 --> 00:11:19,340 You didn't know anything about the distribution, really, 192 00:11:19,340 --> 00:11:24,470 except its expectation and that it was non-negative. 193 00:11:24,470 --> 00:11:27,975 Any other questions about this? 194 00:11:27,975 --> 00:11:29,690 I'll give you some more examples. 195 00:11:29,690 --> 00:11:31,850 All right, here's another example. 196 00:11:31,850 --> 00:11:36,070 Is it possible on the final exam for everybody in the class 197 00:11:36,070 --> 00:11:39,670 to do better than the mean score? 198 00:11:39,670 --> 00:11:40,550 No, of course not. 199 00:11:40,550 --> 00:11:42,760 Because if they did, the mean would be higher. 200 00:11:42,760 --> 00:11:44,570 Because the mean is the average. 201 00:11:44,570 --> 00:11:48,430 OK, let's do another example. 202 00:11:48,430 --> 00:11:51,506 Remember the Chinese appetizer problem? 203 00:11:51,506 --> 00:11:53,380 You're at the restaurant, big circular table. 204 00:11:53,380 --> 00:11:55,010 There's n people at the table. 205 00:11:55,010 --> 00:11:57,250 Everybody has one appetizer in front of them. 206 00:11:57,250 --> 00:11:59,110 And then the joker spins the thing 207 00:11:59,110 --> 00:12:00,520 in the middle of the table. 208 00:12:00,520 --> 00:12:01,850 So it goes around and around. 209 00:12:01,850 --> 00:12:05,140 And it stops in a random uniform position. 210 00:12:05,140 --> 00:12:07,860 And we wanted to know, what's the expected number of people 211 00:12:07,860 --> 00:12:10,447 to get the right appetizer back? 212 00:12:10,447 --> 00:12:11,280 What was the answer? 213 00:12:11,280 --> 00:12:13,325 Does anybody remember? 214 00:12:13,325 --> 00:12:13,825 One. 215 00:12:13,825 --> 00:12:18,290 So you expect one person to get the right appetizer back. 216 00:12:18,290 --> 00:12:21,590 Well, say I want to know the probability that all n people 217 00:12:21,590 --> 00:12:24,150 got the right appetizer back. 218 00:12:24,150 --> 00:12:27,300 What does Markov tell you about the probability 219 00:12:27,300 --> 00:12:31,530 that all n people get the right appetizer back? 220 00:12:31,530 --> 00:12:32,370 1/n. 221 00:12:32,370 --> 00:12:34,680 The expected value is 1. 222 00:12:34,680 --> 00:12:36,380 And now you're asking the probability 223 00:12:36,380 --> 00:12:39,705 that you get R is at least n. 224 00:12:39,705 --> 00:12:40,950 So x is n. 225 00:12:40,950 --> 00:12:43,230 So it's 1 in n. 226 00:12:43,230 --> 00:12:45,914 And what was the probability, or what is the actual probability? 227 00:12:45,914 --> 00:12:47,580 In this case, you know the distribution, 228 00:12:47,580 --> 00:12:51,300 that everybody gets the right appetizer back, all n. 229 00:12:51,300 --> 00:12:52,690 1 in n. 230 00:12:52,690 --> 00:12:56,840 So in the case of the Chinese appetizer problem, 231 00:12:56,840 --> 00:13:01,310 Markov's bound is actually the right answer, right on target, 232 00:13:01,310 --> 00:13:05,550 which gives you an example where you can't improve it. 233 00:13:05,550 --> 00:13:08,730 By itself, if you just know the expected value, 234 00:13:08,730 --> 00:13:11,290 there's no stronger theorem that way. 235 00:13:11,290 --> 00:13:14,140 Because Chinese appetizer is an example where the bound 236 00:13:14,140 --> 00:13:17,347 you get, 1/n, of n people getting the right appetizer 237 00:13:17,347 --> 00:13:18,680 is in fact the true probability. 238 00:13:21,530 --> 00:13:24,630 OK, what about the hat check problem? 239 00:13:24,630 --> 00:13:25,620 Remember that? 240 00:13:25,620 --> 00:13:29,420 So there's n men put the hats in the coat closet. 241 00:13:29,420 --> 00:13:31,830 They get uniformly randomly scrambled. 242 00:13:31,830 --> 00:13:35,530 So it's a random permutation applied to the hats. 243 00:13:35,530 --> 00:13:38,490 Now each man gets a hat back. 244 00:13:38,490 --> 00:13:41,320 What's the expected number of men to get the right hat back? 245 00:13:44,320 --> 00:13:46,775 One, same as the other one. 246 00:13:46,775 --> 00:13:48,900 Because you've got n men each with a 1 in n chance, 247 00:13:48,900 --> 00:13:50,130 so it's 1. 248 00:13:50,130 --> 00:13:54,950 Markov says the probability that n men get the right hat back is 249 00:13:54,950 --> 00:13:58,680 at most 1 in n, same as before. 250 00:13:58,680 --> 00:14:01,350 What's the actual probability that all n 251 00:14:01,350 --> 00:14:03,550 men get the right hat back? 252 00:14:03,550 --> 00:14:05,070 AUDIENCE: [INAUDIBLE] 253 00:14:05,070 --> 00:14:07,550 PROFESSOR: 1 in n factorial. 254 00:14:07,550 --> 00:14:10,120 So in this case, Markov is way off the mark. 255 00:14:10,120 --> 00:14:11,810 It says 1 in n. 256 00:14:11,810 --> 00:14:15,480 But in fact the real bound is much smaller. 257 00:14:15,480 --> 00:14:19,010 So Markov is not always tight. 258 00:14:19,010 --> 00:14:20,230 It's always an upper bound. 259 00:14:20,230 --> 00:14:23,110 But it sometimes is not the right answer. 260 00:14:23,110 --> 00:14:25,030 And to get the right answer, often you 261 00:14:25,030 --> 00:14:27,033 need to know more about the distribution. 262 00:14:30,350 --> 00:14:35,580 OK, what if R can be negative? 263 00:14:35,580 --> 00:14:37,908 Is it possible that Markov's theorem holds there? 264 00:14:40,540 --> 00:14:44,240 Because I use the assumption in the theorem. 265 00:14:44,240 --> 00:14:46,340 Can anybody give me an example where it doesn't 266 00:14:46,340 --> 00:14:50,545 work if R can be negative? 267 00:14:50,545 --> 00:14:53,540 AUDIENCE: [INAUDIBLE] 268 00:14:53,540 --> 00:14:56,320 PROFESSOR: Yeah, good, so for example, 269 00:14:56,320 --> 00:15:02,660 say probability R equals 1,000 is 1/2, 270 00:15:02,660 --> 00:15:08,500 and the probability R equals minus 1,000 is 1/2. 271 00:15:08,500 --> 00:15:13,440 Then the expected value of R is 0. 272 00:15:13,440 --> 00:15:18,310 And say we asked the probability that R is at least 1,000. 273 00:15:18,310 --> 00:15:21,700 Well, that's going to be 1/2. 274 00:15:21,700 --> 00:15:27,730 But that does not equal the expected value of R/1,000, 275 00:15:27,730 --> 00:15:30,150 which would be 0. 276 00:15:30,150 --> 00:15:36,970 So Markov's theorem really does need that R to be non-negative. 277 00:15:36,970 --> 00:15:40,910 In fact, let's see if we saw where we used it in the proof. 278 00:15:40,910 --> 00:15:42,920 Anybody see where we use that fact in the proof, 279 00:15:42,920 --> 00:15:46,050 that R can't be negative? 280 00:15:46,050 --> 00:15:47,148 What is it? 281 00:15:47,148 --> 00:15:49,340 AUDIENCE: [INAUDIBLE] 282 00:15:49,340 --> 00:15:52,236 PROFESSOR: Well, no, because x is positive. 283 00:15:52,236 --> 00:15:55,080 We said x is positive. 284 00:15:55,080 --> 00:15:56,160 So it's not used there. 285 00:15:56,160 --> 00:15:58,850 But that's a good one to look at. 286 00:15:58,850 --> 00:16:01,320 Yeah? 287 00:16:01,320 --> 00:16:06,630 AUDIENCE: [INAUDIBLE] is greater than or equal to 0. 288 00:16:06,630 --> 00:16:10,120 PROFESSOR: Yeah, if R can be negative, 289 00:16:10,120 --> 00:16:13,270 then this is not necessarily a positive number. 290 00:16:13,270 --> 00:16:14,710 It could be a negative number. 291 00:16:14,710 --> 00:16:17,227 And then this inequality doesn't hold. 292 00:16:20,740 --> 00:16:21,590 OK, good. 293 00:16:25,530 --> 00:16:27,940 All right, now it turns out there 294 00:16:27,940 --> 00:16:29,720 is a variation of Markov's theorem 295 00:16:29,720 --> 00:16:32,050 you can use when R is negative. 296 00:16:32,050 --> 00:16:33,379 Yeah. 297 00:16:33,379 --> 00:16:36,082 AUDIENCE: [INAUDIBLE] but would it be OK just 298 00:16:36,082 --> 00:16:37,040 to shift everything up? 299 00:16:37,040 --> 00:16:39,880 PROFESSOR: Yeah, yeah, that's great. 300 00:16:39,880 --> 00:16:43,120 If R has a limit on how negative it can be, 301 00:16:43,120 --> 00:16:47,430 then you make an R prime, which just adds that limit to R, 302 00:16:47,430 --> 00:16:49,710 makes it positive or non-negative. 303 00:16:49,710 --> 00:16:51,890 And now use Markov's theorem there. 304 00:16:51,890 --> 00:16:55,540 And that is now an analogous form of Markov's theorem 305 00:16:55,540 --> 00:16:59,220 when R can be negative, but there's a lower limit to it. 306 00:16:59,220 --> 00:17:01,500 And I won't stay to improve that here. 307 00:17:01,500 --> 00:17:03,000 But that's in the text and something 308 00:17:03,000 --> 00:17:06,770 you want to be familiar with. 309 00:17:06,770 --> 00:17:10,319 What I do want to do in class is another case 310 00:17:10,319 --> 00:17:13,750 where you can use Markov's theorem 311 00:17:13,750 --> 00:17:16,510 to analyze the probability or upper bound the probability 312 00:17:16,510 --> 00:17:19,770 that R is very small, less than its expectation. 313 00:17:19,770 --> 00:17:23,020 And it's the same idea as you just suggested. 314 00:17:23,020 --> 00:17:24,380 So let's state that. 315 00:17:29,610 --> 00:17:32,790 If R is upper bounded, has a hard limit on the upper bound, 316 00:17:32,790 --> 00:17:42,460 by u for some u in the real numbers, 317 00:17:42,460 --> 00:17:50,210 then for all x less than u, the probability that R 318 00:17:50,210 --> 00:17:56,490 is less than or equal to x is at most u 319 00:17:56,490 --> 00:18:03,960 minus the expected value of R over u minus x. 320 00:18:03,960 --> 00:18:06,150 So in this case, we're getting a probability 321 00:18:06,150 --> 00:18:09,770 that R is less than something instead of R 322 00:18:09,770 --> 00:18:12,440 is bigger than something. 323 00:18:12,440 --> 00:18:18,620 And we're going to do it using a simple trick that we'll be 324 00:18:18,620 --> 00:18:21,700 sort of using all day, really. 325 00:18:21,700 --> 00:18:26,690 The probability that R is less than x, this event, 326 00:18:26,690 --> 00:18:32,670 R is less than x, is the same as the event u minus R 327 00:18:32,670 --> 00:18:35,770 is at least u minus x. 328 00:18:38,660 --> 00:18:40,160 So what have I done? 329 00:18:40,160 --> 00:18:44,740 I put negative R over here, subtract x from each side, 330 00:18:44,740 --> 00:18:48,320 add u to each side. 331 00:18:48,320 --> 00:18:51,890 I've got to put a less than or equal to here. 332 00:18:51,890 --> 00:18:54,950 So R is less than or equal to x if and only 333 00:18:54,950 --> 00:18:58,840 if u minus r is at least u minus x. 334 00:18:58,840 --> 00:19:02,510 It's simple math there. 335 00:19:02,510 --> 00:19:07,250 And now I'm going to apply Markov to this. 336 00:19:07,250 --> 00:19:10,889 I'm going to apply Markov to this random variable. 337 00:19:10,889 --> 00:19:12,680 And this will be the value I would have had 338 00:19:12,680 --> 00:19:15,840 for x up in Markov's theorem. 339 00:19:15,840 --> 00:19:18,658 Why is it OK to apply Markov to u minus R? 340 00:19:23,579 --> 00:19:25,870 AUDIENCE: You could just define the new random variable 341 00:19:25,870 --> 00:19:28,360 to be u minus R. 342 00:19:28,360 --> 00:19:30,905 PROFESSOR: Yeah, so I got a new random variable. 343 00:19:30,905 --> 00:19:33,280 But what do I need to know about that new random variable 344 00:19:33,280 --> 00:19:34,270 to apply Markov? 345 00:19:34,270 --> 00:19:36,000 AUDIENCE: u is always greater than R. 346 00:19:36,000 --> 00:19:38,430 PROFESSOR: u is always greater than R, or at least as big 347 00:19:38,430 --> 00:19:41,650 as R. So u minus R is always non-negative. 348 00:19:41,650 --> 00:19:44,360 So I can apply Markov now. 349 00:19:44,360 --> 00:19:52,750 And when I apply Markov, I'll get this is at most-- maybe 350 00:19:52,750 --> 00:19:53,586 I'll go over here. 351 00:19:56,680 --> 00:20:00,660 The probability that-- ooh, not R here. 352 00:20:00,660 --> 00:20:02,840 This is probability. 353 00:20:02,840 --> 00:20:10,250 The probability that u minus R is at least u minus x 354 00:20:10,250 --> 00:20:17,960 is at most the expected value of that random variable 355 00:20:17,960 --> 00:20:19,820 over this value. 356 00:20:23,210 --> 00:20:26,260 And well now I just use the linearity of expectation. 357 00:20:26,260 --> 00:20:27,290 I've got a scalar here. 358 00:20:27,290 --> 00:20:33,346 So this is u minus the expected value of R over u minus x. 359 00:20:36,490 --> 00:20:40,052 So I've used Markov's theorem to get a different version of it. 360 00:20:42,930 --> 00:20:45,480 All right, let's do an example. 361 00:20:45,480 --> 00:20:47,930 Say I'm looking at test scores. 362 00:20:50,460 --> 00:20:56,860 And I'll let R be the score of a random student 363 00:20:56,860 --> 00:20:57,992 uniformly selected. 364 00:21:03,390 --> 00:21:10,920 And say that the max score is 100. 365 00:21:10,920 --> 00:21:12,746 So that's u. 366 00:21:12,746 --> 00:21:15,230 All scores are at most 100. 367 00:21:15,230 --> 00:21:18,530 And say that I tell you the class average, 368 00:21:18,530 --> 00:21:24,183 or the expected value of R, is 75. 369 00:21:27,360 --> 00:21:31,930 And now I want to know, what's the probability 370 00:21:31,930 --> 00:21:35,226 that a random student scores 50 or below? 371 00:21:41,750 --> 00:21:43,860 Can we figure that out? 372 00:21:43,860 --> 00:21:46,130 I don't know anything about the distribution, 373 00:21:46,130 --> 00:21:51,160 just that the max score is 100 and the average score is 75. 374 00:21:51,160 --> 00:21:55,080 What's the probability that a random student 375 00:21:55,080 --> 00:21:56,440 scores 50 or less? 376 00:21:56,440 --> 00:21:59,750 I want to upper bound that. 377 00:21:59,750 --> 00:22:04,860 So we just plug it into the formula. 378 00:22:04,860 --> 00:22:07,890 u is 100. 379 00:22:07,890 --> 00:22:11,300 The expected value is 75. 380 00:22:11,300 --> 00:22:13,430 u is 100. 381 00:22:13,430 --> 00:22:16,960 And x is 50. 382 00:22:16,960 --> 00:22:20,810 And that's 25 over 50, is 1/2. 383 00:22:23,560 --> 00:22:28,430 So at most half the class can score 50 or below. 384 00:22:28,430 --> 00:22:32,160 And state it as a probability question or deterministic fact 385 00:22:32,160 --> 00:22:34,865 if I know the average is 75 and the max is 100. 386 00:22:34,865 --> 00:22:36,740 Of course, another way of thinking about that 387 00:22:36,740 --> 00:22:38,980 is if more than half the class scored 50 388 00:22:38,980 --> 00:22:40,400 or below, your average would have 389 00:22:40,400 --> 00:22:43,720 had to be lower, even if everybody else was 390 00:22:43,720 --> 00:22:44,386 right at 100. 391 00:22:44,386 --> 00:22:45,635 It wouldn't average out to 75. 392 00:22:48,980 --> 00:22:54,410 All right, any questions about that? 393 00:22:54,410 --> 00:23:01,410 OK, so sometimes Markov is dead on right, 394 00:23:01,410 --> 00:23:02,740 gives the right answer. 395 00:23:02,740 --> 00:23:05,020 For example, half the class could have scored 50, 396 00:23:05,020 --> 00:23:08,250 and half could have gotten 100 to make it be 75. 397 00:23:08,250 --> 00:23:12,240 And sometimes it's way off, like in the hat check problem. 398 00:23:12,240 --> 00:23:15,270 Now, if you know more about the distribution, 399 00:23:15,270 --> 00:23:18,190 then you can get better bounds, especially the cases 400 00:23:18,190 --> 00:23:20,430 when you're far off. 401 00:23:20,430 --> 00:23:22,960 For example, if you know the variance in addition 402 00:23:22,960 --> 00:23:26,510 to the expectation, or aside from the expectation, 403 00:23:26,510 --> 00:23:28,970 then you can get better bounds on the probability 404 00:23:28,970 --> 00:23:31,220 that the random variable is large. 405 00:23:31,220 --> 00:23:36,980 And in this case, the result is known as Chebyshev's theorem. 406 00:23:36,980 --> 00:23:38,075 I'll do that over here. 407 00:23:40,840 --> 00:23:45,945 And it's the analog of Markov's theorem based on variance. 408 00:23:49,260 --> 00:23:54,170 It says, for all x bigger than 0, 409 00:23:54,170 --> 00:24:01,230 and any random variable R-- could even be negative-- 410 00:24:01,230 --> 00:24:06,010 the probability that R deviates from its expected value 411 00:24:06,010 --> 00:24:11,340 in either direction by at least x 412 00:24:11,340 --> 00:24:18,850 is at most of the variance of R divided by x squared. 413 00:24:18,850 --> 00:24:20,610 So this is like Markov's theorem, 414 00:24:20,610 --> 00:24:24,640 except that we're now bounding the deviation 415 00:24:24,640 --> 00:24:26,140 in either direction. 416 00:24:26,140 --> 00:24:28,270 Instead of expected value, you have variance. 417 00:24:28,270 --> 00:24:32,930 Instead of x, you've got x squared, but the same idea. 418 00:24:32,930 --> 00:24:35,643 In fact, the proof uses Markov's theorem. 419 00:24:43,330 --> 00:24:47,260 Well, the probability that R deviates 420 00:24:47,260 --> 00:24:52,820 from its expected value by at least x, 421 00:24:52,820 --> 00:24:55,610 this is the same event, or happens if and only 422 00:24:55,610 --> 00:25:01,450 if R minus expected value squared is at least x squared. 423 00:25:01,450 --> 00:25:03,320 I'm just going to square both sides here. 424 00:25:14,960 --> 00:25:16,290 OK, I square both sides. 425 00:25:16,290 --> 00:25:18,290 And since this is positive and this is positive, 426 00:25:18,290 --> 00:25:20,456 I can square both sides and maintain the inequality. 427 00:25:23,650 --> 00:25:28,050 Now I'm going to apply Markov's theorem 428 00:25:28,050 --> 00:25:30,512 to that random variable. 429 00:25:30,512 --> 00:25:31,470 It's a random variable. 430 00:25:31,470 --> 00:25:33,520 It's R minus expected value squared. 431 00:25:33,520 --> 00:25:35,352 So it's a random variable. 432 00:25:35,352 --> 00:25:37,310 And what's nice about this random variable that 433 00:25:37,310 --> 00:25:40,920 lets me apply Markov's theorem? 434 00:25:40,920 --> 00:25:41,820 It's a square. 435 00:25:41,820 --> 00:25:43,990 So it's always non-negative. 436 00:25:43,990 --> 00:25:46,910 So I can apply Markov's theorem. 437 00:25:46,910 --> 00:25:49,860 And my Markov's theorem, this probability 438 00:25:49,860 --> 00:26:04,180 is at most the expected value of that divided by this. 439 00:26:04,180 --> 00:26:07,570 That's what Markov's theorem says as long as this is always 440 00:26:07,570 --> 00:26:10,030 non-negative. 441 00:26:10,030 --> 00:26:12,430 All right, what's a simpler expression 442 00:26:12,430 --> 00:26:19,070 for this, the expected value of the square of the deviation 443 00:26:19,070 --> 00:26:21,380 of a random variable? 444 00:26:21,380 --> 00:26:22,314 That's the variance. 445 00:26:22,314 --> 00:26:23,730 That's the definition of variance. 446 00:26:26,470 --> 00:26:31,760 So that is just the variance of R over x squared. 447 00:26:31,760 --> 00:26:33,920 And we're done. 448 00:26:33,920 --> 00:26:37,590 So Chebyshev's theorem is really just another version 449 00:26:37,590 --> 00:26:39,920 of Markov's theorem. 450 00:26:39,920 --> 00:26:41,484 But now it's based on the variance. 451 00:26:44,450 --> 00:26:46,216 OK, any questions? 452 00:26:49,150 --> 00:26:54,500 OK, so there's a nice corollary for this, just as with 453 00:26:54,500 --> 00:26:56,615 Markov's theorem. 454 00:26:56,615 --> 00:27:05,020 It says the probability that the absolute value, the deviation, 455 00:27:05,020 --> 00:27:12,940 is at least c times the standard deviation of R. 456 00:27:12,940 --> 00:27:14,440 So I'm looking at the probability 457 00:27:14,440 --> 00:27:16,880 that R differs from its expectation 458 00:27:16,880 --> 00:27:22,760 by at least some scalar c times the standard deviation. 459 00:27:22,760 --> 00:27:25,290 Well, what's that? 460 00:27:25,290 --> 00:27:34,490 Well, that's the variance of R over the square of this thing-- 461 00:27:34,490 --> 00:27:40,590 c squared times the standard deviation squared. 462 00:27:40,590 --> 00:27:43,930 What's the square of the standard deviation? 463 00:27:43,930 --> 00:27:45,930 That's the variance. 464 00:27:45,930 --> 00:27:48,248 They cancel, so it's just 1 over c squared. 465 00:27:51,460 --> 00:27:57,840 So the probability of more than twice the standard deviation 466 00:27:57,840 --> 00:28:00,934 off the expectation is at most 1/4, for example. 467 00:28:03,780 --> 00:28:05,638 All right, let's do some examples of that. 468 00:28:08,960 --> 00:28:12,247 Maybe we'll leave Markov up there. 469 00:28:22,060 --> 00:28:23,970 OK, say we're looking at IQs. 470 00:28:27,540 --> 00:28:32,940 In this case, we're going to let R be the IQ of a random person. 471 00:28:38,340 --> 00:28:41,290 All right, now we're going to assume-- 472 00:28:41,290 --> 00:28:49,580 and this actually is the case-- that R is always at least 0, 473 00:28:49,580 --> 00:28:52,170 despite the fact that probably most of you 474 00:28:52,170 --> 00:28:55,540 have somebody you know who you think has a negative IQ. 475 00:28:55,540 --> 00:28:56,500 They can't be negative. 476 00:28:56,500 --> 00:28:59,000 They have to be non-zero. 477 00:28:59,000 --> 00:29:02,970 In fact, IQs are adjusted. 478 00:29:02,970 --> 00:29:07,260 So the expected IQ is supposed to be 100, 479 00:29:07,260 --> 00:29:11,170 although actually the averages may be in the 90's. 480 00:29:11,170 --> 00:29:15,000 And it's set up so that the standard deviation of IQ 481 00:29:15,000 --> 00:29:17,690 is supposed to be 15. 482 00:29:17,690 --> 00:29:20,710 So we're just going to assume those are facts on IQ. 483 00:29:20,710 --> 00:29:23,260 And that's what it's meant to be. 484 00:29:23,260 --> 00:29:27,400 And now we want to know, what's the probability a random person 485 00:29:27,400 --> 00:29:31,050 has an IQ of at least 250? 486 00:29:31,050 --> 00:29:36,070 Now Marilyn, from "Ask Marilyn," has an IQ pretty close to 250. 487 00:29:36,070 --> 00:29:40,700 And she thinks that's pretty special, pretty rare. 488 00:29:40,700 --> 00:29:42,670 So what can we say about that? 489 00:29:42,670 --> 00:29:46,740 In particular, say we used Markov. 490 00:29:46,740 --> 00:29:50,690 What could you say about the probability of having 491 00:29:50,690 --> 00:29:52,015 an IQ of at least 250? 492 00:29:56,800 --> 00:29:58,103 What does Markov tell us? 493 00:30:05,858 --> 00:30:07,081 AUDIENCE: [INAUDIBLE] 494 00:30:07,081 --> 00:30:07,997 PROFESSOR: What is it? 495 00:30:07,997 --> 00:30:09,103 AUDIENCE: [INAUDIBLE] 496 00:30:09,103 --> 00:30:11,602 PROFESSOR: Not quite 1 in 25, but you're on the right track. 497 00:30:14,320 --> 00:30:16,790 It's not quite 2/3. 498 00:30:16,790 --> 00:30:22,080 It's the expected value, which is 100, 499 00:30:22,080 --> 00:30:26,930 over the x value, which is 250. 500 00:30:26,930 --> 00:30:32,740 So it's 1 in 2.5, or 0.4. 501 00:30:32,740 --> 00:30:35,800 So the probability is at most 0.4, 502 00:30:35,800 --> 00:30:39,320 so 40% chance it could happen, potentially, but no 503 00:30:39,320 --> 00:30:41,652 bigger than that. 504 00:30:41,652 --> 00:30:42,526 What about Chebyshev? 505 00:30:44,884 --> 00:30:46,550 See if you can figure out what Chebyshev 506 00:30:46,550 --> 00:30:59,115 says about the probability of having an IQ of at least 250. 507 00:30:59,115 --> 00:30:59,990 It's a little tricky. 508 00:30:59,990 --> 00:31:01,948 You've got to sort of plug it into the equation 509 00:31:01,948 --> 00:31:04,300 there and get it to fit in the right form. 510 00:31:09,240 --> 00:31:15,870 Chebyshev says that-- let's get in the right form. 511 00:31:15,870 --> 00:31:18,800 I've got probability that R is at least 250. 512 00:31:18,800 --> 00:31:23,670 I've got to get it into that form up there. 513 00:31:23,670 --> 00:31:29,480 So that's the probability that-- well, first R minus 100 514 00:31:29,480 --> 00:31:32,500 is at least 150. 515 00:31:32,500 --> 00:31:34,550 So I've got R minus the expected value. 516 00:31:34,550 --> 00:31:38,950 I'm sort of getting it ready to apply Chebyshev here. 517 00:31:38,950 --> 00:31:44,630 And then 150-- how many standard deviations is 150? 518 00:31:44,630 --> 00:31:46,440 10, all right? 519 00:31:46,440 --> 00:31:53,580 So this is the probability that R minus the expected value of R 520 00:31:53,580 --> 00:31:58,522 is at least 10 standard deviations. 521 00:31:58,522 --> 00:31:59,480 That's what I'm asking. 522 00:31:59,480 --> 00:32:00,409 I'm not quite there. 523 00:32:00,409 --> 00:32:01,950 I'm going to use the corollary there. 524 00:32:01,950 --> 00:32:05,110 But I've got to get that absolute value thing in. 525 00:32:05,110 --> 00:32:09,130 But it's upper bounded by the probability 526 00:32:09,130 --> 00:32:14,740 of the absolute value of R minus expected value 527 00:32:14,740 --> 00:32:18,920 bigger than or equal to 10 standard deviations. 528 00:32:18,920 --> 00:32:22,730 Because this allows for two cases. 529 00:32:22,730 --> 00:32:25,480 R is 10 standard deviations high, 530 00:32:25,480 --> 00:32:29,170 and R is 10 standard deviations low or more. 531 00:32:29,170 --> 00:32:31,820 So this is upper bounded by that. 532 00:32:31,820 --> 00:32:35,249 And now I can plug in Chebyshev in the corollary form. 533 00:32:35,249 --> 00:32:36,790 And what's the answer when I do that? 534 00:32:40,880 --> 00:32:43,740 1 in 100-- the probability of being off 535 00:32:43,740 --> 00:32:46,590 by 10 standard deviations or more is at most 1 in 100, 536 00:32:46,590 --> 00:32:49,780 1 in 10 squared. 537 00:32:49,780 --> 00:32:51,580 So it's a lot better bound. 538 00:32:51,580 --> 00:32:55,150 It's 1% instead of 40%. 539 00:32:55,150 --> 00:32:58,467 So knowing the variance of the standard deviation 540 00:32:58,467 --> 00:33:00,800 gives you a lot more information and generally gives you 541 00:33:00,800 --> 00:33:02,580 much better bounds on the probability 542 00:33:02,580 --> 00:33:06,272 of deviating from the mean. 543 00:33:06,272 --> 00:33:07,980 And the reason it gives you better bounds 544 00:33:07,980 --> 00:33:12,250 is because the variance is squaring deviations. 545 00:33:12,250 --> 00:33:13,608 So they count a lot more. 546 00:33:16,910 --> 00:33:22,020 All right, now let's look at this step a little bit more. 547 00:33:36,491 --> 00:33:39,990 All right, let's say here is a line, 548 00:33:39,990 --> 00:33:43,680 and here's the expected value of R. 549 00:33:43,680 --> 00:33:50,340 And say here's 10 standard deviations high here. 550 00:33:50,340 --> 00:33:54,090 So this will be more than 10 standard deviations. 551 00:33:54,090 --> 00:33:57,850 And this will be 10 standard deviations on the low side. 552 00:33:57,850 --> 00:34:00,710 So here, I'm low. 553 00:34:00,710 --> 00:34:05,580 Now, this line here with the absolute value 554 00:34:05,580 --> 00:34:10,300 is figuring out the probability of being low or high. 555 00:34:10,300 --> 00:34:13,929 This is the probability that the absolute value 556 00:34:13,929 --> 00:34:19,060 of R minus its expected value is at least 557 00:34:19,060 --> 00:34:20,062 10 standard deviations. 558 00:34:23,760 --> 00:34:27,280 What we really wanted to know for bound 559 00:34:27,280 --> 00:34:28,507 was just the high side. 560 00:34:32,699 --> 00:34:36,850 Now, is it true that then, since the probability of high or low 561 00:34:36,850 --> 00:34:41,100 is 1 in 100, the probability of being high is at most 1 562 00:34:41,100 --> 00:34:45,830 in 200, half? 563 00:34:45,830 --> 00:34:48,239 Is that true? 564 00:34:48,239 --> 00:34:49,703 Yeah? 565 00:34:49,703 --> 00:34:52,639 AUDIENCE: [INAUDIBLE] 566 00:34:56,610 --> 00:34:58,910 PROFESSOR: Yeah, it is not necessarily true 567 00:34:58,910 --> 00:35:01,420 that the high and the low are equal, 568 00:35:01,420 --> 00:35:03,625 and therefore the high is half the total. 569 00:35:03,625 --> 00:35:06,860 It might be true, but not necessarily true. 570 00:35:06,860 --> 00:35:08,440 And that's a mistake that often gets 571 00:35:08,440 --> 00:35:15,180 made where you'll take this fact as being less than 100 572 00:35:15,180 --> 00:35:17,090 to conclude that's less than 1 in 200. 573 00:35:17,090 --> 00:35:20,040 And that you can't do, unless the distribution is 574 00:35:20,040 --> 00:35:22,950 symmetric around the expected value. 575 00:35:22,950 --> 00:35:25,600 Then you could do it, if it's a symmetric distribution 576 00:35:25,600 --> 00:35:26,790 around the expected value. 577 00:35:26,790 --> 00:35:28,093 But usually it's not. 578 00:35:31,910 --> 00:35:35,660 Now, there is something better you can say. 579 00:35:35,660 --> 00:35:37,100 So let me tell you what it is. 580 00:35:37,100 --> 00:35:39,137 But we won't prove it. 581 00:35:39,137 --> 00:35:40,720 I think we might prove it in the text. 582 00:35:40,720 --> 00:35:43,200 I'm not sure. 583 00:35:43,200 --> 00:35:45,960 If you just want the high side or just want the low side, 584 00:35:45,960 --> 00:35:50,550 you can do slightly better than 1 in c squared. 585 00:35:54,290 --> 00:35:55,880 That's the following theorem. 586 00:35:59,560 --> 00:36:07,320 For any random variable R, the probability 587 00:36:07,320 --> 00:36:16,190 that R is on the high side by c standard deviations 588 00:36:16,190 --> 00:36:21,030 is at most 1 over c squared plus 1. 589 00:36:21,030 --> 00:36:22,640 So it's not 1 over 2c squared. 590 00:36:22,640 --> 00:36:24,710 It's 1 over c squared plus 1, and the same thing 591 00:36:24,710 --> 00:36:27,037 for the probability of being on the low side. 592 00:36:32,300 --> 00:36:37,474 Let's see, have I written this right? 593 00:36:37,474 --> 00:36:44,050 Hmm, I want to get this as less than or equal to negative 594 00:36:44,050 --> 00:36:47,540 c times the standard deviation. 595 00:36:47,540 --> 00:36:50,940 So here I'm high by c or more standard deviations. 596 00:36:50,940 --> 00:36:52,040 Here I'm low. 597 00:36:52,040 --> 00:36:55,240 So R is less than the expected value by at least 598 00:36:55,240 --> 00:36:57,070 c standard deviations. 599 00:36:57,070 --> 00:37:00,980 And that is also 1 over c squared plus 1. 600 00:37:00,980 --> 00:37:04,740 And it is possible to find distributions 601 00:37:04,740 --> 00:37:08,040 that hit these targets-- not both at the same time, 602 00:37:08,040 --> 00:37:10,310 but one or the other, hit those targets. 603 00:37:10,310 --> 00:37:12,470 So that's the best you can say in general. 604 00:37:15,120 --> 00:37:17,300 All right, so using this bound, what's 605 00:37:17,300 --> 00:37:19,160 the probability that a random person has 606 00:37:19,160 --> 00:37:21,931 an IQ of at least 250? 607 00:37:21,931 --> 00:37:26,106 It's a little better than 1 in 100. 608 00:37:26,106 --> 00:37:27,070 AUDIENCE: [INAUDIBLE] 609 00:37:27,070 --> 00:37:28,540 PROFESSOR: Yeah, 1/101. 610 00:37:28,540 --> 00:37:31,920 So in fact, the best we can say without knowing any more 611 00:37:31,920 --> 00:37:39,000 information about IQs is that it's at most 1/101, 612 00:37:39,000 --> 00:37:39,802 slightly better. 613 00:37:43,610 --> 00:37:47,030 Now in fact, with IQs, they know more about the distribution. 614 00:37:47,030 --> 00:37:49,559 And the probability is a lot less. 615 00:37:49,559 --> 00:37:51,600 Because you know more about the distribution than 616 00:37:51,600 --> 00:37:53,670 we've assumed here. 617 00:37:53,670 --> 00:37:56,720 In fact, I don't think anybody has an IQ over 250 618 00:37:56,720 --> 00:37:58,890 as far as I know. 619 00:37:58,890 --> 00:38:04,860 Any questions about this? 620 00:38:04,860 --> 00:38:09,570 OK, all right, say we give the exam. 621 00:38:09,570 --> 00:38:14,010 What fraction of the class can score more than two 622 00:38:14,010 --> 00:38:17,530 standard deviations, get two standard deviations or more, 623 00:38:17,530 --> 00:38:19,940 away from the average, above or below? 624 00:38:23,610 --> 00:38:26,630 Could half the class be two standard deviations 625 00:38:26,630 --> 00:38:29,080 off the mean? 626 00:38:29,080 --> 00:38:29,640 No? 627 00:38:29,640 --> 00:38:31,832 What's the biggest fraction that that could happen? 628 00:38:35,090 --> 00:38:36,910 What do I do? 629 00:38:36,910 --> 00:38:41,290 What fraction of the class can be two standard deviations 630 00:38:41,290 --> 00:38:43,662 or more from the mean? 631 00:38:47,690 --> 00:38:48,245 What is it? 632 00:38:48,245 --> 00:38:49,030 AUDIENCE: 1/4. 633 00:38:49,030 --> 00:38:51,246 PROFESSOR: 1/4, because c is 2. 634 00:38:51,246 --> 00:38:52,788 You don't even know what the mean is. 635 00:38:52,788 --> 00:38:54,704 You don't know what the standard deviation is. 636 00:38:54,704 --> 00:38:55,620 You don't need to. 637 00:38:55,620 --> 00:38:58,490 I just asked, you're two standard deviations 638 00:38:58,490 --> 00:38:59,539 off or more. 639 00:38:59,539 --> 00:39:00,080 At most, 1/4. 640 00:39:03,230 --> 00:39:05,050 How many could be two standard deviations 641 00:39:05,050 --> 00:39:09,080 high or better at most? 642 00:39:09,080 --> 00:39:13,090 1/5-- 1 over 4 plus 1, good. 643 00:39:13,090 --> 00:39:17,030 OK, this holds true no matter what 644 00:39:17,030 --> 00:39:18,620 the distribution of test scores is. 645 00:39:18,620 --> 00:39:19,320 Yeah? 646 00:39:19,320 --> 00:39:21,120 AUDIENCE: [INAUDIBLE] 647 00:39:21,120 --> 00:39:22,190 PROFESSOR: Which one? 648 00:39:22,190 --> 00:39:22,840 This one? 649 00:39:22,840 --> 00:39:23,540 AUDIENCE: Yeah. 650 00:39:23,540 --> 00:39:25,650 PROFESSOR: Oh, that's more complicated. 651 00:39:25,650 --> 00:39:28,850 That'll take us several boards to do, to prove that. 652 00:39:28,850 --> 00:39:31,240 And I forget if we put it in the text or not. 653 00:39:31,240 --> 00:39:36,702 It might be in the text, to prove that. 654 00:39:36,702 --> 00:39:37,535 Any other questions? 655 00:39:40,830 --> 00:39:48,856 OK so Markov and Chebyshev are sometimes close, sometimes not. 656 00:39:52,800 --> 00:39:54,360 Now, for the rest of today, we're 657 00:39:54,360 --> 00:39:58,710 going to talk about a much more powerful technique. 658 00:39:58,710 --> 00:40:00,245 But it only works in a special case. 659 00:40:00,245 --> 00:40:01,870 Now, the good news is this special case 660 00:40:01,870 --> 00:40:04,950 happens all the time in practice. 661 00:40:04,950 --> 00:40:08,550 And it's the case when you're analyzing a random variable 662 00:40:08,550 --> 00:40:11,290 that itself is the sum of a bunch 663 00:40:11,290 --> 00:40:12,640 of other random variables. 664 00:40:12,640 --> 00:40:15,600 And we've seen already examples like that. 665 00:40:15,600 --> 00:40:17,520 And the other random variables have 666 00:40:17,520 --> 00:40:20,100 to be mutually independent. 667 00:40:20,100 --> 00:40:21,730 And in this case, you get a bound 668 00:40:21,730 --> 00:40:23,900 that's called a Chernoff bound. 669 00:40:23,900 --> 00:40:26,280 And this is the same Chernoff who figured out 670 00:40:26,280 --> 00:40:28,390 how to beat the lottery. 671 00:40:28,390 --> 00:40:31,370 And it's interesting. 672 00:40:31,370 --> 00:40:33,362 Long after we started teaching this, 673 00:40:33,362 --> 00:40:35,820 originally this stuff was only taught, for Chernoff bounds, 674 00:40:35,820 --> 00:40:37,280 for graduate students. 675 00:40:37,280 --> 00:40:38,420 And now we teach it here. 676 00:40:38,420 --> 00:40:39,503 Because it's so important. 677 00:40:39,503 --> 00:40:41,560 And it really is accessible. 678 00:40:41,560 --> 00:40:43,700 It'll be probably the most complicated proof 679 00:40:43,700 --> 00:40:46,330 we've done to establish a Chernoff bound. 680 00:40:46,330 --> 00:40:49,400 But Chernoff himself, when he discovered this, 681 00:40:49,400 --> 00:40:50,840 thought it was no big deal. 682 00:40:50,840 --> 00:40:51,970 In fact, he couldn't figure out why 683 00:40:51,970 --> 00:40:54,261 everybody in computer science was always writing papers 684 00:40:54,261 --> 00:40:56,290 with Chernoff bounds in them. 685 00:40:56,290 --> 00:40:58,460 And that's because he didn't put any emphasis 686 00:40:58,460 --> 00:40:59,760 on the bounds in his work. 687 00:40:59,760 --> 00:41:01,800 But computer scientists who came later 688 00:41:01,800 --> 00:41:04,180 found all sorts of important applications. 689 00:41:04,180 --> 00:41:06,920 And we'll see some of those today. 690 00:41:06,920 --> 00:41:10,954 So let me tell you what the bound is. 691 00:41:10,954 --> 00:41:12,370 And the nice thing is it really is 692 00:41:12,370 --> 00:41:14,769 Markov's theorem again in disguise, 693 00:41:14,769 --> 00:41:16,060 just a little more complicated. 694 00:41:31,230 --> 00:41:34,030 Theorem-- it's called a Chernoff bound. 695 00:41:42,470 --> 00:41:55,620 Let T1, T2, up to Tn be any mutually independent-- 696 00:41:55,620 --> 00:42:01,090 that's really important-- random variables 697 00:42:01,090 --> 00:42:12,110 such that each of them takes values only between 0 and 1. 698 00:42:12,110 --> 00:42:16,207 And if they don't, just normalize them so they do. 699 00:42:16,207 --> 00:42:18,290 So we're going to take a bunch of random variables 700 00:42:18,290 --> 00:42:19,539 that are mutually independent. 701 00:42:19,539 --> 00:42:21,958 And they are all between 0 and 1. 702 00:42:26,120 --> 00:42:36,060 Then we're going to look at the sum of those random variables, 703 00:42:36,060 --> 00:42:48,250 call that T. Then for any c at least 1, 704 00:42:48,250 --> 00:42:52,420 the probability that the sum random variable is at least c 705 00:42:52,420 --> 00:42:56,060 times its expected value. 706 00:42:56,060 --> 00:42:59,910 So it's going to be the high side here-- is at most e 707 00:42:59,910 --> 00:43:03,600 to the minus z, and I'll tell you what that is in a minute, 708 00:43:03,600 --> 00:43:19,130 times the expected value of T where z is c natural log of c 709 00:43:19,130 --> 00:43:21,280 plus 1 minus c. 710 00:43:21,280 --> 00:43:24,450 And it turns out if c is bigger than 1, this is positive. 711 00:43:29,300 --> 00:43:31,880 So that's a lot, one of the longest 712 00:43:31,880 --> 00:43:34,730 theorems we wrote down here. 713 00:43:34,730 --> 00:43:36,520 But what it says is that probability were 714 00:43:36,520 --> 00:43:42,610 high is exponentially small. 715 00:43:42,610 --> 00:43:45,460 As the expected value is big, the chance of being high 716 00:43:45,460 --> 00:43:48,340 gets really, really tiny. 717 00:43:48,340 --> 00:43:50,100 Now, I'm going to prove it in a minute. 718 00:43:50,100 --> 00:43:52,560 But let's just plug in some examples 719 00:43:52,560 --> 00:43:54,367 to see what's going on here. 720 00:44:05,200 --> 00:44:20,080 So for example, suppose the expected value of T is 100. 721 00:44:20,080 --> 00:44:25,230 And suppose c is 2. 722 00:44:25,230 --> 00:44:28,990 So we expect to have 100 come out of the sum. 723 00:44:28,990 --> 00:44:33,940 The probability we get at least 200-- well, 724 00:44:33,940 --> 00:44:35,940 let's figure out what that is. 725 00:44:35,940 --> 00:44:39,450 c being 2 we can evaluate z now. 726 00:44:39,450 --> 00:44:43,990 It's 2 natural log of 2 plus 1 minus 2. 727 00:44:43,990 --> 00:44:48,990 And that's close to but a little larger than 0.38. 728 00:44:48,990 --> 00:44:51,720 So we can plug z in, the exponent up there, 729 00:44:51,720 --> 00:44:54,330 and find that the probability that T is at least twice 730 00:44:54,330 --> 00:44:58,340 its expected value, namely at least 200, 731 00:44:58,340 --> 00:45:04,560 is at most e to the minus 0.38 times 100, 732 00:45:04,560 --> 00:45:11,038 which is e to the minus 38, which is just really small. 733 00:45:13,620 --> 00:45:15,940 So that's just way better than any results you 734 00:45:15,940 --> 00:45:17,850 get with Markov or Chebyshev. 735 00:45:20,830 --> 00:45:23,320 So if you have a bunch of random variables between 0 and 1, 736 00:45:23,320 --> 00:45:25,560 and they're mutually independent, you add them up. 737 00:45:25,560 --> 00:45:28,780 If you expect 100 as the answer, the chance 738 00:45:28,780 --> 00:45:31,690 of getting 200 or more-- forget about it, not going to happen. 739 00:45:35,040 --> 00:45:38,150 Now, of course Chernoff doesn't apply to all distributions. 740 00:45:38,150 --> 00:45:40,160 It has to be this type. 741 00:45:40,160 --> 00:45:42,280 This is a pretty broad class. 742 00:45:42,280 --> 00:45:47,200 In fact, it contains the class of all Bernoulli distributions. 743 00:45:47,200 --> 00:45:50,130 So I have binomial distributions. 744 00:45:50,130 --> 00:45:54,620 Because remember a binomial distribution-- well, 745 00:45:54,620 --> 00:45:56,090 remember binomial distributions? 746 00:45:56,090 --> 00:45:58,830 That's where T is the sum of Ti's. 747 00:45:58,830 --> 00:46:02,690 In binomial, you have Tj is 0 or 1. 748 00:46:02,690 --> 00:46:05,210 It can't be in between. 749 00:46:05,210 --> 00:46:09,660 And with binomial, all Tj's have the same distribution. 750 00:46:09,660 --> 00:46:12,840 With Chernoff, they can all be different. 751 00:46:12,840 --> 00:46:15,730 So Chernoff is much broader than binomial. 752 00:46:15,730 --> 00:46:18,590 The individual guys here can have different distributions 753 00:46:18,590 --> 00:46:21,640 and attain values anywhere between 0 and 1, 754 00:46:21,640 --> 00:46:26,062 as opposed to just one or the other. 755 00:46:26,062 --> 00:46:34,433 Any questions about this theorem and what it says in the terms 756 00:46:34,433 --> 00:46:34,932 there? 757 00:46:37,630 --> 00:46:41,350 One nice thing about it is the number of random variables 758 00:46:41,350 --> 00:46:44,010 doesn't even show up in the answer here. 759 00:46:44,010 --> 00:46:46,310 n doesn't even appear. 760 00:46:46,310 --> 00:46:47,190 Yeah. 761 00:46:47,190 --> 00:46:48,757 AUDIENCE: [INAUDIBLE] 762 00:46:48,757 --> 00:46:50,173 PROFESSOR: Does not apply to what? 763 00:46:50,173 --> 00:46:52,590 AUDIENCE: [INAUDIBLE] 764 00:46:52,590 --> 00:46:57,700 PROFESSOR: Yeah, when c equals 1, what happens is z is 0. 765 00:46:57,700 --> 00:47:01,230 Because I have a log of 1 is 0, and 1 minus 1 is 0. 766 00:47:01,230 --> 00:47:04,540 And if z is 0, it says your probability 767 00:47:04,540 --> 00:47:07,360 is upper bounded by 1. 768 00:47:07,360 --> 00:47:10,080 Well, not too interesting, because any probability 769 00:47:10,080 --> 00:47:12,110 is upper bounded by 1. 770 00:47:12,110 --> 00:47:14,070 So it doesn't give you any information 771 00:47:14,070 --> 00:47:16,590 when c is 0, none at all. 772 00:47:16,590 --> 00:47:18,970 But as soon as c starts being-- sorry, if c is 1. 773 00:47:18,970 --> 00:47:22,480 As soon as c starts being bigger than 1, which is sort of a case 774 00:47:22,480 --> 00:47:25,530 you're interested in, you're bigger than your expectation, 775 00:47:25,530 --> 00:47:28,720 then it gives very powerful results. 776 00:47:28,720 --> 00:47:29,517 Yeah. 777 00:47:29,517 --> 00:47:30,392 AUDIENCE: [INAUDIBLE] 778 00:47:33,140 --> 00:47:34,480 PROFESSOR: Yeah, you can. 779 00:47:34,480 --> 00:47:36,990 It's true for n equals 1 as well. 780 00:47:36,990 --> 00:47:40,450 Now, it doesn't give you a lot of information. 781 00:47:40,450 --> 00:47:45,102 Because if c is bigger than 1 and n was 1, so 782 00:47:45,102 --> 00:47:47,060 it's using one variable, what's the probability 783 00:47:47,060 --> 00:47:53,560 that a random variable exceeds its expectation, c 784 00:47:53,560 --> 00:47:56,468 times its expectation? 785 00:47:56,468 --> 00:47:58,850 AUDIENCE: [INAUDIBLE] 786 00:47:58,850 --> 00:48:00,790 PROFESSOR: Yeah, let's see now. 787 00:48:00,790 --> 00:48:03,410 Maybe it does give you information. 788 00:48:03,410 --> 00:48:05,930 Because the random variable has a distribution on 0, 1. 789 00:48:09,736 --> 00:48:11,860 That's right, so it does give you some information. 790 00:48:11,860 --> 00:48:15,270 But I don't think it gives you a lot. 791 00:48:15,270 --> 00:48:16,996 I have to think about that. 792 00:48:16,996 --> 00:48:18,620 What happens when there's just one guy? 793 00:48:18,620 --> 00:48:19,911 Because the same thing is true. 794 00:48:19,911 --> 00:48:22,720 It's just now for a single random variable on 0, 795 00:48:22,720 --> 00:48:28,675 1 the chance that your twice the expected value. 796 00:48:28,675 --> 00:48:29,800 I have to think about that. 797 00:48:29,800 --> 00:48:31,320 That's a good question. 798 00:48:31,320 --> 00:48:34,400 Does it do anything interesting there? 799 00:48:34,400 --> 00:48:40,420 OK, all right, so let's do an example 800 00:48:40,420 --> 00:48:41,970 of how you might apply this. 801 00:48:56,400 --> 00:49:01,550 Say that you're playing Pick 4, and 10 million people 802 00:49:01,550 --> 00:49:02,050 are playing. 803 00:49:11,140 --> 00:49:14,330 And say in this version of Pick 4, 804 00:49:14,330 --> 00:49:17,820 you're picking a four digit number, four single digits. 805 00:49:17,820 --> 00:49:21,050 And you win if you get an exact match. 806 00:49:21,050 --> 00:49:25,820 So the probability of winning, a person winning, 807 00:49:25,820 --> 00:49:28,170 well, they've got to get all four digits right. 808 00:49:28,170 --> 00:49:30,778 That's 1 in 10,000, 10 to the fourth. 809 00:49:34,510 --> 00:49:36,257 What's the expected number of winners? 810 00:49:40,620 --> 00:49:44,402 If I got 10 million people, what's 811 00:49:44,402 --> 00:49:45,693 the expected number of winners? 812 00:49:48,430 --> 00:49:49,046 What is it? 813 00:49:51,830 --> 00:50:00,892 We've got 10 million over 10,000, right? 814 00:50:00,892 --> 00:50:03,100 Because what I'm doing here is the number of winners, 815 00:50:03,100 --> 00:50:07,405 T, is going to be the sum of 10 million indicator variables. 816 00:50:14,060 --> 00:50:17,450 And the probability that any one of these guys wins 817 00:50:17,450 --> 00:50:21,840 is 1 in 10,000. 818 00:50:21,840 --> 00:50:23,670 So the expected number of winners 819 00:50:23,670 --> 00:50:28,720 is 1 in 10,000 added 10 million times, which is this. 820 00:50:32,470 --> 00:50:32,990 Is that OK? 821 00:50:32,990 --> 00:50:34,823 Everybody should be really familiar with how 822 00:50:34,823 --> 00:50:36,560 to whip these things out. 823 00:50:36,560 --> 00:50:38,790 This for sure will have probably at least a couple 824 00:50:38,790 --> 00:50:40,290 questions where you're going to need 825 00:50:40,290 --> 00:50:42,380 to be able to do that kind of stuff on the final. 826 00:50:46,710 --> 00:50:50,390 All right, say I want to know the probability of getting 827 00:50:50,390 --> 00:51:05,940 at least 2,000 winners, and I want to upper bound 828 00:51:05,940 --> 00:51:09,358 that just with the information I've given you. 829 00:51:12,320 --> 00:51:17,920 Well, any thoughts about an upper bound? 830 00:51:17,920 --> 00:51:19,260 AUDIENCE: [INAUDIBLE] 831 00:51:19,260 --> 00:51:20,364 PROFESSOR: What's that? 832 00:51:20,364 --> 00:51:22,360 AUDIENCE: [INAUDIBLE] 833 00:51:22,360 --> 00:51:25,970 PROFESSOR: Yeah, that's a good upper bound. 834 00:51:25,970 --> 00:51:27,845 What did you have to assume to get there? 835 00:51:31,330 --> 00:51:33,680 e to the minus 380 is a great bound. 836 00:51:33,680 --> 00:51:39,070 Because you're going to plug in expected value is 1,000. 837 00:51:39,070 --> 00:51:41,810 And we're asking for more than twice the expected value. 838 00:51:41,810 --> 00:51:44,640 So it's e to the minus 0.38 times 1,000. 839 00:51:44,640 --> 00:51:50,790 And that for sure is-- so you computed this. 840 00:51:50,790 --> 00:51:54,890 And that equals e to the minus 380. 841 00:51:54,890 --> 00:51:56,280 So that's really small. 842 00:51:56,280 --> 00:52:01,670 But what did you have to assume to apply Chernoff? 843 00:52:01,670 --> 00:52:02,710 Mutual independence. 844 00:52:02,710 --> 00:52:05,020 Mutual independence of what? 845 00:52:05,020 --> 00:52:06,880 AUDIENCE: [INAUDIBLE] 846 00:52:06,880 --> 00:52:09,760 PROFESSOR: The numbers people picked. 847 00:52:09,760 --> 00:52:13,249 And we already know, if people are picking numbers, 848 00:52:13,249 --> 00:52:15,040 they don't tend to be mutually independent. 849 00:52:15,040 --> 00:52:18,030 They tend to gang up. 850 00:52:18,030 --> 00:52:20,500 But if you had a computer picking the numbers randomly 851 00:52:20,500 --> 00:52:27,010 and mutually independently, then you would be e to the minus 380 852 00:52:27,010 --> 00:52:33,726 by Chernoff if mutually independent picks. 853 00:52:37,180 --> 00:52:39,270 Everybody see why we did that? 854 00:52:39,270 --> 00:52:43,320 Because it's a probability of twice your expectation. 855 00:52:43,320 --> 00:52:48,080 The total number of winners is the sum of 10 million indicator 856 00:52:48,080 --> 00:52:49,640 variables. 857 00:52:49,640 --> 00:52:51,660 And indicator variables are 0 or 1. 858 00:52:51,660 --> 00:52:53,600 So they fit that definition up there. 859 00:52:56,380 --> 00:53:03,080 And so we already figured out z is at least 0.38. 860 00:53:03,080 --> 00:53:05,330 And you're multiplying by the expected value of 1,000. 861 00:53:05,330 --> 00:53:10,610 That's e to the minus 380, so very, very unlikely. 862 00:53:10,610 --> 00:53:12,360 What if they weren't mutually independent? 863 00:53:12,360 --> 00:53:19,630 Can you say anything about this, anything at all better than 1, 864 00:53:19,630 --> 00:53:21,202 which we know for any probability? 865 00:53:21,202 --> 00:53:23,090 Yeah? 866 00:53:23,090 --> 00:53:25,310 AUDIENCE: It's possible that everyone 867 00:53:25,310 --> 00:53:26,270 chose the same numbers. 868 00:53:28,832 --> 00:53:31,290 PROFESSOR: Yes, everyone could have chosen the same number. 869 00:53:31,290 --> 00:53:40,950 But that number only comes up with a 1 in 10,000 chance. 870 00:53:40,950 --> 00:53:43,282 So you can say something. 871 00:53:43,282 --> 00:53:44,490 AUDIENCE: You can use Markov. 872 00:53:44,490 --> 00:53:45,406 PROFESSOR: Use Markov. 873 00:53:45,406 --> 00:53:46,647 What does Markov give you? 874 00:53:54,667 --> 00:53:55,750 What does Markov give you? 875 00:53:58,480 --> 00:54:00,480 1/2, yeah. 876 00:54:00,480 --> 00:54:05,050 Because you've got the expected value is 877 00:54:05,050 --> 00:54:11,400 1,000 divided by the bound threshold, is 2,000, 878 00:54:11,400 --> 00:54:13,220 is 1/2 by Markov. 879 00:54:13,220 --> 00:54:15,896 And that holds true without any independence assumption. 880 00:54:19,230 --> 00:54:22,720 Now, there is an enormous difference between 1/2 and e 881 00:54:22,720 --> 00:54:24,810 to the minus 380. 882 00:54:24,810 --> 00:54:28,062 Independence really makes a huge difference 883 00:54:28,062 --> 00:54:29,270 in the bound you can compute. 884 00:54:33,700 --> 00:54:39,050 OK, now there's another way we could've gone about this. 885 00:54:39,050 --> 00:54:43,300 What kind of distribution does T have in this case? 886 00:54:45,952 --> 00:54:48,160 It's binomial. 887 00:54:48,160 --> 00:54:51,370 Because it's the sum of indicator random variables, 0, 888 00:54:51,370 --> 00:54:52,370 1's. 889 00:54:52,370 --> 00:54:54,160 Each of these is 0, 1. 890 00:54:54,160 --> 00:54:55,920 And they're all the same distribution. 891 00:54:55,920 --> 00:55:00,160 There's a 1 in 10,000 chance of winning for each one of them. 892 00:55:00,160 --> 00:55:01,512 So it's a binomial. 893 00:55:01,512 --> 00:55:02,970 So we could have gone back and used 894 00:55:02,970 --> 00:55:05,860 the formulas we had for the binomial distribution, 895 00:55:05,860 --> 00:55:11,140 plugged it all in, and we'd have gotten something pretty similar 896 00:55:11,140 --> 00:55:12,030 here. 897 00:55:12,030 --> 00:55:13,580 But Chernoff is so much easier. 898 00:55:13,580 --> 00:55:15,380 Remember that pain we would go through 899 00:55:15,380 --> 00:55:18,580 with a binomial distribution, the approximation, Stirling's 900 00:55:18,580 --> 00:55:22,285 formula, [INAUDIBLE] whatever, the factorials and stuff? 901 00:55:22,285 --> 00:55:24,410 And that's a nightmare. 902 00:55:24,410 --> 00:55:25,510 This was easy. 903 00:55:25,510 --> 00:55:28,490 e to the minus 380 was very easy to compute. 904 00:55:28,490 --> 00:55:30,520 And really at that point it doesn't 905 00:55:30,520 --> 00:55:33,950 matter if it's minus 381 or minus 382 or whatever. 906 00:55:33,950 --> 00:55:35,870 Because it's really small. 907 00:55:35,870 --> 00:55:39,020 So often, even when you have a binomial distribution, 908 00:55:39,020 --> 00:55:40,250 well, Chernoff will apply. 909 00:55:40,250 --> 00:55:42,560 And that's a great way to go. 910 00:55:42,560 --> 00:55:46,180 Because it gives you good bounds generally. 911 00:55:46,180 --> 00:55:49,170 All right, let's figure out the probability 912 00:55:49,170 --> 00:55:56,600 of at least 1,100 winners instead of 1,000. 913 00:55:56,600 --> 00:56:02,110 So let's look at the probability of at least 100 extra winners 914 00:56:02,110 --> 00:56:05,520 over what we expect out of 10 million. 915 00:56:05,520 --> 00:56:06,790 We've got 10 million people. 916 00:56:06,790 --> 00:56:07,730 You expect 1,000. 917 00:56:07,730 --> 00:56:10,830 We're going to analyze the probability of 1,100. 918 00:56:10,830 --> 00:56:12,600 What's c in this case? 919 00:56:12,600 --> 00:56:15,530 We're going to use Chernoff. 920 00:56:15,530 --> 00:56:17,550 1.1. 921 00:56:17,550 --> 00:56:21,090 So this is 1.1 times 1,000. 922 00:56:21,090 --> 00:56:27,150 And that means that z is 1.1 times the natural log 923 00:56:27,150 --> 00:56:31,160 of 1.1 plus 1 minus 1.1. 924 00:56:31,160 --> 00:56:39,020 And that is close to but at least 0.0048. 925 00:56:39,020 --> 00:56:42,840 So this probability is at most, by Chernoff, 926 00:56:42,840 --> 00:56:50,150 e to the minus 0.0048 times the expected number of winners 927 00:56:50,150 --> 00:56:53,010 is 1,000. 928 00:56:53,010 --> 00:56:59,930 So that is e to the minus 4.8, which is less than 1%, 929 00:56:59,930 --> 00:57:03,817 1 in 100. 930 00:57:03,817 --> 00:57:04,900 So that's pretty powerful. 931 00:57:04,900 --> 00:57:08,720 It says, you've got 10 million people who could win. 932 00:57:08,720 --> 00:57:13,250 The chance of even having 100 more than the 1,000 you expect 933 00:57:13,250 --> 00:57:17,880 is 1% chance at most-- very, very powerful. 934 00:57:17,880 --> 00:57:20,220 It says you really expect to get really 935 00:57:20,220 --> 00:57:25,660 close to the mean in this situation. 936 00:57:25,660 --> 00:57:29,280 OK, a lot better-- Markov here gives you, 937 00:57:29,280 --> 00:57:32,380 what, 1,000 over 1,100. 938 00:57:32,380 --> 00:57:35,430 It says your probability could be 90% or something-- 939 00:57:35,430 --> 00:57:36,290 not very useful. 940 00:57:36,290 --> 00:57:38,760 Chebyshev won't give you much here either. 941 00:57:38,760 --> 00:57:41,730 So if you're in a situation to apply Chernoff, 942 00:57:41,730 --> 00:57:42,915 always go there. 943 00:57:42,915 --> 00:57:44,800 It gives you the best bounds. 944 00:57:44,800 --> 00:57:45,434 Any questions? 945 00:57:48,100 --> 00:57:50,700 This of course is why computer scientists use it all the time. 946 00:57:54,010 --> 00:57:56,990 OK, actually, before I do more examples, 947 00:57:56,990 --> 00:57:59,770 let me prove the theorem in a special case 948 00:57:59,770 --> 00:58:03,050 to give you a feel for what's involved. 949 00:58:03,050 --> 00:58:05,950 The full proof is in the text. 950 00:58:05,950 --> 00:58:07,960 I'm going to prove it in the special case 951 00:58:07,960 --> 00:58:11,090 where the Tj are 0, 1. 952 00:58:11,090 --> 00:58:12,860 So they're indicator random variables. 953 00:58:12,860 --> 00:58:16,020 But they don't have to have the same distribution. 954 00:58:16,020 --> 00:58:17,730 So it's still more general than you get 955 00:58:17,730 --> 00:58:20,930 with a binomial distribution. 956 00:58:20,930 --> 00:58:28,420 All right, so we're going to do a proof of Chernoff 957 00:58:28,420 --> 00:58:37,524 for the special case where the Tj are either 0 or 1. 958 00:58:37,524 --> 00:58:38,815 So they're indicator variables. 959 00:58:42,980 --> 00:58:46,140 OK, so the first step is going to seem pretty mysterious. 960 00:58:46,140 --> 00:58:48,130 But we've been doing something like it all day. 961 00:58:52,090 --> 00:58:54,320 I'm trying to compute the probability T is bigger 962 00:58:54,320 --> 00:58:57,670 than c times its expectation. 963 00:58:57,670 --> 00:59:02,500 Well, what I'm going to do is exponentiate both of these guys 964 00:59:02,500 --> 00:59:06,230 and compute the probability that c to the T 965 00:59:06,230 --> 00:59:13,120 is at least c to the c times the expected value of T. 966 00:59:13,120 --> 00:59:16,650 Now, this is not the first thing you'd expect to do, 967 00:59:16,650 --> 00:59:18,580 probably, if you were trying to prove this. 968 00:59:18,580 --> 00:59:20,600 So it's one of those divine insights 969 00:59:20,600 --> 00:59:22,080 that you'd make this step. 970 00:59:24,920 --> 00:59:27,070 And then I'm going to apply Markov, like we've 971 00:59:27,070 --> 00:59:30,960 been doing all day, to this. 972 00:59:30,960 --> 00:59:35,820 Now, since T is positive and c is positive, these are equal. 973 00:59:35,820 --> 00:59:39,520 And this is never non-negative. 974 00:59:39,520 --> 00:59:43,250 So now by Markov, this is simply upper bounded 975 00:59:43,250 --> 00:59:50,190 by the expected value of that, expected value of c to the T, 976 00:59:50,190 --> 00:59:51,260 divided by this. 977 00:59:57,130 --> 00:59:58,851 And that's by Markov. 978 00:59:58,851 --> 01:00:01,351 So everything we've done today is really Markov in disguise. 979 01:00:05,500 --> 01:00:07,342 Any questions so far? 980 01:00:07,342 --> 01:00:09,300 You start looking at this, you go, oh my god, I 981 01:00:09,300 --> 01:00:11,700 got the random variable and the exponent. 982 01:00:11,700 --> 01:00:13,150 This is looking like a nightmare. 983 01:00:13,150 --> 01:00:14,922 What is the expected value of c to the T, 984 01:00:14,922 --> 01:00:15,880 and this kind of stuff? 985 01:00:15,880 --> 01:00:18,580 But we're going to hack through it. 986 01:00:18,580 --> 01:00:20,900 Because it gives you just an amazingly powerful result 987 01:00:20,900 --> 01:00:21,608 when you're done. 988 01:00:27,880 --> 01:00:31,510 All right, so we've got to evaluate the expected value 989 01:00:31,510 --> 01:00:36,140 of c to the T. And we're going to use the fact that T is 990 01:00:36,140 --> 01:00:37,278 the sum of the Tj's. 991 01:00:45,510 --> 01:00:52,110 And that means that c to the T equals c to the T1 times 992 01:00:52,110 --> 01:00:56,770 c to the T2 times c to the Tn. 993 01:00:56,770 --> 01:00:59,480 The weird thing about this proof is that every step sort of 994 01:00:59,480 --> 01:01:01,710 makes it more complicated looking 995 01:01:01,710 --> 01:01:03,760 until we get to the end. 996 01:01:03,760 --> 01:01:08,750 So it's one of those that's hard to figure out the first time. 997 01:01:08,750 --> 01:01:13,470 All right, that means the expected value of c to the T 998 01:01:13,470 --> 01:01:18,150 is the expected value of the product of these things. 999 01:01:26,310 --> 01:01:30,484 Now I'm going to use the product rule for expectation. 1000 01:01:39,570 --> 01:01:41,810 Now, why can I use the product rule? 1001 01:01:41,810 --> 01:01:43,835 What am I assuming to be able to do that? 1002 01:01:47,160 --> 01:01:49,150 That they are mutually independent, 1003 01:01:49,150 --> 01:01:51,290 that the c to the Tj's are mutually 1004 01:01:51,290 --> 01:01:53,630 independent of each other. 1005 01:01:53,630 --> 01:01:58,454 And that follows, because the Tj's are mutually independent. 1006 01:01:58,454 --> 01:02:00,870 So if a bunch of random variable are mutually independent, 1007 01:02:00,870 --> 01:02:05,680 then their exponentiations are mutually independent. 1008 01:02:05,680 --> 01:02:12,320 So this is by product rule for expectation 1009 01:02:12,320 --> 01:02:13,320 and mutual independence. 1010 01:02:18,800 --> 01:02:23,790 OK, so now we've got to evaluate the expected value 1011 01:02:23,790 --> 01:02:25,345 of c to the Tj. 1012 01:02:32,139 --> 01:02:33,680 And this is where we're going to make 1013 01:02:33,680 --> 01:02:40,070 it simpler by assuming that Tj is just a 0, 1 random variable. 1014 01:02:40,070 --> 01:02:42,322 So the simplification comes in here. 1015 01:02:47,400 --> 01:02:53,420 So the expected value of Tj-- well, there's two cases. 1016 01:02:53,420 --> 01:02:55,080 Tj is 1, or it's 0. 1017 01:02:55,080 --> 01:02:57,760 Because we made this simplification. 1018 01:02:57,760 --> 01:03:01,110 If it's 1, I get c to the 1-- ooh, 1019 01:03:01,110 --> 01:03:03,010 expected value of c to the Tj. 1020 01:03:03,010 --> 01:03:04,286 Let's get that right. 1021 01:03:07,600 --> 01:03:10,060 It could be 1, in which case I get a contribution of c 1022 01:03:10,060 --> 01:03:17,590 to the 1 times the probability Tj equals 1 plus the case at 0. 1023 01:03:17,590 --> 01:03:21,055 So I get c to the 0 times the probability Tj is 0. 1024 01:03:24,230 --> 01:03:25,910 Well, c to the 1 is just c. 1025 01:03:32,370 --> 01:03:34,620 c to the 0 is 1. 1026 01:03:34,620 --> 01:03:37,470 And I'm going to rewrite Tj being 1027 01:03:37,470 --> 01:03:39,900 0 as 1 minus the probability Tj is 1. 1028 01:03:45,020 --> 01:03:47,530 All right, this equals that. 1029 01:03:47,530 --> 01:03:49,980 And of course the 1 cancels. 1030 01:03:49,980 --> 01:03:51,750 Now I'm going to collect terms here 1031 01:03:51,750 --> 01:04:00,984 to get 1 plus c minus 1 times the probability Tj equals 1. 1032 01:04:06,680 --> 01:04:10,720 OK, then I'm going to do one more step here. 1033 01:04:10,720 --> 01:04:17,730 This is 1 plus c minus 1 times the expected value of Tj. 1034 01:04:17,730 --> 01:04:21,210 Because if I have an indicator random variable, 1035 01:04:21,210 --> 01:04:24,880 the expected value is the same as the probability that it's 1. 1036 01:04:24,880 --> 01:04:28,780 Because in the other case it's 0. 1037 01:04:28,780 --> 01:04:31,720 And now I'm going to use the trick from last time. 1038 01:04:31,720 --> 01:04:37,564 Remember 1 plus x is always at most e to the x from last time? 1039 01:04:37,564 --> 01:04:39,730 None of these steps is obvious why we're doing them. 1040 01:04:39,730 --> 01:04:43,560 But we're going to do them anyway. 1041 01:04:43,560 --> 01:04:53,416 So this is at most e to this, c minus 1 expected value of Tj. 1042 01:04:57,110 --> 01:05:01,730 Because 1 plus anything is at most the exponential of that. 1043 01:05:01,730 --> 01:05:06,830 And I'm doing this step because I got a product of these guys. 1044 01:05:06,830 --> 01:05:08,556 And I want to put them in the exponent 1045 01:05:08,556 --> 01:05:10,180 so I can then sum them so it gets easy. 1046 01:05:23,570 --> 01:05:32,880 OK, now we just plug this back in here. 1047 01:05:32,880 --> 01:05:41,230 So that means that the expected value of c to the T 1048 01:05:41,230 --> 01:05:51,020 is at most a product of expected value of e 1049 01:05:51,020 --> 01:05:59,280 to the cTj is this-- e to the c minus 1 expected value of Tj. 1050 01:05:59,280 --> 01:06:01,678 And now I can convert this to a sum in the exponent. 1051 01:06:12,630 --> 01:06:16,320 And this is j equals 1 to n. 1052 01:06:16,320 --> 01:06:18,061 And what do I do to simplify that? 1053 01:06:23,760 --> 01:06:24,825 Linearity of expectation. 1054 01:06:27,500 --> 01:06:35,895 c minus 1 times the sum j equals 1 to n expected value of Tj. 1055 01:06:41,060 --> 01:06:42,400 Ooh, let's see, did I? 1056 01:06:42,400 --> 01:06:45,390 Actually, I used linearity coming out. 1057 01:06:45,390 --> 01:06:47,510 I already used linearity. 1058 01:06:47,510 --> 01:06:50,530 I screwed up here. 1059 01:06:50,530 --> 01:06:54,520 So here I used the linearity when I took the sum up here 1060 01:06:54,520 --> 01:06:55,574 inside the expectation. 1061 01:06:55,574 --> 01:06:56,740 I've already used linearity. 1062 01:06:56,740 --> 01:06:59,810 What is the sum of the Tj's? 1063 01:06:59,810 --> 01:07:01,810 T-- yeah, that's what I needed to do here. 1064 01:07:07,130 --> 01:07:09,360 OK, we're now almost done. 1065 01:07:09,360 --> 01:07:11,870 We've got now an upper bound on the expected value of c 1066 01:07:11,870 --> 01:07:14,450 to the T. And it is this. 1067 01:07:14,450 --> 01:07:17,290 And we just plug that in back up here. 1068 01:07:20,750 --> 01:07:28,870 So now this is at most e to the c minus 1 expected value of T 1069 01:07:28,870 --> 01:07:33,810 over c to the c times the expected value of t. 1070 01:07:33,810 --> 01:07:36,390 And now I just do manipulation. 1071 01:07:36,390 --> 01:07:38,800 c to something is the same as e to the log 1072 01:07:38,800 --> 01:07:40,770 of c times that something. 1073 01:07:40,770 --> 01:07:50,560 So this is e to the minus c ln c expected value of T plus that. 1074 01:07:57,220 --> 01:08:00,760 And then I'm running out of room. 1075 01:08:00,760 --> 01:08:05,350 That equals-- I can just pull out the expected values of T. I 1076 01:08:05,350 --> 01:08:14,590 get e to the minus c log of c plus c minus 1 expected 1077 01:08:14,590 --> 01:08:23,707 value of T. And that's e to the minus z expected value of T. 1078 01:08:23,707 --> 01:08:25,290 All right, so that's a marathon proof. 1079 01:08:25,290 --> 01:08:27,117 It's the worst proof I think. 1080 01:08:27,117 --> 01:08:28,950 Well, maybe minimum spanning tree was worse. 1081 01:08:28,950 --> 01:08:31,340 But this is one of the worst proofs we've seen this year. 1082 01:08:31,340 --> 01:08:34,384 But I wanted to show it to you. 1083 01:08:34,384 --> 01:08:36,300 Because it's one of the most important results 1084 01:08:36,300 --> 01:08:39,130 that we cover, certainly in probability, 1085 01:08:39,130 --> 01:08:41,352 that can be very useful in practice. 1086 01:08:41,352 --> 01:08:43,060 And it gives you some feel for, hey, this 1087 01:08:43,060 --> 01:08:45,300 wasn't so obvious to do it the first time, 1088 01:08:45,300 --> 01:08:47,640 and also some of the techniques that are used, 1089 01:08:47,640 --> 01:08:51,210 which is really Markov's theorem. 1090 01:08:51,210 --> 01:08:53,069 Any questions? 1091 01:08:53,069 --> 01:08:54,043 Yeah. 1092 01:08:54,043 --> 01:08:58,381 AUDIENCE: Over there, you define z as 1 minus c. 1093 01:08:58,381 --> 01:08:59,589 PROFESSOR: Did I do it wrong? 1094 01:08:59,589 --> 01:09:01,834 AUDIENCE: c natural log of c, 1 minus c. 1095 01:09:01,834 --> 01:09:02,688 Maybe it's plus c? 1096 01:09:02,688 --> 01:09:04,479 PROFESSOR: Oh, I've got to change the sign. 1097 01:09:04,479 --> 01:09:06,399 Because I pulled a negative out in front. 1098 01:09:06,399 --> 01:09:09,180 So it's got to be negative c minus 1, 1099 01:09:09,180 --> 01:09:11,990 which means negative c plus 1. 1100 01:09:11,990 --> 01:09:15,600 Yeah, good. 1101 01:09:15,600 --> 01:09:16,418 Yeah, this was OK. 1102 01:09:16,418 --> 01:09:18,043 I just made the mistake going to there. 1103 01:09:18,043 --> 01:09:18,949 Any other questions? 1104 01:09:24,670 --> 01:09:33,779 OK, so the common theme here in using Markov to get Chebyshev, 1105 01:09:33,779 --> 01:09:35,840 to get Chernoff, to get the Markov extensions, 1106 01:09:35,840 --> 01:09:37,619 is always the same. 1107 01:09:37,619 --> 01:09:40,649 And let me show you what that theme is. 1108 01:09:46,300 --> 01:09:49,801 Because you can use it to get even other results. 1109 01:09:52,340 --> 01:09:54,590 When we're trying to figure out the probability that T 1110 01:09:54,590 --> 01:10:01,380 is at least c times its expected value, or actually 1111 01:10:01,380 --> 01:10:03,000 even in general, even more generally 1112 01:10:03,000 --> 01:10:08,030 than that, the probability that A is bigger than B, 1113 01:10:08,030 --> 01:10:10,800 even more generally, well, that's 1114 01:10:10,800 --> 01:10:16,150 the same as the probability that f of A is bigger than f of B 1115 01:10:16,150 --> 01:10:20,010 as long as you don't change signs. 1116 01:10:20,010 --> 01:10:25,100 And then by Markov, this is at most the expected value of that 1117 01:10:25,100 --> 01:10:28,110 as long as it's non-negative over that. 1118 01:10:31,600 --> 01:10:34,250 In Chebyshev, what function f did we 1119 01:10:34,250 --> 01:10:39,540 use for Chebyshev in deriving Chebyshev's theorem? 1120 01:10:39,540 --> 01:10:41,260 What was f doing in Chebyshev? 1121 01:10:41,260 --> 01:10:43,254 Actually I probably just erased it. 1122 01:10:46,043 --> 01:10:47,876 What operation were we doing with Chebyshev? 1123 01:10:47,876 --> 01:10:49,620 AUDIENCE: Variance. 1124 01:10:49,620 --> 01:10:50,520 PROFESSOR: Variance. 1125 01:10:50,520 --> 01:10:54,500 And that meant we were squaring it. 1126 01:10:54,500 --> 01:10:57,000 So the technique used to prove Chebyshev 1127 01:10:57,000 --> 01:10:59,650 was f was the square function. 1128 01:10:59,650 --> 01:11:04,380 For Chernoff, f is the exponentiation function, 1129 01:11:04,380 --> 01:11:06,400 which turns out to be-- in fact, when we did it 1130 01:11:06,400 --> 01:11:09,647 for Chernoff, that's the optimal choice of functions 1131 01:11:09,647 --> 01:11:10,561 to get good bounds. 1132 01:11:13,776 --> 01:11:18,301 All right, any questions on that? 1133 01:11:18,301 --> 01:11:22,452 All right, let's do one more example here with numbers. 1134 01:11:26,740 --> 01:11:29,480 And this is a load balancing application 1135 01:11:29,480 --> 01:11:33,450 for example you might have with web servers. 1136 01:11:33,450 --> 01:11:36,080 Say you've got to build a load balancing device, 1137 01:11:36,080 --> 01:11:42,480 and it's got to balance N jobs, B1, B2, 1138 01:11:42,480 --> 01:11:55,910 to BN, across a set of M servers, S1, S2, to SN. 1139 01:11:55,910 --> 01:11:59,600 And say you're doing this for a decent sized website. 1140 01:11:59,600 --> 01:12:03,290 So maybe N is 100,000. 1141 01:12:03,290 --> 01:12:06,060 You get 100,000 requests a minute. 1142 01:12:06,060 --> 01:12:11,700 And say you've got 10 servers to handle those requests. 1143 01:12:11,700 --> 01:12:16,740 And say the requests are-- the time for the j-th request 1144 01:12:16,740 --> 01:12:21,130 is, say, Bj takes the j-th job. 1145 01:12:21,130 --> 01:12:25,600 The j-th request takes Lj time. 1146 01:12:25,600 --> 01:12:27,330 And the time is the same on any server. 1147 01:12:27,330 --> 01:12:29,480 The servers are all equivalent. 1148 01:12:29,480 --> 01:12:34,850 And let's assume it's normalized so that Lj is between 0 and 1. 1149 01:12:34,850 --> 01:12:40,490 Maybe the worst job takes a second to do, let's say. 1150 01:12:40,490 --> 01:12:44,050 And say that if you sum up the length of all the jobs, 1151 01:12:44,050 --> 01:12:53,780 you get L. Total workload is the sum of all of them. 1152 01:12:53,780 --> 01:12:58,180 j equals 1 to N. 1153 01:12:58,180 --> 01:13:01,480 And we're going to assume that the average job length is 1/4 1154 01:13:01,480 --> 01:13:02,890 second. 1155 01:13:02,890 --> 01:13:08,820 So we're going to assume that the total amount of work 1156 01:13:08,820 --> 01:13:12,810 is 25,000 seconds, say. 1157 01:13:12,810 --> 01:13:16,580 So the average job length is 1/4 second. 1158 01:13:16,580 --> 01:13:21,410 And the job is to assign these tasks to the 10 servers so that 1159 01:13:21,410 --> 01:13:25,970 hopefully every server is doing L/M work, 1160 01:13:25,970 --> 01:13:33,360 which would be 25,000/10, or 2,500 milliseconds of work, 1161 01:13:33,360 --> 01:13:34,300 something like that. 1162 01:13:34,300 --> 01:13:36,032 I don't know. 1163 01:13:36,032 --> 01:13:37,740 Because when you're doing load balancing, 1164 01:13:37,740 --> 01:13:40,130 you want to take your load and spread it evenly and equally 1165 01:13:40,130 --> 01:13:41,046 among all the servers. 1166 01:13:43,540 --> 01:13:46,720 Any questions about the problem? 1167 01:13:46,720 --> 01:13:48,800 You've got a bunch of jobs, a bunch of servers. 1168 01:13:48,800 --> 01:13:50,550 You want to assign the jobs to the servers 1169 01:13:50,550 --> 01:13:53,570 to balance the load. 1170 01:13:53,570 --> 01:13:56,720 Well, what is the simplest algorithm 1171 01:13:56,720 --> 01:13:58,410 you could think of to do this? 1172 01:13:58,410 --> 01:14:00,134 AUDIENCE: [INAUDIBLE] 1173 01:14:00,134 --> 01:14:02,050 PROFESSOR: That's a good algorithm to do this. 1174 01:14:02,050 --> 01:14:04,540 In practice, the first thing people 1175 01:14:04,540 --> 01:14:11,910 do is, well, take the first N/M jobs, put them on server one, 1176 01:14:11,910 --> 01:14:14,560 the next N/M on server two. 1177 01:14:14,560 --> 01:14:17,970 Or they'll use something called round robin-- first job goes 1178 01:14:17,970 --> 01:14:22,460 here, second here, third here, 10th here, back and start over. 1179 01:14:22,460 --> 01:14:25,230 And they hope that it will balance the load. 1180 01:14:25,230 --> 01:14:27,340 But it might well not. 1181 01:14:27,340 --> 01:14:31,486 Because maybe every 10th job is a big one. 1182 01:14:31,486 --> 01:14:33,110 So what's much better to do in practice 1183 01:14:33,110 --> 01:14:35,370 is to assign them randomly. 1184 01:14:35,370 --> 01:14:37,370 So a job comes in. 1185 01:14:37,370 --> 01:14:39,290 You don't even pay attention to how hard 1186 01:14:39,290 --> 01:14:41,510 it is, how much time you think it'll take. 1187 01:14:41,510 --> 01:14:44,530 You might not even know before you start the job how long it's 1188 01:14:44,530 --> 01:14:45,920 going to take to complete. 1189 01:14:45,920 --> 01:14:48,280 Give it to a random server. 1190 01:14:48,280 --> 01:14:50,990 Don't even look at how much work that server has. 1191 01:14:50,990 --> 01:14:53,550 Just give it to a random one. 1192 01:14:53,550 --> 01:14:56,550 And it turns out this does very, very well. 1193 01:14:56,550 --> 01:14:58,950 Without knowing anything, just that simple approach 1194 01:14:58,950 --> 01:15:00,940 does great in practice. 1195 01:15:00,940 --> 01:15:05,395 And today, state of the art load balancers do this. 1196 01:15:08,610 --> 01:15:11,120 We've been doing randomized kinds of thing like this 1197 01:15:11,120 --> 01:15:12,780 at Akamai now for a decade. 1198 01:15:12,780 --> 01:15:16,820 And it's just stunning how well it works. 1199 01:15:16,820 --> 01:15:18,580 And so let's see why that is. 1200 01:15:24,730 --> 01:15:28,320 Of course we're going to use the Chernoff bound to do it. 1201 01:15:28,320 --> 01:15:47,080 So let's let Rij be the load on server Si from job Bj. 1202 01:15:47,080 --> 01:15:54,470 Now, if Bj is not assigned to Si, it's zero load. 1203 01:15:54,470 --> 01:15:57,040 Because it's not even doing the work there. 1204 01:15:57,040 --> 01:16:03,950 So we know that Rij equals the load of Bj 1205 01:16:03,950 --> 01:16:06,971 if it's assigned to Si. 1206 01:16:11,020 --> 01:16:15,890 And that happens with probability 1/M. 1207 01:16:15,890 --> 01:16:19,610 The job picks one of the M servers at random. 1208 01:16:19,610 --> 01:16:22,750 And otherwise, the load is 0. 1209 01:16:22,750 --> 01:16:25,770 Because it's not assigned to that server. 1210 01:16:25,770 --> 01:16:31,860 And that is probability 1 minus 1/M. 1211 01:16:31,860 --> 01:16:33,850 Now let's look at how much load gets 1212 01:16:33,850 --> 01:16:39,140 assigned by this random algorithm to server i. 1213 01:16:39,140 --> 01:16:48,890 So we'll let Ri be the sum of all the load assigned 1214 01:16:48,890 --> 01:16:49,630 to server i. 1215 01:16:58,830 --> 01:17:02,710 So we've got this indicator where the random variables 1216 01:17:02,710 --> 01:17:03,510 are not 0, 1. 1217 01:17:03,510 --> 01:17:05,760 They're 0 and whatever this load happens to be 1218 01:17:05,760 --> 01:17:08,540 for the j-th job, at most 1. 1219 01:17:08,540 --> 01:17:12,180 And we sum up the value for the contribution 1220 01:17:12,180 --> 01:17:14,390 to Si over all the jobs. 1221 01:17:17,360 --> 01:17:19,500 So now we compute the expected value 1222 01:17:19,500 --> 01:17:22,628 of Ri, the expected load on the i-th server. 1223 01:17:28,500 --> 01:17:32,930 So the expected load on the i-th server 1224 01:17:32,930 --> 01:17:35,975 is-- well, we use linearity of expectation. 1225 01:17:44,870 --> 01:17:50,070 And the expected value of Rij-- well, 0 or Lj. 1226 01:17:50,070 --> 01:17:53,650 It's Lj with probability 1/M. This 1227 01:17:53,650 --> 01:18:06,060 is just now the sum of Lj over M. And the sum of Lj is just L. 1228 01:18:06,060 --> 01:18:10,050 So the expected load of the i-th server 1229 01:18:10,050 --> 01:18:11,740 is the total load divided by the number 1230 01:18:11,740 --> 01:18:14,210 of servers, which is perfect. 1231 01:18:14,210 --> 01:18:16,910 It's optimal-- can't do better than that. 1232 01:18:20,345 --> 01:18:20,970 It makes sense. 1233 01:18:20,970 --> 01:18:23,600 If you assign all the jobs randomly, 1234 01:18:23,600 --> 01:18:26,960 every server is expecting to get 1/M of the total load. 1235 01:18:29,402 --> 01:18:30,860 Now we want to know the probability 1236 01:18:30,860 --> 01:18:33,260 it deviates from that, that you have too 1237 01:18:33,260 --> 01:18:34,996 much load on the i-th server. 1238 01:18:51,460 --> 01:18:55,760 All right, so the probability that the i-th server has 1239 01:18:55,760 --> 01:19:02,530 c times the optimal load is at most, by Chernoff, 1240 01:19:02,530 --> 01:19:06,910 if the jobs are independent, minus zL over M, 1241 01:19:06,910 --> 01:19:12,780 minus z times the expected load where z is c 1242 01:19:12,780 --> 01:19:16,530 ln c plus 1 minus c. 1243 01:19:16,530 --> 01:19:18,367 This is Chernoff now, just straight 1244 01:19:18,367 --> 01:19:20,700 from the formula of Chernoff, as long as these loads are 1245 01:19:20,700 --> 01:19:23,110 mutually independent. 1246 01:19:23,110 --> 01:19:29,320 All right, so we know that when c gets to be-- I don't know, 1247 01:19:29,320 --> 01:19:36,970 you pick 10% above optimal, c equals 1.1, 1248 01:19:36,970 --> 01:19:40,860 well, we know that this is going to be a very small number. 1249 01:19:40,860 --> 01:19:42,670 L/M is 2,500. 1250 01:19:42,670 --> 01:19:48,730 And z, in this case, we found was 0.0048. 1251 01:19:48,730 --> 01:19:56,970 So we get e to the minus 0.0048 times 2,500. 1252 01:19:56,970 --> 01:19:58,400 And that is really tiny. 1253 01:19:58,400 --> 01:20:04,730 That's less than 1 in 160,000. 1254 01:20:04,730 --> 01:20:06,960 So Chernoff tells us the probability 1255 01:20:06,960 --> 01:20:08,840 that any server, a particular server, 1256 01:20:08,840 --> 01:20:15,220 gets 10% load more than you expect is minuscule. 1257 01:20:15,220 --> 01:20:18,230 Now, we're not quite done. 1258 01:20:18,230 --> 01:20:21,150 That tells us the probability the first server gets 1259 01:20:21,150 --> 01:20:26,830 10% too much load or the problem the second server got 10% too 1260 01:20:26,830 --> 01:20:30,030 much load, and so forth. 1261 01:20:30,030 --> 01:20:35,030 But what we really care about is the worst server. 1262 01:20:35,030 --> 01:20:36,881 If all of them are good except for one, 1263 01:20:36,881 --> 01:20:37,880 you're still in trouble. 1264 01:20:37,880 --> 01:20:40,570 Because the one ruined your day. 1265 01:20:40,570 --> 01:20:43,200 Because it didn't get the work done. 1266 01:20:43,200 --> 01:20:45,720 So what do you do to bound the probability 1267 01:20:45,720 --> 01:20:52,510 that any of the servers got too much load, any of the 10? 1268 01:20:52,510 --> 01:20:55,820 So what I really want to know is the probability 1269 01:20:55,820 --> 01:21:03,725 that the worst server of M takes more than cL 1270 01:21:03,725 --> 01:21:10,190 over M. Well, that's the probability that the first one 1271 01:21:10,190 --> 01:21:16,320 has more than cL over M union the second one has more than cL 1272 01:21:16,320 --> 01:21:19,962 over M union the M-th one. 1273 01:21:25,850 --> 01:21:29,760 What do I do to get that probability, the probability 1274 01:21:29,760 --> 01:21:33,510 of a union of events, upper bounded? 1275 01:21:33,510 --> 01:21:35,970 AUDIENCE: [INAUDIBLE] 1276 01:21:35,970 --> 01:21:38,560 PROFESSOR: Upper bounded by the sum of the individual guys. 1277 01:21:38,560 --> 01:21:44,456 It's the sum i equals 1 to M probability Ri greater than 1278 01:21:44,456 --> 01:21:48,230 or equal to cL over M. And so that, each of these 1279 01:21:48,230 --> 01:21:50,740 is at most 1 in 160,000. 1280 01:21:50,740 --> 01:21:57,320 This is at most M/160,000. 1281 01:21:57,320 --> 01:22:02,899 And that is equal to 1 in 16,000. 1282 01:22:02,899 --> 01:22:04,440 All right, so now we have the answer. 1283 01:22:04,440 --> 01:22:07,770 The chance that any server got 10% load or more 1284 01:22:07,770 --> 01:22:14,790 is 1 in 16,000 at most, which is why randomized load balancing 1285 01:22:14,790 --> 01:22:16,700 is used a lot in practice. 1286 01:22:16,700 --> 01:22:23,250 Now tomorrow, you're going to do a real world example where 1287 01:22:23,250 --> 01:22:25,840 people use this kind of analysis, 1288 01:22:25,840 --> 01:22:28,550 and it led to utter disaster. 1289 01:22:28,550 --> 01:22:33,460 And the reason was that the components they were looking at 1290 01:22:33,460 --> 01:22:35,282 were not independent. 1291 01:22:35,282 --> 01:22:37,865 And the example has to do with the subprime mortgage disaster. 1292 01:22:37,865 --> 01:22:39,906 And I don't have time today to go through it all. 1293 01:22:39,906 --> 01:22:42,060 But it's in the text, and you'll see it tomorrow. 1294 01:22:42,060 --> 01:22:44,270 But basically what happened is that they 1295 01:22:44,270 --> 01:22:47,950 took a whole bunch of loans, subprime loans, 1296 01:22:47,950 --> 01:22:50,780 put them into these things called bonds, 1297 01:22:50,780 --> 01:22:52,970 and then did an analysis about how many failures 1298 01:22:52,970 --> 01:22:55,040 they'd expect to have. 1299 01:22:55,040 --> 01:22:58,300 And they assumed the loans were all mutually independent. 1300 01:22:58,300 --> 01:23:00,700 And they applied their Chernoff bounds. 1301 01:23:00,700 --> 01:23:02,800 And that concluded that the chances 1302 01:23:02,800 --> 01:23:05,270 of being off from the expectation were nil, like e 1303 01:23:05,270 --> 01:23:07,890 to the minus 380. 1304 01:23:07,890 --> 01:23:10,950 In reality, the loans were highly dependent. 1305 01:23:10,950 --> 01:23:13,100 When one failed, a lot tended to fail. 1306 01:23:13,100 --> 01:23:14,207 And that led to disaster. 1307 01:23:14,207 --> 01:23:15,790 And you'll go through some of the math 1308 01:23:15,790 --> 01:23:18,220 on that tomorrow in recitation.