1 00:00:00,600 --> 00:00:03,060 In this segment we look into the probability 2 00:00:03,060 --> 00:00:05,570 that the sum of n independent identically 3 00:00:05,570 --> 00:00:07,760 distributed random variables takes 4 00:00:07,760 --> 00:00:10,790 an abnormally large value. 5 00:00:10,790 --> 00:00:14,460 We will get an upper bound on this quantity, which 6 00:00:14,460 --> 00:00:16,800 is known as Hoeffding's inequality. 7 00:00:16,800 --> 00:00:19,800 This is an upper bound that applies to a special case, 8 00:00:19,800 --> 00:00:23,040 although the method actually generalizes. 9 00:00:23,040 --> 00:00:25,800 Here is the special case that we will consider. 10 00:00:25,800 --> 00:00:28,900 The random variables, the Xi's, are 11 00:00:28,900 --> 00:00:33,030 equally likely to take the values minus 1 and plus 12 00:00:33,030 --> 00:00:34,755 1, with equal probability. 13 00:00:37,420 --> 00:00:40,140 And we're interested in the random variable, 14 00:00:40,140 --> 00:00:42,270 which is the sum of the X's. 15 00:00:46,300 --> 00:00:49,180 What do we know about this random variable? 16 00:00:49,180 --> 00:00:54,000 Well, the expected value of each one of the Xi's is equal to 0, 17 00:00:54,000 --> 00:00:56,410 because the distribution is symmetric. 18 00:00:56,410 --> 00:01:00,340 And also, the distance of Xi from the mean 19 00:01:00,340 --> 00:01:02,410 has always magnitude 1. 20 00:01:02,410 --> 00:01:04,379 And for this reason, the variance 21 00:01:04,379 --> 00:01:08,130 of the Xi's is equal to 1. 22 00:01:08,130 --> 00:01:14,180 For this reason, the random variable Y has a mean of 0 23 00:01:14,180 --> 00:01:16,690 and a variance equal to n. 24 00:01:21,050 --> 00:01:24,630 Now, what do we know about the random variable Y? 25 00:01:24,630 --> 00:01:30,350 By the central limit theorem, Y has an approximately normal 26 00:01:30,350 --> 00:01:32,370 distribution. 27 00:01:32,370 --> 00:01:35,990 The distribution is centered at 0. 28 00:01:35,990 --> 00:01:42,670 And also, the random variable Y over square root of n, 29 00:01:42,670 --> 00:01:45,700 this is a standardized random variable. 30 00:01:45,700 --> 00:01:47,520 So it's approximately normal. 31 00:01:47,520 --> 00:01:50,910 And so the probability that this number is larger than or equal 32 00:01:50,910 --> 00:01:57,840 to some a is approximately 1 minus the cumulative 33 00:01:57,840 --> 00:02:00,350 of the standard normal. 34 00:02:00,350 --> 00:02:04,815 So Phi here stands for the standard normal CDF. 35 00:02:10,400 --> 00:02:12,310 What does this tell us? 36 00:02:12,310 --> 00:02:15,130 It tells us that if we take somewhere here, 37 00:02:15,130 --> 00:02:22,860 the number square root of n times a, then this probability 38 00:02:22,860 --> 00:02:26,940 down here in the tail is approximately constant, 39 00:02:26,940 --> 00:02:29,250 no mater what n is. 40 00:02:29,250 --> 00:02:32,260 And this in particular tells us that values 41 00:02:32,260 --> 00:02:38,940 of the order of square root n are fairly likely to occur. 42 00:02:38,940 --> 00:02:42,190 However, what we're interested in here is not 43 00:02:42,190 --> 00:02:44,680 being larger than square root n times a. 44 00:02:44,680 --> 00:02:48,040 We're interested in being larger than n times a. 45 00:02:48,040 --> 00:02:52,970 So we're talking about what happens further down 46 00:02:52,970 --> 00:02:55,450 in the tail of the distribution. 47 00:02:55,450 --> 00:02:58,710 So if we take here n times a, we're 48 00:02:58,710 --> 00:03:01,530 looking at this probability down here. 49 00:03:01,530 --> 00:03:06,210 And we want to ask, how small is that probability? 50 00:03:06,210 --> 00:03:08,430 Well, we have Chebyshev's inequality. 51 00:03:08,430 --> 00:03:10,775 And Chebyshev's inequality tells us 52 00:03:10,775 --> 00:03:14,950 that the probability of Y being larger than a certain number 53 00:03:14,950 --> 00:03:19,850 is less than or equal to the variance of Y divided 54 00:03:19,850 --> 00:03:21,829 by the square of that number. 55 00:03:24,510 --> 00:03:27,100 And in this case, since the variance is n. 56 00:03:27,100 --> 00:03:30,470 This is 1 over n a squared. 57 00:03:30,470 --> 00:03:32,210 So Chebyshev's inequality tells us 58 00:03:32,210 --> 00:03:35,460 that this probability goes to 0, and it 59 00:03:35,460 --> 00:03:40,829 goes to 0 at least as fast as 1/n goes to 0. 60 00:03:40,829 --> 00:03:44,640 However, it turns out that this is extremely conservative. 61 00:03:44,640 --> 00:03:47,930 Hoeffding's inequality, which we're going to establish, 62 00:03:47,930 --> 00:03:50,350 tells us something much stronger. 63 00:03:50,350 --> 00:03:54,230 It tells us that this tail probability down here 64 00:03:54,230 --> 00:03:59,110 falls exponentially with n. 65 00:03:59,110 --> 00:04:01,440 So this is what we want to show. 66 00:04:01,440 --> 00:04:07,130 And let us get started to see how the derivation goes. 67 00:04:07,130 --> 00:04:10,900 The derivation relies on a beautiful trick. 68 00:04:10,900 --> 00:04:13,900 Instead of looking at this event here, 69 00:04:13,900 --> 00:04:17,440 we're going to look at the following equivalent event. 70 00:04:20,399 --> 00:04:22,780 Let us fix some number s. 71 00:04:25,390 --> 00:04:30,520 We're going to leave the choice of s free for now. 72 00:04:30,520 --> 00:04:33,130 It is only that we're going to assume 73 00:04:33,130 --> 00:04:35,720 that s is a positive number. 74 00:04:38,680 --> 00:04:40,850 And throughout, we're also assuming 75 00:04:40,850 --> 00:04:43,380 that a is also a positive number. 76 00:04:46,070 --> 00:04:49,300 Now, we look at this quantity, and the event 77 00:04:49,300 --> 00:04:52,240 that this quantity is larger than or equal 78 00:04:52,240 --> 00:04:57,920 to e to the sn times a. 79 00:04:57,920 --> 00:05:02,460 Now, this sum is larger than or equal to na if and only 80 00:05:02,460 --> 00:05:07,216 if this quantity is larger than or equal to e to the sna. 81 00:05:07,216 --> 00:05:10,520 This is because a and s are both positive. 82 00:05:10,520 --> 00:05:12,510 So the direction of the inequalities 83 00:05:12,510 --> 00:05:14,840 does not get reversed, and also because 84 00:05:14,840 --> 00:05:18,360 the exponential function is monotonic. 85 00:05:18,360 --> 00:05:23,000 So this event is the same as that event. 86 00:05:23,000 --> 00:05:26,050 So we will try to say something about the probability 87 00:05:26,050 --> 00:05:27,660 of this event. 88 00:05:27,660 --> 00:05:29,680 How are we going to do it? 89 00:05:29,680 --> 00:05:31,880 We will use the Markov inequality, 90 00:05:31,880 --> 00:05:37,740 where Z is the random variable that appears here. 91 00:05:37,740 --> 00:05:41,450 So by Markov's inequality, this probability 92 00:05:41,450 --> 00:05:45,150 is less than or equal to the expected value 93 00:05:45,150 --> 00:05:46,930 of the random variable that we are 94 00:05:46,930 --> 00:05:56,550 dealing with divided by this value. 95 00:06:02,190 --> 00:06:05,660 Now, the exponential of a sum, we 96 00:06:05,660 --> 00:06:08,410 can factor it as a product of exponentials. 97 00:06:20,950 --> 00:06:26,220 And then we use the assumption that the X's are independent. 98 00:06:26,220 --> 00:06:28,870 Since the X's are independent, e to the sX1 99 00:06:28,870 --> 00:06:32,510 is independent from e to the sX2 and so on. 100 00:06:32,510 --> 00:06:34,409 And so we have the expected value 101 00:06:34,409 --> 00:06:37,950 of a product of independent random variables. 102 00:06:37,950 --> 00:06:42,670 And so this is equal to the product of the expectations. 103 00:06:42,670 --> 00:06:46,909 So we're going to multiply the expected value of e 104 00:06:46,909 --> 00:06:52,230 to the sX1 with the expected value of e to the sX2 105 00:06:52,230 --> 00:06:52,890 and so on. 106 00:06:52,890 --> 00:06:56,710 But because all the Xi's are identically distributed, 107 00:06:56,710 --> 00:06:59,440 the terms we get are all the same. 108 00:06:59,440 --> 00:07:04,880 So we get this term to the nth power divided, again, 109 00:07:04,880 --> 00:07:05,990 by e to the sna. 110 00:07:09,320 --> 00:07:14,270 Or we can write this in more suggestive form, as follows. 111 00:07:14,270 --> 00:07:21,720 It's the expected value of e to the sX1 divided by e 112 00:07:21,720 --> 00:07:26,750 to the sa, all of that to the nth power. 113 00:07:26,750 --> 00:07:31,590 So think of that as being some number rho to the nth power. 114 00:07:34,330 --> 00:07:37,580 When is this bound going to be interesting? 115 00:07:37,580 --> 00:07:41,930 It's going to be interesting if rho is less than 1, 116 00:07:41,930 --> 00:07:47,450 because in that case, this bound falls exponentially with n. 117 00:07:47,450 --> 00:07:49,470 And so this probability in particular 118 00:07:49,470 --> 00:07:51,610 will fall exponentially with n. 119 00:07:55,050 --> 00:07:58,560 The key here is that we have freedom to choose s. 120 00:07:58,560 --> 00:08:02,940 For any value of s, we obtain an upper bound. 121 00:08:02,940 --> 00:08:07,390 We're going to choose s so that we get the most informative 122 00:08:07,390 --> 00:08:10,860 or a most powerful upper bound. 123 00:08:10,860 --> 00:08:14,690 So let us continue to see what we can do. 124 00:08:14,690 --> 00:08:19,600 First, let us write down what this expected value is. 125 00:08:19,600 --> 00:08:27,110 Since X1 takes values minus 1 or plus 1 with equal probability, 126 00:08:27,110 --> 00:08:29,380 this expectation is the following. 127 00:08:29,380 --> 00:08:34,130 With probability 1/2, X1 takes the value of 1. 128 00:08:34,130 --> 00:08:37,250 And so this random variable is e to the s. 129 00:08:37,250 --> 00:08:39,440 And with probability 1/2, it takes the value 130 00:08:39,440 --> 00:08:43,159 minus one, in which case this random variable is 131 00:08:43,159 --> 00:08:45,070 e to the minus s. 132 00:08:45,070 --> 00:08:47,650 So this is the expectation in the numerator. 133 00:08:47,650 --> 00:08:52,400 And we write again the term in the denominator. 134 00:08:52,400 --> 00:08:56,870 And we have all this to the power n. 135 00:08:56,870 --> 00:09:00,820 If we can choose s so that this quantity is less than 1, 136 00:09:00,820 --> 00:09:05,450 we will have achieved our objective. 137 00:09:05,450 --> 00:09:07,730 Can we do that? 138 00:09:07,730 --> 00:09:08,700 Let's see. 139 00:09:08,700 --> 00:09:12,070 Let's look at the numerator as a function of s. 140 00:09:14,720 --> 00:09:18,680 When s is equal to 0, we have 1 plus 1 divided by 1/2. 141 00:09:18,680 --> 00:09:21,570 That gives us 1. 142 00:09:21,570 --> 00:09:25,640 And then as s moves away from 0, this function 143 00:09:25,640 --> 00:09:28,520 will have this kind of shape. 144 00:09:28,520 --> 00:09:32,873 And it is symmetric around 0, because we have an s 145 00:09:32,873 --> 00:09:34,400 and a minus s here. 146 00:09:37,110 --> 00:09:42,260 In particular, the derivative of this function is 0 at 0. 147 00:09:42,260 --> 00:09:44,990 Let's look at the denominator term. 148 00:09:44,990 --> 00:09:49,490 The denominator term is an exponential. 149 00:09:49,490 --> 00:09:52,850 a is a positive number, so it's an exponential 150 00:09:52,850 --> 00:09:55,925 that has a shape of this kind. 151 00:09:58,930 --> 00:10:01,740 The important thing to notice is that this exponential 152 00:10:01,740 --> 00:10:06,060 has a positive derivative at 0. 153 00:10:06,060 --> 00:10:07,200 What does that tell us? 154 00:10:07,200 --> 00:10:14,100 That at least in the vicinity of 0, this term, the denominator, 155 00:10:14,100 --> 00:10:17,490 is going to be larger than the numerator term. 156 00:10:17,490 --> 00:10:21,260 And that implies that in the vicinity of 0, 157 00:10:21,260 --> 00:10:25,350 this fraction is going to be less than 1. 158 00:10:25,350 --> 00:10:27,890 And we will have achieved our goal 159 00:10:27,890 --> 00:10:29,880 of an exponentially decaying bound. 160 00:10:32,890 --> 00:10:41,090 So the conclusion is that for small s, 161 00:10:41,090 --> 00:10:45,100 we have that rho is less than 1. 162 00:10:45,100 --> 00:10:49,190 Now, we would like to get an explicit value for rho. 163 00:10:49,190 --> 00:10:53,300 And we will do that by fixing a specific value for s. 164 00:10:53,300 --> 00:10:59,970 It turns out that if we set s to be equal to a, then the bound 165 00:10:59,970 --> 00:11:05,710 that we get is going to be that this probability here 166 00:11:05,710 --> 00:11:16,130 is less than or equal to e to the minus na squared over 2. 167 00:11:16,130 --> 00:11:18,090 And this is the Hoeffding bound. 168 00:11:22,130 --> 00:11:26,190 At this point, you may just pause. 169 00:11:26,190 --> 00:11:30,490 Or if you're curious, you can continue with this video 170 00:11:30,490 --> 00:11:33,800 to see the algebraic manipulations involved 171 00:11:33,800 --> 00:11:36,560 in order to show that this expression is less than 172 00:11:36,560 --> 00:11:39,210 or equal to that expression. 173 00:11:39,210 --> 00:11:43,770 But before going there, I would like to make a general comment. 174 00:11:43,770 --> 00:11:49,770 Even if the X's had a different distribution but with 0 mean, 175 00:11:49,770 --> 00:11:53,640 the derivation up to this point would go through, 176 00:11:53,640 --> 00:11:56,660 here you would have a somewhat different expression 177 00:11:56,660 --> 00:12:00,590 for the expected value of e to the sX1. 178 00:12:00,590 --> 00:12:04,380 However, it turns out that the expression that you get here 179 00:12:04,380 --> 00:12:09,740 will always have this property that it has a 0 derivative. 180 00:12:09,740 --> 00:12:11,840 This is a consequence of the assumption 181 00:12:11,840 --> 00:12:14,480 that we assumed zero mean. 182 00:12:14,480 --> 00:12:16,520 And because of that, we will still 183 00:12:16,520 --> 00:12:18,380 have a picture of this kind. 184 00:12:18,380 --> 00:12:22,210 And so this fraction will always be less than 1 185 00:12:22,210 --> 00:12:25,810 when we choose s to be suitably small. 186 00:12:25,810 --> 00:12:29,210 And so this is going to give us a result for more 187 00:12:29,210 --> 00:12:30,770 general distributions. 188 00:12:30,770 --> 00:12:37,520 And that more general results is known as the Chernoff bound. 189 00:12:37,520 --> 00:12:40,290 However, we will not develop in this video 190 00:12:40,290 --> 00:12:43,850 the Chernoff bound in its greater generality. 191 00:12:43,850 --> 00:12:46,190 We will just stay with Hoeffding's inequality 192 00:12:46,190 --> 00:12:48,590 that gives us the basic idea. 193 00:12:48,590 --> 00:12:54,630 And what we will do next will be to derive this inequality. 194 00:12:54,630 --> 00:12:57,870 So I'm carrying over what we figured out 195 00:12:57,870 --> 00:13:02,150 in the previous slide-- and this is the quantity here 196 00:13:02,150 --> 00:13:03,390 that we wish to bound. 197 00:13:06,210 --> 00:13:08,570 We will look at the numerator term. 198 00:13:08,570 --> 00:13:11,620 And we're going to use a Taylor series 199 00:13:11,620 --> 00:13:15,020 for the exponential function. 200 00:13:15,020 --> 00:13:17,440 Remember, the Taylor series for the exponential function 201 00:13:17,440 --> 00:13:18,800 takes this form. 202 00:13:18,800 --> 00:13:24,960 And using that, we have 1/2 e to be s plus e to the minus 203 00:13:24,960 --> 00:13:28,840 s is equal to the following. 204 00:13:28,840 --> 00:13:31,870 We first write the Taylor series for e to the s. 205 00:13:31,870 --> 00:13:33,620 I'm just copying from here. 206 00:13:33,620 --> 00:13:37,020 It's 1 plus s plus s squared over 207 00:13:37,020 --> 00:13:43,170 2 factorial plus s cubed over 3 factorial. 208 00:13:43,170 --> 00:13:45,730 And we continue similarly. 209 00:13:45,730 --> 00:13:48,845 And then for the term e to the minus 210 00:13:48,845 --> 00:13:53,840 s, we have a similar expansion, except that we put a minus s 211 00:13:53,840 --> 00:13:55,510 in the place of s. 212 00:13:55,510 --> 00:14:01,310 Now, minus s squared is the same as s squared, with a plus sign. 213 00:14:01,310 --> 00:14:04,320 But for s cubed, when we have minus s, 214 00:14:04,320 --> 00:14:08,050 this becomes minus s cube and so on. 215 00:14:08,050 --> 00:14:11,460 And so we see that in this expansion here, 216 00:14:11,460 --> 00:14:15,710 we will alternate between positive and negative signs. 217 00:14:15,710 --> 00:14:19,280 This means that all of the odd power terms 218 00:14:19,280 --> 00:14:21,630 will cancel each other. 219 00:14:21,630 --> 00:14:25,460 But the even power terms will survive. 220 00:14:25,460 --> 00:14:33,295 So what we obtain is the sum of all of those terms. 221 00:14:35,990 --> 00:14:39,210 But we only have the even power terms. 222 00:14:39,210 --> 00:14:43,100 So we have powers of the form 2i. 223 00:14:43,100 --> 00:14:46,270 These are the even integers. 224 00:14:46,270 --> 00:14:49,030 And in the denominators, we will always 225 00:14:49,030 --> 00:14:53,300 have the factorial of whatever exponent we have at the top. 226 00:14:59,100 --> 00:15:05,270 Now, let us get a bound on this term in the denominator. 227 00:15:05,270 --> 00:15:13,490 2i factorial is 1 times 2 times 3, all the way up to i. 228 00:15:13,490 --> 00:15:22,810 And then we continue-- i plus 1, i plus 2, all the way up to 2i. 229 00:15:22,810 --> 00:15:26,390 And what we have is, first, i factorial. 230 00:15:29,120 --> 00:15:34,120 But then each one of these terms is larger than or equal to 2. 231 00:15:34,120 --> 00:15:37,100 And we have i such terms. 232 00:15:37,100 --> 00:15:39,760 And this gives us this inequality. 233 00:15:39,760 --> 00:15:42,170 So we're going to use the substitution here. 234 00:15:42,170 --> 00:15:44,630 Because this term is in the denominator, 235 00:15:44,630 --> 00:15:48,590 the direction of the inequality is going to be reversed. 236 00:15:48,590 --> 00:15:49,780 And we obtain this. 237 00:16:05,380 --> 00:16:09,520 Now, we can rewrite this by taking this term 2 to the i 238 00:16:09,520 --> 00:16:12,285 and combining it with the other term in the numerator. 239 00:16:18,190 --> 00:16:22,710 And what we have is s squared divided by 2-- 240 00:16:22,710 --> 00:16:24,395 all of that to the i'th power. 241 00:16:34,810 --> 00:16:38,070 Now, does this expression look familiar? 242 00:16:38,070 --> 00:16:41,870 It is of exactly the same form as this expansion. 243 00:16:41,870 --> 00:16:45,990 But instead of s, we now have s squared over 2. 244 00:16:45,990 --> 00:16:54,230 Therefore, this is equal to e to the s squared over 2. 245 00:16:54,230 --> 00:16:58,840 So we managed to bound this term. 246 00:16:58,840 --> 00:17:04,319 Using now this bound, we go back to this inequality. 247 00:17:04,319 --> 00:17:09,890 And we have that this is less than or equal to-- 248 00:17:09,890 --> 00:17:17,060 in the numerator, we have e to the s squared over 2. 249 00:17:17,060 --> 00:17:21,252 In the denominator, we have e to the sa, 250 00:17:21,252 --> 00:17:26,118 and all that is raised to the n'th power. 251 00:17:26,118 --> 00:17:29,880 Or another way to write this is, e 252 00:17:29,880 --> 00:17:38,940 to the s squared over 2 minus sa, 253 00:17:38,940 --> 00:17:42,850 and all that to the n'th power. 254 00:17:42,850 --> 00:17:53,460 And now, if I choose s equal to a, what I obtain here 255 00:17:53,460 --> 00:17:58,110 is going to be e to the a over 2 minus a squared. 256 00:17:58,110 --> 00:18:02,620 That leaves me with e to the minus a squared over 2. 257 00:18:02,620 --> 00:18:05,590 And then I take this factor of n as well. 258 00:18:05,590 --> 00:18:09,810 And the final conclusion is that this quantity 259 00:18:09,810 --> 00:18:14,220 becomes equal to this term. 260 00:18:14,220 --> 00:18:16,900 And so we have completed the derivation 261 00:18:16,900 --> 00:18:21,500 that this expression is less than or equal to this quantity 262 00:18:21,500 --> 00:18:23,770 when we choose s equal to a. 263 00:18:23,770 --> 00:18:26,590 And this is Hoeffding's inequality.