1 00:00:00,680 --> 00:00:02,550 We will now continue with the problem 2 00:00:02,550 --> 00:00:06,300 of inferring the unknown bias of a certain coin 3 00:00:06,300 --> 00:00:09,130 for which we have a certain prior distribution 4 00:00:09,130 --> 00:00:11,980 and of which we observe the number of heads 5 00:00:11,980 --> 00:00:15,130 in n independent coin tosses. 6 00:00:15,130 --> 00:00:19,210 We have already seen that if we assume a uniform prior, 7 00:00:19,210 --> 00:00:21,810 the posterior takes this particular form, which 8 00:00:21,810 --> 00:00:24,830 comes from the family of Beta distributions. 9 00:00:24,830 --> 00:00:28,010 What we want to do now is to actually derive 10 00:00:28,010 --> 00:00:29,830 point estimates. 11 00:00:29,830 --> 00:00:33,620 That is, instead of just providing the posterior, 12 00:00:33,620 --> 00:00:36,120 we would like to select a specific estimate 13 00:00:36,120 --> 00:00:38,310 for the unknown bias. 14 00:00:38,310 --> 00:00:42,040 Let us look at the maximum a posteriori probability 15 00:00:42,040 --> 00:00:43,380 estimate. 16 00:00:43,380 --> 00:00:45,030 How can we find it? 17 00:00:45,030 --> 00:00:50,340 By definition, the MAP estimate is that value of theta 18 00:00:50,340 --> 00:00:53,750 that maximizes the posterior, the value of theta 19 00:00:53,750 --> 00:00:56,790 at which the posterior is largest. 20 00:00:56,790 --> 00:00:59,800 Now, instead of maximizing the posterior, 21 00:00:59,800 --> 00:01:02,070 it is more convenient in this example 22 00:01:02,070 --> 00:01:06,030 to maximize the logarithm of the posterior. 23 00:01:06,030 --> 00:01:11,860 And the logarithm is k times log theta, 24 00:01:11,860 --> 00:01:19,160 plus n minus k times the log of 1 minus theta. 25 00:01:19,160 --> 00:01:23,100 To carry out the maximization over theta, 26 00:01:23,100 --> 00:01:26,530 we form the derivative with respect to theta 27 00:01:26,530 --> 00:01:28,970 and set that derivative to 0. 28 00:01:28,970 --> 00:01:33,300 So the derivative of the first term is k over theta. 29 00:01:33,300 --> 00:01:35,550 And the derivative of the second term 30 00:01:35,550 --> 00:01:42,550 is n minus k over this quantity, 1 minus theta. 31 00:01:42,550 --> 00:01:47,520 But because of the minus here when we apply the chain rule, 32 00:01:47,520 --> 00:01:53,130 actually, this plus sign here is going to become a minus sign. 33 00:01:53,130 --> 00:01:56,580 And now we set this derivative to 0. 34 00:01:56,580 --> 00:02:00,340 We carry out the algebra, which is rather simple. 35 00:02:00,340 --> 00:02:03,460 And the end result that you will find 36 00:02:03,460 --> 00:02:08,728 is that the estimate is equal to k over n. 37 00:02:08,728 --> 00:02:12,080 Notice that this is lowercase k. 38 00:02:12,080 --> 00:02:15,200 We are told the specific value of heads 39 00:02:15,200 --> 00:02:17,170 that has been observed. 40 00:02:17,170 --> 00:02:21,140 So little k is a number, and our estimate, accordingly, 41 00:02:21,140 --> 00:02:23,360 is a number. 42 00:02:23,360 --> 00:02:26,370 This answer makes perfect sense. 43 00:02:26,370 --> 00:02:28,980 A very reasonable way of estimating 44 00:02:28,980 --> 00:02:32,110 the probability of heads of a certain coin 45 00:02:32,110 --> 00:02:34,260 is to look at the number of heads 46 00:02:34,260 --> 00:02:38,270 obtained and divide by the total number of trials. 47 00:02:38,270 --> 00:02:40,640 So we see that the MAP estimate turns out 48 00:02:40,640 --> 00:02:43,570 to be a quite natural one. 49 00:02:43,570 --> 00:02:46,920 How about the corresponding estimator? 50 00:02:46,920 --> 00:02:49,390 Recall the distinction that the estimator 51 00:02:49,390 --> 00:02:52,760 is a random variable that tells us what the estimate is going 52 00:02:52,760 --> 00:02:57,810 to be as a function of the random variable that 53 00:02:57,810 --> 00:03:00,630 is going to be observed. 54 00:03:00,630 --> 00:03:07,690 The estimator is uppercase K divided by little n. 55 00:03:07,690 --> 00:03:11,610 So it is a random variable whose value 56 00:03:11,610 --> 00:03:15,130 is determined by the value of the random variable capital 57 00:03:15,130 --> 00:03:17,620 K. If the random variable capital K happens 58 00:03:17,620 --> 00:03:20,730 to take on this specific value, little k, then 59 00:03:20,730 --> 00:03:23,329 our estimator, this random variable, 60 00:03:23,329 --> 00:03:28,690 will be taking this specific value, which is the estimate. 61 00:03:28,690 --> 00:03:31,390 And let us now compare with an alternative way 62 00:03:31,390 --> 00:03:32,970 of estimating Theta. 63 00:03:32,970 --> 00:03:35,270 We will consider estimating Theta 64 00:03:35,270 --> 00:03:39,460 by forming the conditional expectation of Theta, given 65 00:03:39,460 --> 00:03:43,470 the specific number of heads that we have observed. 66 00:03:43,470 --> 00:03:49,110 This is what we call the LMS or least mean squares estimate. 67 00:03:49,110 --> 00:03:51,620 To calculate this conditional expectation, 68 00:03:51,620 --> 00:03:58,650 all that we need to do is to form the integral of theta 69 00:03:58,650 --> 00:04:02,390 times the density of Theta. 70 00:04:02,390 --> 00:04:04,330 But since it's a conditional expectation, 71 00:04:04,330 --> 00:04:07,690 we need to take the conditional density of Theta. 72 00:04:11,240 --> 00:04:14,200 And the integral ranges from 0 to 1, 73 00:04:14,200 --> 00:04:17,089 because this is the range of our random variable, Theta. 74 00:04:19,880 --> 00:04:21,200 Now, what is this? 75 00:04:21,200 --> 00:04:24,380 We have a formula for the posterior density. 76 00:04:24,380 --> 00:04:28,160 So we need to just multiply this expression here by theta, 77 00:04:28,160 --> 00:04:29,970 and then integrate. 78 00:04:29,970 --> 00:04:32,440 This term here is a constant. 79 00:04:32,440 --> 00:04:35,150 So it can be pulled outside the integral. 80 00:04:40,030 --> 00:04:47,260 And inside the integral, we are left with this term times 81 00:04:47,260 --> 00:04:54,520 theta, which changes the exponent of theta to k plus 1. 82 00:04:54,520 --> 00:04:59,860 Then we have 1 minus theta to the power n minus k, d theta. 83 00:05:02,630 --> 00:05:06,540 At this point, we need to do some calculations. 84 00:05:06,540 --> 00:05:09,020 What is d of n, k? 85 00:05:09,020 --> 00:05:13,860 d of n, k is the normalizing constant of this PDF. 86 00:05:13,860 --> 00:05:18,960 For this to be a PDF and to integrate to 1, d of n, k 87 00:05:18,960 --> 00:05:25,310 has to be equal to the integral of this expression from 0 to 1. 88 00:05:25,310 --> 00:05:29,490 So we need to somehow be able to evaluate this integral. 89 00:05:29,490 --> 00:05:35,300 Here, we will be helped by the following very nice formula. 90 00:05:35,300 --> 00:05:37,790 This formula tells us that the integral 91 00:05:37,790 --> 00:05:42,710 of for such a function of theta from 0 to 1 92 00:05:42,710 --> 00:05:47,730 is equal to this very nice and simple expression. 93 00:05:47,730 --> 00:05:49,670 Of course, this formula is only valid 94 00:05:49,670 --> 00:05:52,250 when these factorials make sense. 95 00:05:52,250 --> 00:05:54,860 So we assume that alpha is non-negative 96 00:05:54,860 --> 00:05:58,070 and theta is non-negative. 97 00:05:58,070 --> 00:05:59,960 How is this formula derived? 98 00:05:59,960 --> 00:06:05,140 There's various algebraic or calculus style derivations. 99 00:06:05,140 --> 00:06:07,900 One possibility is to use integration by parts. 100 00:06:07,900 --> 00:06:10,830 And there are also other tricks for deriving it. 101 00:06:10,830 --> 00:06:13,930 It turns out that there is also a very clever 102 00:06:13,930 --> 00:06:16,780 probabilistic proof of this fact. 103 00:06:16,780 --> 00:06:19,380 But in any case, we will not derive it. 104 00:06:19,380 --> 00:06:24,120 We will just take it as a fact that comes to us from calculus. 105 00:06:24,120 --> 00:06:28,440 And now, let us apply this formula. 106 00:06:28,440 --> 00:06:33,770 d of n, k is equal to the integral of this expression, 107 00:06:33,770 --> 00:06:38,520 which is of this form, with alpha equal to k and beta 108 00:06:38,520 --> 00:06:41,390 equal to n minus k. 109 00:06:41,390 --> 00:06:53,625 So d of n, k takes the form alpha is k, beta is n minus k. 110 00:06:57,960 --> 00:07:00,760 And then in the denominator, we have 111 00:07:00,760 --> 00:07:03,630 the sum of the two indices plus 1. 112 00:07:03,630 --> 00:07:05,560 So it's going to be k plus n minus k. 113 00:07:05,560 --> 00:07:07,030 That gives us n. 114 00:07:07,030 --> 00:07:08,520 And then there's a plus 1. 115 00:07:12,530 --> 00:07:15,450 And how about this integral? 116 00:07:15,450 --> 00:07:20,540 Well, this integral is also of the form that we have up here. 117 00:07:20,540 --> 00:07:30,530 But now, we have alpha equal to k plus 1, 118 00:07:30,530 --> 00:07:33,460 beta is n minus k. 119 00:07:37,940 --> 00:07:42,710 And in the denominator, we have the sum of the indices plus 1. 120 00:07:42,710 --> 00:07:45,880 So when we add these indices, we get n plus 1. 121 00:07:45,880 --> 00:07:48,880 And then we get another factor of 1, 122 00:07:48,880 --> 00:07:51,280 which gives us an n plus 2. 123 00:07:55,510 --> 00:07:56,960 This looks formidable. 124 00:07:56,960 --> 00:08:00,590 But actually, there's a lot of simplifications. 125 00:08:00,590 --> 00:08:05,000 This term here cancels with that term. 126 00:08:05,000 --> 00:08:10,160 k plus 1 factorial divided by k factorial, what is it? 127 00:08:10,160 --> 00:08:14,450 It is just a factor of k plus 1. 128 00:08:18,030 --> 00:08:20,110 And what do we have here? 129 00:08:20,110 --> 00:08:23,380 This term is in the denominator of the denominator. 130 00:08:23,380 --> 00:08:26,430 So it can be moved up to the numerator. 131 00:08:26,430 --> 00:08:30,870 We have n plus 1 factorial divided by n plus 2 factorial. 132 00:08:30,870 --> 00:08:33,250 This is just n plus 2. 133 00:08:36,039 --> 00:08:39,640 And this is the final form of the answer. 134 00:08:39,640 --> 00:08:44,810 This is what the conditional expectation of theta is. 135 00:08:44,810 --> 00:08:48,570 So now, we can compare the two estimates that we have, 136 00:08:48,570 --> 00:08:51,840 the MAP estimate and the conditional expectation 137 00:08:51,840 --> 00:08:53,230 estimate. 138 00:08:53,230 --> 00:08:57,120 They're fairly similar, but not exactly the same. 139 00:08:57,120 --> 00:09:01,640 This means that the mean of a Beta distribution 140 00:09:01,640 --> 00:09:05,550 is not the same as the point at which the distribution is 141 00:09:05,550 --> 00:09:06,990 highest. 142 00:09:06,990 --> 00:09:10,830 On the other hand, if n is a very large number, 143 00:09:10,830 --> 00:09:17,450 this expression is going to be approximately equal to k over n 144 00:09:17,450 --> 00:09:18,850 when n is large. 145 00:09:18,850 --> 00:09:24,730 And so in the limit of large n, the two estimators 146 00:09:24,730 --> 00:09:28,504 will not be very different from each other.