1 00:00:01,550 --> 00:00:04,170 In this segment, we will go through two examples 2 00:00:04,170 --> 00:00:06,760 of maximum likelihood estimation, 3 00:00:06,760 --> 00:00:10,240 just in order to get a feel for the procedure involved 4 00:00:10,240 --> 00:00:13,450 and the calculations that one has to go through. 5 00:00:13,450 --> 00:00:16,129 Our first example will be very simple. 6 00:00:16,129 --> 00:00:18,320 We have a binomial random variable 7 00:00:18,320 --> 00:00:21,270 with parameters n and theta. 8 00:00:21,270 --> 00:00:25,450 So think of having a coin that you flip n times, 9 00:00:25,450 --> 00:00:27,750 and theta is the probability of heads 10 00:00:27,750 --> 00:00:29,680 at each one of the tosses. 11 00:00:29,680 --> 00:00:32,090 So we flip it n times and we observe 12 00:00:32,090 --> 00:00:35,550 a certain numerical value, little k 13 00:00:35,550 --> 00:00:38,350 for the random variable K. And on the basis 14 00:00:38,350 --> 00:00:41,810 of that numerical value, we would like to estimate theta. 15 00:00:41,810 --> 00:00:44,510 According to the maximum likelihood methodology, 16 00:00:44,510 --> 00:00:48,250 the first step is to write down the likelihood function. 17 00:00:48,250 --> 00:00:51,510 This is the probability of obtaining this particular piece 18 00:00:51,510 --> 00:00:55,220 of data if the true parameter is theta. 19 00:00:55,220 --> 00:00:57,980 Now, since K is a binomial random variable, 20 00:00:57,980 --> 00:01:01,020 the probability of obtaining k heads in n tosses 21 00:01:01,020 --> 00:01:03,830 is given by this expression here. 22 00:01:03,830 --> 00:01:07,900 So what we need to do is to take the data that we have observed, 23 00:01:07,900 --> 00:01:11,740 plug it in this formula, leave theta free-- 24 00:01:11,740 --> 00:01:14,420 we have here a function of theta-- 25 00:01:14,420 --> 00:01:18,260 and then maximize this function of theta over all theta. 26 00:01:18,260 --> 00:01:21,020 Let us now do this calculation. 27 00:01:21,020 --> 00:01:24,220 Actually, instead of maximizing this expression, 28 00:01:24,220 --> 00:01:27,340 it's a little easier to maximize the logarithm 29 00:01:27,340 --> 00:01:28,960 of this expression. 30 00:01:28,960 --> 00:01:32,150 And the logarithm of this expression is as follows. 31 00:01:32,150 --> 00:01:35,410 There's a first term, which is the logarithm of the n 32 00:01:35,410 --> 00:01:36,870 choose k term. 33 00:01:36,870 --> 00:01:42,890 Then, the logarithm of theta to the k is k times log theta. 34 00:01:42,890 --> 00:01:45,660 And finally, the logarithm of the last term 35 00:01:45,660 --> 00:01:52,130 is n minus k, log of 1 minus theta. 36 00:01:52,130 --> 00:01:54,550 So we need to maximize this expression with respect 37 00:01:54,550 --> 00:01:55,289 to theta. 38 00:01:55,289 --> 00:01:58,140 In order to do that, we take the derivative with respect 39 00:01:58,140 --> 00:01:59,530 to theta. 40 00:01:59,530 --> 00:02:01,130 Here, there is no theta involved. 41 00:02:01,130 --> 00:02:03,210 We get a contribution of 0. 42 00:02:03,210 --> 00:02:07,390 This term has a derivative of k divided by theta. 43 00:02:07,390 --> 00:02:09,880 And this term here has a derivative, 44 00:02:09,880 --> 00:02:14,020 which is n minus k times the derivative 45 00:02:14,020 --> 00:02:16,920 of this logarithmic term, which is 46 00:02:16,920 --> 00:02:19,480 1 over what is inside the logarithm. 47 00:02:19,480 --> 00:02:22,100 But by the chain rule, because of this minus sign 48 00:02:22,100 --> 00:02:27,900 here, we get also a minus sign, and we obtain this expression. 49 00:02:27,900 --> 00:02:29,790 Now, at the maximum, the derivative 50 00:02:29,790 --> 00:02:31,920 has to be equal to 0. 51 00:02:31,920 --> 00:02:34,770 And this gives us now an equation for theta 52 00:02:34,770 --> 00:02:36,270 that we can solve. 53 00:02:36,270 --> 00:02:39,630 Let us take this term, move it to the right-hand side, 54 00:02:39,630 --> 00:02:43,270 and then cross-multiply with the denominators 55 00:02:43,270 --> 00:02:49,560 to obtain the relation that k minus k theta-- this 56 00:02:49,560 --> 00:02:52,550 is obtained by multiplying this k with this one minus theta 57 00:02:52,550 --> 00:02:57,920 factor-- has to be equal to this term times theta, which 58 00:02:57,920 --> 00:03:02,160 is n times theta minus k theta. 59 00:03:02,160 --> 00:03:05,370 The k theta terms cancel, and we're 60 00:03:05,370 --> 00:03:08,390 left with this expression, which tells us 61 00:03:08,390 --> 00:03:11,680 that theta should be equal to k over n. 62 00:03:11,680 --> 00:03:13,950 So this is the maximum likelihood estimate 63 00:03:13,950 --> 00:03:15,450 for this particular problem, which 64 00:03:15,450 --> 00:03:17,990 is a pretty reasonable answer. 65 00:03:17,990 --> 00:03:20,380 If you would like to rephrase what we just 66 00:03:20,380 --> 00:03:24,040 found in terms of estimators and random variables, 67 00:03:24,040 --> 00:03:28,660 the maximum likelihood estimator is as follows. 68 00:03:28,660 --> 00:03:32,579 We take the random variable that we observe, our observations, 69 00:03:32,579 --> 00:03:34,270 and divide it by n. 70 00:03:34,270 --> 00:03:36,630 And this is now a random variable, 71 00:03:36,630 --> 00:03:39,720 which will be our estimator. 72 00:03:39,720 --> 00:03:42,160 Now, notice that in this particular example, 73 00:03:42,160 --> 00:03:46,090 the answer that we got is exactly the same as the answer 74 00:03:46,090 --> 00:03:49,360 that we got in the context of Bayesian inference 75 00:03:49,360 --> 00:03:52,950 when we were finding the maximum a posteriori probability 76 00:03:52,950 --> 00:03:55,810 estimator, but for the special case 77 00:03:55,810 --> 00:04:00,160 where the prior was a uniform distribution. 78 00:04:00,160 --> 00:04:04,400 So if we assume that theta is actually a random variable 79 00:04:04,400 --> 00:04:08,240 but has a uniform distribution, so that we have a flat prior, 80 00:04:08,240 --> 00:04:11,020 and we carry out maximum a posteriori probability 81 00:04:11,020 --> 00:04:12,080 estimation. 82 00:04:12,080 --> 00:04:14,930 We do obtain exactly the same estimate. 83 00:04:14,930 --> 00:04:16,740 And this is consistent with the comments 84 00:04:16,740 --> 00:04:19,990 that we made earlier, that maximum likelihood estimation 85 00:04:19,990 --> 00:04:25,630 can be interpreted also as MAP estimation with a flat prior. 86 00:04:25,630 --> 00:04:27,970 Let us now move to our second example, which 87 00:04:27,970 --> 00:04:30,670 will be a little more complicated. 88 00:04:30,670 --> 00:04:34,830 Here, we have n random variables that are independent, 89 00:04:34,830 --> 00:04:36,090 identically distributed. 90 00:04:36,090 --> 00:04:38,070 They all have a normal distribution 91 00:04:38,070 --> 00:04:40,540 with a certain mean and variance. 92 00:04:40,540 --> 00:04:43,600 But both the mean and the variance are unknown, 93 00:04:43,600 --> 00:04:45,560 and we want to estimate them on the basis 94 00:04:45,560 --> 00:04:47,570 of these observations. 95 00:04:47,570 --> 00:04:51,409 The first step is to write down the likelihood function. 96 00:04:51,409 --> 00:04:53,960 That is the probability density function 97 00:04:53,960 --> 00:04:59,380 for the vector of observations given some set of parameters. 98 00:04:59,380 --> 00:05:02,650 Because of independence, the joint distribution 99 00:05:02,650 --> 00:05:08,250 of the vector of X's that we have obtained is the product 100 00:05:08,250 --> 00:05:13,510 of the PDFs of the individual X's, of the Xi's. 101 00:05:13,510 --> 00:05:21,920 So the PDF of the typical Xi that has variance v and mean mu 102 00:05:21,920 --> 00:05:23,770 is of this form. 103 00:05:23,770 --> 00:05:27,000 So this is the likelihood function in this case. 104 00:05:27,000 --> 00:05:29,430 This is the probability density of obtaining 105 00:05:29,430 --> 00:05:32,730 a particular vector X of observations 106 00:05:32,730 --> 00:05:35,909 when we have these particular parameters. 107 00:05:35,909 --> 00:05:40,980 We would like to maximize this function. 108 00:05:40,980 --> 00:05:44,700 As in our previous example, it is actually a little easier 109 00:05:44,700 --> 00:05:48,900 to maximize the logarithm of this expression. 110 00:05:48,900 --> 00:05:50,880 And this is the same as minimizing 111 00:05:50,880 --> 00:05:54,480 the negative of the logarithm of this expression. 112 00:05:54,480 --> 00:05:57,180 Now, when we take the logarithm of this expression, 113 00:05:57,180 --> 00:05:58,159 we have a product. 114 00:05:58,159 --> 00:06:01,710 So we're going to get a sum of logarithms. 115 00:06:01,710 --> 00:06:06,350 And I leave it to you to verify that the negative logarithm 116 00:06:06,350 --> 00:06:11,810 of this expression is of this form plus some other constant 117 00:06:11,810 --> 00:06:13,930 that does not involve the parameters, 118 00:06:13,930 --> 00:06:18,120 and which comes from this factor of 1 over square root 2pi. 119 00:06:18,120 --> 00:06:21,820 In particular, this term here appears 120 00:06:21,820 --> 00:06:24,220 when we take the logarithm of this. 121 00:06:24,220 --> 00:06:28,410 And this happens n times because we have a product of n terms. 122 00:06:28,410 --> 00:06:31,070 And this term here appears when we 123 00:06:31,070 --> 00:06:33,640 take the logarithm of this expression, 124 00:06:33,640 --> 00:06:36,550 and after we put in the minus sign, 125 00:06:36,550 --> 00:06:38,080 because we're actually considering 126 00:06:38,080 --> 00:06:40,409 the negative of the logarithm. 127 00:06:40,409 --> 00:06:43,100 Now, to carry out the minimization, what 128 00:06:43,100 --> 00:06:46,900 we need to do is to take the derivative of this expression 129 00:06:46,900 --> 00:06:49,420 with respect to mu, set it to zero, 130 00:06:49,420 --> 00:06:52,270 and also take the derivative with respect to v 131 00:06:52,270 --> 00:06:54,320 and set it to zero as well. 132 00:06:54,320 --> 00:07:00,000 Solve those equations and find the optimal mu and v. So let's 133 00:07:00,000 --> 00:07:02,960 start by optimizing with respect to mu. 134 00:07:02,960 --> 00:07:05,720 So we're going to take the derivative of this expression 135 00:07:05,720 --> 00:07:09,160 with respect to mu and set it to zero. 136 00:07:09,160 --> 00:07:11,940 This term does not involve mu, so we only 137 00:07:11,940 --> 00:07:14,690 need to take the derivative of this. 138 00:07:14,690 --> 00:07:16,640 And the derivative of this is going 139 00:07:16,640 --> 00:07:21,470 to be-- there's a term 1 over v. And then 140 00:07:21,470 --> 00:07:24,180 the derivative of a quadratic divided by 2 141 00:07:24,180 --> 00:07:27,930 is just xi minus mu. 142 00:07:27,930 --> 00:07:31,960 And we have one term for each possible i. 143 00:07:31,960 --> 00:07:34,700 We get this equation. 144 00:07:34,700 --> 00:07:40,909 Now we can cancel out v, and we're left with the equation 145 00:07:40,909 --> 00:07:45,170 that the sum of the xi's is equal to the sum 146 00:07:45,170 --> 00:07:47,920 of the mus, which is n times mu. 147 00:07:47,920 --> 00:07:51,260 And now we can send n to the denominator 148 00:07:51,260 --> 00:07:55,020 to obtain that the estimate of mu 149 00:07:55,020 --> 00:07:59,960 is going to be the sum of the xi's divided by n. 150 00:07:59,960 --> 00:08:02,670 So the maximum likelihood estimate of the mean 151 00:08:02,670 --> 00:08:05,480 takes a very simple and very natural form. 152 00:08:05,480 --> 00:08:08,160 It is just the sample mean. 153 00:08:08,160 --> 00:08:11,070 Now, let us continue with the minimization with respect 154 00:08:11,070 --> 00:08:14,670 to v. In order to carry out that minimization, 155 00:08:14,670 --> 00:08:17,680 we need to take the derivative of this expression with respect 156 00:08:17,680 --> 00:08:19,900 to v and set it to zero. 157 00:08:19,900 --> 00:08:22,110 The derivative of the first term is 158 00:08:22,110 --> 00:08:28,700 equal to n over 2 times 1 over v. And then from here, 159 00:08:28,700 --> 00:08:32,915 when we take the derivative, we obtain 160 00:08:32,915 --> 00:08:45,180 the sum of all these terms divided by 2v squared. 161 00:08:45,180 --> 00:08:47,925 But actually, when we take the derivative of 1 over v, 162 00:08:47,925 --> 00:08:51,310 the derivative is minus 1 over v squared. 163 00:08:51,310 --> 00:08:56,330 And for this reason here, we will have a minus sign. 164 00:08:56,330 --> 00:08:58,030 So this is the derivative with respect 165 00:08:58,030 --> 00:09:03,260 to v. We set it equal to zero and carry out some algebra. 166 00:09:03,260 --> 00:09:05,770 What is the algebra involved here? 167 00:09:05,770 --> 00:09:11,280 We can delete this term, 2, that appears here and there. 168 00:09:11,280 --> 00:09:16,840 This term v cancels out this exponent here. 169 00:09:16,840 --> 00:09:20,700 Then we take this v, move it to the other side, 170 00:09:20,700 --> 00:09:24,860 and then take this n and move it to this side, 171 00:09:24,860 --> 00:09:26,630 underneath this term. 172 00:09:26,630 --> 00:09:29,260 And finally, what we obtain after you carry out 173 00:09:29,260 --> 00:09:32,690 this algebra is this expression, that the estimate 174 00:09:32,690 --> 00:09:39,150 of the variance is some form of the sample variance 175 00:09:39,150 --> 00:09:43,490 where we use the optimal value of mu. 176 00:09:43,490 --> 00:09:47,500 And the optimal value of mu we have already found. 177 00:09:47,500 --> 00:09:49,990 It's given by this expression here. 178 00:09:49,990 --> 00:09:53,700 So we obtain a pretty natural estimate for the variance 179 00:09:53,700 --> 00:09:58,340 as well by using this maximum likelihood methodology. 180 00:09:58,340 --> 00:10:00,880 Now, these two examples were particularly nice 181 00:10:00,880 --> 00:10:04,450 because the algebra was not too complicated. 182 00:10:04,450 --> 00:10:07,160 And the answers turned out to be what 183 00:10:07,160 --> 00:10:11,410 you might have guessed without using any fancy methods. 184 00:10:11,410 --> 00:10:13,840 But in other problems, the calculations 185 00:10:13,840 --> 00:10:18,869 may be more complicated and the answers may not be so obvious.