1 00:00:00,880 --> 00:00:03,630 We will finish our discussion of classical statistical 2 00:00:03,630 --> 00:00:07,330 methods by discussing a general method for estimation, 3 00:00:07,330 --> 00:00:10,650 the so-called maximum likelihood method. 4 00:00:10,650 --> 00:00:14,150 If an unknown parameter can be expressed as an expectation, 5 00:00:14,150 --> 00:00:17,710 we have seen that there's a natural way of estimating it. 6 00:00:17,710 --> 00:00:20,730 But what if this is not the case? 7 00:00:20,730 --> 00:00:24,660 Suppose there's no apparent way of interpreting theta as 8 00:00:24,660 --> 00:00:25,760 an expectation. 9 00:00:25,760 --> 00:00:28,410 So we need to do something else. 10 00:00:28,410 --> 00:00:32,110 So rather than using this approach, we will use a 11 00:00:32,110 --> 00:00:34,550 different approach, which is the following. 12 00:00:34,550 --> 00:00:39,780 We will find a value of theta that makes the data that we 13 00:00:39,780 --> 00:00:42,420 have seen most likely. 14 00:00:42,420 --> 00:00:46,970 That is, we will find the value of theta under which the 15 00:00:46,970 --> 00:00:49,950 probability of obtaining the particular x 16 00:00:49,950 --> 00:00:51,710 that we have seen-- 17 00:00:51,710 --> 00:00:54,900 that probability is as large as possible. 18 00:00:54,900 --> 00:00:57,780 And that value of theta is going to be our estimate, the 19 00:00:57,780 --> 00:00:59,900 maximum likelihood estimate. 20 00:00:59,900 --> 00:01:02,240 Here, I wrote a PMF. 21 00:01:02,240 --> 00:01:04,129 That's what you would do if X was a 22 00:01:04,129 --> 00:01:05,470 discrete random variable. 23 00:01:05,470 --> 00:01:10,170 But the same procedure, of course, applies when X is a 24 00:01:10,170 --> 00:01:12,440 continuous random variable. 25 00:01:12,440 --> 00:01:16,039 And more generally, this procedure also applies when X 26 00:01:16,039 --> 00:01:20,550 is a vector of observations and when theta is a vector of 27 00:01:20,550 --> 00:01:22,480 parameters. 28 00:01:22,480 --> 00:01:25,289 But what does this method really do? 29 00:01:25,289 --> 00:01:28,420 It is instructive to compare maximum likelihood estimation 30 00:01:28,420 --> 00:01:30,160 to a Bayesian approach. 31 00:01:30,160 --> 00:01:34,270 In a Bayesian setting, what we do is, we find the posterior 32 00:01:34,270 --> 00:01:37,330 distribution of the unknown parameter, which is now 33 00:01:37,330 --> 00:01:40,180 treated as a random variable. 34 00:01:40,180 --> 00:01:45,729 And then we look for the most likely value of theta. 35 00:01:45,729 --> 00:01:49,050 We look at this distribution and try to find its peak. 36 00:01:49,050 --> 00:01:53,210 So we want to maximize this quantity over theta. 37 00:01:53,210 --> 00:01:55,870 The denominator does not involve any thetas. 38 00:01:55,870 --> 00:01:57,320 So we ignore it. 39 00:01:57,320 --> 00:02:02,000 And suppose now that we use a prior for 40 00:02:02,000 --> 00:02:04,760 theta, which is flat. 41 00:02:04,760 --> 00:02:09,389 Suppose that this prior is constant over the range of 42 00:02:09,389 --> 00:02:11,630 possible values of theta. 43 00:02:11,630 --> 00:02:15,520 In that case, what we need to do is to just take this 44 00:02:15,520 --> 00:02:19,750 expression and to maximize it over all thetas. 45 00:02:19,750 --> 00:02:22,960 And this looks very similar to what is happening here, where 46 00:02:22,960 --> 00:02:27,400 we take this expression and maximize it over all thetas. 47 00:02:27,400 --> 00:02:31,579 So operationally, maximum likelihood estimation is the 48 00:02:31,579 --> 00:02:36,790 same as Bayesian estimation, in which we find the peak of 49 00:02:36,790 --> 00:02:41,160 the posterior for the special case where we're using 50 00:02:41,160 --> 00:02:43,910 constant or a flat prior. 51 00:02:43,910 --> 00:02:47,030 But despite this similarity, the two methods are 52 00:02:47,030 --> 00:02:49,505 philosophically very different. 53 00:02:49,505 --> 00:02:53,010 In the Bayesian setting, you're asking the question, 54 00:02:53,010 --> 00:02:57,090 what is the most likely value of theta? 55 00:02:57,090 --> 00:03:00,500 Whereas in the maximum likelihood setting, you're 56 00:03:00,500 --> 00:03:04,750 asking, what is the value of theta that makes 57 00:03:04,750 --> 00:03:08,070 my data most likely? 58 00:03:08,070 --> 00:03:12,380 Or what is the value of theta under which my data are the 59 00:03:12,380 --> 00:03:14,610 least surprising? 60 00:03:14,610 --> 00:03:19,579 So the interpretation of the two methods is quite 61 00:03:19,579 --> 00:03:22,579 different, even though the mechanics 62 00:03:22,579 --> 00:03:24,810 can be fairly similar. 63 00:03:24,810 --> 00:03:29,350 The maximum likelihood method has some remarkable properties 64 00:03:29,350 --> 00:03:31,430 that we would like now to discuss. 65 00:03:31,430 --> 00:03:33,560 But first, one comment-- 66 00:03:33,560 --> 00:03:38,230 we need to take the probability of the observed 67 00:03:38,230 --> 00:03:39,579 data given theta. 68 00:03:39,579 --> 00:03:43,300 This is a function of theta, and maximize it over theta. 69 00:03:43,300 --> 00:03:47,190 In some problems, we can find closed form solutions for the 70 00:03:47,190 --> 00:03:50,400 optimal value of theta, which is going to be our estimate 71 00:03:50,400 --> 00:03:54,190 but more often, and especially for large problems, one has to 72 00:03:54,190 --> 00:03:57,960 do this maximization in a numerical way. 73 00:03:57,960 --> 00:04:01,440 This is possible these days, and routinely, people solve 74 00:04:01,440 --> 00:04:04,700 very high dimensional problems with lots of data and lots of 75 00:04:04,700 --> 00:04:08,530 parameters using the maximum likelihood methodology. 76 00:04:08,530 --> 00:04:11,480 The maximum likelihood methodology is very popular 77 00:04:11,480 --> 00:04:16,399 because it has a very sound theoretical basis. 78 00:04:16,399 --> 00:04:19,990 I will list a few facts, which we will not attempt to prove 79 00:04:19,990 --> 00:04:21,829 or even justify. 80 00:04:21,829 --> 00:04:25,760 But they're useful to know as general background. 81 00:04:25,760 --> 00:04:29,770 Suppose that we have n pieces of data that are drawn from a 82 00:04:29,770 --> 00:04:32,450 model from a certain structure. 83 00:04:32,450 --> 00:04:37,050 Then under mild assumptions, the maximum likelihood 84 00:04:37,050 --> 00:04:40,190 estimator has the property that it is consistent. 85 00:04:40,190 --> 00:04:43,720 That is, as we draw more and more data, our estimate is 86 00:04:43,720 --> 00:04:47,640 going to converge to the true value of the parameter. 87 00:04:47,640 --> 00:04:50,070 In addition, we know quite a bit more. 88 00:04:50,070 --> 00:04:53,870 Asymptotically, the maximum likelihood estimator behaves 89 00:04:53,870 --> 00:04:55,930 like a normal random variable. 90 00:04:55,930 --> 00:05:00,330 That is, after we normalize, subtract the target and divide 91 00:05:00,330 --> 00:05:04,480 by its standard deviation, it approaches a standard normal 92 00:05:04,480 --> 00:05:05,440 distribution. 93 00:05:05,440 --> 00:05:10,360 So in this sense, it behaves the same way that the sample 94 00:05:10,360 --> 00:05:12,370 mean behaves. 95 00:05:12,370 --> 00:05:15,940 Notice that this expression here involves the standard 96 00:05:15,940 --> 00:05:18,840 error of the maximum likelihood estimator. 97 00:05:18,840 --> 00:05:20,710 This is an important quantity. 98 00:05:20,710 --> 00:05:23,960 And for this reason, people have developed either 99 00:05:23,960 --> 00:05:27,320 analytical or simulation methods for calculating or 100 00:05:27,320 --> 00:05:30,160 approximating this standard error. 101 00:05:30,160 --> 00:05:33,530 Once you have an estimate or an approximation of the 102 00:05:33,530 --> 00:05:37,400 standard error in your hands, you can further use it to 103 00:05:37,400 --> 00:05:40,140 construct confidence intervals. 104 00:05:40,140 --> 00:05:43,980 Using the asymptotic normality, then we can 105 00:05:43,980 --> 00:05:46,710 construct a confidence interval in exactly the same 106 00:05:46,710 --> 00:05:50,690 way as we did for the case of the sample mean estimator. 107 00:05:50,690 --> 00:05:56,010 And this, for example, would be a 95% confidence interval. 108 00:05:56,010 --> 00:05:59,650 Finally, one last important property is that the maximum 109 00:05:59,650 --> 00:06:05,670 likelihood estimator is what is called an asymptotically 110 00:06:05,670 --> 00:06:07,700 efficient estimator. 111 00:06:07,700 --> 00:06:11,720 That is, it is the best possible estimator in the 112 00:06:11,720 --> 00:06:16,070 sense that it achieves the smallest possible variance. 113 00:06:16,070 --> 00:06:18,710 So all of these are very strong properties. 114 00:06:18,710 --> 00:06:22,790 And this is the reason why maximum likelihood estimation 115 00:06:22,790 --> 00:06:26,780 is the most common approach for problems that do not have 116 00:06:26,780 --> 00:06:29,520 any particular special structure that 117 00:06:29,520 --> 00:06:30,770 you can exploit otherwise.