1
00:00:00,880 --> 00:00:03,630
We will finish our discussion
of classical statistical

2
00:00:03,630 --> 00:00:07,330
methods by discussing a general
method for estimation,

3
00:00:07,330 --> 00:00:10,650
the so-called maximum
likelihood method.

4
00:00:10,650 --> 00:00:14,150
If an unknown parameter can be
expressed as an expectation,

5
00:00:14,150 --> 00:00:17,710
we have seen that there's a
natural way of estimating it.

6
00:00:17,710 --> 00:00:20,730
But what if this is
not the case?

7
00:00:20,730 --> 00:00:24,660
Suppose there's no apparent way
of interpreting theta as

8
00:00:24,660 --> 00:00:25,760
an expectation.

9
00:00:25,760 --> 00:00:28,410
So we need to do
something else.

10
00:00:28,410 --> 00:00:32,110
So rather than using this
approach, we will use a

11
00:00:32,110 --> 00:00:34,550
different approach, which
is the following.

12
00:00:34,550 --> 00:00:39,780
We will find a value of theta
that makes the data that we

13
00:00:39,780 --> 00:00:42,420
have seen most likely.

14
00:00:42,420 --> 00:00:46,970
That is, we will find the value
of theta under which the

15
00:00:46,970 --> 00:00:49,950
probability of obtaining
the particular x

16
00:00:49,950 --> 00:00:51,710
that we have seen--

17
00:00:51,710 --> 00:00:54,900
that probability is as
large as possible.

18
00:00:54,900 --> 00:00:57,780
And that value of theta is going
to be our estimate, the

19
00:00:57,780 --> 00:00:59,900
maximum likelihood estimate.

20
00:00:59,900 --> 00:01:02,240
Here, I wrote a PMF.

21
00:01:02,240 --> 00:01:04,129
That's what you would
do if X was a

22
00:01:04,129 --> 00:01:05,470
discrete random variable.

23
00:01:05,470 --> 00:01:10,170
But the same procedure, of
course, applies when X is a

24
00:01:10,170 --> 00:01:12,440
continuous random variable.

25
00:01:12,440 --> 00:01:16,039
And more generally, this
procedure also applies when X

26
00:01:16,039 --> 00:01:20,550
is a vector of observations and
when theta is a vector of

27
00:01:20,550 --> 00:01:22,480
parameters.

28
00:01:22,480 --> 00:01:25,289
But what does this
method really do?

29
00:01:25,289 --> 00:01:28,420
It is instructive to compare
maximum likelihood estimation

30
00:01:28,420 --> 00:01:30,160
to a Bayesian approach.

31
00:01:30,160 --> 00:01:34,270
In a Bayesian setting, what we
do is, we find the posterior

32
00:01:34,270 --> 00:01:37,330
distribution of the unknown
parameter, which is now

33
00:01:37,330 --> 00:01:40,180
treated as a random variable.

34
00:01:40,180 --> 00:01:45,729
And then we look for the most
likely value of theta.

35
00:01:45,729 --> 00:01:49,050
We look at this distribution
and try to find its peak.

36
00:01:49,050 --> 00:01:53,210
So we want to maximize this
quantity over theta.

37
00:01:53,210 --> 00:01:55,870
The denominator does not
involve any thetas.

38
00:01:55,870 --> 00:01:57,320
So we ignore it.

39
00:01:57,320 --> 00:02:02,000
And suppose now that
we use a prior for

40
00:02:02,000 --> 00:02:04,760
theta, which is flat.

41
00:02:04,760 --> 00:02:09,389
Suppose that this prior is
constant over the range of

42
00:02:09,389 --> 00:02:11,630
possible values of theta.

43
00:02:11,630 --> 00:02:15,520
In that case, what we need to
do is to just take this

44
00:02:15,520 --> 00:02:19,750
expression and to maximize
it over all thetas.

45
00:02:19,750 --> 00:02:22,960
And this looks very similar to
what is happening here, where

46
00:02:22,960 --> 00:02:27,400
we take this expression and
maximize it over all thetas.

47
00:02:27,400 --> 00:02:31,579
So operationally, maximum
likelihood estimation is the

48
00:02:31,579 --> 00:02:36,790
same as Bayesian estimation, in
which we find the peak of

49
00:02:36,790 --> 00:02:41,160
the posterior for the special
case where we're using

50
00:02:41,160 --> 00:02:43,910
constant or a flat prior.

51
00:02:43,910 --> 00:02:47,030
But despite this similarity,
the two methods are

52
00:02:47,030 --> 00:02:49,505
philosophically very
different.

53
00:02:49,505 --> 00:02:53,010
In the Bayesian setting, you're
asking the question,

54
00:02:53,010 --> 00:02:57,090
what is the most likely
value of theta?

55
00:02:57,090 --> 00:03:00,500
Whereas in the maximum
likelihood setting, you're

56
00:03:00,500 --> 00:03:04,750
asking, what is the value
of theta that makes

57
00:03:04,750 --> 00:03:08,070
my data most likely?

58
00:03:08,070 --> 00:03:12,380
Or what is the value of theta
under which my data are the

59
00:03:12,380 --> 00:03:14,610
least surprising?

60
00:03:14,610 --> 00:03:19,579
So the interpretation of the
two methods is quite

61
00:03:19,579 --> 00:03:22,579
different, even though
the mechanics

62
00:03:22,579 --> 00:03:24,810
can be fairly similar.

63
00:03:24,810 --> 00:03:29,350
The maximum likelihood method
has some remarkable properties

64
00:03:29,350 --> 00:03:31,430
that we would like
now to discuss.

65
00:03:31,430 --> 00:03:33,560
But first, one comment--

66
00:03:33,560 --> 00:03:38,230
we need to take the probability
of the observed

67
00:03:38,230 --> 00:03:39,579
data given theta.

68
00:03:39,579 --> 00:03:43,300
This is a function of theta,
and maximize it over theta.

69
00:03:43,300 --> 00:03:47,190
In some problems, we can find
closed form solutions for the

70
00:03:47,190 --> 00:03:50,400
optimal value of theta, which
is going to be our estimate

71
00:03:50,400 --> 00:03:54,190
but more often, and especially
for large problems, one has to

72
00:03:54,190 --> 00:03:57,960
do this maximization
in a numerical way.

73
00:03:57,960 --> 00:04:01,440
This is possible these days,
and routinely, people solve

74
00:04:01,440 --> 00:04:04,700
very high dimensional problems
with lots of data and lots of

75
00:04:04,700 --> 00:04:08,530
parameters using the maximum
likelihood methodology.

76
00:04:08,530 --> 00:04:11,480
The maximum likelihood
methodology is very popular

77
00:04:11,480 --> 00:04:16,399
because it has a very sound
theoretical basis.

78
00:04:16,399 --> 00:04:19,990
I will list a few facts, which
we will not attempt to prove

79
00:04:19,990 --> 00:04:21,829
or even justify.

80
00:04:21,829 --> 00:04:25,760
But they're useful to know
as general background.

81
00:04:25,760 --> 00:04:29,770
Suppose that we have n pieces of
data that are drawn from a

82
00:04:29,770 --> 00:04:32,450
model from a certain
structure.

83
00:04:32,450 --> 00:04:37,050
Then under mild assumptions,
the maximum likelihood

84
00:04:37,050 --> 00:04:40,190
estimator has the property
that it is consistent.

85
00:04:40,190 --> 00:04:43,720
That is, as we draw more and
more data, our estimate is

86
00:04:43,720 --> 00:04:47,640
going to converge to the true
value of the parameter.

87
00:04:47,640 --> 00:04:50,070
In addition, we know
quite a bit more.

88
00:04:50,070 --> 00:04:53,870
Asymptotically, the maximum
likelihood estimator behaves

89
00:04:53,870 --> 00:04:55,930
like a normal random variable.

90
00:04:55,930 --> 00:05:00,330
That is, after we normalize,
subtract the target and divide

91
00:05:00,330 --> 00:05:04,480
by its standard deviation, it
approaches a standard normal

92
00:05:04,480 --> 00:05:05,440
distribution.

93
00:05:05,440 --> 00:05:10,360
So in this sense, it behaves the
same way that the sample

94
00:05:10,360 --> 00:05:12,370
mean behaves.

95
00:05:12,370 --> 00:05:15,940
Notice that this expression
here involves the standard

96
00:05:15,940 --> 00:05:18,840
error of the maximum likelihood
estimator.

97
00:05:18,840 --> 00:05:20,710
This is an important quantity.

98
00:05:20,710 --> 00:05:23,960
And for this reason, people
have developed either

99
00:05:23,960 --> 00:05:27,320
analytical or simulation methods
for calculating or

100
00:05:27,320 --> 00:05:30,160
approximating this
standard error.

101
00:05:30,160 --> 00:05:33,530
Once you have an estimate or
an approximation of the

102
00:05:33,530 --> 00:05:37,400
standard error in your hands,
you can further use it to

103
00:05:37,400 --> 00:05:40,140
construct confidence
intervals.

104
00:05:40,140 --> 00:05:43,980
Using the asymptotic normality,
then we can

105
00:05:43,980 --> 00:05:46,710
construct a confidence interval
in exactly the same

106
00:05:46,710 --> 00:05:50,690
way as we did for the case of
the sample mean estimator.

107
00:05:50,690 --> 00:05:56,010
And this, for example, would be
a 95% confidence interval.

108
00:05:56,010 --> 00:05:59,650
Finally, one last important
property is that the maximum

109
00:05:59,650 --> 00:06:05,670
likelihood estimator is what
is called an asymptotically

110
00:06:05,670 --> 00:06:07,700
efficient estimator.

111
00:06:07,700 --> 00:06:11,720
That is, it is the best possible
estimator in the

112
00:06:11,720 --> 00:06:16,070
sense that it achieves the
smallest possible variance.

113
00:06:16,070 --> 00:06:18,710
So all of these are very
strong properties.

114
00:06:18,710 --> 00:06:22,790
And this is the reason why
maximum likelihood estimation

115
00:06:22,790 --> 00:06:26,780
is the most common approach for
problems that do not have

116
00:06:26,780 --> 00:06:29,520
any particular special
structure that

117
00:06:29,520 --> 00:06:30,770
you can exploit otherwise.