1
00:00:00,500 --> 00:00:02,940
In this segment, we introduce
the subject of least

2
00:00:02,940 --> 00:00:04,690
mean squares estimation.

3
00:00:04,690 --> 00:00:07,370
But as a warm-up, we will
start with a very simple

4
00:00:07,370 --> 00:00:08,860
special case.

5
00:00:08,860 --> 00:00:11,330
Suppose that we have
some random variable

6
00:00:11,330 --> 00:00:14,800
that we wish to estimate and
that we have the probability

7
00:00:14,800 --> 00:00:17,500
distribution of this random
variable-- a probability

8
00:00:17,500 --> 00:00:20,440
mass function if it's discrete
or a probability density

9
00:00:20,440 --> 00:00:22,310
function if it's continuous.

10
00:00:22,310 --> 00:00:25,060
As a concrete instance, suppose
that our random variable

11
00:00:25,060 --> 00:00:29,050
is uniformly distributed
over a certain interval.

12
00:00:29,050 --> 00:00:32,320
We would like to estimate
this random variable.

13
00:00:32,320 --> 00:00:34,660
We're interested in
a point estimate.

14
00:00:34,660 --> 00:00:37,780
However, we look at
the very special case

15
00:00:37,780 --> 00:00:40,450
where there are no
observations available.

16
00:00:40,450 --> 00:00:43,580
All that we have is this
probability distribution.

17
00:00:43,580 --> 00:00:46,150
How do we estimate
this random variable?

18
00:00:46,150 --> 00:00:48,340
Well, we can use the
rules that we have already

19
00:00:48,340 --> 00:00:50,530
developed, for
example, the maximum

20
00:00:50,530 --> 00:00:54,079
a posteriori probability
rule, what would it do?

21
00:00:54,079 --> 00:00:55,995
In this case, since there
are no observations,

22
00:00:55,995 --> 00:01:00,530
the posterior distribution of
Theta is the same as the prior.

23
00:01:00,530 --> 00:01:03,280
There are no observations
that could change the prior.

24
00:01:03,280 --> 00:01:06,540
So we need to find a point
at which this distribution is

25
00:01:06,540 --> 00:01:07,710
highest.

26
00:01:07,710 --> 00:01:10,170
Well, because this
distribution is flat,

27
00:01:10,170 --> 00:01:13,920
the MAP rule does not
give us a unique answer.

28
00:01:13,920 --> 00:01:18,840
Any value of theta inside
the interval from 4 to 10

29
00:01:18,840 --> 00:01:21,880
would be an acceptable answer.

30
00:01:21,880 --> 00:01:29,100
So any particular estimate in
this interval would be fine.

31
00:01:29,100 --> 00:01:31,900
So this rule is not
particularly helpful.

32
00:01:31,900 --> 00:01:34,450
How about a different estimator?

33
00:01:34,450 --> 00:01:37,180
We have seen the conditional
expectation estimator.

34
00:01:37,180 --> 00:01:40,840
Once more, in our case, since
we do not have any observations,

35
00:01:40,840 --> 00:01:43,920
the conditional expectation is
the same as the expectation.

36
00:01:43,920 --> 00:01:46,000
And for this
particular example, it

37
00:01:46,000 --> 00:01:48,780
would give us the midpoint
of the distribution, namely

38
00:01:48,780 --> 00:01:52,270
an estimate equal to 7.

39
00:01:52,270 --> 00:01:55,750
Now, this rule was inconclusive.

40
00:01:55,750 --> 00:01:58,000
This rule gave us a number.

41
00:01:58,000 --> 00:02:02,510
How can we choose and
decide that one of these

42
00:02:02,510 --> 00:02:04,560
is the right estimate?

43
00:02:04,560 --> 00:02:08,650
We can do that if we introduce a
specific performance criterion.

44
00:02:08,650 --> 00:02:12,100
What is it that we wish
from our estimators?

45
00:02:12,100 --> 00:02:14,110
And the particular
criterion, the one

46
00:02:14,110 --> 00:02:18,460
that will we be focusing on,
is the mean squared error.

47
00:02:18,460 --> 00:02:20,840
If you come up with
a certain estimate,

48
00:02:20,840 --> 00:02:24,170
you look at how far is your
estimate from the true value

49
00:02:24,170 --> 00:02:27,040
that you're trying to estimate,
take the square of that,

50
00:02:27,040 --> 00:02:28,360
and average it.

51
00:02:28,360 --> 00:02:30,850
And this leads us
to a formulation

52
00:02:30,850 --> 00:02:36,010
whereby we will try to find
an estimate theta hat that

53
00:02:36,010 --> 00:02:38,850
minimizes this
mean squared error

54
00:02:38,850 --> 00:02:42,230
over all possible estimates.

55
00:02:42,230 --> 00:02:44,400
Let us now look at
this formulation

56
00:02:44,400 --> 00:02:47,300
and see how we can solve it.

57
00:02:47,300 --> 00:02:50,650
This is a function of a
single variable theta hat.

58
00:02:50,650 --> 00:02:54,630
And we can try to minimize it
using conventional methods.

59
00:02:54,630 --> 00:02:56,820
To carry out this
minimization, let

60
00:02:56,820 --> 00:03:01,100
us first expand this
expectation into a sum of terms.

61
00:03:01,100 --> 00:03:03,220
We have the expected
value of the square

62
00:03:03,220 --> 00:03:06,940
of the random variable,
then a cross term minus 2

63
00:03:06,940 --> 00:03:12,360
the expected value of
Theta times theta hat.

64
00:03:12,360 --> 00:03:16,010
However, theta hat is a number
that we're trying to choose.

65
00:03:16,010 --> 00:03:17,250
It's not random.

66
00:03:17,250 --> 00:03:20,780
Therefore, we can pull it
outside the expectation.

67
00:03:20,780 --> 00:03:24,079
And similarly, the last term,
the expected value of theta hat

68
00:03:24,079 --> 00:03:28,970
squared is just theta
hat squared itself.

69
00:03:28,970 --> 00:03:30,610
This is what we
want to minimize.

70
00:03:30,610 --> 00:03:32,280
How do we minimize it?

71
00:03:32,280 --> 00:03:36,340
We take the derivative
with respect to theta hat

72
00:03:36,340 --> 00:03:38,710
and set it to 0.

73
00:03:38,710 --> 00:03:47,480
And this gives us minus 2
the expected value of Theta

74
00:03:47,480 --> 00:03:52,450
plus twice theta hat equal to 0.

75
00:03:52,450 --> 00:03:54,770
And when we solve
this, we find is

76
00:03:54,770 --> 00:03:58,180
that theta hat, the
optimal estimate

77
00:03:58,180 --> 00:04:01,250
is equal to the
expected value of Theta.

78
00:04:04,040 --> 00:04:08,250
So this is the answer to
our optimization problem.

79
00:04:08,250 --> 00:04:13,460
The optimal estimate, according
to the least squares criterion,

80
00:04:13,460 --> 00:04:17,709
is the expected value
of the random variable.

81
00:04:17,709 --> 00:04:20,690
Now, it's interesting to
derive this result, also,

82
00:04:20,690 --> 00:04:22,390
in a second way.

83
00:04:22,390 --> 00:04:25,530
Since it is a quite important
and fundamental result,

84
00:04:25,530 --> 00:04:29,465
let us see whether there is a
different way of establishing

85
00:04:29,465 --> 00:04:33,370
it that stays closer to the
probabilistic world rather

86
00:04:33,370 --> 00:04:35,700
than the calculus world.

87
00:04:35,700 --> 00:04:39,050
So let us look at this
criterion that we have here.

88
00:04:39,050 --> 00:04:43,180
The expected value of the
square of a random variable

89
00:04:43,180 --> 00:04:52,540
is always equal to the variance
of that random variable

90
00:04:52,540 --> 00:05:00,115
plus the expected value of
that random variable squared.

91
00:05:03,480 --> 00:05:06,600
This is what we're
trying to minimize.

92
00:05:06,600 --> 00:05:10,190
Now, theta hat is a constant.

93
00:05:10,190 --> 00:05:12,960
The variance of a random
variable plus or minus

94
00:05:12,960 --> 00:05:17,380
a constant is the
same as the variance

95
00:05:17,380 --> 00:05:20,020
of that random variable.

96
00:05:20,020 --> 00:05:23,160
And there is nothing that
we can do about this term.

97
00:05:23,160 --> 00:05:25,680
So what we're trying
to do is, essentially,

98
00:05:25,680 --> 00:05:29,900
just try to minimize this term
with respect to theta hat.

99
00:05:29,900 --> 00:05:32,980
Now, here we have the
square of something.

100
00:05:32,980 --> 00:05:37,810
The way to minimize this is to
try to make this quantity as

101
00:05:37,810 --> 00:05:41,490
small as possible,
make it 0 if we can.

102
00:05:41,490 --> 00:05:46,430
Well, we can make it 0 if we set
theta hat equal to the expected

103
00:05:46,430 --> 00:05:48,400
value of Theta.

104
00:05:48,400 --> 00:05:51,640
So this term here is minimized.

105
00:05:54,700 --> 00:06:00,150
When theta hat is equal to
the expected value of Theta,

106
00:06:00,150 --> 00:06:03,480
it is minimized because
this is a choice that

107
00:06:03,480 --> 00:06:06,620
will make this term equal to 0.

108
00:06:06,620 --> 00:06:09,080
So this was a second
derivation of why

109
00:06:09,080 --> 00:06:13,440
this is the optimal
estimate of Theta.

110
00:06:13,440 --> 00:06:16,670
Once we adopt this
particular estimate,

111
00:06:16,670 --> 00:06:19,650
the mean squared
error is going to be

112
00:06:19,650 --> 00:06:22,680
equal to this, because
this is our theta hat.

113
00:06:22,680 --> 00:06:25,980
And, of course, we recognize
that this is the variance.

114
00:06:25,980 --> 00:06:29,790
So the variance of Theta
is the least possible value

115
00:06:29,790 --> 00:06:32,120
of the mean squared
error that we

116
00:06:32,120 --> 00:06:35,500
can obtain using any
particular estimate.

117
00:06:35,500 --> 00:06:38,250
And this is our
final conclusion.

118
00:06:38,250 --> 00:06:42,230
And in the next segment, we
will exploit the conclusions

119
00:06:42,230 --> 00:06:44,720
that we found here
and apply them

120
00:06:44,720 --> 00:06:47,150
to a more general situation.