1
00:00:00,700 --> 00:00:03,880
We are now ready to
move on to a model which

2
00:00:03,880 --> 00:00:07,240
is quite interesting
and quite realistic.

3
00:00:07,240 --> 00:00:11,900
This is a model in which we have
an unknown parameter modeled

4
00:00:11,900 --> 00:00:14,970
as a random variable
that we try to estimate.

5
00:00:14,970 --> 00:00:17,080
This is the random
variable, Theta.

6
00:00:17,080 --> 00:00:18,880
And we have multiple
observations

7
00:00:18,880 --> 00:00:20,540
of that random variable.

8
00:00:20,540 --> 00:00:22,380
Each one of those
observations is

9
00:00:22,380 --> 00:00:24,720
equal to the unknown
random variable,

10
00:00:24,720 --> 00:00:27,490
plus some additive noise.

11
00:00:27,490 --> 00:00:30,900
This is a model that appears
quite often in practice.

12
00:00:30,900 --> 00:00:32,509
It is often the case
that we're trying

13
00:00:32,509 --> 00:00:35,730
to estimate a certain
quantity, but we can only

14
00:00:35,730 --> 00:00:39,300
observe values of that quantity
in the presence of noise.

15
00:00:39,300 --> 00:00:42,130
And because of the
noise, what we want to do

16
00:00:42,130 --> 00:00:45,280
is to try to measure
it to multiple times.

17
00:00:45,280 --> 00:00:48,470
And so we have multiple
such measurement equations.

18
00:00:48,470 --> 00:00:52,020
And then we want to combine all
of the observations together

19
00:00:52,020 --> 00:00:56,340
to come up with a good
estimate of that parameter.

20
00:00:56,340 --> 00:00:58,440
The assumption that
we will be making

21
00:00:58,440 --> 00:01:01,950
are that Theta is a
normal random variable.

22
00:01:01,950 --> 00:01:05,019
It has a certain mean
that we denote by x0.

23
00:01:05,019 --> 00:01:08,030
The reason for this strange
notation will be seen later.

24
00:01:08,030 --> 00:01:10,800
And it also has a
certain variance.

25
00:01:10,800 --> 00:01:14,130
The noise terms are also
normal random variables

26
00:01:14,130 --> 00:01:17,360
with 0 mean and a
certain variance.

27
00:01:17,360 --> 00:01:21,300
And finally, we assume that
these basic random variables

28
00:01:21,300 --> 00:01:26,240
that define our model
are all independent.

29
00:01:26,240 --> 00:01:29,680
Based on these assumptions, now
we would like to estimate Theta

30
00:01:29,680 --> 00:01:32,840
on the basis of the X's.

31
00:01:32,840 --> 00:01:36,140
And as usual, in the
Bayesian setting,

32
00:01:36,140 --> 00:01:38,759
what we want to do is to
calculate the posterior

33
00:01:38,759 --> 00:01:42,360
distribution of
Theta, given the X's.

34
00:01:42,360 --> 00:01:45,090
The Bayes rule
has the usual form

35
00:01:45,090 --> 00:01:47,590
for the case of continuous
random variables.

36
00:01:47,590 --> 00:01:51,740
The only remark that needs to
be made is that in this case,

37
00:01:51,740 --> 00:01:55,729
there are multiple X's,
so X up here stands

38
00:01:55,729 --> 00:01:58,265
for the vector of
the observations

39
00:01:58,265 --> 00:01:59,920
that we have obtained.

40
00:01:59,920 --> 00:02:02,160
And similarly,
little x will stand

41
00:02:02,160 --> 00:02:06,260
for the vector of the values
of these observations.

42
00:02:06,260 --> 00:02:09,070
So we need now to start
making some progress

43
00:02:09,070 --> 00:02:11,810
towards calculating
this term here.

44
00:02:11,810 --> 00:02:15,080
What is the distribution of
the vector of measurements

45
00:02:15,080 --> 00:02:16,600
given theta.

46
00:02:16,600 --> 00:02:19,030
Before we move to
the vector case,

47
00:02:19,030 --> 00:02:23,570
let us look at one of the
measurements in isolation.

48
00:02:23,570 --> 00:02:26,600
This is something that
we have already seen.

49
00:02:26,600 --> 00:02:30,770
If I tell you the value
of the random variable,

50
00:02:30,770 --> 00:02:35,210
Theta, which is what happens
in this conditional universe

51
00:02:35,210 --> 00:02:38,270
when you condition on
the value of Theta,

52
00:02:38,270 --> 00:02:42,300
then in that universe,
the random variable, Xi,

53
00:02:42,300 --> 00:02:45,610
is equal to the numerical
value that you gave me

54
00:02:45,610 --> 00:02:47,460
for Theta, plus Wi.

55
00:02:50,329 --> 00:02:55,360
And because Wi is independent
from the random variable Theta,

56
00:02:55,360 --> 00:02:57,579
knowing the value of
the random variable

57
00:02:57,579 --> 00:03:00,300
Theta does not change
the distribution of Wi.

58
00:03:00,300 --> 00:03:02,990
It will still have this
normal distribution.

59
00:03:02,990 --> 00:03:08,580
So Xi is a normal of this
kind plus a constant.

60
00:03:08,580 --> 00:03:12,780
And so Xi is a normal
random variable

61
00:03:12,780 --> 00:03:16,470
with mean equal to the
constant that we added,

62
00:03:16,470 --> 00:03:19,020
and variance equal to
the original variance

63
00:03:19,020 --> 00:03:21,420
of the random variable, Wi.

64
00:03:21,420 --> 00:03:25,750
And so we can now write down,
the PDF, the conditional PDF,

65
00:03:25,750 --> 00:03:27,325
of Xi.

66
00:03:27,325 --> 00:03:30,140
There's going to be a
normalizing constant.

67
00:03:30,140 --> 00:03:32,700
And then the usual
exponential term,

68
00:03:32,700 --> 00:03:39,730
which is going to be xi minus
the mean of the distribution,

69
00:03:39,730 --> 00:03:42,704
which is theta.

70
00:03:42,704 --> 00:03:45,620
And then we divide by
the usual variance term.

71
00:03:48,280 --> 00:03:53,730
Let us move next to
this distribution here.

72
00:03:53,730 --> 00:03:58,940
This is a shorthand
notation for the joint PDF

73
00:03:58,940 --> 00:04:03,480
of the random
variables X1 up to Xn,

74
00:04:03,480 --> 00:04:07,610
conditional on the
random variable Theta.

75
00:04:07,610 --> 00:04:11,446
So it's really a function
of multiple variables.

76
00:04:14,990 --> 00:04:17,820
And how do we proceed now?

77
00:04:17,820 --> 00:04:20,899
Here is the crucial observation.

78
00:04:20,899 --> 00:04:26,090
If I tell you the value of
the random variable capital

79
00:04:26,090 --> 00:04:32,400
Theta as before, then
you argue as follows.

80
00:04:32,400 --> 00:04:35,460
All of these random
variables are independent.

81
00:04:35,460 --> 00:04:39,180
So if I tell you the value
of the random variable Theta,

82
00:04:39,180 --> 00:04:43,490
this does not change the joint
distribution of the Wi's.

83
00:04:43,490 --> 00:04:46,680
The Wi's were independent
when we started,

84
00:04:46,680 --> 00:04:50,000
so they remain independent
in the conditional universe.

85
00:04:58,630 --> 00:05:01,110
And since the Wi's
are independent

86
00:05:01,110 --> 00:05:05,560
and Xi's are obtained from the
Wi's by just adding a constant,

87
00:05:05,560 --> 00:05:09,670
this means that
the Xi's are also

88
00:05:09,670 --> 00:05:13,420
independent in this
conditional universe.

89
00:05:13,420 --> 00:05:16,080
Once I tell you
the value of Theta,

90
00:05:16,080 --> 00:05:18,490
then because the
noises are independent,

91
00:05:18,490 --> 00:05:21,460
the observations are
also independent.

92
00:05:21,460 --> 00:05:27,810
But this means that the joint
PDF factors as a product

93
00:05:27,810 --> 00:05:32,330
of the individual
marginal PDFs of the Xi's.

94
00:05:37,900 --> 00:05:41,540
And these PDFs, we
have already found.

95
00:05:41,540 --> 00:05:44,409
So now, we can put
everything together

96
00:05:44,409 --> 00:05:48,560
to write down a formula
for the posterior PDF

97
00:05:48,560 --> 00:05:50,400
using the Bayes rule.

98
00:05:50,400 --> 00:05:57,300
We have this denominator term,
which I will write first,

99
00:05:57,300 --> 00:06:00,580
and which term we do
not need to evaluate.

100
00:06:00,580 --> 00:06:04,320
Then we have the
marginal PDF of Theta.

101
00:06:04,320 --> 00:06:08,440
Now since Theta is normal
with these parameters,

102
00:06:08,440 --> 00:06:16,120
this is of the form e to
the minus theta minus x0

103
00:06:16,120 --> 00:06:21,690
squared over 2 sigma 0 squared.

104
00:06:21,690 --> 00:06:27,500
And then we have this joint
density of the X's conditioned

105
00:06:27,500 --> 00:06:32,840
on Theta, which we have already
found, it is this product here.

106
00:06:32,840 --> 00:06:36,710
It's a product of n terms,
one for each observation.

107
00:06:36,710 --> 00:06:40,480
And each one of these terms
is what we have found earlier,

108
00:06:40,480 --> 00:06:43,780
so I'm just substituting
this expression up here.

109
00:06:52,060 --> 00:06:56,390
Now once we have obtained the
observations, so the value

110
00:06:56,390 --> 00:07:00,130
of the random variable capital
X, that is, the value little x,

111
00:07:00,130 --> 00:07:01,710
is fixed.

112
00:07:01,710 --> 00:07:07,430
Once it is fixed, then the x's
that appear here are constant.

113
00:07:07,430 --> 00:07:11,040
So in particular, this
term here is a constant.

114
00:07:11,040 --> 00:07:12,840
We do not bother with it.

115
00:07:12,840 --> 00:07:19,030
And what we have is a constant
times an exponential in terms

116
00:07:19,030 --> 00:07:21,720
that are quadratic in theta.

117
00:07:21,720 --> 00:07:25,020
So we recognize this
kind of expression.

118
00:07:25,020 --> 00:07:29,410
It has to correspond to
a normal distribution.

119
00:07:29,410 --> 00:07:34,120
And this is the first
conclusion of this exercise.

120
00:07:34,120 --> 00:07:38,770
That is, the posterior PDF
of the parameter, Theta,

121
00:07:38,770 --> 00:07:43,560
given our observations, this
posterior PDF is normal.

122
00:07:43,560 --> 00:07:47,890
We have e to a quadratic
function in theta.

123
00:07:47,890 --> 00:07:50,080
And that quadratic
function also involves

124
00:07:50,080 --> 00:07:53,659
the specific values of the
X's that we have obtained.

125
00:07:53,659 --> 00:07:57,710
Let us copy what we have
found and rearrange it.

126
00:07:57,710 --> 00:08:02,430
Once more, we have a
constant, then the exponential

127
00:08:02,430 --> 00:08:06,170
of the negative of some
quadratic function in theta.

128
00:08:06,170 --> 00:08:07,955
And the specific
quadratic function

129
00:08:07,955 --> 00:08:15,370
that we calculated just before
takes this particular form.

130
00:08:15,370 --> 00:08:19,250
What is the mean of this
normal distribution?

131
00:08:19,250 --> 00:08:23,340
The mean is same as the peak.

132
00:08:23,340 --> 00:08:30,180
And to find the peak, the
location at which this PDF is

133
00:08:30,180 --> 00:08:34,940
largest, what we do
is we try to find

134
00:08:34,940 --> 00:08:39,190
the place at which this
quadratic function is smallest.

135
00:08:39,190 --> 00:08:42,890
So what we do is to take
the derivative with respect

136
00:08:42,890 --> 00:08:50,520
to theta of this
quadratic, and set it to 0.

137
00:08:50,520 --> 00:08:54,530
This gives us a sum of terms.

138
00:08:54,530 --> 00:08:59,970
The derivative of
the typical term

139
00:08:59,970 --> 00:09:14,970
is going to be theta minus xi,
divided by sigma i squared.

140
00:09:14,970 --> 00:09:18,340
And this expression
must be equal to 0

141
00:09:18,340 --> 00:09:23,440
if theta is at the peak of
the posterior distribution.

142
00:09:23,440 --> 00:09:26,930
And so we now rearrange
this equation.

143
00:09:26,930 --> 00:09:30,700
We split and take first
the term involving theta,

144
00:09:30,700 --> 00:09:33,465
and gives us a
contribution of this kind.

145
00:09:39,190 --> 00:09:43,160
And we move the terms involving
x's to the other side.

146
00:09:43,160 --> 00:09:45,365
And this gives us
this expression.

147
00:09:50,830 --> 00:09:52,960
And finally, we
take this sum here

148
00:09:52,960 --> 00:09:56,630
and send it to the
denominator of the other side.

149
00:09:56,630 --> 00:10:00,940
And this gives us the
final form of the solution:

150
00:10:00,940 --> 00:10:05,280
the peak of the posterior
distribution, which is also

151
00:10:05,280 --> 00:10:08,400
the same as the conditional
expectation of the posterior

152
00:10:08,400 --> 00:10:09,240
distribution.

153
00:10:09,240 --> 00:10:12,670
Whenever we have a
normal distribution,

154
00:10:12,670 --> 00:10:15,550
the expected value is
the same as the place

155
00:10:15,550 --> 00:10:17,390
where the distribution
is largest.

156
00:10:21,280 --> 00:10:25,000
Let us now conclude with a few
comments and words about how

157
00:10:25,000 --> 00:10:28,120
to interpret the
result that we found.

158
00:10:28,120 --> 00:10:32,210
First, let me emphasize that the
same conclusions that we have

159
00:10:32,210 --> 00:10:35,680
obtained for the case
of a single observation

160
00:10:35,680 --> 00:10:40,110
go through in this case as well.

161
00:10:40,110 --> 00:10:43,250
The posterior distribution
of the unknown parameter

162
00:10:43,250 --> 00:10:47,480
is still a normal distribution.

163
00:10:47,480 --> 00:10:49,860
Our state of
knowledge about Theta

164
00:10:49,860 --> 00:10:52,000
after we obtain
the observations is

165
00:10:52,000 --> 00:10:54,450
described by a
normal distribution.

166
00:10:54,450 --> 00:10:56,190
Because it is a
normal distribution,

167
00:10:56,190 --> 00:11:00,690
the location of its peak is
the same as the expected value.

168
00:11:00,690 --> 00:11:03,070
And for this reason, the
conditional expectation

169
00:11:03,070 --> 00:11:05,640
estimate and the maximum
a posteriori probability

170
00:11:05,640 --> 00:11:07,490
estimates coincide.

171
00:11:07,490 --> 00:11:11,760
And finally, the form of
the estimates that we get is

172
00:11:11,760 --> 00:11:16,700
a linear function in the xi's.

173
00:11:16,700 --> 00:11:20,430
And this linearity is a very
convenient property to have,

174
00:11:20,430 --> 00:11:23,180
because it allows
further analysis

175
00:11:23,180 --> 00:11:26,630
of these ways of
obtaining estimates.

176
00:11:26,630 --> 00:11:30,620
How do we interpret
this formula?

177
00:11:30,620 --> 00:11:33,530
What we have here
is the following.

178
00:11:33,530 --> 00:11:35,880
Each one of the
xi's gets multiplied

179
00:11:35,880 --> 00:11:39,760
by a certain coefficient,
which is 1 over the variance.

180
00:11:39,760 --> 00:11:42,750
And in the denominator,
we have the sum

181
00:11:42,750 --> 00:11:44,870
of all of those coefficients.

182
00:11:44,870 --> 00:11:50,110
So what we really have here is
a weighted average of the xi's.

183
00:11:50,110 --> 00:11:52,720
Now keep in mind that
those xi's are not

184
00:11:52,720 --> 00:11:54,520
all of them of the same kind.

185
00:11:54,520 --> 00:11:58,170
One term is x0, which
is the prior mean,

186
00:11:58,170 --> 00:12:02,470
whereas the remaining xi's are
the values of the observations.

187
00:12:02,470 --> 00:12:05,080
So there's something
interesting happening here.

188
00:12:05,080 --> 00:12:08,350
We combine the values
of the observations

189
00:12:08,350 --> 00:12:10,790
with the value of
the prior mean.

190
00:12:10,790 --> 00:12:12,900
And in some sense,
the prior mean

191
00:12:12,900 --> 00:12:17,200
is treated as just one
more piece of information

192
00:12:17,200 --> 00:12:19,010
available to us.

193
00:12:19,010 --> 00:12:23,100
And it is treated in
a sort of equal way

194
00:12:23,100 --> 00:12:26,060
as the other observations.

195
00:12:26,060 --> 00:12:29,980
The weight that we have
in this weighted average

196
00:12:29,980 --> 00:12:36,410
are that each xi gets divided
by the corresponding variance.

197
00:12:36,410 --> 00:12:38,130
Does this make sense?

198
00:12:38,130 --> 00:12:42,090
Well, suppose that sigma
i squared is large.

199
00:12:45,290 --> 00:12:49,300
This means that the noise
term Wi is very large.

200
00:12:49,300 --> 00:12:53,190
So Xi is very noisy.

201
00:12:53,190 --> 00:12:56,940
And so it's not a useful
observation to have.

202
00:12:56,940 --> 00:13:00,590
And in that case, it
gets a small weight.

203
00:13:03,320 --> 00:13:06,510
So the weights are
determined by the variances

204
00:13:06,510 --> 00:13:09,480
in a way that is quite sensible.

205
00:13:09,480 --> 00:13:12,350
Those observations that
will get the most weight

206
00:13:12,350 --> 00:13:15,730
will be those observations for
which the corresponding noise

207
00:13:15,730 --> 00:13:18,760
variance is small.

208
00:13:18,760 --> 00:13:22,280
So the solution to this
estimation problem that we just

209
00:13:22,280 --> 00:13:26,350
went through has
many nice properties.

210
00:13:26,350 --> 00:13:30,960
First, we stay within the world
of normal random variables,

211
00:13:30,960 --> 00:13:33,760
because even the
posterior is normal.

212
00:13:33,760 --> 00:13:36,540
We stay within the world
of linear functions

213
00:13:36,540 --> 00:13:39,590
of normal random
variables, and the form

214
00:13:39,590 --> 00:13:42,670
of the formula that
we have, itself,

215
00:13:42,670 --> 00:13:46,120
has an appealing
intuitive content.