1
00:00:00,000 --> 00:00:00,040

2
00:00:00,040 --> 00:00:02,460
The following content is
provided under a Creative

3
00:00:02,460 --> 00:00:03,870
Commons license.

4
00:00:03,870 --> 00:00:06,910
Your support will help MIT
OpenCourseWare continue to

5
00:00:06,910 --> 00:00:10,560
offer high quality educational
resources for free.

6
00:00:10,560 --> 00:00:13,460
To make a donation or view
additional materials from

7
00:00:13,460 --> 00:00:19,290
hundreds of MIT courses, visit
MIT OpenCourseWare at

8
00:00:19,290 --> 00:00:22,410
ocw.mit.edu

9
00:00:22,410 --> 00:00:25,430
PROFESSOR: So we're going to
finish today our discussion of

10
00:00:25,430 --> 00:00:28,870
Bayesian Inference, which
we started last time.

11
00:00:28,870 --> 00:00:32,960
As you probably saw there's
not a huge lot of concepts

12
00:00:32,960 --> 00:00:37,370
that we're introducing at this
point in terms of specific

13
00:00:37,370 --> 00:00:39,770
skills of calculating
probabilities.

14
00:00:39,770 --> 00:00:44,040
But, rather, it's more of an
interpretation and setting up

15
00:00:44,040 --> 00:00:45,460
the framework.

16
00:00:45,460 --> 00:00:48,010
So the framework in Bayesian
estimation is that there is

17
00:00:48,010 --> 00:00:52,500
some parameter which is not
known, but we have a prior

18
00:00:52,500 --> 00:00:53,550
distribution on it.

19
00:00:53,550 --> 00:01:00,040
These are beliefs about what
this variable might be, and

20
00:01:00,040 --> 00:01:02,370
then we'll obtain some
measurements.

21
00:01:02,370 --> 00:01:05,410
And the measurements are
affected by the value of that

22
00:01:05,410 --> 00:01:07,560
parameter that we don't know.

23
00:01:07,560 --> 00:01:12,490
And this effect, the fact that
X is affected by Theta, is

24
00:01:12,490 --> 00:01:15,970
captured by introducing a
conditional probability

25
00:01:15,970 --> 00:01:16,660
distribution--

26
00:01:16,660 --> 00:01:19,590
the distribution of X
depends on Theta.

27
00:01:19,590 --> 00:01:22,270
It's a conditional probability
distribution.

28
00:01:22,270 --> 00:01:26,280
So we have formulas for these
two densities, the prior

29
00:01:26,280 --> 00:01:28,330
density and the conditional
density.

30
00:01:28,330 --> 00:01:31,110
And given that we have these,
if we multiply them we can

31
00:01:31,110 --> 00:01:34,000
also get the joint density
of X and Theta.

32
00:01:34,000 --> 00:01:35,940
So we have everything
that's there is to

33
00:01:35,940 --> 00:01:37,450
know in this second.

34
00:01:37,450 --> 00:01:41,650
And now we observe the random
variable X. Given this random

35
00:01:41,650 --> 00:01:44,400
variable what can we
say about Theta?

36
00:01:44,400 --> 00:01:48,380
Well, what we can do is we
can always calculate the

37
00:01:48,380 --> 00:01:52,600
conditional distribution of
theta given X. And now that we

38
00:01:52,600 --> 00:01:55,990
have the specific value of
X we can plot this as

39
00:01:55,990 --> 00:01:58,650
a function of Theta.

40
00:01:58,650 --> 00:01:59,150
OK.

41
00:01:59,150 --> 00:02:01,380
And this is the complete
answer to a

42
00:02:01,380 --> 00:02:02,990
Bayesian Inference problem.

43
00:02:02,990 --> 00:02:06,130
This posterior distribution
captures everything there is

44
00:02:06,130 --> 00:02:10,240
to say about Theta, that's
what we know about Theta.

45
00:02:10,240 --> 00:02:13,330
Given the X that we have
observed Theta is still

46
00:02:13,330 --> 00:02:15,080
random, it's still unknown.

47
00:02:15,080 --> 00:02:18,270
And it might be here, there,
or there with several

48
00:02:18,270 --> 00:02:19,900
probabilities.

49
00:02:19,900 --> 00:02:22,780
On the other hand, if you want
to report a single value for

50
00:02:22,780 --> 00:02:27,590
Theta then you do
some extra work.

51
00:02:27,590 --> 00:02:31,430
You continue from here, and you
do some data processing on

52
00:02:31,430 --> 00:02:35,360
X. Doing data processing means
that you apply a certain

53
00:02:35,360 --> 00:02:39,000
function on the data,
and this function is

54
00:02:39,000 --> 00:02:40,650
something that you design.

55
00:02:40,650 --> 00:02:42,930
It's the so-called estimator.

56
00:02:42,930 --> 00:02:46,460
And once that function is
applied it outputs an estimate

57
00:02:46,460 --> 00:02:50,760
of Theta, which we
call Theta hat.

58
00:02:50,760 --> 00:02:53,490
So this is sort of the big
picture of what's happening.

59
00:02:53,490 --> 00:02:55,880
Now one thing to keep in mind
is that even though I'm

60
00:02:55,880 --> 00:03:00,450
writing single letters here, in
general Theta or X could be

61
00:03:00,450 --> 00:03:02,030
vector random variables.

62
00:03:02,030 --> 00:03:03,540
So think of this--

63
00:03:03,540 --> 00:03:08,170
it could be a collection
Theta1, Theta2, Theta3.

64
00:03:08,170 --> 00:03:11,570
And maybe we obtained several
measurements, so this X is

65
00:03:11,570 --> 00:03:15,630
really a vector X1,
X2, up to Xn.

66
00:03:15,630 --> 00:03:20,190
All right, so now how do we
choose a Theta to report?

67
00:03:20,190 --> 00:03:21,960
There are various ways
of doing it.

68
00:03:21,960 --> 00:03:25,280
One is to look at the posterior
distribution and

69
00:03:25,280 --> 00:03:29,940
report the value of Theta, at
which the density or the PMF

70
00:03:29,940 --> 00:03:31,990
is highest.

71
00:03:31,990 --> 00:03:35,570
This is called the maximum
a posteriori estimate.

72
00:03:35,570 --> 00:03:38,770
So we pick a value of theta for
which the posteriori is

73
00:03:38,770 --> 00:03:40,990
maximum, and we report it.

74
00:03:40,990 --> 00:03:46,030
An alternative way is to try to
be optimal with respects to

75
00:03:46,030 --> 00:03:48,500
a mean squared error.

76
00:03:48,500 --> 00:03:49,410
So what is this?

77
00:03:49,410 --> 00:03:53,260
If we have a specific estimator,
g, this is the

78
00:03:53,260 --> 00:03:55,880
estimate it's going
to produce.

79
00:03:55,880 --> 00:03:58,300
This is the true value of
Theta, so this is our

80
00:03:58,300 --> 00:03:59,740
estimation error.

81
00:03:59,740 --> 00:04:03,180
We look at the square of the
estimation error, and look at

82
00:04:03,180 --> 00:04:04,180
the average value.

83
00:04:04,180 --> 00:04:07,180
We would like this squared
estimation error to be as

84
00:04:07,180 --> 00:04:08,710
small as possible.

85
00:04:08,710 --> 00:04:12,470
How can we design our estimator
g to make that error

86
00:04:12,470 --> 00:04:13,920
as small as possible?

87
00:04:13,920 --> 00:04:19,490
It turns out that the answer is
to produce, as an estimate,

88
00:04:19,490 --> 00:04:22,660
the conditional expectation
of Theta given X. So the

89
00:04:22,660 --> 00:04:26,600
conditional expectation is the
best estimate that you could

90
00:04:26,600 --> 00:04:30,690
produce if your objective is to
keep the mean squared error

91
00:04:30,690 --> 00:04:32,720
as small as possible.

92
00:04:32,720 --> 00:04:35,280
So this statement here is a
statement of what happens on

93
00:04:35,280 --> 00:04:39,950
the average over all Theta's and
all X's that may happen in

94
00:04:39,950 --> 00:04:42,490
our experiment.

95
00:04:42,490 --> 00:04:45,160
The conditional expectation as
an estimator has an even

96
00:04:45,160 --> 00:04:47,750
stronger property.

97
00:04:47,750 --> 00:04:51,490
Not only it's optimal on the
average, but it's also optimal

98
00:04:51,490 --> 00:04:56,130
given that you have made a
specific observation, no

99
00:04:56,130 --> 00:04:57,840
matter what you observe.

100
00:04:57,840 --> 00:05:01,150
Let's say you observe the
specific value for the random

101
00:05:01,150 --> 00:05:05,560
variable X. After that point if
you're asked to produce a

102
00:05:05,560 --> 00:05:11,190
best estimate Theta hat that
minimizes this mean squared

103
00:05:11,190 --> 00:05:14,080
error, your best estimate
would be the conditional

104
00:05:14,080 --> 00:05:18,940
expectation given the specific
value that you have observed.

105
00:05:18,940 --> 00:05:23,150
These two statements say almost
the same thing, but

106
00:05:23,150 --> 00:05:25,650
this one is a bit stronger.

107
00:05:25,650 --> 00:05:30,830
This one tells you no matter
what specific X happens the

108
00:05:30,830 --> 00:05:33,370
conditional expectation
is the best estimate.

109
00:05:33,370 --> 00:05:36,870
This one tells you on the
average, over all X's may

110
00:05:36,870 --> 00:05:39,050
happen, the conditional

111
00:05:39,050 --> 00:05:42,650
expectation is the best estimator.

112
00:05:42,650 --> 00:05:44,870
Now this is really a consequence
of this.

113
00:05:44,870 --> 00:05:48,510
If the conditional expectation
is best for any specific X,

114
00:05:48,510 --> 00:05:52,750
then it's the best one even when
X is left random and you

115
00:05:52,750 --> 00:05:58,200
are averaging your error
over all possible X's.

116
00:05:58,200 --> 00:06:02,120
OK so now that we know what is
the optimal way of producing

117
00:06:02,120 --> 00:06:05,510
an estimate let's do a
simple example to see

118
00:06:05,510 --> 00:06:07,240
how things work out.

119
00:06:07,240 --> 00:06:10,290
So we have started with an
unknown random variable,

120
00:06:10,290 --> 00:06:15,080
Theta, which is uniformly
distributed between 4 and 10.

121
00:06:15,080 --> 00:06:18,270
And then we have an observation
model that tells

122
00:06:18,270 --> 00:06:22,430
us that given the value of
Theta, X is going to be a

123
00:06:22,430 --> 00:06:24,532
random variable that ranges
between Theta -

124
00:06:24,532 --> 00:06:26,570
1, and Theta + 1.

125
00:06:26,570 --> 00:06:32,550
So think of X as a noisy
measurement of Theta, plus

126
00:06:32,550 --> 00:06:37,600
some noise, which is
between -1, and +1.

127
00:06:37,600 --> 00:06:41,980
So really the model that we are
using here is that X is

128
00:06:41,980 --> 00:06:44,430
equal to Theta plus U --

129
00:06:44,430 --> 00:06:50,500
where U is uniform
on -1, and +1.

130
00:06:50,500 --> 00:06:52,350
one, and plus one.

131
00:06:52,350 --> 00:06:55,946
So we have the true value of
Theta, but X could be Theta -

132
00:06:55,946 --> 00:07:00,750
1, or it could be all the
way up to Theta + 1.

133
00:07:00,750 --> 00:07:03,770
And the X is uniformly
distributed on that interval.

134
00:07:03,770 --> 00:07:08,060
That's the same as saying that
U is uniformly distributed

135
00:07:08,060 --> 00:07:09,820
over this interval.

136
00:07:09,820 --> 00:07:12,780
So now we have all the
information that we need, we

137
00:07:12,780 --> 00:07:15,270
can construct the
joint density.

138
00:07:15,270 --> 00:07:19,020
And the joint density is, of
course, the prior density

139
00:07:19,020 --> 00:07:21,850
times the conditional density.

140
00:07:21,850 --> 00:07:24,540
We go both of these.

141
00:07:24,540 --> 00:07:28,880
Both of these are constants, so
the joint density is also

142
00:07:28,880 --> 00:07:30,150
going to be a constant.

143
00:07:30,150 --> 00:07:34,420
1/6 times 1/2, this
is one over 12.

144
00:07:34,420 --> 00:07:37,700
But it is a constant,
not everywhere.

145
00:07:37,700 --> 00:07:41,280
Only on the range of possible
x's and thetas.

146
00:07:41,280 --> 00:07:46,030
So theta can take any value
between four and ten, so these

147
00:07:46,030 --> 00:07:47,430
are the values of theta.

148
00:07:47,430 --> 00:07:51,990
And for any given value of theta
x can take values from

149
00:07:51,990 --> 00:07:55,690
theta minus one, up
to theta plus one.

150
00:07:55,690 --> 00:08:00,210
So here, if you can imagine, a
line that goes with slope one,

151
00:08:00,210 --> 00:08:08,530
and then x can take that value
of theta plus or minus one.

152
00:08:08,530 --> 00:08:14,720
So this object here, this is
the set of possible x and

153
00:08:14,720 --> 00:08:16,070
theta pairs.

154
00:08:16,070 --> 00:08:21,490
So the density is equal to one
over 12 over this set, and

155
00:08:21,490 --> 00:08:23,640
it's zero everywhere else.

156
00:08:23,640 --> 00:08:28,035
So outside here the density is
zero, the density only applies

157
00:08:28,035 --> 00:08:29,800
at that point.

158
00:08:29,800 --> 00:08:33,110
All right, so now we're
asked to estimate

159
00:08:33,110 --> 00:08:34,890
theta in terms of x.

160
00:08:34,890 --> 00:08:37,500
So we want to build an estimator
which is going to be

161
00:08:37,500 --> 00:08:40,000
a function from the
x's to the thetas.

162
00:08:40,000 --> 00:08:42,909
That's why I chose the axis
this way-- x to be on this

163
00:08:42,909 --> 00:08:44,600
axis, theta on that axis--

164
00:08:44,600 --> 00:08:48,020
Because the estimator we're
building is a function of x.

165
00:08:48,020 --> 00:08:51,070
Based on the observation that
we obtained, we want to

166
00:08:51,070 --> 00:08:51,940
estimate theta.

167
00:08:51,940 --> 00:08:55,680
So we know that the optimal
estimator is the conditional

168
00:08:55,680 --> 00:08:59,360
expectation, given
the value of x.

169
00:08:59,360 --> 00:09:02,160
So what is the conditional
expectation?

170
00:09:02,160 --> 00:09:07,890
If you fix a particular value of
x, let's say in this range.

171
00:09:07,890 --> 00:09:13,240
So this is our x, then what
do we know about theta?

172
00:09:13,240 --> 00:09:18,050
We know that theta lies
in this range.

173
00:09:18,050 --> 00:09:21,670
Theta can only be sampled
between those two values.

174
00:09:21,670 --> 00:09:24,760
And what kind of distribution
does theta have?

175
00:09:24,760 --> 00:09:28,980
What is the conditional
distribution of theta given x?

176
00:09:28,980 --> 00:09:32,260
Well, remember how we built
conditional distributions from

177
00:09:32,260 --> 00:09:33,410
joint distributions?

178
00:09:33,410 --> 00:09:38,900
The conditional distribution is
just a section of the joint

179
00:09:38,900 --> 00:09:41,640
distribution applied to
the place where we're

180
00:09:41,640 --> 00:09:42,770
conditioning.

181
00:09:42,770 --> 00:09:45,800
So the joint is constant.

182
00:09:45,800 --> 00:09:49,310
So the conditional is also going
to be a constant density

183
00:09:49,310 --> 00:09:50,630
over this interval.

184
00:09:50,630 --> 00:09:53,560
So the posterior distribution
of theta is

185
00:09:53,560 --> 00:09:57,210
uniform over this interval.

186
00:09:57,210 --> 00:10:01,110
So if the posterior of theta is
uniform over that interval,

187
00:10:01,110 --> 00:10:04,900
the expected value of theta is
going to be the meet point of

188
00:10:04,900 --> 00:10:06,070
that interval.

189
00:10:06,070 --> 00:10:08,880
So the estimate which
you report--

190
00:10:08,880 --> 00:10:10,710
if you observe that theta--

191
00:10:10,710 --> 00:10:15,750
is going to be this particular
point here, it's the midpoint.

192
00:10:15,750 --> 00:10:19,140
The same argument goes through
even if you obtain an x

193
00:10:19,140 --> 00:10:22,570
somewhere here.

194
00:10:22,570 --> 00:10:29,540
Given this x, theta
can take a value

195
00:10:29,540 --> 00:10:32,800
between these two values.

196
00:10:32,800 --> 00:10:35,990
Theta is going to have a uniform
distribution over this

197
00:10:35,990 --> 00:10:40,650
interval, and the conditional
expectation of theta given x

198
00:10:40,650 --> 00:10:43,840
is going to be the midpoint
of that interval.

199
00:10:43,840 --> 00:10:50,790
So now if we plot our estimator
by tracing midpoints

200
00:10:50,790 --> 00:10:56,300
in this diagram what you're
going to obtain is a curve

201
00:10:56,300 --> 00:11:01,795
that starts like this, then
it changes slope.

202
00:11:01,795 --> 00:11:04,490

203
00:11:04,490 --> 00:11:07,280
So that it keeps track of the
midpoint, and then it goes

204
00:11:07,280 --> 00:11:09,000
like that again.

205
00:11:09,000 --> 00:11:13,760
So this blue curve here is
our g of x, which is the

206
00:11:13,760 --> 00:11:16,910
conditional expectation of
theta given that x is

207
00:11:16,910 --> 00:11:20,480
equal to little x.

208
00:11:20,480 --> 00:11:26,610
So it's a curve, in our example
it consists of three

209
00:11:26,610 --> 00:11:28,220
straight segments.

210
00:11:28,220 --> 00:11:30,780
But overall it's non-linear.

211
00:11:30,780 --> 00:11:33,440
It's not a single line
through this diagram.

212
00:11:33,440 --> 00:11:35,670
And that's how things
are in general.

213
00:11:35,670 --> 00:11:39,300
g of x, our optimal estimate has
no reason to be a linear

214
00:11:39,300 --> 00:11:40,460
function of x.

215
00:11:40,460 --> 00:11:42,780
In general it's going to be
some complicated curve.

216
00:11:42,780 --> 00:11:47,350

217
00:11:47,350 --> 00:11:51,170
So how good is our estimate?

218
00:11:51,170 --> 00:11:55,700
I mean you reported your x, your
estimate of theta based

219
00:11:55,700 --> 00:12:00,690
on x, and your boss asks you
what kind of error do you

220
00:12:00,690 --> 00:12:03,350
expect to get?

221
00:12:03,350 --> 00:12:07,010
Having observed the particular
value of x, what you can

222
00:12:07,010 --> 00:12:11,140
report to your boss is what you
think is the mean squared

223
00:12:11,140 --> 00:12:13,040
error is going to be.

224
00:12:13,040 --> 00:12:15,380
We observe the particular
value of x.

225
00:12:15,380 --> 00:12:19,650
So we're conditioning, and we're
living in this universe.

226
00:12:19,650 --> 00:12:22,760
Given that we have made this
observation, this is the true

227
00:12:22,760 --> 00:12:25,840
value of theta, this is the
estimate that we have

228
00:12:25,840 --> 00:12:32,220
produced, this is the expected
squared error, given that we

229
00:12:32,220 --> 00:12:35,740
have made the particular
observation.

230
00:12:35,740 --> 00:12:39,700
Now in this conditional universe
this is the expected

231
00:12:39,700 --> 00:12:42,880
value of theta given x.

232
00:12:42,880 --> 00:12:46,240
So this is the expected value of
this random variable inside

233
00:12:46,240 --> 00:12:47,900
the conditional universe.

234
00:12:47,900 --> 00:12:50,900
So when you take the mean
squared of a random variable

235
00:12:50,900 --> 00:12:53,780
minus the expected value, this
is the same thing as the

236
00:12:53,780 --> 00:12:55,840
variance of that random
variable.

237
00:12:55,840 --> 00:12:58,670
Except that it's the
variance inside

238
00:12:58,670 --> 00:13:00,940
the conditional universe.

239
00:13:00,940 --> 00:13:06,230
Having observed x, theta is
still a random variable.

240
00:13:06,230 --> 00:13:09,010
It's distributed according to
the posterior distribution.

241
00:13:09,010 --> 00:13:12,220
Since it's a random variable,
it has a variance.

242
00:13:12,220 --> 00:13:16,060
And that variance is our
mean squared error.

243
00:13:16,060 --> 00:13:20,280
So this is the variance of the
posterior distribution of

244
00:13:20,280 --> 00:13:22,605
Theta given the observation
that we have made.

245
00:13:22,605 --> 00:13:26,688

246
00:13:26,688 --> 00:13:30,180
OK, so what is the variance
in our example?

247
00:13:30,180 --> 00:13:36,270
If X happens to be here, then
Theta is uniform over this

248
00:13:36,270 --> 00:13:41,990
interval, and this interval
has length 2.

249
00:13:41,990 --> 00:13:46,960
Theta is uniformly distributed
over an interval of length 2.

250
00:13:46,960 --> 00:13:49,900
This is the posterior
distribution of Theta.

251
00:13:49,900 --> 00:13:51,410
What is the variance?

252
00:13:51,410 --> 00:13:54,680
Then you remember the formula
for the variance of a uniform

253
00:13:54,680 --> 00:13:59,520
random variable, it is the
length of the interval squared

254
00:13:59,520 --> 00:14:03,590
divided by 12, so this is 1/3.

255
00:14:03,590 --> 00:14:06,060
So the variance of Theta --

256
00:14:06,060 --> 00:14:10,330
the mean squared error-- is
going to be 1/3 whenever this

257
00:14:10,330 --> 00:14:12,430
kind of picture applies.

258
00:14:12,430 --> 00:14:16,460
This picture applies when
X is between 5 and 9.

259
00:14:16,460 --> 00:14:20,100
If X is less than 5, then the
picture is a little different,

260
00:14:20,100 --> 00:14:22,020
and Theta is going
to be uniform

261
00:14:22,020 --> 00:14:24,660
over a smaller interval.

262
00:14:24,660 --> 00:14:26,930
And so the variance of
theta is going to

263
00:14:26,930 --> 00:14:28,770
be smaller as well.

264
00:14:28,770 --> 00:14:31,470
So let's start plotting our
mean squared error.

265
00:14:31,470 --> 00:14:35,930
Between 5 and 9 the variance
of Theta --

266
00:14:35,930 --> 00:14:37,260
the posterior variance--

267
00:14:37,260 --> 00:14:39,090
is 1/3.

268
00:14:39,090 --> 00:14:46,100
Now when the X falls in here
Theta is uniformly distributed

269
00:14:46,100 --> 00:14:48,450
over a smaller interval.

270
00:14:48,450 --> 00:14:50,670
The size of this interval
changes

271
00:14:50,670 --> 00:14:52,800
linearly over that range.

272
00:14:52,800 --> 00:14:59,260
And so when we take the square
size of that interval we get a

273
00:14:59,260 --> 00:15:01,560
quadratic function of
how much we have

274
00:15:01,560 --> 00:15:03,120
moved from that corner.

275
00:15:03,120 --> 00:15:07,140
So at that corner what is
the variance of Theta?

276
00:15:07,140 --> 00:15:11,290
Well if I observe an X that's
equal to 3 then I know with

277
00:15:11,290 --> 00:15:14,810
certainty that Theta
is equal to 4.

278
00:15:14,810 --> 00:15:18,340
Then I'm in very good shape, I
know exactly what Theta is

279
00:15:18,340 --> 00:15:19,240
going to be.

280
00:15:19,240 --> 00:15:22,890
So the variance, in this
case, is going to be 0.

281
00:15:22,890 --> 00:15:26,570
If I observe an X that's a
little larger than Theta is

282
00:15:26,570 --> 00:15:31,130
now random, takes values on
a little interval, and the

283
00:15:31,130 --> 00:15:35,430
variance of Theta is going to be
proportional to the square

284
00:15:35,430 --> 00:15:37,910
of the length of that
little interval.

285
00:15:37,910 --> 00:15:40,400
So we get a curve that
starts rising

286
00:15:40,400 --> 00:15:42,560
quadratically from here.

287
00:15:42,560 --> 00:15:45,390
It goes up forward 1/3.

288
00:15:45,390 --> 00:15:48,980
At the other end of the picture
the same is true.

289
00:15:48,980 --> 00:15:54,500
If you observe an X which is
11 then Theta can only be

290
00:15:54,500 --> 00:15:57,150
equal to 10.

291
00:15:57,150 --> 00:16:00,720
And so the error in Theta
is equal to 0,

292
00:16:00,720 --> 00:16:02,920
there's 0 error variance.

293
00:16:02,920 --> 00:16:07,360
But as we obtain X's that are
slightly less than 11 then the

294
00:16:07,360 --> 00:16:10,380
mean squared error again
rises quadratically.

295
00:16:10,380 --> 00:16:13,450
So we end up with a
plot like this.

296
00:16:13,450 --> 00:16:17,120
What this plot tells us is that
certain measurements are

297
00:16:17,120 --> 00:16:18,920
better than others.

298
00:16:18,920 --> 00:16:25,270
If you're lucky, and you see X
equal to 3 then you're lucky,

299
00:16:25,270 --> 00:16:28,820
because you know Theta
exactly what it is.

300
00:16:28,820 --> 00:16:33,830
If you see an X which is equal
to 6 then you're sort of

301
00:16:33,830 --> 00:16:35,800
unlikely, because it
doesn't tell you

302
00:16:35,800 --> 00:16:37,900
Theta with great precision.

303
00:16:37,900 --> 00:16:40,560
Theta could be anywhere
on that interval.

304
00:16:40,560 --> 00:16:42,360
And so the variance
of Theta --

305
00:16:42,360 --> 00:16:44,630
even after you have
observed X --

306
00:16:44,630 --> 00:16:48,470
is a certain number,
1/3 in our case.

307
00:16:48,470 --> 00:16:52,370
So the moral to keep out
of that story is

308
00:16:52,370 --> 00:16:56,970
that the error variance--

309
00:16:56,970 --> 00:17:00,380
or the mean squared error--

310
00:17:00,380 --> 00:17:03,350
depends on what particular
observation

311
00:17:03,350 --> 00:17:04,829
you happen to obtain.

312
00:17:04,829 --> 00:17:10,240
Some observations may be very
informative, and once you see

313
00:17:10,240 --> 00:17:13,550
a specific number than you know
exactly what Theta is.

314
00:17:13,550 --> 00:17:15,760
Some observations might
be less informative.

315
00:17:15,760 --> 00:17:18,980
You observe your X, but it could
still leave a lot of

316
00:17:18,980 --> 00:17:20,230
uncertainty about Theta.

317
00:17:20,230 --> 00:17:23,839

318
00:17:23,839 --> 00:17:27,650
So conditional expectations are
really the cornerstone of

319
00:17:27,650 --> 00:17:28,890
Bayesian estimation.

320
00:17:28,890 --> 00:17:31,690
They're particularly
popular, especially

321
00:17:31,690 --> 00:17:33,950
in engineering contexts.

322
00:17:33,950 --> 00:17:38,260
There used a lot in signal
processing, communications,

323
00:17:38,260 --> 00:17:40,940
control theory, so on.

324
00:17:40,940 --> 00:17:44,300
So that makes it worth playing
a little bit with their

325
00:17:44,300 --> 00:17:50,450
theoretical properties, and get
some appreciation of a few

326
00:17:50,450 --> 00:17:53,590
subtleties involved here.

327
00:17:53,590 --> 00:17:57,990
No new math in reality, in what
we're going to do here.

328
00:17:57,990 --> 00:18:01,290
But it's going to be a good
opportunity to practice

329
00:18:01,290 --> 00:18:05,310
manipulation of conditional
expectations.

330
00:18:05,310 --> 00:18:13,150
So let's look at the expected
value of the estimation error

331
00:18:13,150 --> 00:18:15,330
that we obtained.

332
00:18:15,330 --> 00:18:18,540
So Theta hat is our estimator,
is the conditional

333
00:18:18,540 --> 00:18:19,855
expectation.

334
00:18:19,855 --> 00:18:25,690
Theta hat minus Theta is what
kind of error do we have?

335
00:18:25,690 --> 00:18:29,610
If Theta hat, is bigger than
Theta then we have made the

336
00:18:29,610 --> 00:18:31,510
positive error.

337
00:18:31,510 --> 00:18:33,910
If not, if it's on the other
side, we have made the

338
00:18:33,910 --> 00:18:35,290
negative error.

339
00:18:35,290 --> 00:18:39,110
Then it turns out that on the
average the errors cancel each

340
00:18:39,110 --> 00:18:41,030
other out, on the average.

341
00:18:41,030 --> 00:18:43,110
So let's do this calculation.

342
00:18:43,110 --> 00:18:50,010
Let's calculate the expected
value of the error given X.

343
00:18:50,010 --> 00:18:54,480
Now by definition the error is
expected value of Theta hat

344
00:18:54,480 --> 00:18:57,850
minus Theta given X.

345
00:18:57,850 --> 00:19:01,090
We use linearity of expectations
to break it up as

346
00:19:01,090 --> 00:19:04,850
expected value of Theta hat
given X minus expected value

347
00:19:04,850 --> 00:19:11,090
of Theta given X.
And now what?

348
00:19:11,090 --> 00:19:18,680
Our estimate is made on the
basis of the data of the X's.

349
00:19:18,680 --> 00:19:23,600
If I tell you X then you
know what Theta hat is.

350
00:19:23,600 --> 00:19:26,490
Remember that the conditional
expectation is a random

351
00:19:26,490 --> 00:19:29,680
variable which is a function
of the random variable, on

352
00:19:29,680 --> 00:19:31,560
which you're conditioning on.

353
00:19:31,560 --> 00:19:35,330
If you know X then you know the
conditional expectation

354
00:19:35,330 --> 00:19:38,390
given X, you know what Theta
hat is going to be.

355
00:19:38,390 --> 00:19:42,910
So Theta hat is a function of
X. If it's a function of X

356
00:19:42,910 --> 00:19:45,910
then once I tell you X
you know what Theta

357
00:19:45,910 --> 00:19:47,460
hat is going to be.

358
00:19:47,460 --> 00:19:49,580
So this conditional expectation
is going to be

359
00:19:49,580 --> 00:19:51,860
Theta hat itself.

360
00:19:51,860 --> 00:19:54,030
Here this is-- just
by definition--

361
00:19:54,030 --> 00:19:59,580
Theta hat, and so we
get equality to 0.

362
00:19:59,580 --> 00:20:04,260
So what we have proved is that
no matter what I have

363
00:20:04,260 --> 00:20:08,970
observed, and given that I have
observed something on the

364
00:20:08,970 --> 00:20:14,050
average my error is
going to be 0.

365
00:20:14,050 --> 00:20:19,960
This is a statement involving
equality of random variables.

366
00:20:19,960 --> 00:20:22,620
Remember that conditional
expectations are random

367
00:20:22,620 --> 00:20:26,970
variables because they depend
on the thing you're

368
00:20:26,970 --> 00:20:28,440
conditioning on.

369
00:20:28,440 --> 00:20:31,630
0 is sort of a trivial
random variable.

370
00:20:31,630 --> 00:20:34,080
This tells you that this random
variable is identically

371
00:20:34,080 --> 00:20:36,390
equal to the 0 random
variable.

372
00:20:36,390 --> 00:20:40,720
More specifically it tells you
that no matter what value for

373
00:20:40,720 --> 00:20:45,120
X you observe, the conditional
expectation of the error is

374
00:20:45,120 --> 00:20:46,410
going to be 0.

375
00:20:46,410 --> 00:20:49,150
And this takes us to this
statement here, which is

376
00:20:49,150 --> 00:20:51,830
inequality between numbers.

377
00:20:51,830 --> 00:20:56,330
No matter what specific value
for capital X you have

378
00:20:56,330 --> 00:21:00,440
observed, your error, on
the average, is going

379
00:21:00,440 --> 00:21:02,420
to be equal to 0.

380
00:21:02,420 --> 00:21:06,730
So this is a less abstract
version of these statements.

381
00:21:06,730 --> 00:21:09,300
This is inequality between
two numbers.

382
00:21:09,300 --> 00:21:15,080
It's true for every value of
X, so it's true in terms of

383
00:21:15,080 --> 00:21:18,550
these random variables being
equal to that random variable.

384
00:21:18,550 --> 00:21:21,170
Because remember according to
our definition this random

385
00:21:21,170 --> 00:21:24,400
variable is the random variable
that takes this

386
00:21:24,400 --> 00:21:27,410
specific value when capital
X happens to be

387
00:21:27,410 --> 00:21:29,410
equal to little x.

388
00:21:29,410 --> 00:21:33,480
Now this doesn't mean that your
error is 0, it only means

389
00:21:33,480 --> 00:21:37,050
that your error is as likely, in
some sense, to fall on the

390
00:21:37,050 --> 00:21:40,040
positive side, as to fall
on the negative side.

391
00:21:40,040 --> 00:21:41,400
So sometimes your error will be

392
00:21:41,400 --> 00:21:42,880
positive, sometimes negative.

393
00:21:42,880 --> 00:21:46,360
And on the average these
things cancel out and

394
00:21:46,360 --> 00:21:48,150
give you a 0 --.

395
00:21:48,150 --> 00:21:49,470
on the average.

396
00:21:49,470 --> 00:21:53,620
So this is a property that's
sometimes giving the name we

397
00:21:53,620 --> 00:21:59,040
say that Theta hat
is unbiased.

398
00:21:59,040 --> 00:22:03,190
So Theta hat, our estimate, does
not have a tendency to be

399
00:22:03,190 --> 00:22:04,180
on the high side.

400
00:22:04,180 --> 00:22:06,920
It does not have a tendency
to be on the low side.

401
00:22:06,920 --> 00:22:10,580
On the average it's
just right.

402
00:22:10,580 --> 00:22:14,700

403
00:22:14,700 --> 00:22:18,390
So let's do a little
more playing here.

404
00:22:18,390 --> 00:22:21,790

405
00:22:21,790 --> 00:22:27,690
Let's see how our error is
related to an arbitrary

406
00:22:27,690 --> 00:22:30,270
function of the data.

407
00:22:30,270 --> 00:22:36,960
Let's do this in a conditional
universe and

408
00:22:36,960 --> 00:22:38,210
look at this quantity.

409
00:22:38,210 --> 00:22:45,210

410
00:22:45,210 --> 00:22:47,910
In a conditional universe
where X is known

411
00:22:47,910 --> 00:22:51,060
then h of X is known.

412
00:22:51,060 --> 00:22:54,200
And so you can pull it outside
the expectation.

413
00:22:54,200 --> 00:22:58,010
In the conditional universe
where the value of X is given

414
00:22:58,010 --> 00:23:01,290
this quantity becomes
just a constant.

415
00:23:01,290 --> 00:23:03,250
There's nothing random
about it.

416
00:23:03,250 --> 00:23:06,280
So you can pull it out,
the expectation, and

417
00:23:06,280 --> 00:23:09,840
write things this way.

418
00:23:09,840 --> 00:23:14,090
And we have just calculated
that this quantity is 0.

419
00:23:14,090 --> 00:23:17,390
So this number turns out
to be 0 as well.

420
00:23:17,390 --> 00:23:20,810

421
00:23:20,810 --> 00:23:23,830
Now having done this,
we can take

422
00:23:23,830 --> 00:23:26,110
expectations of both sides.

423
00:23:26,110 --> 00:23:29,530
And now let's use the law of
iterated expectations.

424
00:23:29,530 --> 00:23:33,040
Expectation of a conditional
expectation gives us the

425
00:23:33,040 --> 00:23:42,200
unconditional expectation, and
this is also going to be 0.

426
00:23:42,200 --> 00:23:47,455
So here we use the law of
iterated expectations.

427
00:23:47,455 --> 00:23:54,460

428
00:23:54,460 --> 00:23:55,710
OK.

429
00:23:55,710 --> 00:24:04,510

430
00:24:04,510 --> 00:24:06,290
OK, why are we doing this?

431
00:24:06,290 --> 00:24:09,990
We're doing this because I would
like to calculate the

432
00:24:09,990 --> 00:24:13,940
covariance between Theta
tilde and Theta hat.

433
00:24:13,940 --> 00:24:16,490
Theta hat is, ask the question
-- is there a systematic

434
00:24:16,490 --> 00:24:20,870
relation between the error
and the estimate?

435
00:24:20,870 --> 00:24:30,830
So to calculate the covariance
we use the property that we

436
00:24:30,830 --> 00:24:34,460
can calculate the covariances
by calculating the expected

437
00:24:34,460 --> 00:24:39,520
value of the product minus
the product of

438
00:24:39,520 --> 00:24:40,770
the expected values.

439
00:24:40,770 --> 00:24:48,440

440
00:24:48,440 --> 00:24:50,850
And what do we get?

441
00:24:50,850 --> 00:24:56,080
This is 0, because of
what we just proved.

442
00:24:56,080 --> 00:25:00,980

443
00:25:00,980 --> 00:25:06,160
And this is 0, because of
what we proved earlier.

444
00:25:06,160 --> 00:25:09,740
That the expected value of
the error is equal to 0.

445
00:25:09,740 --> 00:25:12,900

446
00:25:12,900 --> 00:25:27,800
So the covariance between the
error and any function of X is

447
00:25:27,800 --> 00:25:29,470
equal to 0.

448
00:25:29,470 --> 00:25:33,060
Let's use that to the case where
the function of X we're

449
00:25:33,060 --> 00:25:38,620
considering is Theta
hat itself.

450
00:25:38,620 --> 00:25:43,300
Theta hat is our estimate, it's
a function of X. So this

451
00:25:43,300 --> 00:25:46,845
0 result would still apply,
and we get that this

452
00:25:46,845 --> 00:25:50,570
covariance is equal to 0.

453
00:25:50,570 --> 00:25:59,100
OK, so that's what we proved.

454
00:25:59,100 --> 00:26:02,720
Let's see, what are the morals
to take out of all this?

455
00:26:02,720 --> 00:26:07,640
First is you should be very
comfortable with this type of

456
00:26:07,640 --> 00:26:10,580
calculation involving
conditional expectations.

457
00:26:10,580 --> 00:26:14,100
The main two things that we're
using are that when you

458
00:26:14,100 --> 00:26:17,630
condition on a random variable
any function of that random

459
00:26:17,630 --> 00:26:21,020
variable becomes a constant,
and can be pulled out the

460
00:26:21,020 --> 00:26:22,690
conditional expectation.

461
00:26:22,690 --> 00:26:25,460
The other thing that we are
using is the law of iterated

462
00:26:25,460 --> 00:26:29,450
expectations, so these are
the skills involved.

463
00:26:29,450 --> 00:26:32,980
Now on the substance, why is
this result interesting?

464
00:26:32,980 --> 00:26:35,390
This tells us that the error is

465
00:26:35,390 --> 00:26:37,060
uncorrelated with the estimate.

466
00:26:37,060 --> 00:26:39,770

467
00:26:39,770 --> 00:26:42,530
What's a hypothetical situation
where these would

468
00:26:42,530 --> 00:26:44,160
not happen?

469
00:26:44,160 --> 00:26:52,720
Whenever Theta hat is positive
my error tends to be negative.

470
00:26:52,720 --> 00:26:57,000
Suppose that whenever Theta hat
is big then you say oh my

471
00:26:57,000 --> 00:27:00,610
estimate is too big, maybe the
true Theta is on the lower

472
00:27:00,610 --> 00:27:04,470
side, so I expect my error
to be negative.

473
00:27:04,470 --> 00:27:09,230
That would be a situation that
would violate this condition.

474
00:27:09,230 --> 00:27:13,880
This condition tells you that
no matter what Theta hat is,

475
00:27:13,880 --> 00:27:17,110
you don't expect your error to
be on the positive side or on

476
00:27:17,110 --> 00:27:18,030
the negative side.

477
00:27:18,030 --> 00:27:21,630
Your error will still
be 0 on the average.

478
00:27:21,630 --> 00:27:25,780
So if you obtain a very high
estimate this is no reason for

479
00:27:25,780 --> 00:27:29,630
you to suspect that
the true Theta is

480
00:27:29,630 --> 00:27:30,890
lower than your estimate.

481
00:27:30,890 --> 00:27:34,420
If you suspected that the true
Theta was lower than your

482
00:27:34,420 --> 00:27:38,830
estimate you should have
changed your Theta hat.

483
00:27:38,830 --> 00:27:42,580
If you make an estimate and
after obtaining that estimate

484
00:27:42,580 --> 00:27:46,270
you say I think my estimate
is too big, and so

485
00:27:46,270 --> 00:27:47,770
the error is negative.

486
00:27:47,770 --> 00:27:50,730
If you thought that way then
that means that your estimate

487
00:27:50,730 --> 00:27:53,690
is not the optimal one, that
your estimate should have been

488
00:27:53,690 --> 00:27:57,200
corrected to be smaller.

489
00:27:57,200 --> 00:28:00,030
And that would mean that there's
a better estimate than

490
00:28:00,030 --> 00:28:03,060
the one you used, but the
estimate that we are using

491
00:28:03,060 --> 00:28:06,060
here is the optimal one in terms
of mean squared error,

492
00:28:06,060 --> 00:28:08,350
there's no way of
improving it.

493
00:28:08,350 --> 00:28:11,640
And this is really captured
in that statement.

494
00:28:11,640 --> 00:28:14,250
That is knowing Theta hat
doesn't give you a lot of

495
00:28:14,250 --> 00:28:18,290
information about the error, and
gives you, therefore, no

496
00:28:18,290 --> 00:28:24,430
reason to adjust your estimate
from what it was.

497
00:28:24,430 --> 00:28:29,190
Finally, a consequence
of all this.

498
00:28:29,190 --> 00:28:31,910
This is the definition
of the error.

499
00:28:31,910 --> 00:28:35,770
Send Theta to this side, send
Theta tilde to that side, you

500
00:28:35,770 --> 00:28:36,850
get this relation.

501
00:28:36,850 --> 00:28:41,010
The true parameter is composed
of two quantities.

502
00:28:41,010 --> 00:28:44,940
The estimate, and the
error that they got

503
00:28:44,940 --> 00:28:46,460
with a minus sign.

504
00:28:46,460 --> 00:28:49,790
These two quantities are
uncorrelated with each other.

505
00:28:49,790 --> 00:28:53,350
Their covariance is 0, and
therefore, the variance of

506
00:28:53,350 --> 00:28:56,330
this is the sum of the variances
of these two

507
00:28:56,330 --> 00:28:57,580
quantities.

508
00:28:57,580 --> 00:29:00,470

509
00:29:00,470 --> 00:29:07,520
So what's an interpretation
of this equality?

510
00:29:07,520 --> 00:29:10,930
There is some inherent
randomness in the random

511
00:29:10,930 --> 00:29:14,540
variable theta that we're
trying to estimate.

512
00:29:14,540 --> 00:29:19,360
Theta hat tries to estimate it,
tries to get close to it.

513
00:29:19,360 --> 00:29:25,500
And if Theta hat always stays
close to Theta, since Theta is

514
00:29:25,500 --> 00:29:29,260
random Theta hat must also be
quite random, so it has

515
00:29:29,260 --> 00:29:31,170
uncertainty in it.

516
00:29:31,170 --> 00:29:35,270
And the more uncertain Theta
hat is the more it moves

517
00:29:35,270 --> 00:29:36,640
together with Theta.

518
00:29:36,640 --> 00:29:40,860
So the more uncertainty
it removes from Theta.

519
00:29:40,860 --> 00:29:43,900
And this is the remaining
uncertainty in Theta.

520
00:29:43,900 --> 00:29:47,140
The uncertainty that's left
after we've done our

521
00:29:47,140 --> 00:29:48,350
estimation.

522
00:29:48,350 --> 00:29:52,330
So ideally, to have a small
error we want this

523
00:29:52,330 --> 00:29:54,120
quantity to be small.

524
00:29:54,120 --> 00:29:55,820
Which is the same as
saying that this

525
00:29:55,820 --> 00:29:57,740
quantity should be big.

526
00:29:57,740 --> 00:30:02,070
In the ideal case Theta hat
is the same as Theta.

527
00:30:02,070 --> 00:30:04,820
That's the best we
could hope for.

528
00:30:04,820 --> 00:30:09,250
That corresponds to 0 error,
and all the uncertainly in

529
00:30:09,250 --> 00:30:14,230
Theta is absorbed by the
uncertainty in Theta hat.

530
00:30:14,230 --> 00:30:18,960
Interestingly, this relation
here is just another variation

531
00:30:18,960 --> 00:30:21,630
of the law of total variance
that we have seen at some

532
00:30:21,630 --> 00:30:23,880
point in the past.

533
00:30:23,880 --> 00:30:28,570
I will skip that derivation, but
it's an interesting fact,

534
00:30:28,570 --> 00:30:31,430
and it can give you an
alternative interpretation of

535
00:30:31,430 --> 00:30:32,680
the law of total variance.

536
00:30:32,680 --> 00:30:36,840

537
00:30:36,840 --> 00:30:40,570
OK, so now let's return
to our example.

538
00:30:40,570 --> 00:30:45,630
In our example we obtained the
optimal estimator, and we saw

539
00:30:45,630 --> 00:30:51,220
that it was a nonlinear curve,
something like this.

540
00:30:51,220 --> 00:30:53,660
I'm exaggerating the corner
of a little bit to

541
00:30:53,660 --> 00:30:55,350
show that it's nonlinear.

542
00:30:55,350 --> 00:30:57,400
This is the optimal estimator.

543
00:30:57,400 --> 00:31:01,070
It's a nonlinear function
of X --

544
00:31:01,070 --> 00:31:05,200
nonlinear generally
means complicated.

545
00:31:05,200 --> 00:31:09,020
Sometimes the conditional
expectation is really hard to

546
00:31:09,020 --> 00:31:12,320
compute, because whenever you
have to compute expectations

547
00:31:12,320 --> 00:31:17,270
you need to do some integrals.

548
00:31:17,270 --> 00:31:19,880
And if you have many random
variables involved it might

549
00:31:19,880 --> 00:31:23,160
correspond to a
multi-dimensional integration.

550
00:31:23,160 --> 00:31:24,370
We don't like this.

551
00:31:24,370 --> 00:31:27,370
Can we come up, maybe,
with a simpler way

552
00:31:27,370 --> 00:31:29,200
of estimating Theta?

553
00:31:29,200 --> 00:31:32,580
Of coming up with a point
estimate which still has some

554
00:31:32,580 --> 00:31:34,350
nice properties, it
has some good

555
00:31:34,350 --> 00:31:37,120
motivation, but is simpler.

556
00:31:37,120 --> 00:31:38,630
What does simpler mean?

557
00:31:38,630 --> 00:31:40,920
Perhaps linear.

558
00:31:40,920 --> 00:31:45,570
Let's put ourselves in a
straitjacket and restrict

559
00:31:45,570 --> 00:31:50,260
ourselves to estimators that's
are of these forms.

560
00:31:50,260 --> 00:31:53,280
My estimate is constrained
to be a linear

561
00:31:53,280 --> 00:31:54,930
function of the X's.

562
00:31:54,930 --> 00:31:59,320
So my estimator is going to be
a curve, a linear curve.

563
00:31:59,320 --> 00:32:03,450
It could be this, it could be
that, maybe it would want to

564
00:32:03,450 --> 00:32:06,350
be something like this.

565
00:32:06,350 --> 00:32:10,540
I want to choose the best
possible linear function.

566
00:32:10,540 --> 00:32:11,490
What does that mean?

567
00:32:11,490 --> 00:32:15,570
It means that I write my
Theta hat in this form.

568
00:32:15,570 --> 00:32:20,750
If I fix a certain a and b I
have fixed the functional form

569
00:32:20,750 --> 00:32:23,940
of my estimator, and this
is the corresponding

570
00:32:23,940 --> 00:32:25,360
mean squared error.

571
00:32:25,360 --> 00:32:28,210
That's the error between the
true parameter and the

572
00:32:28,210 --> 00:32:31,130
estimate of that parameter, we
take the square of this.

573
00:32:31,130 --> 00:32:33,730

574
00:32:33,730 --> 00:32:38,350
And now the optimal linear
estimator is defined as one

575
00:32:38,350 --> 00:32:42,210
for which these mean squared
error is smallest possible

576
00:32:42,210 --> 00:32:45,600
over all choices of a and b.

577
00:32:45,600 --> 00:32:48,260
So we want to minimize
this expression

578
00:32:48,260 --> 00:32:52,030
over all a's and b's.

579
00:32:52,030 --> 00:32:55,650
How do we do this
minimization?

580
00:32:55,650 --> 00:32:58,910
Well this is a square,
you can expand it.

581
00:32:58,910 --> 00:33:02,040
Write down all the terms in the
expansion of the square.

582
00:33:02,040 --> 00:33:03,810
So you're going to get
the term expected

583
00:33:03,810 --> 00:33:05,400
value of Theta squared.

584
00:33:05,400 --> 00:33:07,380
You're going to get
another term--

585
00:33:07,380 --> 00:33:11,010
a squared expected value of X
squared, another term which is

586
00:33:11,010 --> 00:33:13,340
b squared, and then you're
going to get to

587
00:33:13,340 --> 00:33:16,620
various cross terms.

588
00:33:16,620 --> 00:33:22,050
What you have here is really a
quadratic function of a and b.

589
00:33:22,050 --> 00:33:25,030
So think of this quantity that
we're minimizing as some

590
00:33:25,030 --> 00:33:28,920
function h of a and b, and it
happens to be quadratic.

591
00:33:28,920 --> 00:33:32,500

592
00:33:32,500 --> 00:33:35,280
How do we minimize a
quadratic function?

593
00:33:35,280 --> 00:33:38,890
We set the derivative of this
function with respect to a and

594
00:33:38,890 --> 00:33:42,940
b to 0, and then
do the algebra.

595
00:33:42,940 --> 00:33:48,000
After you do the algebra you
find that the best choice for

596
00:33:48,000 --> 00:33:54,380
a is this 1, so this is the
coefficient next to X. This is

597
00:33:54,380 --> 00:33:55,630
the optimal a.

598
00:33:55,630 --> 00:33:59,560

599
00:33:59,560 --> 00:34:03,660
And the optimal b corresponds
of the constant terms.

600
00:34:03,660 --> 00:34:08,770
So this term and this times that
together are the optimal

601
00:34:08,770 --> 00:34:11,090
choices of b.

602
00:34:11,090 --> 00:34:15,590
So the algebra itself is
not very interesting.

603
00:34:15,590 --> 00:34:19,210
What is really interesting is
the nature of the result that

604
00:34:19,210 --> 00:34:21,179
we get here.

605
00:34:21,179 --> 00:34:26,260
If we were to plot the result on
this particular example you

606
00:34:26,260 --> 00:34:32,280
would get the curve that's
something like this.

607
00:34:32,280 --> 00:34:36,949

608
00:34:36,949 --> 00:34:40,710
It goes through the middle
of this diagram

609
00:34:40,710 --> 00:34:43,080
and is a little slanted.

610
00:34:43,080 --> 00:34:48,639
In this example, X and Theta
are positively correlated.

611
00:34:48,639 --> 00:34:51,190
Bigger values of X generally
correspond to

612
00:34:51,190 --> 00:34:53,139
bigger values of Theta.

613
00:34:53,139 --> 00:34:56,310
So in this example the
covariance between X and Theta

614
00:34:56,310 --> 00:35:05,530
is positive, and so our estimate
can be interpreted in

615
00:35:05,530 --> 00:35:09,110
the following way: The expected
value of Theta is the

616
00:35:09,110 --> 00:35:13,130
estimate that you would come up
with if you didn't have any

617
00:35:13,130 --> 00:35:15,960
information about Theta.

618
00:35:15,960 --> 00:35:19,590
If you don't make any
observations this is the best

619
00:35:19,590 --> 00:35:22,270
way of estimating Theta.

620
00:35:22,270 --> 00:35:26,190
But I have made an observation,
X, and I need to

621
00:35:26,190 --> 00:35:27,920
take it into account.

622
00:35:27,920 --> 00:35:32,360
I look at this difference, which
is the piece of news

623
00:35:32,360 --> 00:35:34,380
contained in X?

624
00:35:34,380 --> 00:35:37,870
That's what X should
be on the average.

625
00:35:37,870 --> 00:35:41,910
If I observe an X which is
bigger than what I expected it

626
00:35:41,910 --> 00:35:46,830
to be, and since X and Theta
are positively correlated,

627
00:35:46,830 --> 00:35:51,070
this tells me that Theta should
also be bigger than its

628
00:35:51,070 --> 00:35:52,690
average value.

629
00:35:52,690 --> 00:35:57,180
Whenever I see an X that's
larger than its average value

630
00:35:57,180 --> 00:36:00,230
this gives me an indication
that theta should also

631
00:36:00,230 --> 00:36:04,480
probably be larger than
its average value.

632
00:36:04,480 --> 00:36:08,040
And so I'm taking that
difference and multiplying it

633
00:36:08,040 --> 00:36:10,240
by a positive coefficient.

634
00:36:10,240 --> 00:36:12,360
And that's what gives
me a curve here that

635
00:36:12,360 --> 00:36:14,880
has a positive slope.

636
00:36:14,880 --> 00:36:17,780
So this increment--

637
00:36:17,780 --> 00:36:21,750
the new information contained
in X as compared to the

638
00:36:21,750 --> 00:36:25,950
average value we expected
apriori, that increment allows

639
00:36:25,950 --> 00:36:30,780
us to make a correction to our
prior estimate of Theta, and

640
00:36:30,780 --> 00:36:34,780
the amount of that correction is
guided by the covariance of

641
00:36:34,780 --> 00:36:36,260
X with Theta.

642
00:36:36,260 --> 00:36:39,670
If the covariance of X with
Theta were 0, that would mean

643
00:36:39,670 --> 00:36:43,050
there's no systematic relation
between the two, and in that

644
00:36:43,050 --> 00:36:46,380
case obtaining some information
from X doesn't

645
00:36:46,380 --> 00:36:51,010
give us a guide as to how to
change the estimates of Theta.

646
00:36:51,010 --> 00:36:53,870
If that were 0, we would
just stay with

647
00:36:53,870 --> 00:36:55,050
this particular estimate.

648
00:36:55,050 --> 00:36:57,090
We're not able to make
a correction.

649
00:36:57,090 --> 00:37:00,810
But when there's a non zero
covariance between X and Theta

650
00:37:00,810 --> 00:37:04,620
that covariance works as a
guide for us to obtain a

651
00:37:04,620 --> 00:37:08,130
better estimate of Theta.

652
00:37:08,130 --> 00:37:12,270

653
00:37:12,270 --> 00:37:15,220
How about the resulting
mean squared error?

654
00:37:15,220 --> 00:37:18,690
In this context turns out that
there's a very nice formula

655
00:37:18,690 --> 00:37:21,360
for the mean squared
error obtained from

656
00:37:21,360 --> 00:37:24,780
the best linear estimate.

657
00:37:24,780 --> 00:37:27,900
What's the story here?

658
00:37:27,900 --> 00:37:31,210
The mean squared error that we
have has something to do with

659
00:37:31,210 --> 00:37:35,450
the variance of the original
random variable.

660
00:37:35,450 --> 00:37:38,710
The more uncertain our original
random variable is,

661
00:37:38,710 --> 00:37:41,670
the more error we're
going to make.

662
00:37:41,670 --> 00:37:45,590
On the other hand, when the two
variables are correlated

663
00:37:45,590 --> 00:37:48,370
we explored that correlation
to improve our estimate.

664
00:37:48,370 --> 00:37:52,100

665
00:37:52,100 --> 00:37:54,650
This row here is the correlation
coefficient

666
00:37:54,650 --> 00:37:56,730
between the two random
variables.

667
00:37:56,730 --> 00:37:59,720
When this correlation
coefficient is larger this

668
00:37:59,720 --> 00:38:01,780
factor here becomes smaller.

669
00:38:01,780 --> 00:38:04,660
And our mean squared error
become smaller.

670
00:38:04,660 --> 00:38:07,560
So think of the two
extreme cases.

671
00:38:07,560 --> 00:38:11,270
One extreme case is when
rho equal to 1 --

672
00:38:11,270 --> 00:38:14,200
so X and Theta are perfectly
correlated.

673
00:38:14,200 --> 00:38:18,420
When they're perfectly
correlated once I know X then

674
00:38:18,420 --> 00:38:20,310
I also know Theta.

675
00:38:20,310 --> 00:38:23,580
And the two random variables
are linearly related.

676
00:38:23,580 --> 00:38:27,080
In that case, my estimate is
right on the target, and the

677
00:38:27,080 --> 00:38:30,860
mean squared error
is going to be 0.

678
00:38:30,860 --> 00:38:34,810
The other extreme case is
if rho is equal to 0.

679
00:38:34,810 --> 00:38:37,590
The two random variables
are uncorrelated.

680
00:38:37,590 --> 00:38:41,740
In that case the measurement
does not help me estimate

681
00:38:41,740 --> 00:38:45,390
Theta, and the uncertainty
that's left--

682
00:38:45,390 --> 00:38:46,970
the mean squared error--

683
00:38:46,970 --> 00:38:49,830
is just the original
variance of Theta.

684
00:38:49,830 --> 00:38:53,750
So the uncertainty in Theta
does not get reduced.

685
00:38:53,750 --> 00:38:54,670
So moral--

686
00:38:54,670 --> 00:38:59,710
the estimation error is a
reduced version of the

687
00:38:59,710 --> 00:39:03,660
original amount of uncertainty
in the random variable Theta,

688
00:39:03,660 --> 00:39:08,280
and the larger the correlation
between those two random

689
00:39:08,280 --> 00:39:12,620
variables, the better we can
remove uncertainty from the

690
00:39:12,620 --> 00:39:13,970
original random variable.

691
00:39:13,970 --> 00:39:17,320

692
00:39:17,320 --> 00:39:21,200
I didn't derive this formula,
but it's just a matter of

693
00:39:21,200 --> 00:39:22,430
algebraic manipulations.

694
00:39:22,430 --> 00:39:25,770
We have a formula for
Theta hat, subtract

695
00:39:25,770 --> 00:39:27,620
Theta from that formula.

696
00:39:27,620 --> 00:39:30,640
Take square, take expectations,
and do a few

697
00:39:30,640 --> 00:39:33,750
lines of algebra that you can
read in the text, and you end

698
00:39:33,750 --> 00:39:35,915
up with this really neat
and clean formula.

699
00:39:35,915 --> 00:39:38,650

700
00:39:38,650 --> 00:39:42,360
Now I mentioned in the beginning
of the lecture that

701
00:39:42,360 --> 00:39:45,220
we can do inference with Theta's
and X's not just being

702
00:39:45,220 --> 00:39:48,970
single numbers, but they could
be vector random variables.

703
00:39:48,970 --> 00:39:52,100
So for example we might have
multiple data that gives us

704
00:39:52,100 --> 00:39:56,710
information about X.

705
00:39:56,710 --> 00:40:00,240
There are no vectors here, so
this discussion was for the

706
00:40:00,240 --> 00:40:04,460
case where Theta and X were just
scalar, one-dimensional

707
00:40:04,460 --> 00:40:05,350
quantities.

708
00:40:05,350 --> 00:40:08,060
What do we do if we have
multiple data?

709
00:40:08,060 --> 00:40:11,990
Suppose that Theta is still a
scalar, it's one dimensional,

710
00:40:11,990 --> 00:40:14,710
but we make several
observations.

711
00:40:14,710 --> 00:40:17,050
And on the basis of these
observations we want to

712
00:40:17,050 --> 00:40:20,080
estimate Theta.

713
00:40:20,080 --> 00:40:24,650
The optimal least mean squares
estimator would be again the

714
00:40:24,650 --> 00:40:28,830
conditional expectation of
Theta given X. That's the

715
00:40:28,830 --> 00:40:30,130
optimal one.

716
00:40:30,130 --> 00:40:36,330
And in this case X is a
vector, so the general

717
00:40:36,330 --> 00:40:40,650
estimator we would use
would be this one.

718
00:40:40,650 --> 00:40:44,050
But if we want to keep things
simple and we want our

719
00:40:44,050 --> 00:40:47,300
estimator to have a simple
functional form we might

720
00:40:47,300 --> 00:40:51,870
restrict to estimator that are
linear functions of the data.

721
00:40:51,870 --> 00:40:53,800
And then the story is
exactly the same as

722
00:40:53,800 --> 00:40:57,010
we discussed before.

723
00:40:57,010 --> 00:41:00,460
I constrained myself to
estimating Theta using a

724
00:41:00,460 --> 00:41:05,880
linear function of the data,
so my signal processing box

725
00:41:05,880 --> 00:41:07,830
just applies a linear
function.

726
00:41:07,830 --> 00:41:11,145
And I'm looking for the best
coefficients, the coefficients

727
00:41:11,145 --> 00:41:13,490
that are going to result
in the least

728
00:41:13,490 --> 00:41:15,990
possible squared error.

729
00:41:15,990 --> 00:41:19,780
This is my squared error, this
is (my estimate minus the

730
00:41:19,780 --> 00:41:22,110
thing I'm trying to estimate)
squared, and

731
00:41:22,110 --> 00:41:24,100
then taking the average.

732
00:41:24,100 --> 00:41:25,330
How do we do this?

733
00:41:25,330 --> 00:41:26,580
Same story as before.

734
00:41:26,580 --> 00:41:29,510

735
00:41:29,510 --> 00:41:32,500
The X's and the Theta's get
averaged out because we have

736
00:41:32,500 --> 00:41:33,430
an expectation.

737
00:41:33,430 --> 00:41:36,830
Whatever is left is just a
function of the coefficients

738
00:41:36,830 --> 00:41:38,760
of the a's and of b's.

739
00:41:38,760 --> 00:41:42,110
As before it turns out to
be a quadratic function.

740
00:41:42,110 --> 00:41:46,580
Then we set the derivatives of
this function of a's and b's

741
00:41:46,580 --> 00:41:50,000
with respect to the
coefficients, we set it to 0.

742
00:41:50,000 --> 00:41:54,340
And this gives us a system
of linear equations.

743
00:41:54,340 --> 00:41:56,780
It's a system of linear
equations that's satisfied by

744
00:41:56,780 --> 00:41:57,730
those coefficients.

745
00:41:57,730 --> 00:42:00,860
It's a linear system because
this is a quadratic function

746
00:42:00,860 --> 00:42:03,950
of those coefficients.

747
00:42:03,950 --> 00:42:10,410
So to get closed-form formulas
in this particular case one

748
00:42:10,410 --> 00:42:13,180
would need to introduce vectors,
and matrices, and

749
00:42:13,180 --> 00:42:15,330
metrics inverses and so on.

750
00:42:15,330 --> 00:42:18,570
The particular formulas are not
so much what interests us

751
00:42:18,570 --> 00:42:22,950
here, rather, the interesting
thing is that this is simply

752
00:42:22,950 --> 00:42:27,120
done just using straightforward
solvers of

753
00:42:27,120 --> 00:42:29,240
linear equations.

754
00:42:29,240 --> 00:42:32,470
The only thing you need to do
is to write down the correct

755
00:42:32,470 --> 00:42:35,280
coefficients of those non-linear
equations.

756
00:42:35,280 --> 00:42:37,440
And the typical coefficient
that you would

757
00:42:37,440 --> 00:42:39,240
get would be what?

758
00:42:39,240 --> 00:42:42,480
Let say a typical quick
equations would be --

759
00:42:42,480 --> 00:42:44,190
let's take a typical
term of this

760
00:42:44,190 --> 00:42:45,680
quadratic one you expanded.

761
00:42:45,680 --> 00:42:51,470
You're going to get the terms
such as a1x1 times a2x2.

762
00:42:51,470 --> 00:42:55,680
When you take expectations
you're left with a1a2 times

763
00:42:55,680 --> 00:42:58,210
expected value of x1x2.

764
00:42:58,210 --> 00:43:02,030

765
00:43:02,030 --> 00:43:06,700
So this would involve terms such
as a1 squared expected

766
00:43:06,700 --> 00:43:08,520
value of x1 squared.

767
00:43:08,520 --> 00:43:14,760
You would get terms such as
a1a2, expected value of x1x2,

768
00:43:14,760 --> 00:43:20,120
and a lot of other terms
here should have a too.

769
00:43:20,120 --> 00:43:23,600
So you get something that's
quadratic in your

770
00:43:23,600 --> 00:43:24,890
coefficients.

771
00:43:24,890 --> 00:43:30,490
And the constants that show up
in your system of equations

772
00:43:30,490 --> 00:43:33,790
are things that have to do with
the expected values of

773
00:43:33,790 --> 00:43:37,070
squares of your random
variables, or products of your

774
00:43:37,070 --> 00:43:39,130
random variables.

775
00:43:39,130 --> 00:43:43,060
To write down the numerical
values for these the only

776
00:43:43,060 --> 00:43:46,330
thing you need to know are the
means and variances of your

777
00:43:46,330 --> 00:43:47,570
random variables.

778
00:43:47,570 --> 00:43:50,360
If you know the mean and
variance then you know what

779
00:43:50,360 --> 00:43:51,760
this thing is.

780
00:43:51,760 --> 00:43:54,950
And if you know the covariances
as well then you

781
00:43:54,950 --> 00:43:57,250
know what this thing is.

782
00:43:57,250 --> 00:44:02,080
So in order to find the optimal
linear estimator in

783
00:44:02,080 --> 00:44:06,870
the case of multiple data you do
not need to know the entire

784
00:44:06,870 --> 00:44:09,230
probability distribution
of the random

785
00:44:09,230 --> 00:44:11,050
variables that are involved.

786
00:44:11,050 --> 00:44:14,690
You only need to know your
means and covariances.

787
00:44:14,690 --> 00:44:18,670
These are the only quantities
that affect the construction

788
00:44:18,670 --> 00:44:20,570
of your optimal estimator.

789
00:44:20,570 --> 00:44:23,840
We could see this already
in this formula.

790
00:44:23,840 --> 00:44:29,650
The form of my optimal estimator
is completely

791
00:44:29,650 --> 00:44:34,100
determined once I know the
means, variance, and

792
00:44:34,100 --> 00:44:37,970
covariance of the random
variables in my model.

793
00:44:37,970 --> 00:44:44,410
I do not need to know how the
details distribution of the

794
00:44:44,410 --> 00:44:46,570
random variables that
are involved here.

795
00:44:46,570 --> 00:44:51,690

796
00:44:51,690 --> 00:44:55,110
So as I said in general, you
find the form of the optimal

797
00:44:55,110 --> 00:44:59,550
estimator by using a linear
equation solver.

798
00:44:59,550 --> 00:45:01,890
There are special examples
in which you can

799
00:45:01,890 --> 00:45:05,210
get closed-form solutions.

800
00:45:05,210 --> 00:45:10,090
The nicest simplest estimation
problem one can think of is

801
00:45:10,090 --> 00:45:11,120
the following--

802
00:45:11,120 --> 00:45:14,870
you have some uncertain
parameter, and you make

803
00:45:14,870 --> 00:45:17,790
multiple measurements
of that parameter in

804
00:45:17,790 --> 00:45:19,950
the presence of noise.

805
00:45:19,950 --> 00:45:22,520
So the Wi's are noises.

806
00:45:22,520 --> 00:45:25,130
I corresponds to your
i-th experiment.

807
00:45:25,130 --> 00:45:27,810
So this is the most common
situation that you encounter

808
00:45:27,810 --> 00:45:28,490
in the lab.

809
00:45:28,490 --> 00:45:31,240
If you are dealing with some
process, you're trying to

810
00:45:31,240 --> 00:45:34,110
measure something you measure
it over and over.

811
00:45:34,110 --> 00:45:37,030
Each time your measurement
has some random error.

812
00:45:37,030 --> 00:45:40,360
And then you need to take all
your measurements together and

813
00:45:40,360 --> 00:45:43,550
come up with a single
estimate.

814
00:45:43,550 --> 00:45:48,320
So the noises are assumed to be
independent of each other,

815
00:45:48,320 --> 00:45:50,010
and also to be independent
from the

816
00:45:50,010 --> 00:45:52,090
value of the true parameter.

817
00:45:52,090 --> 00:45:55,010
Without loss of generality we
can assume that the noises

818
00:45:55,010 --> 00:45:58,890
have 0 mean and they have
some variances that we

819
00:45:58,890 --> 00:46:00,340
assume to be known.

820
00:46:00,340 --> 00:46:03,180
Theta itself has a prior
distribution with a certain

821
00:46:03,180 --> 00:46:05,670
mean and the certain variance.

822
00:46:05,670 --> 00:46:07,610
So the form of the
optimal linear

823
00:46:07,610 --> 00:46:10,940
estimator is really nice.

824
00:46:10,940 --> 00:46:14,930
Well maybe you cannot see it
right away because this looks

825
00:46:14,930 --> 00:46:18,580
messy, but what is it really?

826
00:46:18,580 --> 00:46:24,590
It's a linear combination of
the X's and the prior mean.

827
00:46:24,590 --> 00:46:28,560
And it's actually a weighted
average of the X's and the

828
00:46:28,560 --> 00:46:30,250
prior mean.

829
00:46:30,250 --> 00:46:33,570
Here we collect all of
the coefficients that

830
00:46:33,570 --> 00:46:35,920
we have at the top.

831
00:46:35,920 --> 00:46:42,060
So the whole thing is basically
a weighted average.

832
00:46:42,060 --> 00:46:46,460

833
00:46:46,460 --> 00:46:51,110
1/(sigma_i-squared) is the
weight that we give to Xi, and

834
00:46:51,110 --> 00:46:54,710
in the denominator we have the
sum of all of the weights.

835
00:46:54,710 --> 00:46:59,260
So in the end we're dealing
with a weighted average.

836
00:46:59,260 --> 00:47:03,760
If mu was equal to 1, and all
the Xi's were equal to 1 then

837
00:47:03,760 --> 00:47:06,790
our estimate would also
be equal to 1.

838
00:47:06,790 --> 00:47:10,670
Now the form of the weights that
we have is interesting.

839
00:47:10,670 --> 00:47:16,050
Any given data point is
weighted inversely

840
00:47:16,050 --> 00:47:17,820
proportional to the variance.

841
00:47:17,820 --> 00:47:20,270
What does that say?

842
00:47:20,270 --> 00:47:26,920
If my i-th data point has a lot
of variance, if Wi is very

843
00:47:26,920 --> 00:47:32,900
noisy then Xi is not very
useful, is not very reliable.

844
00:47:32,900 --> 00:47:36,840
So I'm giving it
a small weight.

845
00:47:36,840 --> 00:47:41,870
Large variance, a lot of error
in my Xi means that I should

846
00:47:41,870 --> 00:47:44,200
give it a smaller weight.

847
00:47:44,200 --> 00:47:47,920
If two data points have the
same variance, they're of

848
00:47:47,920 --> 00:47:50,140
comparable quality,
then I'm going to

849
00:47:50,140 --> 00:47:51,950
give them equal weight.

850
00:47:51,950 --> 00:47:56,200
The other interesting thing is
that the prior mean is treated

851
00:47:56,200 --> 00:47:58,300
the same way as the X's.

852
00:47:58,300 --> 00:48:03,050
So it's treated as an additional
observation.

853
00:48:03,050 --> 00:48:07,100
So we're taking a weighted
average of the prior mean and

854
00:48:07,100 --> 00:48:09,850
of the measurements that
we are making.

855
00:48:09,850 --> 00:48:13,380
The formula looks as if the
prior mean was just another

856
00:48:13,380 --> 00:48:14,210
data point.

857
00:48:14,210 --> 00:48:17,440
So that's the way of thinking
about Bayesian estimation.

858
00:48:17,440 --> 00:48:20,270
You have your real data points,
the X's that you

859
00:48:20,270 --> 00:48:23,430
observe, you also had some
prior information.

860
00:48:23,430 --> 00:48:27,470
This plays a role similar
to a data point.

861
00:48:27,470 --> 00:48:31,580
Interesting note that if all
random variables are normal in

862
00:48:31,580 --> 00:48:35,230
this model these optimal linear
estimator happens to be

863
00:48:35,230 --> 00:48:36,950
also the conditional
expectation.

864
00:48:36,950 --> 00:48:40,000
That's the nice thing about
normal random variables that

865
00:48:40,000 --> 00:48:42,770
conditional expectations
turn out to be linear.

866
00:48:42,770 --> 00:48:46,920
So the optimal estimate and the
optimal linear estimate

867
00:48:46,920 --> 00:48:48,560
turn out to be the same.

868
00:48:48,560 --> 00:48:51,050
And that gives us another
interpretation of linear

869
00:48:51,050 --> 00:48:52,100
estimation.

870
00:48:52,100 --> 00:48:54,660
Linear estimation is essentially
the same as

871
00:48:54,660 --> 00:48:58,970
pretending that all random
variables are normal.

872
00:48:58,970 --> 00:49:02,040
So that's a side point.

873
00:49:02,040 --> 00:49:04,230
Now I'd like to close
with a comment.

874
00:49:04,230 --> 00:49:08,370

875
00:49:08,370 --> 00:49:11,760
You do your measurements and
you estimate Theta on the

876
00:49:11,760 --> 00:49:17,040
basis of X. Suppose that instead
you have a measuring

877
00:49:17,040 --> 00:49:20,970
device that's measures X-cubed
instead of measuring X, and

878
00:49:20,970 --> 00:49:23,350
you want to estimate Theta.

879
00:49:23,350 --> 00:49:26,760
Are you going to get to
different a estimate?

880
00:49:26,760 --> 00:49:31,790
Well X and X-cubed contained
the same information.

881
00:49:31,790 --> 00:49:34,730
Telling you X is the
same as telling you

882
00:49:34,730 --> 00:49:36,640
the value of X-cubed.

883
00:49:36,640 --> 00:49:40,660
So the posterior distribution
of Theta given X is the same

884
00:49:40,660 --> 00:49:44,160
as the posterior distribution
of Theta given X-cubed.

885
00:49:44,160 --> 00:49:47,450
And so the means of these
posterior distributions are

886
00:49:47,450 --> 00:49:49,390
going to be the same.

887
00:49:49,390 --> 00:49:52,850
So doing transformations through
your data does not

888
00:49:52,850 --> 00:49:57,370
matter if you're doing optimal
least squares estimation.

889
00:49:57,370 --> 00:50:00,100
On the other hand, if you
restrict yourself to doing

890
00:50:00,100 --> 00:50:05,540
linear estimation then using a
linear function of X is not

891
00:50:05,540 --> 00:50:09,720
the same as using a linear
function of X-cubed.

892
00:50:09,720 --> 00:50:14,720
So this is a linear estimator,
but where the data are the

893
00:50:14,720 --> 00:50:19,250
X-cube's, and we have a linear
function of the data.

894
00:50:19,250 --> 00:50:23,690
So this means that when you're
using linear estimation you

895
00:50:23,690 --> 00:50:28,040
have some choices to make
linear on what?

896
00:50:28,040 --> 00:50:32,290
Sometimes you want to plot your
data on a not ordinary

897
00:50:32,290 --> 00:50:35,090
scale and try to plot
a line through them.

898
00:50:35,090 --> 00:50:38,360
Sometimes you plot your data
on a logarithmic scale, and

899
00:50:38,360 --> 00:50:40,480
try to plot a line
through them.

900
00:50:40,480 --> 00:50:42,390
Which scale is the
appropriate one?

901
00:50:42,390 --> 00:50:44,510
Here it would be
a cubic scale.

902
00:50:44,510 --> 00:50:46,830
And you have to think about
your particular model to

903
00:50:46,830 --> 00:50:51,180
decide which version would be
a more appropriate one.

904
00:50:51,180 --> 00:50:55,830
Finally when we have multiple
data sometimes these multiple

905
00:50:55,830 --> 00:50:59,910
data might contain the
same information.

906
00:50:59,910 --> 00:51:02,800
So X is one data point,
X-squared is another data

907
00:51:02,800 --> 00:51:05,610
point, X-cubed is another
data point.

908
00:51:05,610 --> 00:51:08,540
The three of them contain the
same information, but you can

909
00:51:08,540 --> 00:51:11,480
try to form a linear
function of them.

910
00:51:11,480 --> 00:51:14,380
And then you obtain a linear
estimator that has a more

911
00:51:14,380 --> 00:51:16,930
general form as a
function of X.

912
00:51:16,930 --> 00:51:22,130
So if you want to estimate your
Theta as a cubic function

913
00:51:22,130 --> 00:51:26,330
of X, for example, you can set
up a linear estimation model

914
00:51:26,330 --> 00:51:29,480
of this particular form and find
the optimal coefficients,

915
00:51:29,480 --> 00:51:32,900
the a's and the b's.

916
00:51:32,900 --> 00:51:35,700
All right, so the last slide
just gives you the big picture

917
00:51:35,700 --> 00:51:39,330
of what's happening in Bayesian
Inference, it's for

918
00:51:39,330 --> 00:51:40,330
you to ponder.

919
00:51:40,330 --> 00:51:41,930
Basically we talked about three

920
00:51:41,930 --> 00:51:43,470
possible estimation methods.

921
00:51:43,470 --> 00:51:48,300
Maximum posteriori, mean squared
error estimation, and

922
00:51:48,300 --> 00:51:51,070
linear mean squared error
estimation, or least squares

923
00:51:51,070 --> 00:51:52,290
estimation.

924
00:51:52,290 --> 00:51:54,410
And there's a number of standard
examples that you

925
00:51:54,410 --> 00:51:57,130
will be seeing over and over in
the recitations, tutorial,

926
00:51:57,130 --> 00:52:00,950
homework, and so on, perhaps
on exams even.

927
00:52:00,950 --> 00:52:05,630
Where we take some nice priors
on some unknown parameter, we

928
00:52:05,630 --> 00:52:09,410
take some nice models for the
noise or the observations, and

929
00:52:09,410 --> 00:52:11,880
then you need to work out
posterior distributions in the

930
00:52:11,880 --> 00:52:13,570
various estimates and
compare them.

931
00:52:13,570 --> 00:52:15,040