1
00:00:00,040 --> 00:00:02,460
The following content is
provided under a Creative

2
00:00:02,460 --> 00:00:03,870
Commons license.

3
00:00:03,870 --> 00:00:06,910
Your support will help MIT
OpenCourseWare continue to

4
00:00:06,910 --> 00:00:08,700
offer high quality, educational

5
00:00:08,700 --> 00:00:10,560
resources for free.

6
00:00:10,560 --> 00:00:13,460
To make a donation or view
additional materials from

7
00:00:13,460 --> 00:00:19,290
hundreds of MIT courses, visit
MIT OpenCourseWare at

8
00:00:19,290 --> 00:00:20,540
ocw.mit.edu.

9
00:00:22,200 --> 00:00:24,920
PROFESSOR: So for the last three
lectures we're going to

10
00:00:24,920 --> 00:00:28,200
talk about classical statistics,
the way statistics

11
00:00:28,200 --> 00:00:32,340
can be done if you don't want to
assume a prior distribution

12
00:00:32,340 --> 00:00:34,800
on the unknown parameters.

13
00:00:34,800 --> 00:00:38,290
Today we're going to focus,
mostly, on the estimation side

14
00:00:38,290 --> 00:00:41,910
and leave hypothesis testing
for the next two lectures.

15
00:00:41,910 --> 00:00:46,700
So where there is one generic
method that one can use to

16
00:00:46,700 --> 00:00:50,850
carry out parameter estimation,
that's the maximum

17
00:00:50,850 --> 00:00:51,850
likelihood method.

18
00:00:51,850 --> 00:00:53,990
We're going to define
what it is.

19
00:00:53,990 --> 00:00:58,200
Then we will look at the most
common estimation problem

20
00:00:58,200 --> 00:01:00,620
there is, which is to estimate
the mean of a given

21
00:01:00,620 --> 00:01:02,110
distribution.

22
00:01:02,110 --> 00:01:05,540
And we're going to talk about
confidence intervals, which

23
00:01:05,540 --> 00:01:09,130
refers to providing an
interval around your

24
00:01:09,130 --> 00:01:13,330
estimates, which has some
properties of the kind that

25
00:01:13,330 --> 00:01:17,640
the parameter is highly likely
to be inside that interval,

26
00:01:17,640 --> 00:01:20,040
but we will be careful about
how to interpret that

27
00:01:20,040 --> 00:01:22,220
particular statement.

28
00:01:22,220 --> 00:01:22,345
Ok.

29
00:01:22,345 --> 00:01:25,920
So the big framework first.

30
00:01:25,920 --> 00:01:29,120
The picture is almost the same
as the one that we had in the

31
00:01:29,120 --> 00:01:31,130
case of Bayesian statistics.

32
00:01:31,130 --> 00:01:33,570
We have some unknown
parameter.

33
00:01:33,570 --> 00:01:35,510
And we have a measuring
device.

34
00:01:35,510 --> 00:01:38,150
There is some noise,
some randomness.

35
00:01:38,150 --> 00:01:42,560
And we get an observation, X,
whose distribution depends on

36
00:01:42,560 --> 00:01:44,560
the value of the parameter.

37
00:01:44,560 --> 00:01:47,850
However, the big change from the
Bayesian setting is that

38
00:01:47,850 --> 00:01:50,840
here, this parameter
is just a number.

39
00:01:50,840 --> 00:01:53,200
It's not modeled as
a random variable.

40
00:01:53,200 --> 00:01:55,900
It does not have a probability
distribution.

41
00:01:55,900 --> 00:01:57,460
There's nothing random
about it.

42
00:01:57,460 --> 00:01:58,720
It's a constant.

43
00:01:58,720 --> 00:02:02,360
It just happens that we don't
know what that constant is.

44
00:02:02,360 --> 00:02:05,970
And in particular, this
probability distribution here,

45
00:02:05,970 --> 00:02:10,350
the distribution of X,
depends on Theta.

46
00:02:10,350 --> 00:02:13,900
But this is not a conditional
distribution in the usual

47
00:02:13,900 --> 00:02:15,450
sense of the word.

48
00:02:15,450 --> 00:02:18,480
Conditional distributions were
defined when we had two random

49
00:02:18,480 --> 00:02:21,800
variables and we condition one
random variable on the other.

50
00:02:21,800 --> 00:02:25,890
And we used the bar to separate
the X from the Theta.

51
00:02:25,890 --> 00:02:27,870
To make the point that this
is not a conditioned

52
00:02:27,870 --> 00:02:29,840
distribution, we use a
different notation.

53
00:02:29,840 --> 00:02:31,730
We put a semicolon here.

54
00:02:31,730 --> 00:02:35,760
And what this is meant to say is
that X has a distribution.

55
00:02:35,760 --> 00:02:39,640
That distribution has
a certain parameter.

56
00:02:39,640 --> 00:02:42,240
And we don't know what
that parameter is.

57
00:02:42,240 --> 00:02:46,270
So for example, this might be
a normal distribution, with

58
00:02:46,270 --> 00:02:49,070
variance 1 but a mean Theta.

59
00:02:49,070 --> 00:02:50,560
We don't know what Theta is.

60
00:02:50,560 --> 00:02:52,980
And we want to estimate it.

61
00:02:52,980 --> 00:02:55,970
Now once we have this setting,
then your job is to design

62
00:02:55,970 --> 00:02:57,560
this box, the estimator.

63
00:02:57,560 --> 00:03:00,620
The estimator is some data
processing box that takes the

64
00:03:00,620 --> 00:03:03,950
measurements and produces
an estimate

65
00:03:03,950 --> 00:03:06,300
of the unknown parameter.

66
00:03:06,300 --> 00:03:11,950
Now the notation that's used
here is as if X and Theta were

67
00:03:11,950 --> 00:03:13,640
one-dimensional quantities.

68
00:03:13,640 --> 00:03:16,610
But actually, everything we
say remains valid if you

69
00:03:16,610 --> 00:03:20,090
interpret X and Theta as
vectors of parameters.

70
00:03:20,090 --> 00:03:22,180
So for example, you
may obtain several

71
00:03:22,180 --> 00:03:25,050
measurements, X1 up to 2Xn.

72
00:03:25,050 --> 00:03:27,980
And there may be several unknown
parameters in the

73
00:03:27,980 --> 00:03:30,260
background.

74
00:03:30,260 --> 00:03:34,200
Once more, we do not have, and
we do not want to assume, a

75
00:03:34,200 --> 00:03:35,780
prior distribution on Theta.

76
00:03:35,780 --> 00:03:37,070
It's a constant.

77
00:03:37,070 --> 00:03:39,040
And if you want to think
mathematically about this

78
00:03:39,040 --> 00:03:41,510
situation, it's as if you
have many different

79
00:03:41,510 --> 00:03:43,340
probabilistic models.

80
00:03:43,340 --> 00:03:46,360
So a normal with this mean or
a normal with that mean or a

81
00:03:46,360 --> 00:03:49,020
normal with that mean, these
are alternative candidate

82
00:03:49,020 --> 00:03:50,700
probabilistic models.

83
00:03:50,700 --> 00:03:55,080
And we want to try to make a
decision about which one is

84
00:03:55,080 --> 00:03:56,420
the correct model.

85
00:03:56,420 --> 00:03:59,480
In some cases, we have to choose
just between a small

86
00:03:59,480 --> 00:04:00,390
number of models.

87
00:04:00,390 --> 00:04:03,400
For example, you have a coin
with an unknown bias.

88
00:04:03,400 --> 00:04:06,410
The bias is either 1/2 or 3/4.

89
00:04:06,410 --> 00:04:08,650
You're going to flip the
coin a few times.

90
00:04:08,650 --> 00:04:13,150
And you try to decide whether
the true bias is this one or

91
00:04:13,150 --> 00:04:14,150
is that one.

92
00:04:14,150 --> 00:04:17,610
So in this case, we have two
specific, alternative

93
00:04:17,610 --> 00:04:20,800
probabilistic models from which
we want to distinguish.

94
00:04:20,800 --> 00:04:25,000
But sometimes things are a
little more complicated.

95
00:04:25,000 --> 00:04:26,940
For example, you have a coin.

96
00:04:26,940 --> 00:04:30,940
And you have one hypothesis
that my coin is unbiased.

97
00:04:30,940 --> 00:04:34,650
And the other hypothesis is
that my coin is biased.

98
00:04:34,650 --> 00:04:36,040
And you do your experiments.

99
00:04:36,040 --> 00:04:40,840
And you want to come up with a
decision that decides whether

100
00:04:40,840 --> 00:04:43,970
this is true or this
one is true.

101
00:04:43,970 --> 00:04:46,630
In this case, we're not
dealing with just two

102
00:04:46,630 --> 00:04:48,710
alternative probabilistic
models.

103
00:04:48,710 --> 00:04:51,540
This one is a specific
model for the coin.

104
00:04:51,540 --> 00:04:54,230
But this one actually
corresponds to lots of

105
00:04:54,230 --> 00:04:56,890
possible, alternative
coin models.

106
00:04:56,890 --> 00:05:00,420
So this includes the model where
Theta is 0.6, the model

107
00:05:00,420 --> 00:05:03,860
where Theta is 0.7, Theta
is 0.8, and so on.

108
00:05:03,860 --> 00:05:07,350
So we're trying to discriminate
between one model

109
00:05:07,350 --> 00:05:09,510
and lots of alternative
models.

110
00:05:09,510 --> 00:05:11,560
How does one go about this?

111
00:05:11,560 --> 00:05:14,750
Well, there's some systematic
ways that one can approach

112
00:05:14,750 --> 00:05:16,120
problems of this kind.

113
00:05:16,120 --> 00:05:19,850
And we will start talking
about these next time.

114
00:05:19,850 --> 00:05:22,380
So today, we're going to focus
on estimation problems.

115
00:05:22,380 --> 00:05:27,080
In estimation problems, theta is
a quantity, which is a real

116
00:05:27,080 --> 00:05:29,070
number, a continuous
parameter.

117
00:05:29,070 --> 00:05:33,730
We're to design this box, so
what we get out of this box is

118
00:05:33,730 --> 00:05:34,280
an estimate.

119
00:05:34,280 --> 00:05:37,900
Now notice that this estimate
here is a random variable.

120
00:05:37,900 --> 00:05:42,000
Even though theta is
deterministic, this is random,

121
00:05:42,000 --> 00:05:45,110
because it's a function of
the data that we observe.

122
00:05:45,110 --> 00:05:46,360
The data are random.

123
00:05:46,360 --> 00:05:49,155
We're applying a function
to the data to

124
00:05:49,155 --> 00:05:50,270
construct our estimate.

125
00:05:50,270 --> 00:05:52,850
So, since it's a function of
random variables, it's a

126
00:05:52,850 --> 00:05:54,630
random variable itself.

127
00:05:54,630 --> 00:05:57,940
The distribution of Theta hat
depends on the distribution of

128
00:05:57,940 --> 00:06:01,280
X. The distribution of X
is affected by Theta.

129
00:06:01,280 --> 00:06:03,650
So in the end, the distribution
of your estimate

130
00:06:03,650 --> 00:06:08,290
Theta hat will also be affected
by whatever Theta

131
00:06:08,290 --> 00:06:09,920
happens to be.

132
00:06:09,920 --> 00:06:12,950
Our general objective, when
designing estimators, is that

133
00:06:12,950 --> 00:06:17,390
we want to get, in the end, an
error, an estimation error,

134
00:06:17,390 --> 00:06:19,070
which is not too large.

135
00:06:19,070 --> 00:06:21,500
But we'll have to make
that specific.

136
00:06:21,500 --> 00:06:24,720
Again, what exactly do
we mean by that?

137
00:06:24,720 --> 00:06:27,170
So how do we go about
this problem?

138
00:06:29,670 --> 00:06:40,150
One general approach is to pick
a Theta, under which the

139
00:06:40,150 --> 00:06:44,590
data that we observe, that
this is the X's, our most

140
00:06:44,590 --> 00:06:47,180
likely to have occurred.

141
00:06:47,180 --> 00:06:52,700
So I observe X. For any given
Theta, I can calculate this

142
00:06:52,700 --> 00:06:56,630
quantity, which tells me, under
this particular Theta,

143
00:06:56,630 --> 00:07:00,670
the X that you observed had this
probability of occurring.

144
00:07:00,670 --> 00:07:03,270
Under that Theta, the X that
you observe had that

145
00:07:03,270 --> 00:07:04,770
probability of occurring.

146
00:07:04,770 --> 00:07:08,580
You just choose that Theta,
which makes the data that you

147
00:07:08,580 --> 00:07:12,700
observed most likely.

148
00:07:12,700 --> 00:07:15,810
It's interesting to compare
this maximum likelihood

149
00:07:15,810 --> 00:07:19,120
estimate with the estimates that
you would have, if you

150
00:07:19,120 --> 00:07:22,050
were in a Bayesian setting,
and you were using maximum

151
00:07:22,050 --> 00:07:25,010
approach theory probability
estimation.

152
00:07:25,010 --> 00:07:31,650
In the Bayesian setting, what
we do is, given the data, we

153
00:07:31,650 --> 00:07:34,350
use the prior distribution
on Theta.

154
00:07:34,350 --> 00:07:41,660
And we calculate the posterior
distribution of Theta given X.

155
00:07:41,660 --> 00:07:44,350
Notice that this is sort
of the opposite from

156
00:07:44,350 --> 00:07:46,040
what we have here.

157
00:07:46,040 --> 00:07:49,180
This is the probability of X
for a particular value of

158
00:07:49,180 --> 00:07:51,780
Theta, whereas this is the
probability of Theta for a

159
00:07:51,780 --> 00:07:55,380
particular X. So it's the
opposite type of conditioning.

160
00:07:55,380 --> 00:07:58,240
In the Bayesian setting, Theta
is a random variable.

161
00:07:58,240 --> 00:07:59,890
So we can talk about
the probability

162
00:07:59,890 --> 00:08:01,570
distribution of Theta.

163
00:08:01,570 --> 00:08:04,740
So how do these two compare,
except for this syntactic

164
00:08:04,740 --> 00:08:08,160
difference that the order X's
and Theta's are reversed?

165
00:08:08,160 --> 00:08:11,410
Let's write down, in full
detail, what this posterior

166
00:08:11,410 --> 00:08:13,280
distribution of Theta is.

167
00:08:13,280 --> 00:08:17,390
By the Bayes rule, this
conditional distribution is

168
00:08:17,390 --> 00:08:20,430
obtained from the prior, and the
model of the measurement

169
00:08:20,430 --> 00:08:21,850
process that we have.

170
00:08:21,850 --> 00:08:24,510
And we get to this expression.

171
00:08:24,510 --> 00:08:29,520
So in Bayesian estimation, we
want to find the most likely

172
00:08:29,520 --> 00:08:30,870
value of Theta.

173
00:08:30,870 --> 00:08:33,070
And we need to maximize
this quantity over

174
00:08:33,070 --> 00:08:34,539
all possible Theta's.

175
00:08:34,539 --> 00:08:38,210
First thing to notice is that
the denominator is a constant.

176
00:08:38,210 --> 00:08:40,220
It does not involve Theta.

177
00:08:40,220 --> 00:08:43,250
So when you maximize this
quantity, you don't care about

178
00:08:43,250 --> 00:08:44,520
the denominator.

179
00:08:44,520 --> 00:08:47,800
You just want to maximize
the numerator.

180
00:08:47,800 --> 00:08:52,310
Now, here, things start to look
a little more similar.

181
00:08:52,310 --> 00:08:56,530
And they would be exactly of
the same kind, if that term

182
00:08:56,530 --> 00:08:59,890
here was absent, it the
prior was absent.

183
00:08:59,890 --> 00:09:03,860
The two are going to become
the same if that prior was

184
00:09:03,860 --> 00:09:05,830
just a constant.

185
00:09:05,830 --> 00:09:10,160
So if that prior is a constant,
then maximum

186
00:09:10,160 --> 00:09:13,720
likelihood estimation takes
exactly the same form as

187
00:09:13,720 --> 00:09:17,360
Bayesian maximum posterior
probability estimation.

188
00:09:17,360 --> 00:09:21,230
So you can give this particular
interpretation of

189
00:09:21,230 --> 00:09:22,680
maximum likelihood estimation.

190
00:09:22,680 --> 00:09:27,400
Maximum likelihood estimation
is essentially what you have

191
00:09:27,400 --> 00:09:31,380
done, if you were in a Bayesian
world, and you had

192
00:09:31,380 --> 00:09:35,400
assumed a prior on the Theta's
that's uniform, all the

193
00:09:35,400 --> 00:09:37,030
Theta's being equally likely.

194
00:09:42,620 --> 00:09:42,725
Okay.

195
00:09:42,725 --> 00:09:45,770
So let's look at a
simple example.

196
00:09:45,770 --> 00:09:48,510
Suppose that the Xi's are
independent, identically

197
00:09:48,510 --> 00:09:50,770
distributed random
variables, with a

198
00:09:50,770 --> 00:09:52,690
certain parameter Theta.

199
00:09:52,690 --> 00:09:55,910
So the distribution of each
one of the Xi's is this

200
00:09:55,910 --> 00:09:57,950
particular term.

201
00:09:57,950 --> 00:09:59,840
So Theta is one-dimensional.

202
00:09:59,840 --> 00:10:01,280
It's a one-dimensional
parameter.

203
00:10:01,280 --> 00:10:03,180
But we have several data.

204
00:10:03,180 --> 00:10:07,020
We write down the formula
for the probability of a

205
00:10:07,020 --> 00:10:12,360
particular X vector, given a
particular value of Theta.

206
00:10:12,360 --> 00:10:14,950
But again, when I use the word,
given, here it's not in

207
00:10:14,950 --> 00:10:16,080
the conditioning sense.

208
00:10:16,080 --> 00:10:18,770
It's the value of the
density for a

209
00:10:18,770 --> 00:10:21,710
particular choice of Theta.

210
00:10:21,710 --> 00:10:24,890
Here, I wrote down, I defined
maximum likelihood estimation

211
00:10:24,890 --> 00:10:26,190
in terms of PMFs.

212
00:10:26,190 --> 00:10:28,050
That's what you would
do if the X's were

213
00:10:28,050 --> 00:10:29,950
discrete random variables.

214
00:10:29,950 --> 00:10:32,770
Here, the X's are continuous
random variables, so instead

215
00:10:32,770 --> 00:10:36,220
of I'm using the PDF
instead of the PMF.

216
00:10:36,220 --> 00:10:39,530
So this a definition, here,
generalizes to the case of

217
00:10:39,530 --> 00:10:40,900
continuous random variables.

218
00:10:40,900 --> 00:10:44,620
And you use F's instead of
X's, our usual recipe.

219
00:10:44,620 --> 00:10:47,560
So the maximum likelihood
estimate is defined.

220
00:10:47,560 --> 00:10:51,880
Now, since the Xi's are
independent, the joint density

221
00:10:51,880 --> 00:10:54,410
of all the X's together
is the product of

222
00:10:54,410 --> 00:10:57,680
the individual densities.

223
00:10:57,680 --> 00:10:59,170
So you look at this quantity.

224
00:10:59,170 --> 00:11:03,310
This is the density or sort of
probability of observing a

225
00:11:03,310 --> 00:11:05,340
particular sequence of X's.

226
00:11:05,340 --> 00:11:08,230
And we ask the question, what's
the value of Theta that

227
00:11:08,230 --> 00:11:10,940
makes the X's that we
observe most likely?

228
00:11:10,940 --> 00:11:13,160
So we want to carry out
this maximization.

229
00:11:13,160 --> 00:11:17,430
Now this maximization is just
a calculational problem.

230
00:11:17,430 --> 00:11:19,920
We're going to do this
maximization by taking the

231
00:11:19,920 --> 00:11:21,910
logarithm of this expression.

232
00:11:21,910 --> 00:11:23,880
Maximizing an expression
is the same as

233
00:11:23,880 --> 00:11:25,790
maximizing the logarithm.

234
00:11:25,790 --> 00:11:28,790
So the logarithm of this
expression, the logarithm of a

235
00:11:28,790 --> 00:11:31,290
product is the sum of
the logarithms.

236
00:11:31,290 --> 00:11:34,390
You get contributions from
this Theta term.

237
00:11:34,390 --> 00:11:37,660
There's n of these, so we
get an n log Theta.

238
00:11:37,660 --> 00:11:40,430
And then we have the sum of the
logarithms of these terms.

239
00:11:40,430 --> 00:11:43,060
It gives us minus Theta.

240
00:11:43,060 --> 00:11:45,020
And then the sum of the X's.

241
00:11:45,020 --> 00:11:47,060
So we need to maximize
this expression

242
00:11:47,060 --> 00:11:48,630
with respect to Theta.

243
00:11:48,630 --> 00:11:51,130
The way to do this maximization
is you take the

244
00:11:51,130 --> 00:11:53,320
derivative, with respect
to Theta.

245
00:11:53,320 --> 00:11:58,520
And you get n over Theta equals
to the sum of the X's.

246
00:11:58,520 --> 00:12:00,280
And then you solve for Theta.

247
00:12:00,280 --> 00:12:02,040
And you find that the
maximum likelihood

248
00:12:02,040 --> 00:12:04,680
estimate is this quantity.

249
00:12:04,680 --> 00:12:13,160
Which sort of makes sense,
because this is the reciprocal

250
00:12:13,160 --> 00:12:16,700
of the sample mean of X's.

251
00:12:16,700 --> 00:12:19,520
Theta, in an exponential
distribution, we know that

252
00:12:19,520 --> 00:12:23,380
it's 1 over (the mean of the
exponential distribution).

253
00:12:23,380 --> 00:12:26,960
So it looks like a reasonable
estimate.

254
00:12:26,960 --> 00:12:29,570
So in any case, this is the
estimates that the maximum

255
00:12:29,570 --> 00:12:33,420
likelihood estimation procedure
tells us that we

256
00:12:33,420 --> 00:12:35,780
should report.

257
00:12:35,780 --> 00:12:39,790
This formula here, of course,
tells you what to do if you

258
00:12:39,790 --> 00:12:42,640
have already observed
specific numbers.

259
00:12:42,640 --> 00:12:46,020
If you have observed specific
numbers, then you observe this

260
00:12:46,020 --> 00:12:49,110
particular number as your
estimate of Theta.

261
00:12:49,110 --> 00:12:52,000
If you want to describe your
estimation procedure more

262
00:12:52,000 --> 00:12:55,900
abstractly, what you have
constructed is an estimator,

263
00:12:55,900 --> 00:12:59,690
which is a box that's takes in
the random variables, capital

264
00:12:59,690 --> 00:13:05,430
X1 up to Capital Xn, and
produces out your estimate,

265
00:13:05,430 --> 00:13:07,440
which is also a random
variable.

266
00:13:07,440 --> 00:13:10,760
Because it's a function of these
random variables and is

267
00:13:10,760 --> 00:13:14,750
denoted by an upper case Theta,
to indicate that this

268
00:13:14,750 --> 00:13:17,470
is now a random variable.

269
00:13:17,470 --> 00:13:21,040
So this is an equality
about numbers.

270
00:13:21,040 --> 00:13:23,860
This is a description of the
general procedure, which is an

271
00:13:23,860 --> 00:13:25,745
equality between two
random variables.

272
00:13:28,360 --> 00:13:31,920
And this gives you the more
abstract view of what we're

273
00:13:31,920 --> 00:13:35,040
doing here.

274
00:13:35,040 --> 00:13:35,352
All right.

275
00:13:35,352 --> 00:13:37,970
So what can we tell about
our estimate?

276
00:13:37,970 --> 00:13:40,090
Is it good or is it bad?

277
00:13:40,090 --> 00:13:42,960
So we should look at this
particular random variable and

278
00:13:42,960 --> 00:13:46,220
talk about the statistical
properties that it has.

279
00:13:46,220 --> 00:13:49,930
What we would like is this
random variable to be close to

280
00:13:49,930 --> 00:13:55,810
the true value of Theta, with
high probability, no matter

281
00:13:55,810 --> 00:13:59,470
what Theta is, since we don't
know what Theta is.

282
00:13:59,470 --> 00:14:01,400
Let's make a little
more specific the

283
00:14:01,400 --> 00:14:05,100
properties that we want.

284
00:14:05,100 --> 00:14:08,470
So we cook up the estimator
somehow.

285
00:14:08,470 --> 00:14:11,850
So this estimator corresponds,
again, to a box that takes

286
00:14:11,850 --> 00:14:15,400
data in, the capital X's,
and produces an

287
00:14:15,400 --> 00:14:17,470
estimate Theta hat.

288
00:14:17,470 --> 00:14:18,710
This estimate is random.

289
00:14:18,710 --> 00:14:23,070
Sometimes it will be above
the true value of Theta.

290
00:14:23,070 --> 00:14:25,660
Sometimes it will be below.

291
00:14:25,660 --> 00:14:30,220
Ideally, we would like it to not
have a systematic error,

292
00:14:30,220 --> 00:14:32,810
on the positive side or
the negative side.

293
00:14:32,810 --> 00:14:37,310
So a reasonable wish to have,
for a good estimator, is that,

294
00:14:37,310 --> 00:14:41,700
on the average, it gives
you the correct value.

295
00:14:41,700 --> 00:14:45,850
Now here, let's be a little more
specific about what that

296
00:14:45,850 --> 00:14:47,740
expectation is.

297
00:14:47,740 --> 00:14:51,270
This is an expectation, with
respect to the probability

298
00:14:51,270 --> 00:14:54,240
distribution of Theta hat.

299
00:14:54,240 --> 00:14:58,780
The probability distribution
of Theta hat is affected by

300
00:14:58,780 --> 00:15:01,410
the probability distribution
of the X's.

301
00:15:01,410 --> 00:15:03,760
Because Theta hat is a
function of the X's.

302
00:15:03,760 --> 00:15:05,930
And the probability distribution
of the X's is

303
00:15:05,930 --> 00:15:09,220
affected by the true
value of Theta.

304
00:15:09,220 --> 00:15:13,710
So depending on which one is the
true value of Theta, this

305
00:15:13,710 --> 00:15:16,650
is going to be a different
expectation.

306
00:15:16,650 --> 00:15:20,830
So if you were to write this
expectation out in more

307
00:15:20,830 --> 00:15:25,840
detail, it would look
something like this.

308
00:15:25,840 --> 00:15:28,690
You need to write down
the probability

309
00:15:28,690 --> 00:15:30,260
distribution of Theta hat.

310
00:15:32,890 --> 00:15:36,470
And this is going to
be some function.

311
00:15:36,470 --> 00:15:41,200
But this function depends on the
true Theta, is affected by

312
00:15:41,200 --> 00:15:42,800
the true Theta.

313
00:15:42,800 --> 00:15:48,300
And then you integrate this
with respect to Theta hat.

314
00:15:48,300 --> 00:15:49,430
What's the point here?

315
00:15:49,430 --> 00:15:53,280
Again, Theta hat is a
function of the X's.

316
00:15:53,280 --> 00:15:57,000
So the density of Theta
hat is affected by the

317
00:15:57,000 --> 00:15:58,400
density of the X's.

318
00:15:58,400 --> 00:16:00,730
The density of the X's
is affected by the

319
00:16:00,730 --> 00:16:02,380
true value of Theta.

320
00:16:02,380 --> 00:16:05,420
So the distribution of Theta
hat is affected by

321
00:16:05,420 --> 00:16:07,680
the value of Theta.

322
00:16:07,680 --> 00:16:10,500
Another way to put it is, as
I've mentioned a few minutes

323
00:16:10,500 --> 00:16:14,550
ago, in this business, it's
as if we are considering

324
00:16:14,550 --> 00:16:17,880
different possible probabilistic
models, one

325
00:16:17,880 --> 00:16:20,470
probabilistic model for
each choice of Theta.

326
00:16:20,470 --> 00:16:22,560
And we're trying to guess
which one of these

327
00:16:22,560 --> 00:16:25,200
probabilistic models
is the true one.

328
00:16:25,200 --> 00:16:28,420
One way of emphasizing the
fact that this expression

329
00:16:28,420 --> 00:16:31,780
depends on the true Theta is
to put a little subscript

330
00:16:31,780 --> 00:16:36,790
here, expectation, under the
particular value of the

331
00:16:36,790 --> 00:16:38,300
parameter Theta.

332
00:16:38,300 --> 00:16:42,450
So depending on what value the
true parameter Theta takes,

333
00:16:42,450 --> 00:16:45,000
this expectation will have
a different value.

334
00:16:45,000 --> 00:16:49,730
And what we would like is that
no matter what the true value

335
00:16:49,730 --> 00:16:55,300
is, that our estimate will not
have a bias on the positive or

336
00:16:55,300 --> 00:16:57,140
the negative sides.

337
00:16:57,140 --> 00:17:00,150
So this is a property
that's desirable.

338
00:17:00,150 --> 00:17:02,160
Is it always going to be true?

339
00:17:02,160 --> 00:17:05,218
Not necessarily, it depends on
what estimator we construct.

340
00:17:09,160 --> 00:17:12,400
Is it true for our exponential
example?

341
00:17:12,400 --> 00:17:14,770
Unfortunately not, the estimate
that we have in the

342
00:17:14,770 --> 00:17:18,300
exponential example turns
out to be biased.

343
00:17:18,300 --> 00:17:22,900
And one extreme way of seeing
this is to consider the case

344
00:17:22,900 --> 00:17:25,160
where our sample size is 1.

345
00:17:25,160 --> 00:17:27,020
We're trying to estimate
Theta.

346
00:17:27,020 --> 00:17:30,370
And the estimator from the
previous slide, in that case,

347
00:17:30,370 --> 00:17:33,410
is just 1/X1.

348
00:17:33,410 --> 00:17:37,990
Now X1 has a fair amount of
density in the vicinity of 0,

349
00:17:37,990 --> 00:17:41,360
which means that 1/X1 has
significant probability of

350
00:17:41,360 --> 00:17:42,810
being very large.

351
00:17:42,810 --> 00:17:46,140
And if you do the calculation,
this ultimately makes the

352
00:17:46,140 --> 00:17:49,170
expected value of 1/X1
to be infinite.

353
00:17:49,170 --> 00:17:52,870
Now infinity is definitely
not the correct value.

354
00:17:52,870 --> 00:17:56,330
So our estimate is
biased upwards.

355
00:17:56,330 --> 00:18:00,130
And it's actually biased
a lot upwards.

356
00:18:00,130 --> 00:18:01,800
So that's how things are.

357
00:18:01,800 --> 00:18:06,690
Maximum likelihood estimates,
in general, will be biased.

358
00:18:06,690 --> 00:18:10,870
But under some conditions,
they will turn out to be

359
00:18:10,870 --> 00:18:12,780
asymptotically unbiased.

360
00:18:12,780 --> 00:18:16,810
That is, as you get more and
more data, as your X vector is

361
00:18:16,810 --> 00:18:21,750
longer and longer, with
independent data, the estimate

362
00:18:21,750 --> 00:18:25,010
that you're going to have, the
expected value of your

363
00:18:25,010 --> 00:18:26,860
estimator is going
to get closer and

364
00:18:26,860 --> 00:18:28,370
closer to the true value.

365
00:18:28,370 --> 00:18:31,330
So you do have some nice
asymptotic properties, but

366
00:18:31,330 --> 00:18:34,000
we're not going to prove
anything like this.

367
00:18:34,000 --> 00:18:37,680
Speaking of asymptotic
properties, in general, what

368
00:18:37,680 --> 00:18:40,950
we would like to have is that,
as you collect more and more

369
00:18:40,950 --> 00:18:46,550
data, you get the correct
answer, in some sense.

370
00:18:46,550 --> 00:18:49,360
And the sense that we're going
to use here is the limiting

371
00:18:49,360 --> 00:18:52,560
sense of convergence in
probability, since this is the

372
00:18:52,560 --> 00:18:55,270
only notion of convergence of
random variables that we have

373
00:18:55,270 --> 00:18:56,540
in our hands.

374
00:18:56,540 --> 00:18:59,600
This is similar to what
we had in the pollster

375
00:18:59,600 --> 00:19:01,180
problem, for example.

376
00:19:01,180 --> 00:19:04,900
If we had a bigger and bigger
sample size, we could be more

377
00:19:04,900 --> 00:19:08,360
and more confident that the
estimate that we obtained is

378
00:19:08,360 --> 00:19:11,970
close to the unknown true
parameter of the distribution

379
00:19:11,970 --> 00:19:13,320
that we have.

380
00:19:13,320 --> 00:19:16,420
So this is a desirable
property.

381
00:19:16,420 --> 00:19:20,720
If you have an infinitely large
amount of data, you

382
00:19:20,720 --> 00:19:25,070
should be able to estimate
an unknown parameter

383
00:19:25,070 --> 00:19:26,890
more or less exactly.

384
00:19:26,890 --> 00:19:32,280
So this is it desirable property
of estimators.

385
00:19:32,280 --> 00:19:35,560
It turns out that maximum
likelihood estimation, given

386
00:19:35,560 --> 00:19:39,330
independent data, does have
this property, under mild

387
00:19:39,330 --> 00:19:40,280
conditions.

388
00:19:40,280 --> 00:19:43,100
So maximum likelihood
estimation, in this respect,

389
00:19:43,100 --> 00:19:45,180
is a good approach.

390
00:19:45,180 --> 00:19:48,520
So let's see, do we have this
consistency property in our

391
00:19:48,520 --> 00:19:50,150
exponential example?

392
00:19:50,150 --> 00:19:56,560
In our exponential example, we
used this quantity to estimate

393
00:19:56,560 --> 00:19:59,040
the unknown parameter Theta.

394
00:19:59,040 --> 00:20:01,000
What properties does
this quantity have

395
00:20:01,000 --> 00:20:03,160
as n goes to infinity?

396
00:20:03,160 --> 00:20:06,580
Well this quantity is the
reciprocal of that quantity up

397
00:20:06,580 --> 00:20:09,890
here, which is the
sample mean.

398
00:20:09,890 --> 00:20:12,950
We know from the weak law of
large numbers, that the sample

399
00:20:12,950 --> 00:20:16,350
mean converges to
the expectation.

400
00:20:16,350 --> 00:20:19,250
So this property here
comes from the weak

401
00:20:19,250 --> 00:20:21,660
law of large numbers.

402
00:20:21,660 --> 00:20:24,670
In probability, this quantity
converges to the expected

403
00:20:24,670 --> 00:20:29,830
value, which, for exponential
distributions, is 1/Theta.

404
00:20:29,830 --> 00:20:33,460
Now, if something converges to
something, then the reciprocal

405
00:20:33,460 --> 00:20:37,680
of that should converge to
the reciprocal of that.

406
00:20:37,680 --> 00:20:41,520
That's a property that's
certainly correct for numbers.

407
00:20:41,520 --> 00:20:44,000
But you're not talking about
convergence of numbers.

408
00:20:44,000 --> 00:20:46,420
We're talking about convergence
in probability,

409
00:20:46,420 --> 00:20:48,820
which is a more complicated
notion.

410
00:20:48,820 --> 00:20:52,370
Fortunately, it turns out that
the same thing is true, when

411
00:20:52,370 --> 00:20:54,660
we deal with convergence
in probability.

412
00:20:54,660 --> 00:20:58,690
One can show, although we will
not bother doing this, that

413
00:20:58,690 --> 00:21:01,840
indeed, the reciprocal of this,
which is our estimate,

414
00:21:01,840 --> 00:21:05,880
converges in probability to
the reciprocal of that.

415
00:21:05,880 --> 00:21:08,880
And that reciprocal is the
true parameter Theta.

416
00:21:08,880 --> 00:21:11,570
So for this particular
exponential example, we do

417
00:21:11,570 --> 00:21:15,250
have the desirable property,
that as the number of data

418
00:21:15,250 --> 00:21:18,230
becomes larger and larger,
the estimate that we have

419
00:21:18,230 --> 00:21:20,970
constructed will get closer
and closer to the true

420
00:21:20,970 --> 00:21:22,510
parameter value.

421
00:21:22,510 --> 00:21:27,050
And this is true no matter
what Theta is.

422
00:21:27,050 --> 00:21:30,130
No matter what the true
parameter Theta is, we're

423
00:21:30,130 --> 00:21:33,240
going to get close to it as
we collect more data.

424
00:21:35,780 --> 00:21:35,950
Okay.

425
00:21:35,950 --> 00:21:39,100
So these are two rough
qualitative properties that

426
00:21:39,100 --> 00:21:42,350
would be nice to have.

427
00:21:42,350 --> 00:21:47,340
If you want to get a little
more quantitative, you can

428
00:21:47,340 --> 00:21:50,210
start looking at the mean
squared error that your

429
00:21:50,210 --> 00:21:52,000
estimator gives.

430
00:21:52,000 --> 00:21:56,600
Now, once more, the comment I
was making up there applies.

431
00:21:56,600 --> 00:22:00,540
Namely, that this expectation
here is an expectation with

432
00:22:00,540 --> 00:22:04,600
respect to the probability
distribution of Theta hat that

433
00:22:04,600 --> 00:22:07,280
corresponds to a particular
value of little theta.

434
00:22:07,280 --> 00:22:09,840
So fix a little theta.

435
00:22:09,840 --> 00:22:11,910
Write down this expression.

436
00:22:11,910 --> 00:22:14,550
Look at the probability
distribution of Theta hat,

437
00:22:14,550 --> 00:22:16,380
under that little theta.

438
00:22:16,380 --> 00:22:18,220
And do this calculation.

439
00:22:18,220 --> 00:22:20,610
You're going to get some
quantity that depends on the

440
00:22:20,610 --> 00:22:21,860
little theta.

441
00:22:24,200 --> 00:22:28,450
And so all quantities in this
equality here should be

442
00:22:28,450 --> 00:22:33,360
interpreted as quantities under
that particular value of

443
00:22:33,360 --> 00:22:34,490
little theta.

444
00:22:34,490 --> 00:22:38,640
So if you wanted to make this
more explicit, you could start

445
00:22:38,640 --> 00:22:41,870
throwing little subscripts
everywhere in those

446
00:22:41,870 --> 00:22:44,430
expressions.

447
00:22:44,430 --> 00:22:49,190
And let's see what those
expressions tell us.

448
00:22:49,190 --> 00:22:55,430
The expected value squared of
a random variable, we know

449
00:22:55,430 --> 00:22:59,210
that it's always equal to the
variance of this random

450
00:22:59,210 --> 00:23:03,790
variable, plus the expectation
of the

451
00:23:03,790 --> 00:23:05,860
random variable squared.

452
00:23:05,860 --> 00:23:08,465
So the expectation value of that
random variable, squared.

453
00:23:12,020 --> 00:23:17,030
This equality here is just our
familiar formula, that the

454
00:23:17,030 --> 00:23:23,250
expected value of X squared is
the variance of X plus the

455
00:23:23,250 --> 00:23:26,350
expected value of X squared.

456
00:23:26,350 --> 00:23:30,040
So we apply this formula
to X equal to

457
00:23:30,040 --> 00:23:34,024
Theta hat minus Theta.

458
00:23:37,180 --> 00:23:41,220
Now, remember that, in this
classical setting, theta is

459
00:23:41,220 --> 00:23:42,140
just a constant.

460
00:23:42,140 --> 00:23:43,450
We have fixed Theta.

461
00:23:43,450 --> 00:23:45,850
We want to calculate the
variance of this quantity,

462
00:23:45,850 --> 00:23:47,760
under that particular Theta.

463
00:23:47,760 --> 00:23:51,000
When you add or subtract a
constant to a random variable,

464
00:23:51,000 --> 00:23:54,070
the variance doesn't change.

465
00:23:54,070 --> 00:23:56,860
This is the same as the variance
of our estimator.

466
00:23:56,860 --> 00:24:00,300
And what we've got here is
the bias of our estimate.

467
00:24:00,300 --> 00:24:02,580
It tells us, on the average,
whether we

468
00:24:02,580 --> 00:24:04,470
fall above or below.

469
00:24:04,470 --> 00:24:06,850
And we're taking the bias
to be b squared.

470
00:24:06,850 --> 00:24:10,110
If we have an unbiased
estimator, the bias

471
00:24:10,110 --> 00:24:13,690
term will be 0.

472
00:24:13,690 --> 00:24:18,250
So ideally we want Theta hat
to be very close to Theta.

473
00:24:18,250 --> 00:24:21,840
And since Theta is a constant,
if that happens, the variance

474
00:24:21,840 --> 00:24:25,650
of Theta hat would
be very small.

475
00:24:25,650 --> 00:24:26,870
So Theta is a constant.

476
00:24:26,870 --> 00:24:30,180
If Theta hat has a distribution
that's

477
00:24:30,180 --> 00:24:33,610
concentrated just around own
little theta, then Theta hat

478
00:24:33,610 --> 00:24:35,250
would have a small variance.

479
00:24:35,250 --> 00:24:37,690
So this is one desire
that have.

480
00:24:37,690 --> 00:24:39,740
We're going to have
a small variance.

481
00:24:39,740 --> 00:24:43,710
But we also want to have a small
bias at the same time.

482
00:24:43,710 --> 00:24:47,370
So the general form of the mean
squared error has two

483
00:24:47,370 --> 00:24:48,240
contributions.

484
00:24:48,240 --> 00:24:50,530
One is the variance
of our estimator.

485
00:24:50,530 --> 00:24:52,350
The other is the bias.

486
00:24:52,350 --> 00:24:54,990
And one usually wants to design
an estimator that

487
00:24:54,990 --> 00:24:58,900
simultaneously keeps both
of these terms small.

488
00:24:58,900 --> 00:25:03,250
So here's an estimation method
that would do very well with

489
00:25:03,250 --> 00:25:05,080
respect to this term,
but badly with

490
00:25:05,080 --> 00:25:06,680
respect to that term.

491
00:25:06,680 --> 00:25:09,410
So suppose that my distribution
is, let's say,

492
00:25:09,410 --> 00:25:13,700
normal with an unknown mean
Theta and variance 1.

493
00:25:13,700 --> 00:25:17,580
And I use as my estimator
something very dumb.

494
00:25:17,580 --> 00:25:23,330
I always produce an estimate
that says my estimate is 100.

495
00:25:23,330 --> 00:25:26,430
So I'm just ignoring the
data and report 100.

496
00:25:26,430 --> 00:25:27,750
What does this do?

497
00:25:27,750 --> 00:25:30,950
The variance of my
estimator is 0.

498
00:25:30,950 --> 00:25:33,690
There's no randomness in the
estimate that I report.

499
00:25:33,690 --> 00:25:37,020
But the bias is going
to be pretty bad.

500
00:25:37,020 --> 00:25:44,180
The bias is going to be Theta
hat, which is 100 minus the

501
00:25:44,180 --> 00:25:46,770
true value of Theta.

502
00:25:46,770 --> 00:25:50,340
And for some Theta's, my bias
is going to be horrible.

503
00:25:50,340 --> 00:25:54,600
If my true Theta happens
to be 0, my bias

504
00:25:54,600 --> 00:25:56,200
squared is a huge term.

505
00:25:56,200 --> 00:25:57,810
And I get a large error.

506
00:25:57,810 --> 00:26:00,220
So what's the moral
of this example?

507
00:26:00,220 --> 00:26:03,700
There are ways of making that
variance very small, but, in

508
00:26:03,700 --> 00:26:07,360
those cases, you pay a
price in the bias.

509
00:26:07,360 --> 00:26:10,340
So you want to do something a
little more delicate, where

510
00:26:10,340 --> 00:26:14,640
you try to keep both terms
small at the same time.

511
00:26:14,640 --> 00:26:16,720
So these types of considerations
become

512
00:26:16,720 --> 00:26:20,280
important when you start to try
to design sophisticated

513
00:26:20,280 --> 00:26:22,840
estimators for more complicated
problems.

514
00:26:22,840 --> 00:26:24,800
But we will not do this
in this class.

515
00:26:24,800 --> 00:26:26,720
This belongs to further
classes on

516
00:26:26,720 --> 00:26:28,750
statistics and inference.

517
00:26:28,750 --> 00:26:31,960
For this class, for parameter
estimation, we will basically

518
00:26:31,960 --> 00:26:34,400
stick to two very
simple methods.

519
00:26:34,400 --> 00:26:37,930
One is the maximum likelihood
method we've just discussed.

520
00:26:37,930 --> 00:26:41,300
And the other method is what you
would do if you were still

521
00:26:41,300 --> 00:26:44,010
in high school and didn't
know any probability.

522
00:26:44,010 --> 00:26:46,610
You get data.

523
00:26:46,610 --> 00:26:50,430
And these data come from
some distribution

524
00:26:50,430 --> 00:26:51,850
with an unknown mean.

525
00:26:51,850 --> 00:26:53,930
And you want to estimate
that the unknown mean.

526
00:26:53,930 --> 00:26:54,810
What would you do?

527
00:26:54,810 --> 00:26:57,990
You would just take those data
and average them out.

528
00:26:57,990 --> 00:27:00,440
So let's make this a little
more specific.

529
00:27:00,440 --> 00:27:04,770
We have X's that come from
a given distribution.

530
00:27:04,770 --> 00:27:07,775
We know the general form of
the distribution, perhaps.

531
00:27:10,570 --> 00:27:15,180
We do know, perhaps, the
variance of that distribution,

532
00:27:15,180 --> 00:27:17,050
or, perhaps, we don't know it.

533
00:27:17,050 --> 00:27:19,030
But we do not know the mean.

534
00:27:19,030 --> 00:27:22,700
And we want to estimate the
mean of that distribution.

535
00:27:22,700 --> 00:27:25,370
Now, we can write
this situation.

536
00:27:25,370 --> 00:27:27,710
We can represent it in
a different form.

537
00:27:27,710 --> 00:27:30,120
The Xi's are equal to Theta.

538
00:27:30,120 --> 00:27:31,380
This is the mean.

539
00:27:31,380 --> 00:27:34,310
Plus a 0 mean random
variable, that you

540
00:27:34,310 --> 00:27:36,000
can think of as noise.

541
00:27:36,000 --> 00:27:39,380
So this corresponds to the usual
situation you would have

542
00:27:39,380 --> 00:27:41,950
in a lab, where you
go and try to

543
00:27:41,950 --> 00:27:43,870
measure an unknown quantity.

544
00:27:43,870 --> 00:27:45,260
You get lots of measurements.

545
00:27:45,260 --> 00:27:49,490
But each time that you measure
them, your measurements have

546
00:27:49,490 --> 00:27:51,920
some extra noise in there.

547
00:27:51,920 --> 00:27:54,510
And you want to kind of
get rid of that noise.

548
00:27:54,510 --> 00:27:57,860
The way to try to get rid of
the measurement noise is to

549
00:27:57,860 --> 00:28:01,170
collect lots of data and
average them out.

550
00:28:01,170 --> 00:28:02,930
This is the sample mean.

551
00:28:02,930 --> 00:28:07,380
And this is a very, very
reasonable way of trying to

552
00:28:07,380 --> 00:28:10,130
estimate the unknown
mean of the X's.

553
00:28:10,130 --> 00:28:12,700
So this is the sample mean.

554
00:28:12,700 --> 00:28:17,840
It's a reasonable, plausible,
in general, pretty good

555
00:28:17,840 --> 00:28:22,390
estimator of the unknown mean
of a certain distribution.

556
00:28:22,390 --> 00:28:26,910
We can apply this estimator
without really knowing a lot

557
00:28:26,910 --> 00:28:28,810
about the distribution
of the X's.

558
00:28:28,810 --> 00:28:31,010
Actually, we don't need to
know anything about the

559
00:28:31,010 --> 00:28:32,320
distribution.

560
00:28:32,320 --> 00:28:35,840
We can still apply it, because
the variance, for example,

561
00:28:35,840 --> 00:28:37,130
does not show up here.

562
00:28:37,130 --> 00:28:38,660
We don't need to know
the variance to

563
00:28:38,660 --> 00:28:40,520
calculate that quantity.

564
00:28:40,520 --> 00:28:43,520
Does this estimator have
good properties?

565
00:28:43,520 --> 00:28:45,110
Yes, it does.

566
00:28:45,110 --> 00:28:48,110
What's the expected value
of the sample mean?

567
00:28:48,110 --> 00:28:51,910
If the expectation of this, it's
the expectation of this

568
00:28:51,910 --> 00:28:53,600
sum divided by n.

569
00:28:53,600 --> 00:28:56,410
The expected value for each
one of the X's is Theta.

570
00:28:56,410 --> 00:28:58,290
So the expected value
of the sample mean

571
00:28:58,290 --> 00:29:00,010
is just Theta itself.

572
00:29:00,010 --> 00:29:03,310
So our estimator is unbiased.

573
00:29:03,310 --> 00:29:06,410
No matter what Theta is, our
estimator does not have a

574
00:29:06,410 --> 00:29:11,130
systematic error in
either direction.

575
00:29:11,130 --> 00:29:13,870
Furthermore, the weak law of
large numbers tells us that

576
00:29:13,870 --> 00:29:18,140
this quantity converges to the
true parameter in probability.

577
00:29:18,140 --> 00:29:20,700
So it's a consistent
estimator.

578
00:29:20,700 --> 00:29:21,920
This is good.

579
00:29:21,920 --> 00:29:26,740
And if you want to calculate
the mean squared error

580
00:29:26,740 --> 00:29:28,780
corresponding to
this estimator.

581
00:29:28,780 --> 00:29:31,550
Remember how we defined the
mean squared error?

582
00:29:31,550 --> 00:29:35,300
It's this quantity.

583
00:29:35,300 --> 00:29:38,680
Then it's a calculation that we
have done a fair number of

584
00:29:38,680 --> 00:29:40,080
times by now.

585
00:29:40,080 --> 00:29:43,640
The mean squared error is the
variance of the distribution

586
00:29:43,640 --> 00:29:46,000
of the X's divided by n.

587
00:29:46,000 --> 00:29:49,370
So as we get more and more data,
the mean squared error

588
00:29:49,370 --> 00:29:52,170
goes down to 0.

589
00:29:52,170 --> 00:29:56,420
In some examples, it turns out
that the sample mean is also

590
00:29:56,420 --> 00:29:58,930
the same as the maximum
likelihood estimate.

591
00:29:58,930 --> 00:30:02,790
For example, if the X's are
coming from a normal

592
00:30:02,790 --> 00:30:07,700
distribution, you can write down
the likelihood, do the

593
00:30:07,700 --> 00:30:10,240
maximization with respect to
Theta, you'll find that the

594
00:30:10,240 --> 00:30:15,190
maximum likelihood estimate is
the same as the sample mean.

595
00:30:15,190 --> 00:30:18,730
In other cases, the sample mean
will be different from

596
00:30:18,730 --> 00:30:20,850
the maximum likelihood.

597
00:30:20,850 --> 00:30:23,990
And then you have a choice
about which one of the

598
00:30:23,990 --> 00:30:24,860
two you would use.

599
00:30:24,860 --> 00:30:27,890
Probably, in most reasonable
situations, you would just use

600
00:30:27,890 --> 00:30:31,460
the sample mean, because it's
simple, easy to compute, and

601
00:30:31,460 --> 00:30:33,830
has nice properties.

602
00:30:33,830 --> 00:30:33,936
All right.

603
00:30:33,936 --> 00:30:35,110
So you go to your boss.

604
00:30:35,110 --> 00:30:38,120
And you report and say,
OK, I did all my

605
00:30:38,120 --> 00:30:39,910
experiments in the lab.

606
00:30:39,910 --> 00:30:49,820
And the average value that I got
is a certain number, 2.37.

607
00:30:49,820 --> 00:30:52,490
So is that the informative
to your boss?

608
00:30:52,490 --> 00:30:55,470
Well your boss would like to
know how much they can trust

609
00:30:55,470 --> 00:30:58,280
this number, 2.37.

610
00:30:58,280 --> 00:31:00,630
Well, I know that the true
value is not going to be

611
00:31:00,630 --> 00:31:02,270
exactly that.

612
00:31:02,270 --> 00:31:07,410
But how close should it be?

613
00:31:07,410 --> 00:31:09,820
So give me a range of
what you think are

614
00:31:09,820 --> 00:31:12,080
possible values of Theta.

615
00:31:12,080 --> 00:31:16,220
So the situation is like this.

616
00:31:16,220 --> 00:31:20,370
So suppose that we observe X's
that are coming from a certain

617
00:31:20,370 --> 00:31:22,070
distribution.

618
00:31:22,070 --> 00:31:24,230
And we're trying to
estimate the mean.

619
00:31:24,230 --> 00:31:25,480
We get our data.

620
00:31:27,880 --> 00:31:32,090
Maybe our data looks something
like this.

621
00:31:32,090 --> 00:31:34,090
You calculate the mean.

622
00:31:34,090 --> 00:31:36,140
You find the sample mean.

623
00:31:36,140 --> 00:31:40,120
So let's suppose that the sample
mean is a number, for

624
00:31:40,120 --> 00:31:45,570
some reason take to be 2.37.

625
00:31:45,570 --> 00:31:48,300
But you want to convey something
to your boss about

626
00:31:48,300 --> 00:31:51,450
how spread out these
data were.

627
00:31:51,450 --> 00:31:56,690
So the boss asks you to give
him or her some kind of

628
00:31:56,690 --> 00:32:05,340
interval on which Theta, the
true parameter, might lie.

629
00:32:05,340 --> 00:32:07,540
So the boss asked you
for an interval.

630
00:32:07,540 --> 00:32:11,740
So what you do is you end up
reporting an interval.

631
00:32:11,740 --> 00:32:14,990
And you somehow use the data
that you have seen to

632
00:32:14,990 --> 00:32:17,580
construct this interval.

633
00:32:17,580 --> 00:32:19,900
And you report to your
boss also the

634
00:32:19,900 --> 00:32:21,420
endpoints of this interval.

635
00:32:21,420 --> 00:32:24,020
Let's give names to
these endpoints,

636
00:32:24,020 --> 00:32:27,710
Theta_n- and Theta_n+.

637
00:32:27,710 --> 00:32:31,000
The ends here just play the role
of keeping track of how

638
00:32:31,000 --> 00:32:33,000
many data we're using.

639
00:32:33,000 --> 00:32:39,320
So what you report to your boss
is this interval as well.

640
00:32:39,320 --> 00:32:42,340
Are these Theta's here, the
endpoints of the interval,

641
00:32:42,340 --> 00:32:44,220
lowercase or uppercase?

642
00:32:44,220 --> 00:32:45,750
What should they be?

643
00:32:45,750 --> 00:32:48,180
Well you construct these
intervals after

644
00:32:48,180 --> 00:32:49,430
you see your data.

645
00:32:49,430 --> 00:32:53,830
You take the data into account
to construct your interval.

646
00:32:53,830 --> 00:32:57,020
So these definitely should
depend on the data.

647
00:32:57,020 --> 00:32:59,460
And therefore they are
random variables.

648
00:32:59,460 --> 00:33:03,240
Same thing with your estimator,
in general, it's

649
00:33:03,240 --> 00:33:05,120
going to be a random variable.

650
00:33:05,120 --> 00:33:07,930
Although, when you go and report
numbers to your boss,

651
00:33:07,930 --> 00:33:10,580
you give the specific
realizations of the random

652
00:33:10,580 --> 00:33:15,450
variables, given the
data that you got.

653
00:33:15,450 --> 00:33:21,500
So instead of having
just a single box

654
00:33:21,500 --> 00:33:25,050
that produces estimates.

655
00:33:25,050 --> 00:33:29,540
So our previous picture was that
you have your estimator

656
00:33:29,540 --> 00:33:34,130
that takes X's and produces
Theta hats.

657
00:33:34,130 --> 00:33:40,960
Now our box will also be
producing Theta hats minus and

658
00:33:40,960 --> 00:33:42,570
Theta hats plus.

659
00:33:42,570 --> 00:33:45,180
It's going to produce
an interval as well.

660
00:33:45,180 --> 00:33:48,670
The X's are random, therefore
these quantities are random.

661
00:33:48,670 --> 00:33:52,340
Once you go and do the
experiment and obtain your

662
00:33:52,340 --> 00:33:55,930
data, then your data
will be some

663
00:33:55,930 --> 00:33:58,810
lowercase x, specific numbers.

664
00:33:58,810 --> 00:34:00,950
And then your estimates
and estimator

665
00:34:00,950 --> 00:34:05,110
become also lower case.

666
00:34:05,110 --> 00:34:08,010
What would we like this
interval to do?

667
00:34:08,010 --> 00:34:11,760
We would like it to be highly
likely to contain the true

668
00:34:11,760 --> 00:34:13,810
value of the parameter.

669
00:34:13,810 --> 00:34:17,800
So we might impose some specs
of the following kind.

670
00:34:17,800 --> 00:34:19,170
I pick a number, alpha.

671
00:34:19,170 --> 00:34:21,170
Usually that alpha,
think of it as a

672
00:34:21,170 --> 00:34:23,050
probability of a large error.

673
00:34:23,050 --> 00:34:27,449
Typical value of alpha might
be 0.05, in which case this

674
00:34:27,449 --> 00:34:30,360
number here is point 0.95.

675
00:34:30,360 --> 00:34:33,989
And you're given specs that
say something like this.

676
00:34:33,989 --> 00:34:41,110
I would like, with probability
at least 0.95, this to happen,

677
00:34:41,110 --> 00:34:44,739
which says that the true
parameter lies inside the

678
00:34:44,739 --> 00:34:47,100
confidence interval.

679
00:34:47,100 --> 00:34:50,840
Now let's try to interpret
this statement.

680
00:34:50,840 --> 00:34:53,560
Suppose that you did the
experiment, and that you ended

681
00:34:53,560 --> 00:34:56,230
up reporting to your boss
a confidence interval

682
00:34:56,230 --> 00:35:01,520
from 1.97 to 2.56.

683
00:35:01,520 --> 00:35:03,170
That's what you report
to your boss.

684
00:35:06,790 --> 00:35:08,300
And suppose that the confidence

685
00:35:08,300 --> 00:35:10,280
interval has this property.

686
00:35:10,280 --> 00:35:16,400
Can you go to your boss and say,
with probability 95%, the

687
00:35:16,400 --> 00:35:20,090
true value of Theta is between
these two numbers?

688
00:35:20,090 --> 00:35:22,630
Is that a meaningful
statement?

689
00:35:22,630 --> 00:35:26,100
So the statement is, the
tentative statement is, with

690
00:35:26,100 --> 00:35:30,200
probability 95%, the true
value of Theta is

691
00:35:30,200 --> 00:35:34,930
between 1.97 and 2.56.

692
00:35:34,930 --> 00:35:38,910
Well, what is random
in that statement?

693
00:35:38,910 --> 00:35:40,460
There's nothing random.

694
00:35:40,460 --> 00:35:43,070
The true value of theta
is a constant.

695
00:35:43,070 --> 00:35:44,720
1.97 is a number.

696
00:35:44,720 --> 00:35:46,740
2.56 is a number.

697
00:35:46,740 --> 00:35:52,960
So it doesn't make any sense to
talk about the probability

698
00:35:52,960 --> 00:35:54,920
that theta is in
this interval.

699
00:35:54,920 --> 00:35:57,540
Either theta happens to be
in that interval, or it

700
00:35:57,540 --> 00:35:58,760
happens to not be.

701
00:35:58,760 --> 00:36:01,560
But there are no probabilities
associated with this.

702
00:36:01,560 --> 00:36:04,700
Because theta is not random.

703
00:36:04,700 --> 00:36:06,690
Syntactically, you
can see this.

704
00:36:06,690 --> 00:36:09,210
Because theta here
is a lower case.

705
00:36:09,210 --> 00:36:11,930
So what kind of probabilities
are we talking about here?

706
00:36:11,930 --> 00:36:13,460
Where's the randomness?

707
00:36:13,460 --> 00:36:15,880
Well the random thing
is the interval.

708
00:36:15,880 --> 00:36:17,560
It's not theta.

709
00:36:17,560 --> 00:36:21,090
So the statement that is being
made here is that the

710
00:36:21,090 --> 00:36:24,290
interval, that's being
constructed by our procedure,

711
00:36:24,290 --> 00:36:28,410
should have the property that,
with probability 95%, it's

712
00:36:28,410 --> 00:36:33,280
going to fall on top of the
true value of theta.

713
00:36:33,280 --> 00:36:37,680
So the right way of interpreting
what the 95%

714
00:36:37,680 --> 00:36:42,270
confidence interval is, is
something like the following.

715
00:36:42,270 --> 00:36:45,390
We have the true value of theta
that we don't know.

716
00:36:45,390 --> 00:36:46,750
I get data.

717
00:36:46,750 --> 00:36:50,150
Based on the data, I construct
a confidence interval.

718
00:36:50,150 --> 00:36:51,950
I get my confidence interval.

719
00:36:51,950 --> 00:36:52,790
I got lucky.

720
00:36:52,790 --> 00:36:54,850
And the true value of
theta is in here.

721
00:36:54,850 --> 00:36:57,790
Next day, I do the same
experiment, take my data,

722
00:36:57,790 --> 00:37:00,500
construct a confidence
interval.

723
00:37:00,500 --> 00:37:04,040
And I get this confidence
interval, lucky once more.

724
00:37:04,040 --> 00:37:06,320
Next day I get data.

725
00:37:06,320 --> 00:37:09,620
I use my data to come up with
an estimate of theta and the

726
00:37:09,620 --> 00:37:10,660
confidence interval.

727
00:37:10,660 --> 00:37:12,340
That day, I was unlucky.

728
00:37:12,340 --> 00:37:15,000
And I got a confidence
interval out there.

729
00:37:15,000 --> 00:37:20,890
What the requirement here is, is
that 95% of the days, where

730
00:37:20,890 --> 00:37:25,270
we use this certain procedure
for constructing confidence

731
00:37:25,270 --> 00:37:29,180
intervals, 95% of those days,
we will be lucky.

732
00:37:29,180 --> 00:37:33,750
And we will capture the correct
value of theta by your

733
00:37:33,750 --> 00:37:35,160
confidence interval.

734
00:37:35,160 --> 00:37:39,390
So it's a statement about the
distribution of these random

735
00:37:39,390 --> 00:37:42,820
confidence intervals, how likely
are they to fall on top

736
00:37:42,820 --> 00:37:45,210
of the true theta, as opposed
to how likely

737
00:37:45,210 --> 00:37:47,060
they are to fall outside.

738
00:37:47,060 --> 00:37:50,770
So it's a statement about
probabilities associated with

739
00:37:50,770 --> 00:37:52,380
a confidence interval.

740
00:37:52,380 --> 00:37:55,080
They're not probabilities about
theta, because theta,

741
00:37:55,080 --> 00:37:58,370
itself, is not random.

742
00:37:58,370 --> 00:38:02,080
So this is what the confidence
interval is, in general, and

743
00:38:02,080 --> 00:38:03,470
how we interpret it.

744
00:38:03,470 --> 00:38:07,470
How do we construct a 95%
confidence interval?

745
00:38:07,470 --> 00:38:09,320
Let's go through this
exercise, in

746
00:38:09,320 --> 00:38:10,980
a particular example.

747
00:38:10,980 --> 00:38:13,970
The calculations are exactly the
same as the ones that you

748
00:38:13,970 --> 00:38:17,770
did when we talked about laws
of large numbers and the

749
00:38:17,770 --> 00:38:19,240
central limit theorem.

750
00:38:19,240 --> 00:38:22,600
So there's nothing new
calculationally but it's,

751
00:38:22,600 --> 00:38:25,440
perhaps, new in terms of the
language that we use and the

752
00:38:25,440 --> 00:38:26,800
interpretation.

753
00:38:26,800 --> 00:38:30,890
So we got our sample mean
from some distribution.

754
00:38:30,890 --> 00:38:34,650
And we would like to calculate
a 95% confidence interval.

755
00:38:39,590 --> 00:38:42,650
We know from the normal tables,
that the standard

756
00:38:42,650 --> 00:38:54,011
normal has 2.5% on the tail,
that's after 1.96.

757
00:38:54,011 --> 00:38:58,060
Yes, by this time,
the number 1.96

758
00:38:58,060 --> 00:39:00,600
should be pretty familiar.

759
00:39:00,600 --> 00:39:05,880
So if this probability
here is 2.5%, this

760
00:39:05,880 --> 00:39:09,510
number here is 1.96.

761
00:39:09,510 --> 00:39:12,310
Now look at this random
variable here.

762
00:39:12,310 --> 00:39:15,000
This is the sample mean.

763
00:39:15,000 --> 00:39:17,950
Difference, from the true mean,
normalized by the usual

764
00:39:17,950 --> 00:39:18,940
normalizing factor.

765
00:39:18,940 --> 00:39:22,090
By the central limit theorem,
this is approximately normal.

766
00:39:22,090 --> 00:39:26,790
So it has probability 0.95
of being less than 1.96.

767
00:39:26,790 --> 00:39:31,050
Now take this event here
and rewrite it.

768
00:39:31,050 --> 00:39:36,240
This the event, well, that
Theta hat minus theta is

769
00:39:36,240 --> 00:39:40,350
bigger than this number and
smaller than that number.

770
00:39:40,350 --> 00:39:45,650
This event here is equivalent
to that event here.

771
00:39:45,650 --> 00:39:50,670
And so this suggests a way of
constructing our 95% percent

772
00:39:50,670 --> 00:39:52,130
confidence interval.

773
00:39:52,130 --> 00:39:56,330
I'm going to report the
interval, which gives this as

774
00:39:56,330 --> 00:40:00,350
the lower end of the confidence
interval, and gives

775
00:40:00,350 --> 00:40:05,720
this as the upper end of
the confidence interval

776
00:40:05,720 --> 00:40:09,180
In other words, at the end of
the experiment, we report the

777
00:40:09,180 --> 00:40:12,170
sample mean, which
is our estimate.

778
00:40:12,170 --> 00:40:14,230
And we report also, an interval

779
00:40:14,230 --> 00:40:16,080
around the sample mean.

780
00:40:16,080 --> 00:40:20,510
And this is our 95% confidence
interval.

781
00:40:20,510 --> 00:40:22,800
The confidence interval becomes

782
00:40:22,800 --> 00:40:26,050
smaller, when n is larger.

783
00:40:26,050 --> 00:40:28,950
In some sense, we're more
certain that we're doing a

784
00:40:28,950 --> 00:40:32,390
good estimation job, so we can
have a small interval and

785
00:40:32,390 --> 00:40:36,000
still be quite confident that
our interval captures the true

786
00:40:36,000 --> 00:40:37,520
value of the parameter.

787
00:40:37,520 --> 00:40:41,890
Also, if our data have very
little noise, when you have

788
00:40:41,890 --> 00:40:45,060
more accurate measurements,
you're more confident that

789
00:40:45,060 --> 00:40:47,220
your estimate is pretty good.

790
00:40:47,220 --> 00:40:51,120
And that results in a smaller
confidence interval, smaller

791
00:40:51,120 --> 00:40:52,610
length of the confidence
interval.

792
00:40:52,610 --> 00:40:56,040
And still you have 95%
probability of capturing the

793
00:40:56,040 --> 00:40:57,650
true value of theta.

794
00:40:57,650 --> 00:41:01,660
So we did this exercise by
taking 95% confidence

795
00:41:01,660 --> 00:41:04,010
intervals and the corresponding
value from the

796
00:41:04,010 --> 00:41:06,670
normal tables, which is 1.96.

797
00:41:06,670 --> 00:41:11,390
Of course, you can do it more
generally, if you set your

798
00:41:11,390 --> 00:41:13,730
alpha to be some other number.

799
00:41:13,730 --> 00:41:16,590
Again, you look at the
normal tables.

800
00:41:16,590 --> 00:41:20,460
And you find the value here,
so that the tail has

801
00:41:20,460 --> 00:41:22,640
probability alpha over 2.

802
00:41:22,640 --> 00:41:26,790
And instead of using these 1.96,
you use whatever number

803
00:41:26,790 --> 00:41:31,380
you get from the
normal tables.

804
00:41:31,380 --> 00:41:33,520
And this tells you
how to construct

805
00:41:33,520 --> 00:41:36,680
a confidence interval.

806
00:41:36,680 --> 00:41:42,060
Well, to be exact, this
is not necessarily a

807
00:41:42,060 --> 00:41:44,640
95% confidence interval.

808
00:41:44,640 --> 00:41:47,540
It's approximately a 95%
confidence interval.

809
00:41:47,540 --> 00:41:48,950
Why is this?

810
00:41:48,950 --> 00:41:51,060
Because we've done
an approximation.

811
00:41:51,060 --> 00:41:53,890
We have used the central
limit theorem.

812
00:41:53,890 --> 00:41:59,990
So it might turn out to be a
95.5% confidence interval

813
00:41:59,990 --> 00:42:03,220
instead of 95%, because
our calculations are

814
00:42:03,220 --> 00:42:04,740
not entirely accurate.

815
00:42:04,740 --> 00:42:08,230
But for reasonable values of
n, using the central limit

816
00:42:08,230 --> 00:42:10,190
theorem is a good
approximation.

817
00:42:10,190 --> 00:42:13,330
And that's what people
almost always do.

818
00:42:13,330 --> 00:42:17,350
So just take the value from
the normal tables.

819
00:42:17,350 --> 00:42:18,600
Okay, except for one catch.

820
00:42:22,830 --> 00:42:24,590
I used the data.

821
00:42:24,590 --> 00:42:26,440
I obtained my estimate.

822
00:42:26,440 --> 00:42:29,830
And I want to go to my boss and
report this theta minus

823
00:42:29,830 --> 00:42:33,010
and theta hat, which is the
confidence interval.

824
00:42:33,010 --> 00:42:35,720
What's the difficulty?

825
00:42:35,720 --> 00:42:37,540
I know what n is.

826
00:42:37,540 --> 00:42:40,790
But I don't know what sigma
is, in general.

827
00:42:40,790 --> 00:42:44,750
So if I don't know sigma,
what am I going to do?

828
00:42:44,750 --> 00:42:48,980
Here, there's a few options
for what you can do.

829
00:42:48,980 --> 00:42:52,910
And the first option is familiar
from what we did when

830
00:42:52,910 --> 00:42:55,020
we talked about the
pollster problem.

831
00:42:55,020 --> 00:42:58,480
We don't know what sigma is,
but maybe we have an upper

832
00:42:58,480 --> 00:43:00,030
bound on sigma.

833
00:43:00,030 --> 00:43:03,540
For example, if the Xi's
Bernoulli random variables, we

834
00:43:03,540 --> 00:43:06,910
have seen that the standard
deviation is at most 1/2.

835
00:43:06,910 --> 00:43:10,220
So use the most conservative
value for sigma.

836
00:43:10,220 --> 00:43:13,520
Using the most conservative
value means that you take

837
00:43:13,520 --> 00:43:17,890
bigger confidence intervals
than necessary.

838
00:43:17,890 --> 00:43:20,780
So that's one option.

839
00:43:20,780 --> 00:43:25,480
Another option is to try to
estimate sigma from the data.

840
00:43:25,480 --> 00:43:27,630
How do you do this estimation?

841
00:43:27,630 --> 00:43:31,140
In special cases, for special
types of distributions, you

842
00:43:31,140 --> 00:43:34,180
can think of heuristic ways
of doing this estimation.

843
00:43:34,180 --> 00:43:38,390
For example, in the case of
Bernoulli random variables, we

844
00:43:38,390 --> 00:43:42,420
know that the true value of
sigma, the standard deviation

845
00:43:42,420 --> 00:43:45,120
of a Bernoulli random variable,
is the square root

846
00:43:45,120 --> 00:43:47,670
of theta1 minus theta,
where theta is

847
00:43:47,670 --> 00:43:50,290
the mean of the Bernoulli.

848
00:43:50,290 --> 00:43:51,900
Try to use this formula.

849
00:43:51,900 --> 00:43:54,140
But theta is the thing we're
trying to estimate in the

850
00:43:54,140 --> 00:43:54,760
first place.

851
00:43:54,760 --> 00:43:55,880
We don't know it.

852
00:43:55,880 --> 00:43:57,150
What do we do?

853
00:43:57,150 --> 00:44:00,850
Well, we have an estimate for
theta, the estimate, produced

854
00:44:00,850 --> 00:44:04,195
by our estimation procedure,
the sample mean.

855
00:44:04,195 --> 00:44:05,670
So I obtain my data.

856
00:44:05,670 --> 00:44:06,540
I get my data.

857
00:44:06,540 --> 00:44:09,030
I produce the estimate
theta hat.

858
00:44:09,030 --> 00:44:10,740
It's an estimate of the mean.

859
00:44:10,740 --> 00:44:14,770
Use that estimate in this
formula to come up with an

860
00:44:14,770 --> 00:44:17,290
estimate of my standard
deviation.

861
00:44:17,290 --> 00:44:20,210
And then use that standard
deviation, in the construction

862
00:44:20,210 --> 00:44:22,510
of the confidence interval,
pretending

863
00:44:22,510 --> 00:44:24,180
that this is correct.

864
00:44:24,180 --> 00:44:29,050
Well the number of your data is
large, then we know, from

865
00:44:29,050 --> 00:44:31,870
the law of large numbers, that
theta hat is a pretty good

866
00:44:31,870 --> 00:44:33,130
estimate of theta.

867
00:44:33,130 --> 00:44:36,670
So sigma hat is going to be a
pretty good estimate of sigma.

868
00:44:36,670 --> 00:44:42,380
So we're not making large errors
by using this approach.

869
00:44:42,380 --> 00:44:47,980
So in this scenario here, things
were simple, because we

870
00:44:47,980 --> 00:44:49,890
had an analytical formula.

871
00:44:49,890 --> 00:44:52,210
Sigma was determined by theta.

872
00:44:52,210 --> 00:44:54,420
So we could come up
with a quick and

873
00:44:54,420 --> 00:44:57,340
dirty estimate of sigma.

874
00:44:57,340 --> 00:45:00,940
In general, if you do not have
any nice formulas of this

875
00:45:00,940 --> 00:45:03,000
kind, what could you do?

876
00:45:03,000 --> 00:45:04,920
Well, you still need
to come up with an

877
00:45:04,920 --> 00:45:07,110
estimate of sigma somehow.

878
00:45:07,110 --> 00:45:08,950
What is a generic method for

879
00:45:08,950 --> 00:45:11,300
estimating a standard deviation?

880
00:45:11,300 --> 00:45:14,440
Equivalently, what could be a
generic method for estimating

881
00:45:14,440 --> 00:45:16,920
a variance?

882
00:45:16,920 --> 00:45:19,360
Well the variance is
an expected value

883
00:45:19,360 --> 00:45:20,940
of some random variable.

884
00:45:20,940 --> 00:45:25,610
The variance is the mean of the
random variable inside of

885
00:45:25,610 --> 00:45:28,200
those brackets.

886
00:45:28,200 --> 00:45:33,160
How does one estimate the mean
of some random variable?

887
00:45:33,160 --> 00:45:36,140
You obtain lots of measurements
of that random

888
00:45:36,140 --> 00:45:40,210
variable and average them out.

889
00:45:40,210 --> 00:45:45,170
So this would be a reasonable
way of estimating the variance

890
00:45:45,170 --> 00:45:47,310
of a distribution.

891
00:45:47,310 --> 00:45:50,590
And again, the weak law of large
numbers tells us that

892
00:45:50,590 --> 00:45:55,370
this average converges to the
expected value of this, which

893
00:45:55,370 --> 00:45:58,590
is just the variance of
the distribution.

894
00:45:58,590 --> 00:46:01,700
So we got a nice and
consistent way

895
00:46:01,700 --> 00:46:03,940
of estimating variances.

896
00:46:03,940 --> 00:46:08,100
But now, we seem to be getting
in a vicious circle here,

897
00:46:08,100 --> 00:46:10,580
because to estimate
the variance, we

898
00:46:10,580 --> 00:46:12,910
need to know the mean.

899
00:46:12,910 --> 00:46:16,075
And the mean is something we're
trying to estimate in

900
00:46:16,075 --> 00:46:18,250
the first place.

901
00:46:18,250 --> 00:46:18,400
Okay.

902
00:46:18,400 --> 00:46:20,880
But we do have an estimate
from the mean.

903
00:46:20,880 --> 00:46:24,640
So a reasonable approximation,
once more, is to plug-in,

904
00:46:24,640 --> 00:46:27,620
here, since we don't
know the mean, the

905
00:46:27,620 --> 00:46:29,270
estimate of the mean.

906
00:46:29,270 --> 00:46:32,370
And so you get that expression,
but with a theta

907
00:46:32,370 --> 00:46:35,130
hat instead of theta itself.

908
00:46:35,130 --> 00:46:37,980
And this is another
reasonable way of

909
00:46:37,980 --> 00:46:40,180
estimating the variance.

910
00:46:40,180 --> 00:46:42,940
It does have the same
consistency properties.

911
00:46:42,940 --> 00:46:44,050
Why?

912
00:46:44,050 --> 00:46:51,100
When n is large, this is going
to behave the same as that,

913
00:46:51,100 --> 00:46:53,640
because theta hat converges
to theta.

914
00:46:53,640 --> 00:46:57,890
And when n is large, this is
approximately the same as

915
00:46:57,890 --> 00:46:58,820
sigma squared.

916
00:46:58,820 --> 00:47:02,220
So for a large n, this quantity
also converges to

917
00:47:02,220 --> 00:47:03,350
sigma squared.

918
00:47:03,350 --> 00:47:05,500
And we have a consistent
estimate of

919
00:47:05,500 --> 00:47:07,000
the variance as well.

920
00:47:07,000 --> 00:47:09,490
And we can take that consistent
estimate and use it

921
00:47:09,490 --> 00:47:12,360
back in the construction
of confidence interval.

922
00:47:12,360 --> 00:47:16,310
One little detail, here,
we're dividing by n.

923
00:47:16,310 --> 00:47:19,590
Here, we're dividing by n-1.

924
00:47:19,590 --> 00:47:21,050
Why do we do this?

925
00:47:21,050 --> 00:47:24,630
Well, it turns out that's what
you need to do for these

926
00:47:24,630 --> 00:47:28,590
estimates to be an unbiased
estimate of the variance.

927
00:47:28,590 --> 00:47:32,080
One has to do a little bit of
a calculation, and one finds

928
00:47:32,080 --> 00:47:36,650
that that's the factor that you
need to have here in order

929
00:47:36,650 --> 00:47:37,770
to be unbiased.

930
00:47:37,770 --> 00:47:42,280
Of course, if you get 100 data
points, whether you divide by

931
00:47:42,280 --> 00:47:46,070
100 or divided by 99, it's
going to make only a tiny

932
00:47:46,070 --> 00:47:48,620
difference in your estimate
of your variance.

933
00:47:48,620 --> 00:47:50,740
So it's going to make only
a tiny difference in your

934
00:47:50,740 --> 00:47:52,670
estimate of the standard
deviation.

935
00:47:52,670 --> 00:47:54,180
It's not a big deal.

936
00:47:54,180 --> 00:47:56,550
And it doesn't really matter.

937
00:47:56,550 --> 00:48:00,720
But if you want to show off
about your deeper knowledge of

938
00:48:00,720 --> 00:48:06,810
statistics, you throw in the
1 over n-1 factor in there.

939
00:48:06,810 --> 00:48:11,350
So now one basically needs to
put together this story here,

940
00:48:11,350 --> 00:48:15,260
how you estimate the variance.

941
00:48:15,260 --> 00:48:18,370
You first estimate
the sample mean.

942
00:48:18,370 --> 00:48:21,010
And then you do some extra
work to come up with a

943
00:48:21,010 --> 00:48:23,020
reasonable estimate of
the variance and

944
00:48:23,020 --> 00:48:24,640
the standard deviation.

945
00:48:24,640 --> 00:48:27,510
And then you use your estimate,
of the standard

946
00:48:27,510 --> 00:48:32,960
deviation, to come up with a
confidence interval, which has

947
00:48:32,960 --> 00:48:35,150
these two endpoints.

948
00:48:35,150 --> 00:48:39,130
In doing this procedure, there's
basically a number of

949
00:48:39,130 --> 00:48:41,810
approximations that
are involved.

950
00:48:41,810 --> 00:48:43,570
There are two types
of approximations.

951
00:48:43,570 --> 00:48:46,170
One approximation is that we're
pretending that the

952
00:48:46,170 --> 00:48:48,720
sample mean has a normal
distribution.

953
00:48:48,720 --> 00:48:51,080
That's something we're justified
to do, by the

954
00:48:51,080 --> 00:48:52,470
central limit theorem.

955
00:48:52,470 --> 00:48:53,550
But it's not exact.

956
00:48:53,550 --> 00:48:54,910
It's an approximation.

957
00:48:54,910 --> 00:48:58,080
And the second approximation
that comes in is that, instead

958
00:48:58,080 --> 00:49:01,260
of using the correct standard
deviation, in general, you

959
00:49:01,260 --> 00:49:04,850
will have to use some
approximation of

960
00:49:04,850 --> 00:49:06,100
the standard deviation.

961
00:49:08,390 --> 00:49:11,200
Okay so you will be getting a
little bit of practice with

962
00:49:11,200 --> 00:49:14,550
these concepts in recitation
and tutorial.

963
00:49:14,550 --> 00:49:18,070
And we will move on to
new topics next week.

964
00:49:18,070 --> 00:49:20,930
But the material that's going
to be covered in the final

965
00:49:20,930 --> 00:49:23,570
exam is only up to this point.

966
00:49:23,570 --> 00:49:28,220
So next week is just
general education.

967
00:49:28,220 --> 00:49:30,550
Hopefully useful, but it's
not in the exam.