1
00:00:00,000 --> 00:00:00,040

2
00:00:00,040 --> 00:00:02,460
The following content is
provided under a Creative

3
00:00:02,460 --> 00:00:03,870
Commons license.

4
00:00:03,870 --> 00:00:06,910
Your support will help MIT
OpenCourseWare continue to

5
00:00:06,910 --> 00:00:08,700
offer high quality, educational

6
00:00:08,700 --> 00:00:10,560
resources for free.

7
00:00:10,560 --> 00:00:13,460
To make a donation or view
additional materials from

8
00:00:13,460 --> 00:00:19,290
hundreds of MIT courses, visit
MIT OpenCourseWare at

9
00:00:19,290 --> 00:00:20,540
ocw.mit.edu.

10
00:00:20,540 --> 00:00:22,200

11
00:00:22,200 --> 00:00:24,920
PROFESSOR: So for the last three
lectures we're going to

12
00:00:24,920 --> 00:00:28,200
talk about classical statistics,
the way statistics

13
00:00:28,200 --> 00:00:32,340
can be done if you don't want to
assume a prior distribution

14
00:00:32,340 --> 00:00:34,800
on the unknown parameters.

15
00:00:34,800 --> 00:00:38,290
Today we're going to focus,
mostly, on the estimation side

16
00:00:38,290 --> 00:00:41,910
and leave hypothesis testing
for the next two lectures.

17
00:00:41,910 --> 00:00:46,700
So where there is one generic
method that one can use to

18
00:00:46,700 --> 00:00:50,850
carry out parameter estimation,
that's the maximum

19
00:00:50,850 --> 00:00:51,850
likelihood method.

20
00:00:51,850 --> 00:00:53,990
We're going to define
what it is.

21
00:00:53,990 --> 00:00:58,200
Then we will look at the most
common estimation problem

22
00:00:58,200 --> 00:01:00,620
there is, which is to estimate
the mean of a given

23
00:01:00,620 --> 00:01:02,110
distribution.

24
00:01:02,110 --> 00:01:05,540
And we're going to talk about
confidence intervals, which

25
00:01:05,540 --> 00:01:09,130
refers to providing an
interval around your

26
00:01:09,130 --> 00:01:13,330
estimates, which has some
properties of the kind that

27
00:01:13,330 --> 00:01:17,640
the parameter is highly likely
to be inside that interval,

28
00:01:17,640 --> 00:01:20,040
but we will be careful about
how to interpret that

29
00:01:20,040 --> 00:01:22,220
particular statement.

30
00:01:22,220 --> 00:01:22,345
Ok.

31
00:01:22,345 --> 00:01:25,920
So the big framework first.

32
00:01:25,920 --> 00:01:29,120
The picture is almost the same
as the one that we had in the

33
00:01:29,120 --> 00:01:31,130
case of Bayesian statistics.

34
00:01:31,130 --> 00:01:33,570
We have some unknown
parameter.

35
00:01:33,570 --> 00:01:35,510
And we have a measuring
device.

36
00:01:35,510 --> 00:01:38,150
There is some noise,
some randomness.

37
00:01:38,150 --> 00:01:42,560
And we get an observation, X,
whose distribution depends on

38
00:01:42,560 --> 00:01:44,560
the value of the parameter.

39
00:01:44,560 --> 00:01:47,850
However, the big change from the
Bayesian setting is that

40
00:01:47,850 --> 00:01:50,840
here, this parameter
is just a number.

41
00:01:50,840 --> 00:01:53,200
It's not modeled as
a random variable.

42
00:01:53,200 --> 00:01:55,900
It does not have a probability
distribution.

43
00:01:55,900 --> 00:01:57,460
There's nothing random
about it.

44
00:01:57,460 --> 00:01:58,720
It's a constant.

45
00:01:58,720 --> 00:02:02,360
It just happens that we don't
know what that constant is.

46
00:02:02,360 --> 00:02:05,970
And in particular, this
probability distribution here,

47
00:02:05,970 --> 00:02:10,350
the distribution of X,
depends on Theta.

48
00:02:10,350 --> 00:02:13,900
But this is not a conditional
distribution in the usual

49
00:02:13,900 --> 00:02:15,450
sense of the word.

50
00:02:15,450 --> 00:02:18,480
Conditional distributions were
defined when we had two random

51
00:02:18,480 --> 00:02:21,800
variables and we condition one
random variable on the other.

52
00:02:21,800 --> 00:02:25,890
And we used the bar to separate
the X from the Theta.

53
00:02:25,890 --> 00:02:27,870
To make the point that this
is not a conditioned

54
00:02:27,870 --> 00:02:29,840
distribution, we use a
different notation.

55
00:02:29,840 --> 00:02:31,730
We put a semicolon here.

56
00:02:31,730 --> 00:02:35,760
And what this is meant to say is
that X has a distribution.

57
00:02:35,760 --> 00:02:39,640
That distribution has
a certain parameter.

58
00:02:39,640 --> 00:02:42,240
And we don't know what
that parameter is.

59
00:02:42,240 --> 00:02:46,270
So for example, this might be
a normal distribution, with

60
00:02:46,270 --> 00:02:49,070
variance 1 but a mean Theta.

61
00:02:49,070 --> 00:02:50,560
We don't know what Theta is.

62
00:02:50,560 --> 00:02:52,980
And we want to estimate it.

63
00:02:52,980 --> 00:02:55,970
Now once we have this setting,
then your job is to design

64
00:02:55,970 --> 00:02:57,560
this box, the estimator.

65
00:02:57,560 --> 00:03:00,620
The estimator is some data
processing box that takes the

66
00:03:00,620 --> 00:03:03,950
measurements and produces
an estimate

67
00:03:03,950 --> 00:03:06,300
of the unknown parameter.

68
00:03:06,300 --> 00:03:11,950
Now the notation that's used
here is as if X and Theta were

69
00:03:11,950 --> 00:03:13,640
one-dimensional quantities.

70
00:03:13,640 --> 00:03:16,610
But actually, everything we
say remains valid if you

71
00:03:16,610 --> 00:03:20,090
interpret X and Theta as
vectors of parameters.

72
00:03:20,090 --> 00:03:22,180
So for example, you
may obtain several

73
00:03:22,180 --> 00:03:25,050
measurements, X1 up to 2Xn.

74
00:03:25,050 --> 00:03:27,980
And there may be several unknown
parameters in the

75
00:03:27,980 --> 00:03:30,260
background.

76
00:03:30,260 --> 00:03:34,200
Once more, we do not have, and
we do not want to assume, a

77
00:03:34,200 --> 00:03:35,780
prior distribution on Theta.

78
00:03:35,780 --> 00:03:37,070
It's a constant.

79
00:03:37,070 --> 00:03:39,040
And if you want to think
mathematically about this

80
00:03:39,040 --> 00:03:41,510
situation, it's as if you
have many different

81
00:03:41,510 --> 00:03:43,340
probabilistic models.

82
00:03:43,340 --> 00:03:46,360
So a normal with this mean or
a normal with that mean or a

83
00:03:46,360 --> 00:03:49,020
normal with that mean, these
are alternative candidate

84
00:03:49,020 --> 00:03:50,700
probabilistic models.

85
00:03:50,700 --> 00:03:55,080
And we want to try to make a
decision about which one is

86
00:03:55,080 --> 00:03:56,420
the correct model.

87
00:03:56,420 --> 00:03:59,480
In some cases, we have to choose
just between a small

88
00:03:59,480 --> 00:04:00,390
number of models.

89
00:04:00,390 --> 00:04:03,400
For example, you have a coin
with an unknown bias.

90
00:04:03,400 --> 00:04:06,410
The bias is either 1/2 or 3/4.

91
00:04:06,410 --> 00:04:08,650
You're going to flip the
coin a few times.

92
00:04:08,650 --> 00:04:13,150
And you try to decide whether
the true bias is this one or

93
00:04:13,150 --> 00:04:14,150
is that one.

94
00:04:14,150 --> 00:04:17,610
So in this case, we have two
specific, alternative

95
00:04:17,610 --> 00:04:20,800
probabilistic models from which
we want to distinguish.

96
00:04:20,800 --> 00:04:25,000
But sometimes things are a
little more complicated.

97
00:04:25,000 --> 00:04:26,940
For example, you have a coin.

98
00:04:26,940 --> 00:04:30,940
And you have one hypothesis
that my coin is unbiased.

99
00:04:30,940 --> 00:04:34,650
And the other hypothesis is
that my coin is biased.

100
00:04:34,650 --> 00:04:36,040
And you do your experiments.

101
00:04:36,040 --> 00:04:40,840
And you want to come up with a
decision that decides whether

102
00:04:40,840 --> 00:04:43,970
this is true or this
one is true.

103
00:04:43,970 --> 00:04:46,630
In this case, we're not
dealing with just two

104
00:04:46,630 --> 00:04:48,710
alternative probabilistic
models.

105
00:04:48,710 --> 00:04:51,540
This one is a specific
model for the coin.

106
00:04:51,540 --> 00:04:54,230
But this one actually
corresponds to lots of

107
00:04:54,230 --> 00:04:56,890
possible, alternative
coin models.

108
00:04:56,890 --> 00:05:00,420
So this includes the model where
Theta is 0.6, the model

109
00:05:00,420 --> 00:05:03,860
where Theta is 0.7, Theta
is 0.8, and so on.

110
00:05:03,860 --> 00:05:07,350
So we're trying to discriminate
between one model

111
00:05:07,350 --> 00:05:09,510
and lots of alternative
models.

112
00:05:09,510 --> 00:05:11,560
How does one go about this?

113
00:05:11,560 --> 00:05:14,750
Well, there's some systematic
ways that one can approach

114
00:05:14,750 --> 00:05:16,120
problems of this kind.

115
00:05:16,120 --> 00:05:19,850
And we will start talking
about these next time.

116
00:05:19,850 --> 00:05:22,380
So today, we're going to focus
on estimation problems.

117
00:05:22,380 --> 00:05:27,080
In estimation problems, theta is
a quantity, which is a real

118
00:05:27,080 --> 00:05:29,070
number, a continuous
parameter.

119
00:05:29,070 --> 00:05:33,730
We're to design this box, so
what we get out of this box is

120
00:05:33,730 --> 00:05:34,280
an estimate.

121
00:05:34,280 --> 00:05:37,900
Now notice that this estimate
here is a random variable.

122
00:05:37,900 --> 00:05:42,000
Even though theta is
deterministic, this is random,

123
00:05:42,000 --> 00:05:45,110
because it's a function of
the data that we observe.

124
00:05:45,110 --> 00:05:46,360
The data are random.

125
00:05:46,360 --> 00:05:49,155
We're applying a function
to the data to

126
00:05:49,155 --> 00:05:50,270
construct our estimate.

127
00:05:50,270 --> 00:05:52,850
So, since it's a function of
random variables, it's a

128
00:05:52,850 --> 00:05:54,630
random variable itself.

129
00:05:54,630 --> 00:05:57,940
The distribution of Theta hat
depends on the distribution of

130
00:05:57,940 --> 00:06:01,280
X. The distribution of X
is affected by Theta.

131
00:06:01,280 --> 00:06:03,650
So in the end, the distribution
of your estimate

132
00:06:03,650 --> 00:06:08,290
Theta hat will also be affected
by whatever Theta

133
00:06:08,290 --> 00:06:09,920
happens to be.

134
00:06:09,920 --> 00:06:12,950
Our general objective, when
designing estimators, is that

135
00:06:12,950 --> 00:06:17,390
we want to get, in the end, an
error, an estimation error,

136
00:06:17,390 --> 00:06:19,070
which is not too large.

137
00:06:19,070 --> 00:06:21,500
But we'll have to make
that specific.

138
00:06:21,500 --> 00:06:24,720
Again, what exactly do
we mean by that?

139
00:06:24,720 --> 00:06:27,170
So how do we go about
this problem?

140
00:06:27,170 --> 00:06:29,670

141
00:06:29,670 --> 00:06:40,150
One general approach is to pick
a Theta, under which the

142
00:06:40,150 --> 00:06:44,590
data that we observe, that
this is the X's, our most

143
00:06:44,590 --> 00:06:47,180
likely to have occurred.

144
00:06:47,180 --> 00:06:52,700
So I observe X. For any given
Theta, I can calculate this

145
00:06:52,700 --> 00:06:56,630
quantity, which tells me, under
this particular Theta,

146
00:06:56,630 --> 00:07:00,670
the X that you observed had this
probability of occurring.

147
00:07:00,670 --> 00:07:03,270
Under that Theta, the X that
you observe had that

148
00:07:03,270 --> 00:07:04,770
probability of occurring.

149
00:07:04,770 --> 00:07:08,580
You just choose that Theta,
which makes the data that you

150
00:07:08,580 --> 00:07:12,700
observed most likely.

151
00:07:12,700 --> 00:07:15,810
It's interesting to compare
this maximum likelihood

152
00:07:15,810 --> 00:07:19,120
estimate with the estimates that
you would have, if you

153
00:07:19,120 --> 00:07:22,050
were in a Bayesian setting,
and you were using maximum

154
00:07:22,050 --> 00:07:25,010
approach theory probability
estimation.

155
00:07:25,010 --> 00:07:31,650
In the Bayesian setting, what
we do is, given the data, we

156
00:07:31,650 --> 00:07:34,350
use the prior distribution
on Theta.

157
00:07:34,350 --> 00:07:41,660
And we calculate the posterior
distribution of Theta given X.

158
00:07:41,660 --> 00:07:44,350
Notice that this is sort
of the opposite from

159
00:07:44,350 --> 00:07:46,040
what we have here.

160
00:07:46,040 --> 00:07:49,180
This is the probability of X
for a particular value of

161
00:07:49,180 --> 00:07:51,780
Theta, whereas this is the
probability of Theta for a

162
00:07:51,780 --> 00:07:55,380
particular X. So it's the
opposite type of conditioning.

163
00:07:55,380 --> 00:07:58,240
In the Bayesian setting, Theta
is a random variable.

164
00:07:58,240 --> 00:07:59,890
So we can talk about
the probability

165
00:07:59,890 --> 00:08:01,570
distribution of Theta.

166
00:08:01,570 --> 00:08:04,740
So how do these two compare,
except for this syntactic

167
00:08:04,740 --> 00:08:08,160
difference that the order X's
and Theta's are reversed?

168
00:08:08,160 --> 00:08:11,410
Let's write down, in full
detail, what this posterior

169
00:08:11,410 --> 00:08:13,280
distribution of Theta is.

170
00:08:13,280 --> 00:08:17,390
By the Bayes rule, this
conditional distribution is

171
00:08:17,390 --> 00:08:20,430
obtained from the prior, and the
model of the measurement

172
00:08:20,430 --> 00:08:21,850
process that we have.

173
00:08:21,850 --> 00:08:24,510
And we get to this expression.

174
00:08:24,510 --> 00:08:29,520
So in Bayesian estimation, we
want to find the most likely

175
00:08:29,520 --> 00:08:30,870
value of Theta.

176
00:08:30,870 --> 00:08:33,070
And we need to maximize
this quantity over

177
00:08:33,070 --> 00:08:34,539
all possible Theta's.

178
00:08:34,539 --> 00:08:38,210
First thing to notice is that
the denominator is a constant.

179
00:08:38,210 --> 00:08:40,220
It does not involve Theta.

180
00:08:40,220 --> 00:08:43,250
So when you maximize this
quantity, you don't care about

181
00:08:43,250 --> 00:08:44,520
the denominator.

182
00:08:44,520 --> 00:08:47,800
You just want to maximize
the numerator.

183
00:08:47,800 --> 00:08:52,310
Now, here, things start to look
a little more similar.

184
00:08:52,310 --> 00:08:56,530
And they would be exactly of
the same kind, if that term

185
00:08:56,530 --> 00:08:59,890
here was absent, it the
prior was absent.

186
00:08:59,890 --> 00:09:03,860
The two are going to become
the same if that prior was

187
00:09:03,860 --> 00:09:05,830
just a constant.

188
00:09:05,830 --> 00:09:10,160
So if that prior is a constant,
then maximum

189
00:09:10,160 --> 00:09:13,720
likelihood estimation takes
exactly the same form as

190
00:09:13,720 --> 00:09:17,360
Bayesian maximum posterior
probability estimation.

191
00:09:17,360 --> 00:09:21,230
So you can give this particular
interpretation of

192
00:09:21,230 --> 00:09:22,680
maximum likelihood estimation.

193
00:09:22,680 --> 00:09:27,400
Maximum likelihood estimation
is essentially what you have

194
00:09:27,400 --> 00:09:31,380
done, if you were in a Bayesian
world, and you had

195
00:09:31,380 --> 00:09:35,400
assumed a prior on the Theta's
that's uniform, all the

196
00:09:35,400 --> 00:09:37,030
Theta's being equally likely.

197
00:09:37,030 --> 00:09:42,620

198
00:09:42,620 --> 00:09:42,725
Okay.

199
00:09:42,725 --> 00:09:45,770
So let's look at a
simple example.

200
00:09:45,770 --> 00:09:48,510
Suppose that the Xi's are
independent, identically

201
00:09:48,510 --> 00:09:50,770
distributed random
variables, with a

202
00:09:50,770 --> 00:09:52,690
certain parameter Theta.

203
00:09:52,690 --> 00:09:55,910
So the distribution of each
one of the Xi's is this

204
00:09:55,910 --> 00:09:57,950
particular term.

205
00:09:57,950 --> 00:09:59,840
So Theta is one-dimensional.

206
00:09:59,840 --> 00:10:01,280
It's a one-dimensional
parameter.

207
00:10:01,280 --> 00:10:03,180
But we have several data.

208
00:10:03,180 --> 00:10:07,020
We write down the formula
for the probability of a

209
00:10:07,020 --> 00:10:12,360
particular X vector, given a
particular value of Theta.

210
00:10:12,360 --> 00:10:14,950
But again, when I use the word,
given, here it's not in

211
00:10:14,950 --> 00:10:16,080
the conditioning sense.

212
00:10:16,080 --> 00:10:18,770
It's the value of the
density for a

213
00:10:18,770 --> 00:10:21,710
particular choice of Theta.

214
00:10:21,710 --> 00:10:24,890
Here, I wrote down, I defined
maximum likelihood estimation

215
00:10:24,890 --> 00:10:26,190
in terms of PMFs.

216
00:10:26,190 --> 00:10:28,050
That's what you would
do if the X's were

217
00:10:28,050 --> 00:10:29,950
discrete random variables.

218
00:10:29,950 --> 00:10:32,770
Here, the X's are continuous
random variables, so instead

219
00:10:32,770 --> 00:10:36,220
of I'm using the PDF
instead of the PMF.

220
00:10:36,220 --> 00:10:39,530
So this a definition, here,
generalizes to the case of

221
00:10:39,530 --> 00:10:40,900
continuous random variables.

222
00:10:40,900 --> 00:10:44,620
And you use F's instead of
X's, our usual recipe.

223
00:10:44,620 --> 00:10:47,560
So the maximum likelihood
estimate is defined.

224
00:10:47,560 --> 00:10:51,880
Now, since the Xi's are
independent, the joint density

225
00:10:51,880 --> 00:10:54,410
of all the X's together
is the product of

226
00:10:54,410 --> 00:10:57,680
the individual densities.

227
00:10:57,680 --> 00:10:59,170
So you look at this quantity.

228
00:10:59,170 --> 00:11:03,310
This is the density or sort of
probability of observing a

229
00:11:03,310 --> 00:11:05,340
particular sequence of X's.

230
00:11:05,340 --> 00:11:08,230
And we ask the question, what's
the value of Theta that

231
00:11:08,230 --> 00:11:10,940
makes the X's that we
observe most likely?

232
00:11:10,940 --> 00:11:13,160
So we want to carry out
this maximization.

233
00:11:13,160 --> 00:11:17,430
Now this maximization is just
a calculational problem.

234
00:11:17,430 --> 00:11:19,920
We're going to do this
maximization by taking the

235
00:11:19,920 --> 00:11:21,910
logarithm of this expression.

236
00:11:21,910 --> 00:11:23,880
Maximizing an expression
is the same as

237
00:11:23,880 --> 00:11:25,790
maximizing the logarithm.

238
00:11:25,790 --> 00:11:28,790
So the logarithm of this
expression, the logarithm of a

239
00:11:28,790 --> 00:11:31,290
product is the sum of
the logarithms.

240
00:11:31,290 --> 00:11:34,390
You get contributions from
this Theta term.

241
00:11:34,390 --> 00:11:37,660
There's n of these, so we
get an n log Theta.

242
00:11:37,660 --> 00:11:40,430
And then we have the sum of the
logarithms of these terms.

243
00:11:40,430 --> 00:11:43,060
It gives us minus Theta.

244
00:11:43,060 --> 00:11:45,020
And then the sum of the X's.

245
00:11:45,020 --> 00:11:47,060
So we need to maximize
this expression

246
00:11:47,060 --> 00:11:48,630
with respect to Theta.

247
00:11:48,630 --> 00:11:51,130
The way to do this maximization
is you take the

248
00:11:51,130 --> 00:11:53,320
derivative, with respect
to Theta.

249
00:11:53,320 --> 00:11:58,520
And you get n over Theta equals
to the sum of the X's.

250
00:11:58,520 --> 00:12:00,280
And then you solve for Theta.

251
00:12:00,280 --> 00:12:02,040
And you find that the
maximum likelihood

252
00:12:02,040 --> 00:12:04,680
estimate is this quantity.

253
00:12:04,680 --> 00:12:13,160
Which sort of makes sense,
because this is the reciprocal

254
00:12:13,160 --> 00:12:16,700
of the sample mean of X's.

255
00:12:16,700 --> 00:12:19,520
Theta, in an exponential
distribution, we know that

256
00:12:19,520 --> 00:12:23,380
it's 1 over (the mean of the
exponential distribution).

257
00:12:23,380 --> 00:12:26,960
So it looks like a reasonable
estimate.

258
00:12:26,960 --> 00:12:29,570
So in any case, this is the
estimates that the maximum

259
00:12:29,570 --> 00:12:33,420
likelihood estimation procedure
tells us that we

260
00:12:33,420 --> 00:12:35,780
should report.

261
00:12:35,780 --> 00:12:39,790
This formula here, of course,
tells you what to do if you

262
00:12:39,790 --> 00:12:42,640
have already observed
specific numbers.

263
00:12:42,640 --> 00:12:46,020
If you have observed specific
numbers, then you observe this

264
00:12:46,020 --> 00:12:49,110
particular number as your
estimate of Theta.

265
00:12:49,110 --> 00:12:52,000
If you want to describe your
estimation procedure more

266
00:12:52,000 --> 00:12:55,900
abstractly, what you have
constructed is an estimator,

267
00:12:55,900 --> 00:12:59,690
which is a box that's takes in
the random variables, capital

268
00:12:59,690 --> 00:13:05,430
X1 up to Capital Xn, and
produces out your estimate,

269
00:13:05,430 --> 00:13:07,440
which is also a random
variable.

270
00:13:07,440 --> 00:13:10,760
Because it's a function of these
random variables and is

271
00:13:10,760 --> 00:13:14,750
denoted by an upper case Theta,
to indicate that this

272
00:13:14,750 --> 00:13:17,470
is now a random variable.

273
00:13:17,470 --> 00:13:21,040
So this is an equality
about numbers.

274
00:13:21,040 --> 00:13:23,860
This is a description of the
general procedure, which is an

275
00:13:23,860 --> 00:13:25,745
equality between two
random variables.

276
00:13:25,745 --> 00:13:28,360

277
00:13:28,360 --> 00:13:31,920
And this gives you the more
abstract view of what we're

278
00:13:31,920 --> 00:13:35,040
doing here.

279
00:13:35,040 --> 00:13:35,352
All right.

280
00:13:35,352 --> 00:13:37,970
So what can we tell about
our estimate?

281
00:13:37,970 --> 00:13:40,090
Is it good or is it bad?

282
00:13:40,090 --> 00:13:42,960
So we should look at this
particular random variable and

283
00:13:42,960 --> 00:13:46,220
talk about the statistical
properties that it has.

284
00:13:46,220 --> 00:13:49,930
What we would like is this
random variable to be close to

285
00:13:49,930 --> 00:13:55,810
the true value of Theta, with
high probability, no matter

286
00:13:55,810 --> 00:13:59,470
what Theta is, since we don't
know what Theta is.

287
00:13:59,470 --> 00:14:01,400
Let's make a little
more specific the

288
00:14:01,400 --> 00:14:05,100
properties that we want.

289
00:14:05,100 --> 00:14:08,470
So we cook up the estimator
somehow.

290
00:14:08,470 --> 00:14:11,850
So this estimator corresponds,
again, to a box that takes

291
00:14:11,850 --> 00:14:15,400
data in, the capital X's,
and produces an

292
00:14:15,400 --> 00:14:17,470
estimate Theta hat.

293
00:14:17,470 --> 00:14:18,710
This estimate is random.

294
00:14:18,710 --> 00:14:23,070
Sometimes it will be above
the true value of Theta.

295
00:14:23,070 --> 00:14:25,660
Sometimes it will be below.

296
00:14:25,660 --> 00:14:30,220
Ideally, we would like it to not
have a systematic error,

297
00:14:30,220 --> 00:14:32,810
on the positive side or
the negative side.

298
00:14:32,810 --> 00:14:37,310
So a reasonable wish to have,
for a good estimator, is that,

299
00:14:37,310 --> 00:14:41,700
on the average, it gives
you the correct value.

300
00:14:41,700 --> 00:14:45,850
Now here, let's be a little more
specific about what that

301
00:14:45,850 --> 00:14:47,740
expectation is.

302
00:14:47,740 --> 00:14:51,270
This is an expectation, with
respect to the probability

303
00:14:51,270 --> 00:14:54,240
distribution of Theta hat.

304
00:14:54,240 --> 00:14:58,780
The probability distribution
of Theta hat is affected by

305
00:14:58,780 --> 00:15:01,410
the probability distribution
of the X's.

306
00:15:01,410 --> 00:15:03,760
Because Theta hat is a
function of the X's.

307
00:15:03,760 --> 00:15:05,930
And the probability distribution
of the X's is

308
00:15:05,930 --> 00:15:09,220
affected by the true
value of Theta.

309
00:15:09,220 --> 00:15:13,710
So depending on which one is the
true value of Theta, this

310
00:15:13,710 --> 00:15:16,650
is going to be a different
expectation.

311
00:15:16,650 --> 00:15:20,830
So if you were to write this
expectation out in more

312
00:15:20,830 --> 00:15:25,840
detail, it would look
something like this.

313
00:15:25,840 --> 00:15:28,690
You need to write down
the probability

314
00:15:28,690 --> 00:15:30,260
distribution of Theta hat.

315
00:15:30,260 --> 00:15:32,890

316
00:15:32,890 --> 00:15:36,470
And this is going to
be some function.

317
00:15:36,470 --> 00:15:41,200
But this function depends on the
true Theta, is affected by

318
00:15:41,200 --> 00:15:42,800
the true Theta.

319
00:15:42,800 --> 00:15:48,300
And then you integrate this
with respect to Theta hat.

320
00:15:48,300 --> 00:15:49,430
What's the point here?

321
00:15:49,430 --> 00:15:53,280
Again, Theta hat is a
function of the X's.

322
00:15:53,280 --> 00:15:57,000
So the density of Theta
hat is affected by the

323
00:15:57,000 --> 00:15:58,400
density of the X's.

324
00:15:58,400 --> 00:16:00,730
The density of the X's
is affected by the

325
00:16:00,730 --> 00:16:02,380
true value of Theta.

326
00:16:02,380 --> 00:16:05,420
So the distribution of Theta
hat is affected by

327
00:16:05,420 --> 00:16:07,680
the value of Theta.

328
00:16:07,680 --> 00:16:10,500
Another way to put it is, as
I've mentioned a few minutes

329
00:16:10,500 --> 00:16:14,550
ago, in this business, it's
as if we are considering

330
00:16:14,550 --> 00:16:17,880
different possible probabilistic
models, one

331
00:16:17,880 --> 00:16:20,470
probabilistic model for
each choice of Theta.

332
00:16:20,470 --> 00:16:22,560
And we're trying to guess
which one of these

333
00:16:22,560 --> 00:16:25,200
probabilistic models
is the true one.

334
00:16:25,200 --> 00:16:28,420
One way of emphasizing the
fact that this expression

335
00:16:28,420 --> 00:16:31,780
depends on the true Theta is
to put a little subscript

336
00:16:31,780 --> 00:16:36,790
here, expectation, under the
particular value of the

337
00:16:36,790 --> 00:16:38,300
parameter Theta.

338
00:16:38,300 --> 00:16:42,450
So depending on what value the
true parameter Theta takes,

339
00:16:42,450 --> 00:16:45,000
this expectation will have
a different value.

340
00:16:45,000 --> 00:16:49,730
And what we would like is that
no matter what the true value

341
00:16:49,730 --> 00:16:55,300
is, that our estimate will not
have a bias on the positive or

342
00:16:55,300 --> 00:16:57,140
the negative sides.

343
00:16:57,140 --> 00:17:00,150
So this is a property
that's desirable.

344
00:17:00,150 --> 00:17:02,160
Is it always going to be true?

345
00:17:02,160 --> 00:17:05,218
Not necessarily, it depends on
what estimator we construct.

346
00:17:05,218 --> 00:17:09,160

347
00:17:09,160 --> 00:17:12,400
Is it true for our exponential
example?

348
00:17:12,400 --> 00:17:14,770
Unfortunately not, the estimate
that we have in the

349
00:17:14,770 --> 00:17:18,300
exponential example turns
out to be biased.

350
00:17:18,300 --> 00:17:22,900
And one extreme way of seeing
this is to consider the case

351
00:17:22,900 --> 00:17:25,160
where our sample size is 1.

352
00:17:25,160 --> 00:17:27,020
We're trying to estimate
Theta.

353
00:17:27,020 --> 00:17:30,370
And the estimator from the
previous slide, in that case,

354
00:17:30,370 --> 00:17:33,410
is just 1/X1.

355
00:17:33,410 --> 00:17:37,990
Now X1 has a fair amount of
density in the vicinity of 0,

356
00:17:37,990 --> 00:17:41,360
which means that 1/X1 has
significant probability of

357
00:17:41,360 --> 00:17:42,810
being very large.

358
00:17:42,810 --> 00:17:46,140
And if you do the calculation,
this ultimately makes the

359
00:17:46,140 --> 00:17:49,170
expected value of 1/X1
to be infinite.

360
00:17:49,170 --> 00:17:52,870
Now infinity is definitely
not the correct value.

361
00:17:52,870 --> 00:17:56,330
So our estimate is
biased upwards.

362
00:17:56,330 --> 00:18:00,130
And it's actually biased
a lot upwards.

363
00:18:00,130 --> 00:18:01,800
So that's how things are.

364
00:18:01,800 --> 00:18:06,690
Maximum likelihood estimates,
in general, will be biased.

365
00:18:06,690 --> 00:18:10,870
But under some conditions,
they will turn out to be

366
00:18:10,870 --> 00:18:12,780
asymptotically unbiased.

367
00:18:12,780 --> 00:18:16,810
That is, as you get more and
more data, as your X vector is

368
00:18:16,810 --> 00:18:21,750
longer and longer, with
independent data, the estimate

369
00:18:21,750 --> 00:18:25,010
that you're going to have, the
expected value of your

370
00:18:25,010 --> 00:18:26,860
estimator is going
to get closer and

371
00:18:26,860 --> 00:18:28,370
closer to the true value.

372
00:18:28,370 --> 00:18:31,330
So you do have some nice
asymptotic properties, but

373
00:18:31,330 --> 00:18:34,000
we're not going to prove
anything like this.

374
00:18:34,000 --> 00:18:37,680
Speaking of asymptotic
properties, in general, what

375
00:18:37,680 --> 00:18:40,950
we would like to have is that,
as you collect more and more

376
00:18:40,950 --> 00:18:46,550
data, you get the correct
answer, in some sense.

377
00:18:46,550 --> 00:18:49,360
And the sense that we're going
to use here is the limiting

378
00:18:49,360 --> 00:18:52,560
sense of convergence in
probability, since this is the

379
00:18:52,560 --> 00:18:55,270
only notion of convergence of
random variables that we have

380
00:18:55,270 --> 00:18:56,540
in our hands.

381
00:18:56,540 --> 00:18:59,600
This is similar to what
we had in the pollster

382
00:18:59,600 --> 00:19:01,180
problem, for example.

383
00:19:01,180 --> 00:19:04,900
If we had a bigger and bigger
sample size, we could be more

384
00:19:04,900 --> 00:19:08,360
and more confident that the
estimate that we obtained is

385
00:19:08,360 --> 00:19:11,970
close to the unknown true
parameter of the distribution

386
00:19:11,970 --> 00:19:13,320
that we have.

387
00:19:13,320 --> 00:19:16,420
So this is a desirable
property.

388
00:19:16,420 --> 00:19:20,720
If you have an infinitely large
amount of data, you

389
00:19:20,720 --> 00:19:25,070
should be able to estimate
an unknown parameter

390
00:19:25,070 --> 00:19:26,890
more or less exactly.

391
00:19:26,890 --> 00:19:32,280
So this is it desirable property
of estimators.

392
00:19:32,280 --> 00:19:35,560
It turns out that maximum
likelihood estimation, given

393
00:19:35,560 --> 00:19:39,330
independent data, does have
this property, under mild

394
00:19:39,330 --> 00:19:40,280
conditions.

395
00:19:40,280 --> 00:19:43,100
So maximum likelihood
estimation, in this respect,

396
00:19:43,100 --> 00:19:45,180
is a good approach.

397
00:19:45,180 --> 00:19:48,520
So let's see, do we have this
consistency property in our

398
00:19:48,520 --> 00:19:50,150
exponential example?

399
00:19:50,150 --> 00:19:56,560
In our exponential example, we
used this quantity to estimate

400
00:19:56,560 --> 00:19:59,040
the unknown parameter Theta.

401
00:19:59,040 --> 00:20:01,000
What properties does
this quantity have

402
00:20:01,000 --> 00:20:03,160
as n goes to infinity?

403
00:20:03,160 --> 00:20:06,580
Well this quantity is the
reciprocal of that quantity up

404
00:20:06,580 --> 00:20:09,890
here, which is the
sample mean.

405
00:20:09,890 --> 00:20:12,950
We know from the weak law of
large numbers, that the sample

406
00:20:12,950 --> 00:20:16,350
mean converges to
the expectation.

407
00:20:16,350 --> 00:20:19,250
So this property here
comes from the weak

408
00:20:19,250 --> 00:20:21,660
law of large numbers.

409
00:20:21,660 --> 00:20:24,670
In probability, this quantity
converges to the expected

410
00:20:24,670 --> 00:20:29,830
value, which, for exponential
distributions, is 1/Theta.

411
00:20:29,830 --> 00:20:33,460
Now, if something converges to
something, then the reciprocal

412
00:20:33,460 --> 00:20:37,680
of that should converge to
the reciprocal of that.

413
00:20:37,680 --> 00:20:41,520
That's a property that's
certainly correct for numbers.

414
00:20:41,520 --> 00:20:44,000
But you're not talking about
convergence of numbers.

415
00:20:44,000 --> 00:20:46,420
We're talking about convergence
in probability,

416
00:20:46,420 --> 00:20:48,820
which is a more complicated
notion.

417
00:20:48,820 --> 00:20:52,370
Fortunately, it turns out that
the same thing is true, when

418
00:20:52,370 --> 00:20:54,660
we deal with convergence
in probability.

419
00:20:54,660 --> 00:20:58,690
One can show, although we will
not bother doing this, that

420
00:20:58,690 --> 00:21:01,840
indeed, the reciprocal of this,
which is our estimate,

421
00:21:01,840 --> 00:21:05,880
converges in probability to
the reciprocal of that.

422
00:21:05,880 --> 00:21:08,880
And that reciprocal is the
true parameter Theta.

423
00:21:08,880 --> 00:21:11,570
So for this particular
exponential example, we do

424
00:21:11,570 --> 00:21:15,250
have the desirable property,
that as the number of data

425
00:21:15,250 --> 00:21:18,230
becomes larger and larger,
the estimate that we have

426
00:21:18,230 --> 00:21:20,970
constructed will get closer
and closer to the true

427
00:21:20,970 --> 00:21:22,510
parameter value.

428
00:21:22,510 --> 00:21:27,050
And this is true no matter
what Theta is.

429
00:21:27,050 --> 00:21:30,130
No matter what the true
parameter Theta is, we're

430
00:21:30,130 --> 00:21:33,240
going to get close to it as
we collect more data.

431
00:21:33,240 --> 00:21:35,780

432
00:21:35,780 --> 00:21:35,950
Okay.

433
00:21:35,950 --> 00:21:39,100
So these are two rough
qualitative properties that

434
00:21:39,100 --> 00:21:42,350
would be nice to have.

435
00:21:42,350 --> 00:21:47,340
If you want to get a little
more quantitative, you can

436
00:21:47,340 --> 00:21:50,210
start looking at the mean
squared error that your

437
00:21:50,210 --> 00:21:52,000
estimator gives.

438
00:21:52,000 --> 00:21:56,600
Now, once more, the comment I
was making up there applies.

439
00:21:56,600 --> 00:22:00,540
Namely, that this expectation
here is an expectation with

440
00:22:00,540 --> 00:22:04,600
respect to the probability
distribution of Theta hat that

441
00:22:04,600 --> 00:22:07,280
corresponds to a particular
value of little theta.

442
00:22:07,280 --> 00:22:09,840
So fix a little theta.

443
00:22:09,840 --> 00:22:11,910
Write down this expression.

444
00:22:11,910 --> 00:22:14,550
Look at the probability
distribution of Theta hat,

445
00:22:14,550 --> 00:22:16,380
under that little theta.

446
00:22:16,380 --> 00:22:18,220
And do this calculation.

447
00:22:18,220 --> 00:22:20,610
You're going to get some
quantity that depends on the

448
00:22:20,610 --> 00:22:21,860
little theta.

449
00:22:21,860 --> 00:22:24,200

450
00:22:24,200 --> 00:22:28,450
And so all quantities in this
equality here should be

451
00:22:28,450 --> 00:22:33,360
interpreted as quantities under
that particular value of

452
00:22:33,360 --> 00:22:34,490
little theta.

453
00:22:34,490 --> 00:22:38,640
So if you wanted to make this
more explicit, you could start

454
00:22:38,640 --> 00:22:41,870
throwing little subscripts
everywhere in those

455
00:22:41,870 --> 00:22:44,430
expressions.

456
00:22:44,430 --> 00:22:49,190
And let's see what those
expressions tell us.

457
00:22:49,190 --> 00:22:55,430
The expected value squared of
a random variable, we know

458
00:22:55,430 --> 00:22:59,210
that it's always equal to the
variance of this random

459
00:22:59,210 --> 00:23:03,790
variable, plus the expectation
of the

460
00:23:03,790 --> 00:23:05,860
random variable squared.

461
00:23:05,860 --> 00:23:08,465
So the expectation value of that
random variable, squared.

462
00:23:08,465 --> 00:23:12,020

463
00:23:12,020 --> 00:23:17,030
This equality here is just our
familiar formula, that the

464
00:23:17,030 --> 00:23:23,250
expected value of X squared is
the variance of X plus the

465
00:23:23,250 --> 00:23:26,350
expected value of X squared.

466
00:23:26,350 --> 00:23:30,040
So we apply this formula
to X equal to

467
00:23:30,040 --> 00:23:34,024
Theta hat minus Theta.

468
00:23:34,024 --> 00:23:37,180

469
00:23:37,180 --> 00:23:41,220
Now, remember that, in this
classical setting, theta is

470
00:23:41,220 --> 00:23:42,140
just a constant.

471
00:23:42,140 --> 00:23:43,450
We have fixed Theta.

472
00:23:43,450 --> 00:23:45,850
We want to calculate the
variance of this quantity,

473
00:23:45,850 --> 00:23:47,760
under that particular Theta.

474
00:23:47,760 --> 00:23:51,000
When you add or subtract a
constant to a random variable,

475
00:23:51,000 --> 00:23:54,070
the variance doesn't change.

476
00:23:54,070 --> 00:23:56,860
This is the same as the variance
of our estimator.

477
00:23:56,860 --> 00:24:00,300
And what we've got here is
the bias of our estimate.

478
00:24:00,300 --> 00:24:02,580
It tells us, on the average,
whether we

479
00:24:02,580 --> 00:24:04,470
fall above or below.

480
00:24:04,470 --> 00:24:06,850
And we're taking the bias
to be b squared.

481
00:24:06,850 --> 00:24:10,110
If we have an unbiased
estimator, the bias

482
00:24:10,110 --> 00:24:13,690
term will be 0.

483
00:24:13,690 --> 00:24:18,250
So ideally we want Theta hat
to be very close to Theta.

484
00:24:18,250 --> 00:24:21,840
And since Theta is a constant,
if that happens, the variance

485
00:24:21,840 --> 00:24:25,650
of Theta hat would
be very small.

486
00:24:25,650 --> 00:24:26,870
So Theta is a constant.

487
00:24:26,870 --> 00:24:30,180
If Theta hat has a distribution
that's

488
00:24:30,180 --> 00:24:33,610
concentrated just around own
little theta, then Theta hat

489
00:24:33,610 --> 00:24:35,250
would have a small variance.

490
00:24:35,250 --> 00:24:37,690
So this is one desire
that have.

491
00:24:37,690 --> 00:24:39,740
We're going to have
a small variance.

492
00:24:39,740 --> 00:24:43,710
But we also want to have a small
bias at the same time.

493
00:24:43,710 --> 00:24:47,370
So the general form of the mean
squared error has two

494
00:24:47,370 --> 00:24:48,240
contributions.

495
00:24:48,240 --> 00:24:50,530
One is the variance
of our estimator.

496
00:24:50,530 --> 00:24:52,350
The other is the bias.

497
00:24:52,350 --> 00:24:54,990
And one usually wants to design
an estimator that

498
00:24:54,990 --> 00:24:58,900
simultaneously keeps both
of these terms small.

499
00:24:58,900 --> 00:25:03,250
So here's an estimation method
that would do very well with

500
00:25:03,250 --> 00:25:05,080
respect to this term,
but badly with

501
00:25:05,080 --> 00:25:06,680
respect to that term.

502
00:25:06,680 --> 00:25:09,410
So suppose that my distribution
is, let's say,

503
00:25:09,410 --> 00:25:13,700
normal with an unknown mean
Theta and variance 1.

504
00:25:13,700 --> 00:25:17,580
And I use as my estimator
something very dumb.

505
00:25:17,580 --> 00:25:23,330
I always produce an estimate
that says my estimate is 100.

506
00:25:23,330 --> 00:25:26,430
So I'm just ignoring the
data and report 100.

507
00:25:26,430 --> 00:25:27,750
What does this do?

508
00:25:27,750 --> 00:25:30,950
The variance of my
estimator is 0.

509
00:25:30,950 --> 00:25:33,690
There's no randomness in the
estimate that I report.

510
00:25:33,690 --> 00:25:37,020
But the bias is going
to be pretty bad.

511
00:25:37,020 --> 00:25:44,180
The bias is going to be Theta
hat, which is 100 minus the

512
00:25:44,180 --> 00:25:46,770
true value of Theta.

513
00:25:46,770 --> 00:25:50,340
And for some Theta's, my bias
is going to be horrible.

514
00:25:50,340 --> 00:25:54,600
If my true Theta happens
to be 0, my bias

515
00:25:54,600 --> 00:25:56,200
squared is a huge term.

516
00:25:56,200 --> 00:25:57,810
And I get a large error.

517
00:25:57,810 --> 00:26:00,220
So what's the moral
of this example?

518
00:26:00,220 --> 00:26:03,700
There are ways of making that
variance very small, but, in

519
00:26:03,700 --> 00:26:07,360
those cases, you pay a
price in the bias.

520
00:26:07,360 --> 00:26:10,340
So you want to do something a
little more delicate, where

521
00:26:10,340 --> 00:26:14,640
you try to keep both terms
small at the same time.

522
00:26:14,640 --> 00:26:16,720
So these types of considerations
become

523
00:26:16,720 --> 00:26:20,280
important when you start to try
to design sophisticated

524
00:26:20,280 --> 00:26:22,840
estimators for more complicated
problems.

525
00:26:22,840 --> 00:26:24,800
But we will not do this
in this class.

526
00:26:24,800 --> 00:26:26,720
This belongs to further
classes on

527
00:26:26,720 --> 00:26:28,750
statistics and inference.

528
00:26:28,750 --> 00:26:31,960
For this class, for parameter
estimation, we will basically

529
00:26:31,960 --> 00:26:34,400
stick to two very
simple methods.

530
00:26:34,400 --> 00:26:37,930
One is the maximum likelihood
method we've just discussed.

531
00:26:37,930 --> 00:26:41,300
And the other method is what you
would do if you were still

532
00:26:41,300 --> 00:26:44,010
in high school and didn't
know any probability.

533
00:26:44,010 --> 00:26:46,610
You get data.

534
00:26:46,610 --> 00:26:50,430
And these data come from
some distribution

535
00:26:50,430 --> 00:26:51,850
with an unknown mean.

536
00:26:51,850 --> 00:26:53,930
And you want to estimate
that the unknown mean.

537
00:26:53,930 --> 00:26:54,810
What would you do?

538
00:26:54,810 --> 00:26:57,990
You would just take those data
and average them out.

539
00:26:57,990 --> 00:27:00,440
So let's make this a little
more specific.

540
00:27:00,440 --> 00:27:04,770
We have X's that come from
a given distribution.

541
00:27:04,770 --> 00:27:07,775
We know the general form of
the distribution, perhaps.

542
00:27:07,775 --> 00:27:10,570

543
00:27:10,570 --> 00:27:15,180
We do know, perhaps, the
variance of that distribution,

544
00:27:15,180 --> 00:27:17,050
or, perhaps, we don't know it.

545
00:27:17,050 --> 00:27:19,030
But we do not know the mean.

546
00:27:19,030 --> 00:27:22,700
And we want to estimate the
mean of that distribution.

547
00:27:22,700 --> 00:27:25,370
Now, we can write
this situation.

548
00:27:25,370 --> 00:27:27,710
We can represent it in
a different form.

549
00:27:27,710 --> 00:27:30,120
The Xi's are equal to Theta.

550
00:27:30,120 --> 00:27:31,380
This is the mean.

551
00:27:31,380 --> 00:27:34,310
Plus a 0 mean random
variable, that you

552
00:27:34,310 --> 00:27:36,000
can think of as noise.

553
00:27:36,000 --> 00:27:39,380
So this corresponds to the usual
situation you would have

554
00:27:39,380 --> 00:27:41,950
in a lab, where you
go and try to

555
00:27:41,950 --> 00:27:43,870
measure an unknown quantity.

556
00:27:43,870 --> 00:27:45,260
You get lots of measurements.

557
00:27:45,260 --> 00:27:49,490
But each time that you measure
them, your measurements have

558
00:27:49,490 --> 00:27:51,920
some extra noise in there.

559
00:27:51,920 --> 00:27:54,510
And you want to kind of
get rid of that noise.

560
00:27:54,510 --> 00:27:57,860
The way to try to get rid of
the measurement noise is to

561
00:27:57,860 --> 00:28:01,170
collect lots of data and
average them out.

562
00:28:01,170 --> 00:28:02,930
This is the sample mean.

563
00:28:02,930 --> 00:28:07,380
And this is a very, very
reasonable way of trying to

564
00:28:07,380 --> 00:28:10,130
estimate the unknown
mean of the X's.

565
00:28:10,130 --> 00:28:12,700
So this is the sample mean.

566
00:28:12,700 --> 00:28:17,840
It's a reasonable, plausible,
in general, pretty good

567
00:28:17,840 --> 00:28:22,390
estimator of the unknown mean
of a certain distribution.

568
00:28:22,390 --> 00:28:26,910
We can apply this estimator
without really knowing a lot

569
00:28:26,910 --> 00:28:28,810
about the distribution
of the X's.

570
00:28:28,810 --> 00:28:31,010
Actually, we don't need to
know anything about the

571
00:28:31,010 --> 00:28:32,320
distribution.

572
00:28:32,320 --> 00:28:35,840
We can still apply it, because
the variance, for example,

573
00:28:35,840 --> 00:28:37,130
does not show up here.

574
00:28:37,130 --> 00:28:38,660
We don't need to know
the variance to

575
00:28:38,660 --> 00:28:40,520
calculate that quantity.

576
00:28:40,520 --> 00:28:43,520
Does this estimator have
good properties?

577
00:28:43,520 --> 00:28:45,110
Yes, it does.

578
00:28:45,110 --> 00:28:48,110
What's the expected value
of the sample mean?

579
00:28:48,110 --> 00:28:51,910
If the expectation of this, it's
the expectation of this

580
00:28:51,910 --> 00:28:53,600
sum divided by n.

581
00:28:53,600 --> 00:28:56,410
The expected value for each
one of the X's is Theta.

582
00:28:56,410 --> 00:28:58,290
So the expected value
of the sample mean

583
00:28:58,290 --> 00:29:00,010
is just Theta itself.

584
00:29:00,010 --> 00:29:03,310
So our estimator is unbiased.

585
00:29:03,310 --> 00:29:06,410
No matter what Theta is, our
estimator does not have a

586
00:29:06,410 --> 00:29:11,130
systematic error in
either direction.

587
00:29:11,130 --> 00:29:13,870
Furthermore, the weak law of
large numbers tells us that

588
00:29:13,870 --> 00:29:18,140
this quantity converges to the
true parameter in probability.

589
00:29:18,140 --> 00:29:20,700
So it's a consistent
estimator.

590
00:29:20,700 --> 00:29:21,920
This is good.

591
00:29:21,920 --> 00:29:26,740
And if you want to calculate
the mean squared error

592
00:29:26,740 --> 00:29:28,780
corresponding to
this estimator.

593
00:29:28,780 --> 00:29:31,550
Remember how we defined the
mean squared error?

594
00:29:31,550 --> 00:29:35,300
It's this quantity.

595
00:29:35,300 --> 00:29:38,680
Then it's a calculation that we
have done a fair number of

596
00:29:38,680 --> 00:29:40,080
times by now.

597
00:29:40,080 --> 00:29:43,640
The mean squared error is the
variance of the distribution

598
00:29:43,640 --> 00:29:46,000
of the X's divided by n.

599
00:29:46,000 --> 00:29:49,370
So as we get more and more data,
the mean squared error

600
00:29:49,370 --> 00:29:52,170
goes down to 0.

601
00:29:52,170 --> 00:29:56,420
In some examples, it turns out
that the sample mean is also

602
00:29:56,420 --> 00:29:58,930
the same as the maximum
likelihood estimate.

603
00:29:58,930 --> 00:30:02,790
For example, if the X's are
coming from a normal

604
00:30:02,790 --> 00:30:07,700
distribution, you can write down
the likelihood, do the

605
00:30:07,700 --> 00:30:10,240
maximization with respect to
Theta, you'll find that the

606
00:30:10,240 --> 00:30:15,190
maximum likelihood estimate is
the same as the sample mean.

607
00:30:15,190 --> 00:30:18,730
In other cases, the sample mean
will be different from

608
00:30:18,730 --> 00:30:20,850
the maximum likelihood.

609
00:30:20,850 --> 00:30:23,990
And then you have a choice
about which one of the

610
00:30:23,990 --> 00:30:24,860
two you would use.

611
00:30:24,860 --> 00:30:27,890
Probably, in most reasonable
situations, you would just use

612
00:30:27,890 --> 00:30:31,460
the sample mean, because it's
simple, easy to compute, and

613
00:30:31,460 --> 00:30:33,830
has nice properties.

614
00:30:33,830 --> 00:30:33,936
All right.

615
00:30:33,936 --> 00:30:35,110
So you go to your boss.

616
00:30:35,110 --> 00:30:38,120
And you report and say,
OK, I did all my

617
00:30:38,120 --> 00:30:39,910
experiments in the lab.

618
00:30:39,910 --> 00:30:49,820
And the average value that I got
is a certain number, 2.37.

619
00:30:49,820 --> 00:30:52,490
So is that the informative
to your boss?

620
00:30:52,490 --> 00:30:55,470
Well your boss would like to
know how much they can trust

621
00:30:55,470 --> 00:30:58,280
this number, 2.37.

622
00:30:58,280 --> 00:31:00,630
Well, I know that the true
value is not going to be

623
00:31:00,630 --> 00:31:02,270
exactly that.

624
00:31:02,270 --> 00:31:07,410
But how close should it be?

625
00:31:07,410 --> 00:31:09,820
So give me a range of
what you think are

626
00:31:09,820 --> 00:31:12,080
possible values of Theta.

627
00:31:12,080 --> 00:31:16,220
So the situation is like this.

628
00:31:16,220 --> 00:31:20,370
So suppose that we observe X's
that are coming from a certain

629
00:31:20,370 --> 00:31:22,070
distribution.

630
00:31:22,070 --> 00:31:24,230
And we're trying to
estimate the mean.

631
00:31:24,230 --> 00:31:25,480
We get our data.

632
00:31:25,480 --> 00:31:27,880

633
00:31:27,880 --> 00:31:32,090
Maybe our data looks something
like this.

634
00:31:32,090 --> 00:31:34,090
You calculate the mean.

635
00:31:34,090 --> 00:31:36,140
You find the sample mean.

636
00:31:36,140 --> 00:31:40,120
So let's suppose that the sample
mean is a number, for

637
00:31:40,120 --> 00:31:45,570
some reason take to be 2.37.

638
00:31:45,570 --> 00:31:48,300
But you want to convey something
to your boss about

639
00:31:48,300 --> 00:31:51,450
how spread out these
data were.

640
00:31:51,450 --> 00:31:56,690
So the boss asks you to give
him or her some kind of

641
00:31:56,690 --> 00:32:05,340
interval on which Theta, the
true parameter, might lie.

642
00:32:05,340 --> 00:32:07,540
So the boss asked you
for an interval.

643
00:32:07,540 --> 00:32:11,740
So what you do is you end up
reporting an interval.

644
00:32:11,740 --> 00:32:14,990
And you somehow use the data
that you have seen to

645
00:32:14,990 --> 00:32:17,580
construct this interval.

646
00:32:17,580 --> 00:32:19,900
And you report to your
boss also the

647
00:32:19,900 --> 00:32:21,420
endpoints of this interval.

648
00:32:21,420 --> 00:32:24,020
Let's give names to
these endpoints,

649
00:32:24,020 --> 00:32:27,710
Theta_n- and Theta_n+.

650
00:32:27,710 --> 00:32:31,000
The ends here just play the role
of keeping track of how

651
00:32:31,000 --> 00:32:33,000
many data we're using.

652
00:32:33,000 --> 00:32:39,320
So what you report to your boss
is this interval as well.

653
00:32:39,320 --> 00:32:42,340
Are these Theta's here, the
endpoints of the interval,

654
00:32:42,340 --> 00:32:44,220
lowercase or uppercase?

655
00:32:44,220 --> 00:32:45,750
What should they be?

656
00:32:45,750 --> 00:32:48,180
Well you construct these
intervals after

657
00:32:48,180 --> 00:32:49,430
you see your data.

658
00:32:49,430 --> 00:32:53,830
You take the data into account
to construct your interval.

659
00:32:53,830 --> 00:32:57,020
So these definitely should
depend on the data.

660
00:32:57,020 --> 00:32:59,460
And therefore they are
random variables.

661
00:32:59,460 --> 00:33:03,240
Same thing with your estimator,
in general, it's

662
00:33:03,240 --> 00:33:05,120
going to be a random variable.

663
00:33:05,120 --> 00:33:07,930
Although, when you go and report
numbers to your boss,

664
00:33:07,930 --> 00:33:10,580
you give the specific
realizations of the random

665
00:33:10,580 --> 00:33:15,450
variables, given the
data that you got.

666
00:33:15,450 --> 00:33:21,500
So instead of having
just a single box

667
00:33:21,500 --> 00:33:25,050
that produces estimates.

668
00:33:25,050 --> 00:33:29,540
So our previous picture was that
you have your estimator

669
00:33:29,540 --> 00:33:34,130
that takes X's and produces
Theta hats.

670
00:33:34,130 --> 00:33:40,960
Now our box will also be
producing Theta hats minus and

671
00:33:40,960 --> 00:33:42,570
Theta hats plus.

672
00:33:42,570 --> 00:33:45,180
It's going to produce
an interval as well.

673
00:33:45,180 --> 00:33:48,670
The X's are random, therefore
these quantities are random.

674
00:33:48,670 --> 00:33:52,340
Once you go and do the
experiment and obtain your

675
00:33:52,340 --> 00:33:55,930
data, then your data
will be some

676
00:33:55,930 --> 00:33:58,810
lowercase x, specific numbers.

677
00:33:58,810 --> 00:34:00,950
And then your estimates
and estimator

678
00:34:00,950 --> 00:34:05,110
become also lower case.

679
00:34:05,110 --> 00:34:08,010
What would we like this
interval to do?

680
00:34:08,010 --> 00:34:11,760
We would like it to be highly
likely to contain the true

681
00:34:11,760 --> 00:34:13,810
value of the parameter.

682
00:34:13,810 --> 00:34:17,800
So we might impose some specs
of the following kind.

683
00:34:17,800 --> 00:34:19,170
I pick a number, alpha.

684
00:34:19,170 --> 00:34:21,170
Usually that alpha,
think of it as a

685
00:34:21,170 --> 00:34:23,050
probability of a large error.

686
00:34:23,050 --> 00:34:27,449
Typical value of alpha might
be 0.05, in which case this

687
00:34:27,449 --> 00:34:30,360
number here is point 0.95.

688
00:34:30,360 --> 00:34:33,989
And you're given specs that
say something like this.

689
00:34:33,989 --> 00:34:41,110
I would like, with probability
at least 0.95, this to happen,

690
00:34:41,110 --> 00:34:44,739
which says that the true
parameter lies inside the

691
00:34:44,739 --> 00:34:47,100
confidence interval.

692
00:34:47,100 --> 00:34:50,840
Now let's try to interpret
this statement.

693
00:34:50,840 --> 00:34:53,560
Suppose that you did the
experiment, and that you ended

694
00:34:53,560 --> 00:34:56,230
up reporting to your boss
a confidence interval

695
00:34:56,230 --> 00:35:01,520
from 1.97 to 2.56.

696
00:35:01,520 --> 00:35:03,170
That's what you report
to your boss.

697
00:35:03,170 --> 00:35:06,790

698
00:35:06,790 --> 00:35:08,300
And suppose that the confidence

699
00:35:08,300 --> 00:35:10,280
interval has this property.

700
00:35:10,280 --> 00:35:16,400
Can you go to your boss and say,
with probability 95%, the

701
00:35:16,400 --> 00:35:20,090
true value of Theta is between
these two numbers?

702
00:35:20,090 --> 00:35:22,630
Is that a meaningful
statement?

703
00:35:22,630 --> 00:35:26,100
So the statement is, the
tentative statement is, with

704
00:35:26,100 --> 00:35:30,200
probability 95%, the true
value of Theta is

705
00:35:30,200 --> 00:35:34,930
between 1.97 and 2.56.

706
00:35:34,930 --> 00:35:38,910
Well, what is random
in that statement?

707
00:35:38,910 --> 00:35:40,460
There's nothing random.

708
00:35:40,460 --> 00:35:43,070
The true value of theta
is a constant.

709
00:35:43,070 --> 00:35:44,720
1.97 is a number.

710
00:35:44,720 --> 00:35:46,740
2.56 is a number.

711
00:35:46,740 --> 00:35:52,960
So it doesn't make any sense to
talk about the probability

712
00:35:52,960 --> 00:35:54,920
that theta is in
this interval.

713
00:35:54,920 --> 00:35:57,540
Either theta happens to be
in that interval, or it

714
00:35:57,540 --> 00:35:58,760
happens to not be.

715
00:35:58,760 --> 00:36:01,560
But there are no probabilities
associated with this.

716
00:36:01,560 --> 00:36:04,700
Because theta is not random.

717
00:36:04,700 --> 00:36:06,690
Syntactically, you
can see this.

718
00:36:06,690 --> 00:36:09,210
Because theta here
is a lower case.

719
00:36:09,210 --> 00:36:11,930
So what kind of probabilities
are we talking about here?

720
00:36:11,930 --> 00:36:13,460
Where's the randomness?

721
00:36:13,460 --> 00:36:15,880
Well the random thing
is the interval.

722
00:36:15,880 --> 00:36:17,560
It's not theta.

723
00:36:17,560 --> 00:36:21,090
So the statement that is being
made here is that the

724
00:36:21,090 --> 00:36:24,290
interval, that's being
constructed by our procedure,

725
00:36:24,290 --> 00:36:28,410
should have the property that,
with probability 95%, it's

726
00:36:28,410 --> 00:36:33,280
going to fall on top of the
true value of theta.

727
00:36:33,280 --> 00:36:37,680
So the right way of interpreting
what the 95%

728
00:36:37,680 --> 00:36:42,270
confidence interval is, is
something like the following.

729
00:36:42,270 --> 00:36:45,390
We have the true value of theta
that we don't know.

730
00:36:45,390 --> 00:36:46,750
I get data.

731
00:36:46,750 --> 00:36:50,150
Based on the data, I construct
a confidence interval.

732
00:36:50,150 --> 00:36:51,950
I get my confidence interval.

733
00:36:51,950 --> 00:36:52,790
I got lucky.

734
00:36:52,790 --> 00:36:54,850
And the true value of
theta is in here.

735
00:36:54,850 --> 00:36:57,790
Next day, I do the same
experiment, take my data,

736
00:36:57,790 --> 00:37:00,500
construct a confidence
interval.

737
00:37:00,500 --> 00:37:04,040
And I get this confidence
interval, lucky once more.

738
00:37:04,040 --> 00:37:06,320
Next day I get data.

739
00:37:06,320 --> 00:37:09,620
I use my data to come up with
an estimate of theta and the

740
00:37:09,620 --> 00:37:10,660
confidence interval.

741
00:37:10,660 --> 00:37:12,340
That day, I was unlucky.

742
00:37:12,340 --> 00:37:15,000
And I got a confidence
interval out there.

743
00:37:15,000 --> 00:37:20,890
What the requirement here is, is
that 95% of the days, where

744
00:37:20,890 --> 00:37:25,270
we use this certain procedure
for constructing confidence

745
00:37:25,270 --> 00:37:29,180
intervals, 95% of those days,
we will be lucky.

746
00:37:29,180 --> 00:37:33,750
And we will capture the correct
value of theta by your

747
00:37:33,750 --> 00:37:35,160
confidence interval.

748
00:37:35,160 --> 00:37:39,390
So it's a statement about the
distribution of these random

749
00:37:39,390 --> 00:37:42,820
confidence intervals, how likely
are they to fall on top

750
00:37:42,820 --> 00:37:45,210
of the true theta, as opposed
to how likely

751
00:37:45,210 --> 00:37:47,060
they are to fall outside.

752
00:37:47,060 --> 00:37:50,770
So it's a statement about
probabilities associated with

753
00:37:50,770 --> 00:37:52,380
a confidence interval.

754
00:37:52,380 --> 00:37:55,080
They're not probabilities about
theta, because theta,

755
00:37:55,080 --> 00:37:58,370
itself, is not random.

756
00:37:58,370 --> 00:38:02,080
So this is what the confidence
interval is, in general, and

757
00:38:02,080 --> 00:38:03,470
how we interpret it.

758
00:38:03,470 --> 00:38:07,470
How do we construct a 95%
confidence interval?

759
00:38:07,470 --> 00:38:09,320
Let's go through this
exercise, in

760
00:38:09,320 --> 00:38:10,980
a particular example.

761
00:38:10,980 --> 00:38:13,970
The calculations are exactly the
same as the ones that you

762
00:38:13,970 --> 00:38:17,770
did when we talked about laws
of large numbers and the

763
00:38:17,770 --> 00:38:19,240
central limit theorem.

764
00:38:19,240 --> 00:38:22,600
So there's nothing new
calculationally but it's,

765
00:38:22,600 --> 00:38:25,440
perhaps, new in terms of the
language that we use and the

766
00:38:25,440 --> 00:38:26,800
interpretation.

767
00:38:26,800 --> 00:38:30,890
So we got our sample mean
from some distribution.

768
00:38:30,890 --> 00:38:34,650
And we would like to calculate
a 95% confidence interval.

769
00:38:34,650 --> 00:38:39,590

770
00:38:39,590 --> 00:38:42,650
We know from the normal tables,
that the standard

771
00:38:42,650 --> 00:38:54,011
normal has 2.5% on the tail,
that's after 1.96.

772
00:38:54,011 --> 00:38:58,060
Yes, by this time,
the number 1.96

773
00:38:58,060 --> 00:39:00,600
should be pretty familiar.

774
00:39:00,600 --> 00:39:05,880
So if this probability
here is 2.5%, this

775
00:39:05,880 --> 00:39:09,510
number here is 1.96.

776
00:39:09,510 --> 00:39:12,310
Now look at this random
variable here.

777
00:39:12,310 --> 00:39:15,000
This is the sample mean.

778
00:39:15,000 --> 00:39:17,950
Difference, from the true mean,
normalized by the usual

779
00:39:17,950 --> 00:39:18,940
normalizing factor.

780
00:39:18,940 --> 00:39:22,090
By the central limit theorem,
this is approximately normal.

781
00:39:22,090 --> 00:39:26,790
So it has probability 0.95
of being less than 1.96.

782
00:39:26,790 --> 00:39:31,050
Now take this event here
and rewrite it.

783
00:39:31,050 --> 00:39:36,240
This the event, well, that
Theta hat minus theta is

784
00:39:36,240 --> 00:39:40,350
bigger than this number and
smaller than that number.

785
00:39:40,350 --> 00:39:45,650
This event here is equivalent
to that event here.

786
00:39:45,650 --> 00:39:50,670
And so this suggests a way of
constructing our 95% percent

787
00:39:50,670 --> 00:39:52,130
confidence interval.

788
00:39:52,130 --> 00:39:56,330
I'm going to report the
interval, which gives this as

789
00:39:56,330 --> 00:40:00,350
the lower end of the confidence
interval, and gives

790
00:40:00,350 --> 00:40:05,720
this as the upper end of
the confidence interval

791
00:40:05,720 --> 00:40:09,180
In other words, at the end of
the experiment, we report the

792
00:40:09,180 --> 00:40:12,170
sample mean, which
is our estimate.

793
00:40:12,170 --> 00:40:14,230
And we report also, an interval

794
00:40:14,230 --> 00:40:16,080
around the sample mean.

795
00:40:16,080 --> 00:40:20,510
And this is our 95% confidence
interval.

796
00:40:20,510 --> 00:40:22,800
The confidence interval becomes

797
00:40:22,800 --> 00:40:26,050
smaller, when n is larger.

798
00:40:26,050 --> 00:40:28,950
In some sense, we're more
certain that we're doing a

799
00:40:28,950 --> 00:40:32,390
good estimation job, so we can
have a small interval and

800
00:40:32,390 --> 00:40:36,000
still be quite confident that
our interval captures the true

801
00:40:36,000 --> 00:40:37,520
value of the parameter.

802
00:40:37,520 --> 00:40:41,890
Also, if our data have very
little noise, when you have

803
00:40:41,890 --> 00:40:45,060
more accurate measurements,
you're more confident that

804
00:40:45,060 --> 00:40:47,220
your estimate is pretty good.

805
00:40:47,220 --> 00:40:51,120
And that results in a smaller
confidence interval, smaller

806
00:40:51,120 --> 00:40:52,610
length of the confidence
interval.

807
00:40:52,610 --> 00:40:56,040
And still you have 95%
probability of capturing the

808
00:40:56,040 --> 00:40:57,650
true value of theta.

809
00:40:57,650 --> 00:41:01,660
So we did this exercise by
taking 95% confidence

810
00:41:01,660 --> 00:41:04,010
intervals and the corresponding
value from the

811
00:41:04,010 --> 00:41:06,670
normal tables, which is 1.96.

812
00:41:06,670 --> 00:41:11,390
Of course, you can do it more
generally, if you set your

813
00:41:11,390 --> 00:41:13,730
alpha to be some other number.

814
00:41:13,730 --> 00:41:16,590
Again, you look at the
normal tables.

815
00:41:16,590 --> 00:41:20,460
And you find the value here,
so that the tail has

816
00:41:20,460 --> 00:41:22,640
probability alpha over 2.

817
00:41:22,640 --> 00:41:26,790
And instead of using these 1.96,
you use whatever number

818
00:41:26,790 --> 00:41:31,380
you get from the
normal tables.

819
00:41:31,380 --> 00:41:33,520
And this tells you
how to construct

820
00:41:33,520 --> 00:41:36,680
a confidence interval.

821
00:41:36,680 --> 00:41:42,060
Well, to be exact, this
is not necessarily a

822
00:41:42,060 --> 00:41:44,640
95% confidence interval.

823
00:41:44,640 --> 00:41:47,540
It's approximately a 95%
confidence interval.

824
00:41:47,540 --> 00:41:48,950
Why is this?

825
00:41:48,950 --> 00:41:51,060
Because we've done
an approximation.

826
00:41:51,060 --> 00:41:53,890
We have used the central
limit theorem.

827
00:41:53,890 --> 00:41:59,990
So it might turn out to be a
95.5% confidence interval

828
00:41:59,990 --> 00:42:03,220
instead of 95%, because
our calculations are

829
00:42:03,220 --> 00:42:04,740
not entirely accurate.

830
00:42:04,740 --> 00:42:08,230
But for reasonable values of
n, using the central limit

831
00:42:08,230 --> 00:42:10,190
theorem is a good
approximation.

832
00:42:10,190 --> 00:42:13,330
And that's what people
almost always do.

833
00:42:13,330 --> 00:42:17,350
So just take the value from
the normal tables.

834
00:42:17,350 --> 00:42:18,600
Okay, except for one catch.

835
00:42:18,600 --> 00:42:22,830

836
00:42:22,830 --> 00:42:24,590
I used the data.

837
00:42:24,590 --> 00:42:26,440
I obtained my estimate.

838
00:42:26,440 --> 00:42:29,830
And I want to go to my boss and
report this theta minus

839
00:42:29,830 --> 00:42:33,010
and theta hat, which is the
confidence interval.

840
00:42:33,010 --> 00:42:35,720
What's the difficulty?

841
00:42:35,720 --> 00:42:37,540
I know what n is.

842
00:42:37,540 --> 00:42:40,790
But I don't know what sigma
is, in general.

843
00:42:40,790 --> 00:42:44,750
So if I don't know sigma,
what am I going to do?

844
00:42:44,750 --> 00:42:48,980
Here, there's a few options
for what you can do.

845
00:42:48,980 --> 00:42:52,910
And the first option is familiar
from what we did when

846
00:42:52,910 --> 00:42:55,020
we talked about the
pollster problem.

847
00:42:55,020 --> 00:42:58,480
We don't know what sigma is,
but maybe we have an upper

848
00:42:58,480 --> 00:43:00,030
bound on sigma.

849
00:43:00,030 --> 00:43:03,540
For example, if the Xi's
Bernoulli random variables, we

850
00:43:03,540 --> 00:43:06,910
have seen that the standard
deviation is at most 1/2.

851
00:43:06,910 --> 00:43:10,220
So use the most conservative
value for sigma.

852
00:43:10,220 --> 00:43:13,520
Using the most conservative
value means that you take

853
00:43:13,520 --> 00:43:17,890
bigger confidence intervals
than necessary.

854
00:43:17,890 --> 00:43:20,780
So that's one option.

855
00:43:20,780 --> 00:43:25,480
Another option is to try to
estimate sigma from the data.

856
00:43:25,480 --> 00:43:27,630
How do you do this estimation?

857
00:43:27,630 --> 00:43:31,140
In special cases, for special
types of distributions, you

858
00:43:31,140 --> 00:43:34,180
can think of heuristic ways
of doing this estimation.

859
00:43:34,180 --> 00:43:38,390
For example, in the case of
Bernoulli random variables, we

860
00:43:38,390 --> 00:43:42,420
know that the true value of
sigma, the standard deviation

861
00:43:42,420 --> 00:43:45,120
of a Bernoulli random variable,
is the square root

862
00:43:45,120 --> 00:43:47,670
of theta1 minus theta,
where theta is

863
00:43:47,670 --> 00:43:50,290
the mean of the Bernoulli.

864
00:43:50,290 --> 00:43:51,900
Try to use this formula.

865
00:43:51,900 --> 00:43:54,140
But theta is the thing we're
trying to estimate in the

866
00:43:54,140 --> 00:43:54,760
first place.

867
00:43:54,760 --> 00:43:55,880
We don't know it.

868
00:43:55,880 --> 00:43:57,150
What do we do?

869
00:43:57,150 --> 00:44:00,850
Well, we have an estimate for
theta, the estimate, produced

870
00:44:00,850 --> 00:44:04,195
by our estimation procedure,
the sample mean.

871
00:44:04,195 --> 00:44:05,670
So I obtain my data.

872
00:44:05,670 --> 00:44:06,540
I get my data.

873
00:44:06,540 --> 00:44:09,030
I produce the estimate
theta hat.

874
00:44:09,030 --> 00:44:10,740
It's an estimate of the mean.

875
00:44:10,740 --> 00:44:14,770
Use that estimate in this
formula to come up with an

876
00:44:14,770 --> 00:44:17,290
estimate of my standard
deviation.

877
00:44:17,290 --> 00:44:20,210
And then use that standard
deviation, in the construction

878
00:44:20,210 --> 00:44:22,510
of the confidence interval,
pretending

879
00:44:22,510 --> 00:44:24,180
that this is correct.

880
00:44:24,180 --> 00:44:29,050
Well the number of your data is
large, then we know, from

881
00:44:29,050 --> 00:44:31,870
the law of large numbers, that
theta hat is a pretty good

882
00:44:31,870 --> 00:44:33,130
estimate of theta.

883
00:44:33,130 --> 00:44:36,670
So sigma hat is going to be a
pretty good estimate of sigma.

884
00:44:36,670 --> 00:44:42,380
So we're not making large errors
by using this approach.

885
00:44:42,380 --> 00:44:47,980
So in this scenario here, things
were simple, because we

886
00:44:47,980 --> 00:44:49,890
had an analytical formula.

887
00:44:49,890 --> 00:44:52,210
Sigma was determined by theta.

888
00:44:52,210 --> 00:44:54,420
So we could come up
with a quick and

889
00:44:54,420 --> 00:44:57,340
dirty estimate of sigma.

890
00:44:57,340 --> 00:45:00,940
In general, if you do not have
any nice formulas of this

891
00:45:00,940 --> 00:45:03,000
kind, what could you do?

892
00:45:03,000 --> 00:45:04,920
Well, you still need
to come up with an

893
00:45:04,920 --> 00:45:07,110
estimate of sigma somehow.

894
00:45:07,110 --> 00:45:08,950
What is a generic method for

895
00:45:08,950 --> 00:45:11,300
estimating a standard deviation?

896
00:45:11,300 --> 00:45:14,440
Equivalently, what could be a
generic method for estimating

897
00:45:14,440 --> 00:45:16,920
a variance?

898
00:45:16,920 --> 00:45:19,360
Well the variance is
an expected value

899
00:45:19,360 --> 00:45:20,940
of some random variable.

900
00:45:20,940 --> 00:45:25,610
The variance is the mean of the
random variable inside of

901
00:45:25,610 --> 00:45:28,200
those brackets.

902
00:45:28,200 --> 00:45:33,160
How does one estimate the mean
of some random variable?

903
00:45:33,160 --> 00:45:36,140
You obtain lots of measurements
of that random

904
00:45:36,140 --> 00:45:40,210
variable and average them out.

905
00:45:40,210 --> 00:45:45,170
So this would be a reasonable
way of estimating the variance

906
00:45:45,170 --> 00:45:47,310
of a distribution.

907
00:45:47,310 --> 00:45:50,590
And again, the weak law of large
numbers tells us that

908
00:45:50,590 --> 00:45:55,370
this average converges to the
expected value of this, which

909
00:45:55,370 --> 00:45:58,590
is just the variance of
the distribution.

910
00:45:58,590 --> 00:46:01,700
So we got a nice and
consistent way

911
00:46:01,700 --> 00:46:03,940
of estimating variances.

912
00:46:03,940 --> 00:46:08,100
But now, we seem to be getting
in a vicious circle here,

913
00:46:08,100 --> 00:46:10,580
because to estimate
the variance, we

914
00:46:10,580 --> 00:46:12,910
need to know the mean.

915
00:46:12,910 --> 00:46:16,075
And the mean is something we're
trying to estimate in

916
00:46:16,075 --> 00:46:18,250
the first place.

917
00:46:18,250 --> 00:46:18,400
Okay.

918
00:46:18,400 --> 00:46:20,880
But we do have an estimate
from the mean.

919
00:46:20,880 --> 00:46:24,640
So a reasonable approximation,
once more, is to plug-in,

920
00:46:24,640 --> 00:46:27,620
here, since we don't
know the mean, the

921
00:46:27,620 --> 00:46:29,270
estimate of the mean.

922
00:46:29,270 --> 00:46:32,370
And so you get that expression,
but with a theta

923
00:46:32,370 --> 00:46:35,130
hat instead of theta itself.

924
00:46:35,130 --> 00:46:37,980
And this is another
reasonable way of

925
00:46:37,980 --> 00:46:40,180
estimating the variance.

926
00:46:40,180 --> 00:46:42,940
It does have the same
consistency properties.

927
00:46:42,940 --> 00:46:44,050
Why?

928
00:46:44,050 --> 00:46:51,100
When n is large, this is going
to behave the same as that,

929
00:46:51,100 --> 00:46:53,640
because theta hat converges
to theta.

930
00:46:53,640 --> 00:46:57,890
And when n is large, this is
approximately the same as

931
00:46:57,890 --> 00:46:58,820
sigma squared.

932
00:46:58,820 --> 00:47:02,220
So for a large n, this quantity
also converges to

933
00:47:02,220 --> 00:47:03,350
sigma squared.

934
00:47:03,350 --> 00:47:05,500
And we have a consistent
estimate of

935
00:47:05,500 --> 00:47:07,000
the variance as well.

936
00:47:07,000 --> 00:47:09,490
And we can take that consistent
estimate and use it

937
00:47:09,490 --> 00:47:12,360
back in the construction
of confidence interval.

938
00:47:12,360 --> 00:47:16,310
One little detail, here,
we're dividing by n.

939
00:47:16,310 --> 00:47:19,590
Here, we're dividing by n-1.

940
00:47:19,590 --> 00:47:21,050
Why do we do this?

941
00:47:21,050 --> 00:47:24,630
Well, it turns out that's what
you need to do for these

942
00:47:24,630 --> 00:47:28,590
estimates to be an unbiased
estimate of the variance.

943
00:47:28,590 --> 00:47:32,080
One has to do a little bit of
a calculation, and one finds

944
00:47:32,080 --> 00:47:36,650
that that's the factor that you
need to have here in order

945
00:47:36,650 --> 00:47:37,770
to be unbiased.

946
00:47:37,770 --> 00:47:42,280
Of course, if you get 100 data
points, whether you divide by

947
00:47:42,280 --> 00:47:46,070
100 or divided by 99, it's
going to make only a tiny

948
00:47:46,070 --> 00:47:48,620
difference in your estimate
of your variance.

949
00:47:48,620 --> 00:47:50,740
So it's going to make only
a tiny difference in your

950
00:47:50,740 --> 00:47:52,670
estimate of the standard
deviation.

951
00:47:52,670 --> 00:47:54,180
It's not a big deal.

952
00:47:54,180 --> 00:47:56,550
And it doesn't really matter.

953
00:47:56,550 --> 00:48:00,720
But if you want to show off
about your deeper knowledge of

954
00:48:00,720 --> 00:48:06,810
statistics, you throw in the
1 over n-1 factor in there.

955
00:48:06,810 --> 00:48:11,350
So now one basically needs to
put together this story here,

956
00:48:11,350 --> 00:48:15,260
how you estimate the variance.

957
00:48:15,260 --> 00:48:18,370
You first estimate
the sample mean.

958
00:48:18,370 --> 00:48:21,010
And then you do some extra
work to come up with a

959
00:48:21,010 --> 00:48:23,020
reasonable estimate of
the variance and

960
00:48:23,020 --> 00:48:24,640
the standard deviation.

961
00:48:24,640 --> 00:48:27,510
And then you use your estimate,
of the standard

962
00:48:27,510 --> 00:48:32,960
deviation, to come up with a
confidence interval, which has

963
00:48:32,960 --> 00:48:35,150
these two endpoints.

964
00:48:35,150 --> 00:48:39,130
In doing this procedure, there's
basically a number of

965
00:48:39,130 --> 00:48:41,810
approximations that
are involved.

966
00:48:41,810 --> 00:48:43,570
There are two types
of approximations.

967
00:48:43,570 --> 00:48:46,170
One approximation is that we're
pretending that the

968
00:48:46,170 --> 00:48:48,720
sample mean has a normal
distribution.

969
00:48:48,720 --> 00:48:51,080
That's something we're justified
to do, by the

970
00:48:51,080 --> 00:48:52,470
central limit theorem.

971
00:48:52,470 --> 00:48:53,550
But it's not exact.

972
00:48:53,550 --> 00:48:54,910
It's an approximation.

973
00:48:54,910 --> 00:48:58,080
And the second approximation
that comes in is that, instead

974
00:48:58,080 --> 00:49:01,260
of using the correct standard
deviation, in general, you

975
00:49:01,260 --> 00:49:04,850
will have to use some
approximation of

976
00:49:04,850 --> 00:49:06,100
the standard deviation.

977
00:49:06,100 --> 00:49:08,390

978
00:49:08,390 --> 00:49:11,200
Okay so you will be getting a
little bit of practice with

979
00:49:11,200 --> 00:49:14,550
these concepts in recitation
and tutorial.

980
00:49:14,550 --> 00:49:18,070
And we will move on to
new topics next week.

981
00:49:18,070 --> 00:49:20,930
But the material that's going
to be covered in the final

982
00:49:20,930 --> 00:49:23,570
exam is only up to this point.

983
00:49:23,570 --> 00:49:28,220
So next week is just
general education.

984
00:49:28,220 --> 00:49:30,550
Hopefully useful, but it's
not in the exam.

985
00:49:30,550 --> 00:49:31,800