1
00:00:00,000 --> 00:00:00,040

2
00:00:00,040 --> 00:00:02,460
The following content is
provided under a Creative

3
00:00:02,460 --> 00:00:03,870
Commons license.

4
00:00:03,870 --> 00:00:06,910
Your support will help MIT
OpenCourseWare continue to

5
00:00:06,910 --> 00:00:10,560
offer high quality educational
resources for free.

6
00:00:10,560 --> 00:00:13,460
To make a donation or view
additional materials from

7
00:00:13,460 --> 00:00:19,290
hundreds of MIT courses, visit
MIT OpenCourseWare at

8
00:00:19,290 --> 00:00:21,732
ocw.mit.edu.

9
00:00:21,732 --> 00:00:24,170
JOHN TSITSIKLIS: And we're going
to continue today with

10
00:00:24,170 --> 00:00:26,820
our discussion of classical
statistics.

11
00:00:26,820 --> 00:00:29,290
We'll start with a quick review
of what we discussed

12
00:00:29,290 --> 00:00:34,680
last time, and then talk about
two topics that cover a lot of

13
00:00:34,680 --> 00:00:37,740
statistics that are happening
in the real world.

14
00:00:37,740 --> 00:00:39,510
So two basic methods.

15
00:00:39,510 --> 00:00:43,730
One is the method of linear
regression, and the other one

16
00:00:43,730 --> 00:00:46,500
is the basic methods and
tools for how to

17
00:00:46,500 --> 00:00:49,540
do hypothesis testing.

18
00:00:49,540 --> 00:00:53,970
OK, so these two are topics
that any scientifically

19
00:00:53,970 --> 00:00:57,170
literate person should
know something about.

20
00:00:57,170 --> 00:00:59,570
So we're going to introduce
the basic ideas

21
00:00:59,570 --> 00:01:01,860
and concepts involved.

22
00:01:01,860 --> 00:01:07,580
So in classical statistics we
basically have essentially a

23
00:01:07,580 --> 00:01:11,250
family of possible models
about the world.

24
00:01:11,250 --> 00:01:15,190
So the world is the random
variable that we observe, and

25
00:01:15,190 --> 00:01:19,370
we have a model for it, but
actually not just one model,

26
00:01:19,370 --> 00:01:20,960
several candidate models.

27
00:01:20,960 --> 00:01:24,380
And each candidate model
corresponds to a different

28
00:01:24,380 --> 00:01:28,070
value of a parameter theta
that we do not know.

29
00:01:28,070 --> 00:01:32,275
So in contrast to Bayesian
statistics, this theta is

30
00:01:32,275 --> 00:01:35,540
assumed to be a constant
that we do not know.

31
00:01:35,540 --> 00:01:38,190
It is not modeled as a random
variable, there's no

32
00:01:38,190 --> 00:01:40,480
probabilities associated
with theta.

33
00:01:40,480 --> 00:01:43,380
We only have probabilities
about the X's.

34
00:01:43,380 --> 00:01:47,320
So in this context what is a
reasonable way of choosing a

35
00:01:47,320 --> 00:01:49,350
value for the parameter?

36
00:01:49,350 --> 00:01:53,470
One general approach is the
maximum likelihood approach,

37
00:01:53,470 --> 00:01:56,090
which chooses the
theta for which

38
00:01:56,090 --> 00:01:58,630
this quantity is largest.

39
00:01:58,630 --> 00:02:00,690
So what does that mean
intuitively?

40
00:02:00,690 --> 00:02:04,550
I'm trying to find the value of
theta under which the data

41
00:02:04,550 --> 00:02:08,970
that I observe are most likely
to have occurred.

42
00:02:08,970 --> 00:02:11,470
So is the thinking is
essentially as follows.

43
00:02:11,470 --> 00:02:13,970
Let's say I have to choose
between two choices of theta.

44
00:02:13,970 --> 00:02:16,520
Under this theta the
X that I observed

45
00:02:16,520 --> 00:02:17,940
would be very unlikely.

46
00:02:17,940 --> 00:02:21,350
Under that theta the X that I
observed would have a decent

47
00:02:21,350 --> 00:02:22,830
probability of occurring.

48
00:02:22,830 --> 00:02:28,340
So I chose the latter as
my estimate of theta.

49
00:02:28,340 --> 00:02:31,200
It's interesting to do the
comparison with the Bayesian

50
00:02:31,200 --> 00:02:34,110
approach which we did discuss
last time, in the Bayesian

51
00:02:34,110 --> 00:02:38,430
approach we also maximize over
theta, but we maximize a

52
00:02:38,430 --> 00:02:43,220
quantity in which the relation
between X's and thetas run the

53
00:02:43,220 --> 00:02:44,520
opposite way.

54
00:02:44,520 --> 00:02:47,500
Here in the Bayesian world,
Theta is a random variable.

55
00:02:47,500 --> 00:02:48,980
So it has a distribution.

56
00:02:48,980 --> 00:02:53,030
Once we observe the data, it has
a posterior distribution,

57
00:02:53,030 --> 00:02:56,480
and we find the value of Theta,
which is most likely

58
00:02:56,480 --> 00:02:59,250
under the posterior
distribution.

59
00:02:59,250 --> 00:03:03,090
As we discussed last time when
you do this maximization now

60
00:03:03,090 --> 00:03:05,750
the posterior distribution is
given by this expression.

61
00:03:05,750 --> 00:03:09,760
The denominator doesn't matter,
and if you were to

62
00:03:09,760 --> 00:03:12,790
take a prior, which is flat--

63
00:03:12,790 --> 00:03:16,210
that is a constant independent
of Theta, then that

64
00:03:16,210 --> 00:03:17,640
term would go away.

65
00:03:17,640 --> 00:03:19,360
And syntactically,
at least, the two

66
00:03:19,360 --> 00:03:21,970
approaches look the same.

67
00:03:21,970 --> 00:03:28,170
So syntactically, or formally,
maximum likelihood estimation

68
00:03:28,170 --> 00:03:32,225
is the same as Bayesian
estimation in which you assume

69
00:03:32,225 --> 00:03:36,090
a prior which is flat, so that
all possible values of Theta

70
00:03:36,090 --> 00:03:37,570
are equally likely.

71
00:03:37,570 --> 00:03:40,790
Philosophically, however,
they're very different things.

72
00:03:40,790 --> 00:03:44,150
Here I'm picking the most
likely value of Theta.

73
00:03:44,150 --> 00:03:47,140
Here I'm picking the value of
Theta under which the observed

74
00:03:47,140 --> 00:03:51,050
data would have been more
likely to occur.

75
00:03:51,050 --> 00:03:53,590
So maximum likelihood estimation
is a general

76
00:03:53,590 --> 00:03:57,820
purpose method, so it's applied
all over the place in

77
00:03:57,820 --> 00:04:02,220
many, many different types
of estimation problems.

78
00:04:02,220 --> 00:04:05,100
There is a special kind of
estimation problem in which

79
00:04:05,100 --> 00:04:08,040
you may forget about maximum
likelihood estimation, and

80
00:04:08,040 --> 00:04:12,700
come up with an estimate in
a straightforward way.

81
00:04:12,700 --> 00:04:15,680
And this is the case where
you're trying to estimate the

82
00:04:15,680 --> 00:04:22,390
mean of the distribution of X,
where X is a random variable.

83
00:04:22,390 --> 00:04:25,140
You observe several independent
identically

84
00:04:25,140 --> 00:04:30,020
distributed random variables
X1 up to Xn.

85
00:04:30,020 --> 00:04:32,880
All of them have the same
distribution as this X.

86
00:04:32,880 --> 00:04:34,600
So they have a common mean.

87
00:04:34,600 --> 00:04:37,020
We do not know the mean we
want to estimate it.

88
00:04:37,020 --> 00:04:40,560
What is more natural than just
taking the average of the

89
00:04:40,560 --> 00:04:42,470
values that we have observed?

90
00:04:42,470 --> 00:04:46,150
So you generate lots of X's,
take the average of them, and

91
00:04:46,150 --> 00:04:50,560
you expect that this is going to
be a reasonable estimate of

92
00:04:50,560 --> 00:04:53,420
the true mean of that
random variable.

93
00:04:53,420 --> 00:04:56,290
And indeed we know from the weak
law of large numbers that

94
00:04:56,290 --> 00:05:00,790
this estimate converges in
probability to the true mean

95
00:05:00,790 --> 00:05:02,680
of the random variable.

96
00:05:02,680 --> 00:05:04,870
The other thing that we talked
about last time is that

97
00:05:04,870 --> 00:05:07,770
besides giving a point estimate
we may want to also

98
00:05:07,770 --> 00:05:13,530
give an interval that tells us
something about where we might

99
00:05:13,530 --> 00:05:16,170
believe theta to lie.

100
00:05:16,170 --> 00:05:21,950
And 1-alpha confidence interval
is in interval

101
00:05:21,950 --> 00:05:24,200
generated based on the data.

102
00:05:24,200 --> 00:05:26,860
So it's an interval from this
value to that value.

103
00:05:26,860 --> 00:05:30,120
These values are written with
capital letters because

104
00:05:30,120 --> 00:05:32,390
they're random, because they
depend on the data

105
00:05:32,390 --> 00:05:33,870
that we have seen.

106
00:05:33,870 --> 00:05:36,740
And this gives us an interval,
and we would like this

107
00:05:36,740 --> 00:05:40,600
interval to have the property
that theta is inside that

108
00:05:40,600 --> 00:05:42,830
interval with high
probability.

109
00:05:42,830 --> 00:05:46,805
So typically we would take
1-alpha to be a quantity such

110
00:05:46,805 --> 00:05:49,780
as 95% for example.

111
00:05:49,780 --> 00:05:54,340
In which case we have a 95%
confidence interval.

112
00:05:54,340 --> 00:05:56,980
As we discussed last time it's
important to have the right

113
00:05:56,980 --> 00:06:00,730
interpretation of what's
95% means.

114
00:06:00,730 --> 00:06:04,640
What it does not mean
is the following--

115
00:06:04,640 --> 00:06:09,800
the unknown value has 95%
percent probability of being

116
00:06:09,800 --> 00:06:12,450
in the interval that
we have generated.

117
00:06:12,450 --> 00:06:14,550
That's because the unknown
value is not a random

118
00:06:14,550 --> 00:06:15,910
variable, it's a constant.

119
00:06:15,910 --> 00:06:18,930
Once we generate the interval
either it's inside or it's

120
00:06:18,930 --> 00:06:22,500
outside, but there's no
probabilities involved.

121
00:06:22,500 --> 00:06:26,415
Rather the probabilities are
to be interpreted over the

122
00:06:26,415 --> 00:06:28,590
random interval itself.

123
00:06:28,590 --> 00:06:31,730
What a statement like this
says is that if I have a

124
00:06:31,730 --> 00:06:37,060
procedure for generating 95%
confidence intervals, then

125
00:06:37,060 --> 00:06:40,800
whenever I use that procedure
I'm going to get a random

126
00:06:40,800 --> 00:06:44,260
interval, and it's going to
have 95% probability of

127
00:06:44,260 --> 00:06:48,270
capturing the true
value of theta.

128
00:06:48,270 --> 00:06:53,010
So most of the time when I use
this particular procedure for

129
00:06:53,010 --> 00:06:56,170
generating confidence intervals
the true theta will

130
00:06:56,170 --> 00:06:59,440
happen to lie inside that
confidence interval with

131
00:06:59,440 --> 00:07:01,190
probability 95%.

132
00:07:01,190 --> 00:07:04,230
So the randomness in this
statement is with respect to

133
00:07:04,230 --> 00:07:09,190
my confidence interval, it's
not with respect to theta,

134
00:07:09,190 --> 00:07:11,880
because theta is not random.

135
00:07:11,880 --> 00:07:14,710
How does one construct
confidence intervals?

136
00:07:14,710 --> 00:07:17,500
There's various ways of going
about it, but in the case

137
00:07:17,500 --> 00:07:20,330
where we're dealing with the
estimation of the mean of a

138
00:07:20,330 --> 00:07:23,790
random variable doing this is
straightforward using the

139
00:07:23,790 --> 00:07:25,680
central limit theorem.

140
00:07:25,680 --> 00:07:31,440
Basically we take our estimated
mean, that's the

141
00:07:31,440 --> 00:07:35,910
sample mean, and we take a
symmetric interval to the left

142
00:07:35,910 --> 00:07:38,220
and to the right of
the sample mean.

143
00:07:38,220 --> 00:07:42,340
And we choose the width of that
interval by looking at

144
00:07:42,340 --> 00:07:43,680
the normal tables.

145
00:07:43,680 --> 00:07:50,180
So if this quantity, 1-alpha is
95% percent, we're going to

146
00:07:50,180 --> 00:07:55,790
look at the 97.5 percentile of
the normal distribution.

147
00:07:55,790 --> 00:07:59,910
Find the constant number that
corresponds to that value from

148
00:07:59,910 --> 00:08:02,790
the normal tables, and construct
the confidence

149
00:08:02,790 --> 00:08:07,350
intervals according
to this formula.

150
00:08:07,350 --> 00:08:10,810
So that gives you a pretty
mechanical way of going about

151
00:08:10,810 --> 00:08:13,250
constructing confidence
intervals when you're

152
00:08:13,250 --> 00:08:15,270
estimating the sample mean.

153
00:08:15,270 --> 00:08:18,650
So constructing confidence
intervals in this way involves

154
00:08:18,650 --> 00:08:19,630
an approximation.

155
00:08:19,630 --> 00:08:22,230
The approximation is the
central limit theorem.

156
00:08:22,230 --> 00:08:24,490
We are pretending that
the sample mean is a

157
00:08:24,490 --> 00:08:26,400
normal random variable.

158
00:08:26,400 --> 00:08:30,110
Which is, more or less,
right when n is large.

159
00:08:30,110 --> 00:08:32,780
That's what the central limit
theorem tells us.

160
00:08:32,780 --> 00:08:36,429
And sometimes we may need to
do some extra approximation

161
00:08:36,429 --> 00:08:39,480
work, because quite often
we do not know the

162
00:08:39,480 --> 00:08:41,030
true value of sigma.

163
00:08:41,030 --> 00:08:43,559
So we need to do some work
either to estimate

164
00:08:43,559 --> 00:08:45,360
sigma from the data.

165
00:08:45,360 --> 00:08:48,520
So sigma is, of course, the
standard deviation of the X's.

166
00:08:48,520 --> 00:08:51,410
We may want to estimate it from
the data, or we may have

167
00:08:51,410 --> 00:08:54,450
an upper bound on sigma, and we
just use that upper bound.

168
00:08:54,450 --> 00:08:57,430

169
00:08:57,430 --> 00:09:02,520
So now let's move on
to a new topic.

170
00:09:02,520 --> 00:09:09,420
A lot of statistics in the
real world are of the

171
00:09:09,420 --> 00:09:12,540
following flavor.

172
00:09:12,540 --> 00:09:16,820
So suppose that X is the SAT
score of a student in high

173
00:09:16,820 --> 00:09:23,620
school, and Y is the MIT GPA
of that same student.

174
00:09:23,620 --> 00:09:27,570
So you expect that there is a
relation between these two.

175
00:09:27,570 --> 00:09:31,240
So you go and collect data for
different students, and you

176
00:09:31,240 --> 00:09:35,470
record for a typical student
this would be their SAT score,

177
00:09:35,470 --> 00:09:37,700
that could be their MIT GPA.

178
00:09:37,700 --> 00:09:43,720
And you plot all this data
on an (X,Y) diagram.

179
00:09:43,720 --> 00:09:48,240
Now it's reasonable to believe
that there is some systematic

180
00:09:48,240 --> 00:09:49,940
relation between the two.

181
00:09:49,940 --> 00:09:54,650
So people who had higher SAT
scores in high school may have

182
00:09:54,650 --> 00:09:57,110
higher GPA in college.

183
00:09:57,110 --> 00:10:00,310
Well that may or may
not be true.

184
00:10:00,310 --> 00:10:05,270
You want to construct a model of
this kind, and see to what

185
00:10:05,270 --> 00:10:08,330
extent a relation of
this type is true.

186
00:10:08,330 --> 00:10:15,560
So you might hypothesize that
the real world is described by

187
00:10:15,560 --> 00:10:17,390
a model of this kind.

188
00:10:17,390 --> 00:10:22,730
That there is a linear relation
between the SAT

189
00:10:22,730 --> 00:10:27,710
score, and the college GPA.

190
00:10:27,710 --> 00:10:30,560
So it's a linear relation with
some parameters, theta0 and

191
00:10:30,560 --> 00:10:33,060
theta1 that we do not know.

192
00:10:33,060 --> 00:10:37,460
So we assume a linear relation
for the data, and depending on

193
00:10:37,460 --> 00:10:41,690
the choices of theta0 and theta1
it could be a different

194
00:10:41,690 --> 00:10:43,530
line through those data.

195
00:10:43,530 --> 00:10:47,670
Now we would like to find the
best model of this kind to

196
00:10:47,670 --> 00:10:49,230
explain the data.

197
00:10:49,230 --> 00:10:52,260
Of course there's going
to be some randomness.

198
00:10:52,260 --> 00:10:55,370
So in general it's going to be
impossible to find a line that

199
00:10:55,370 --> 00:10:57,780
goes through all of
the data points.

200
00:10:57,780 --> 00:11:04,020
So let's try to find the best
line that comes closest to

201
00:11:04,020 --> 00:11:05,810
explaining those data.

202
00:11:05,810 --> 00:11:08,520
And here's how we go about it.

203
00:11:08,520 --> 00:11:13,100
Suppose we try some particular
values of theta0 and theta1.

204
00:11:13,100 --> 00:11:15,750
These give us a certain line.

205
00:11:15,750 --> 00:11:20,760
Given that line, we can
make predictions.

206
00:11:20,760 --> 00:11:24,470
For a student who had this x,
the model that we have would

207
00:11:24,470 --> 00:11:27,580
predict that y would
be this value.

208
00:11:27,580 --> 00:11:32,150
The actual y is something else,
and so this quantity is

209
00:11:32,150 --> 00:11:37,660
the error that our model would
make in predicting the y of

210
00:11:37,660 --> 00:11:39,580
that particular student.

211
00:11:39,580 --> 00:11:43,350
We would like to choose a line
for which the predictions are

212
00:11:43,350 --> 00:11:45,110
as good as possible.

213
00:11:45,110 --> 00:11:47,790
And what do we mean by
as good as possible?

214
00:11:47,790 --> 00:11:51,150
As our criteria we're going
to take the following.

215
00:11:51,150 --> 00:11:54,070
We are going to look at the
prediction error that our

216
00:11:54,070 --> 00:11:56,310
model makes for each
particular student.

217
00:11:56,310 --> 00:12:01,050
Take the square of that, and
then add them up over all of

218
00:12:01,050 --> 00:12:02,580
our data points.

219
00:12:02,580 --> 00:12:06,140
So what we're looking at is
the sum of this quantity

220
00:12:06,140 --> 00:12:08,270
squared, that quantity squared,
that quantity

221
00:12:08,270 --> 00:12:09,570
squared, and so on.

222
00:12:09,570 --> 00:12:13,220
We add all of these squares, and
we would like to find the

223
00:12:13,220 --> 00:12:17,500
line for which the sum of
these squared prediction

224
00:12:17,500 --> 00:12:20,910
errors are as small
as possible.

225
00:12:20,910 --> 00:12:23,950
So that's the procedure.

226
00:12:23,950 --> 00:12:27,100
We have our data, the
X's and the Y's.

227
00:12:27,100 --> 00:12:31,340
And we're going to find theta's
the best model of this

228
00:12:31,340 --> 00:12:35,580
type, the best possible model,
by minimizing this sum of

229
00:12:35,580 --> 00:12:38,010
squared errors.

230
00:12:38,010 --> 00:12:41,020
So that's a method that one
could pull out of the hat and

231
00:12:41,020 --> 00:12:44,120
say OK, that's how I'm going
to build my model.

232
00:12:44,120 --> 00:12:46,730
And it sounds pretty
reasonable.

233
00:12:46,730 --> 00:12:49,530
And it sounds pretty reasonable
even if you don't

234
00:12:49,530 --> 00:12:51,660
know anything about
probability.

235
00:12:51,660 --> 00:12:55,340
But does it have some
probabilistic justification?

236
00:12:55,340 --> 00:12:59,280
It turns out that yes, you can
motivate this method with

237
00:12:59,280 --> 00:13:03,100
probabilistic considerations
under certain assumptions.

238
00:13:03,100 --> 00:13:07,360
So let's make a probabilistic
model that's going to lead us

239
00:13:07,360 --> 00:13:10,600
to these particular way of
estimating the parameters.

240
00:13:10,600 --> 00:13:12,920
So here's a probabilistic
model.

241
00:13:12,920 --> 00:13:18,090
I pick a student who had
a specific SAT score.

242
00:13:18,090 --> 00:13:21,190
And that could be done at
random, but also could be done

243
00:13:21,190 --> 00:13:22,330
in a systematic way.

244
00:13:22,330 --> 00:13:25,240
That is, I pick a student who
had an SAT of 600, a student

245
00:13:25,240 --> 00:13:33,170
of 610 all the way to 1,400
or 1,600, whatever the

246
00:13:33,170 --> 00:13:34,670
right number is.

247
00:13:34,670 --> 00:13:36,320
I pick all those students.

248
00:13:36,320 --> 00:13:40,370
And I assume that for a student
of this kind there's a

249
00:13:40,370 --> 00:13:44,500
true model that tells me that
their GPA is going to be a

250
00:13:44,500 --> 00:13:48,580
random variable, which is
something predicted by their

251
00:13:48,580 --> 00:13:52,690
SAT score plus some randomness,
some random noise.

252
00:13:52,690 --> 00:13:56,400
And I model that random noise
by independent normal random

253
00:13:56,400 --> 00:14:00,710
variables with 0 mean and
a certain variance.

254
00:14:00,710 --> 00:14:04,470
So this is a specific
probabilistic model, and now I

255
00:14:04,470 --> 00:14:09,010
can think about doing maximum
likelihood estimation for this

256
00:14:09,010 --> 00:14:10,980
particular model.

257
00:14:10,980 --> 00:14:14,490
So to do maximum likelihood
estimation here I need to

258
00:14:14,490 --> 00:14:19,830
write down the likelihood of the
y's that I have observed.

259
00:14:19,830 --> 00:14:23,380
What's the likelihood of the
y's that I have observed?

260
00:14:23,380 --> 00:14:28,425
Well, a particular w has a
likelihood of the form e to

261
00:14:28,425 --> 00:14:33,030
the minus w squared over
(2 sigma-squared).

262
00:14:33,030 --> 00:14:37,070
That's the likelihood
of a particular w.

263
00:14:37,070 --> 00:14:40,310
The probability, or the
likelihood of observing a

264
00:14:40,310 --> 00:14:43,990
particular value of y, that's
the same as the likelihood

265
00:14:43,990 --> 00:14:49,020
that w takes a value of y
minus this, minus that.

266
00:14:49,020 --> 00:14:52,850
So the likelihood of the
y's is of this form.

267
00:14:52,850 --> 00:14:57,360
Think of this as just being
the w_i-squared.

268
00:14:57,360 --> 00:15:01,370
So this is the density --

269
00:15:01,370 --> 00:15:06,060
and if we have multiple data you
multiply the likelihoods

270
00:15:06,060 --> 00:15:07,660
of the different y's.

271
00:15:07,660 --> 00:15:12,090
So you have to write something
like this.

272
00:15:12,090 --> 00:15:16,390
Since the w's are independent
that means that the y's are

273
00:15:16,390 --> 00:15:17,910
also independent.

274
00:15:17,910 --> 00:15:21,410
The likelihood of a y vector
is the product of the

275
00:15:21,410 --> 00:15:24,240
likelihoods of the
individual y's.

276
00:15:24,240 --> 00:15:27,800
The likelihood of every
individual y is of this form.

277
00:15:27,800 --> 00:15:33,050
Where w is y_i minus these
two quantities.

278
00:15:33,050 --> 00:15:36,000
So this is the form that the
likelihood function is going

279
00:15:36,000 --> 00:15:38,880
to take under this
particular model.

280
00:15:38,880 --> 00:15:42,260
And under the maximum likelihood
methodology we want

281
00:15:42,260 --> 00:15:49,170
to maximize this quantity with
respect to theta0 and theta1.

282
00:15:49,170 --> 00:15:56,930
Now to do this maximization you
might as well consider the

283
00:15:56,930 --> 00:16:00,990
logarithm and maximize the
logarithm, which is just the

284
00:16:00,990 --> 00:16:02,730
exponent up here.

285
00:16:02,730 --> 00:16:05,750
Maximizing this exponent because
we have a minus sign

286
00:16:05,750 --> 00:16:08,900
is the same as minimizing
the exponent

287
00:16:08,900 --> 00:16:10,840
without the minus sign.

288
00:16:10,840 --> 00:16:12,840
Sigma squared is a constant.

289
00:16:12,840 --> 00:16:17,970
So what you end up doing is
minimizing this quantity here,

290
00:16:17,970 --> 00:16:20,120
which is the same as
what we had in our

291
00:16:20,120 --> 00:16:23,640
linear regression methods.

292
00:16:23,640 --> 00:16:29,400
So in conclusion you might
choose to do linear regression

293
00:16:29,400 --> 00:16:34,490
in this particular way,
just because it looks

294
00:16:34,490 --> 00:16:36,210
reasonable or plausible.

295
00:16:36,210 --> 00:16:41,050
Or you might interpret what
you're doing as maximum

296
00:16:41,050 --> 00:16:45,220
likelihood estimation, in which
you assume a model of

297
00:16:45,220 --> 00:16:49,520
this kind where the noise
terms are normal random

298
00:16:49,520 --> 00:16:51,970
variables with the same
distribution --

299
00:16:51,970 --> 00:16:54,540
independent identically
distributed.

300
00:16:54,540 --> 00:17:01,320
So linear regression implicitly
makes an assumption

301
00:17:01,320 --> 00:17:02,840
of this kind.

302
00:17:02,840 --> 00:17:07,380
It's doing maximum likelihood
estimation as if the world was

303
00:17:07,380 --> 00:17:11,000
really described by a model of
this form, and with the W's

304
00:17:11,000 --> 00:17:12,560
being random variables.

305
00:17:12,560 --> 00:17:17,920
So this gives us at least some
justification that this

306
00:17:17,920 --> 00:17:21,800
particular approach to fitting
lines to data is not so

307
00:17:21,800 --> 00:17:25,579
arbitrary, but it has
a sound footing.

308
00:17:25,579 --> 00:17:30,530
OK so then once you accept this
formulation as being a

309
00:17:30,530 --> 00:17:32,920
reasonable one what's
the next step?

310
00:17:32,920 --> 00:17:37,760
The next step is to see how to
carry out this minimization.

311
00:17:37,760 --> 00:17:42,220
This is not a very difficult
minimization to do.

312
00:17:42,220 --> 00:17:48,260
The way it's done is by setting
the derivatives of

313
00:17:48,260 --> 00:17:50,930
this expression to 0.

314
00:17:50,930 --> 00:17:54,500
Now because this is a quadratic
function of theta0

315
00:17:54,500 --> 00:17:55,410
and theta1--

316
00:17:55,410 --> 00:17:57,270
when you take the derivatives
with respect

317
00:17:57,270 --> 00:17:58,940
to theta0 and theta1--

318
00:17:58,940 --> 00:18:03,250
you get linear functions
of theta0 and theta1.

319
00:18:03,250 --> 00:18:08,010
And you end up solving a system
of linear equations in

320
00:18:08,010 --> 00:18:09,630
theta0 and theta1.

321
00:18:09,630 --> 00:18:15,660
And it turns out that there's
very nice and simple formulas

322
00:18:15,660 --> 00:18:18,950
for the optimal estimates
of the parameters in

323
00:18:18,950 --> 00:18:20,510
terms of the data.

324
00:18:20,510 --> 00:18:23,910
And the formulas
are these ones.

325
00:18:23,910 --> 00:18:28,130
I said that these are nice
and simple formulas.

326
00:18:28,130 --> 00:18:29,800
Let's see why.

327
00:18:29,800 --> 00:18:31,270
How can we interpret them?

328
00:18:31,270 --> 00:18:34,050

329
00:18:34,050 --> 00:18:42,250
So suppose that the world is
described by a model of this

330
00:18:42,250 --> 00:18:48,990
kind, where the X's and Y's
are random variables.

331
00:18:48,990 --> 00:18:53,920
And where W is a noise term
that's independent of X. So

332
00:18:53,920 --> 00:18:57,750
we're assuming that a linear
model is indeed true, but not

333
00:18:57,750 --> 00:18:58,530
exactly true.

334
00:18:58,530 --> 00:19:01,790
There's always some noise
associated with any particular

335
00:19:01,790 --> 00:19:04,980
data point that we obtain.

336
00:19:04,980 --> 00:19:10,880
So if a model of this kind is
true, and the W's have 0 mean

337
00:19:10,880 --> 00:19:15,370
then we have that the expected
value of Y would be theta0

338
00:19:15,370 --> 00:19:23,570
plus theta1 expected value of
X. And because W has 0 mean

339
00:19:23,570 --> 00:19:26,200
there's no extra term.

340
00:19:26,200 --> 00:19:31,660
So in particular, theta0 would
be equal to expected value of

341
00:19:31,660 --> 00:19:37,380
Y minus theta1 expected
value of X.

342
00:19:37,380 --> 00:19:40,660
So let's use this equation
to try to come up with a

343
00:19:40,660 --> 00:19:44,060
reasonable estimate of theta0.

344
00:19:44,060 --> 00:19:47,220
I do not know the expected
value of Y, but I

345
00:19:47,220 --> 00:19:48,430
can estimate it.

346
00:19:48,430 --> 00:19:49,820
How do I estimate it?

347
00:19:49,820 --> 00:19:53,460
I look at the average of all the
y's that I have obtained.

348
00:19:53,460 --> 00:19:57,320
so I replace this, I estimate
it with the average of the

349
00:19:57,320 --> 00:19:59,940
data I have seen.

350
00:19:59,940 --> 00:20:02,430
Here, similarly with the X's.

351
00:20:02,430 --> 00:20:06,820
I might not know the expected
value of X's, but I have data

352
00:20:06,820 --> 00:20:08,520
points for the x's.

353
00:20:08,520 --> 00:20:13,070
I look at the average of all my
data points, I come up with

354
00:20:13,070 --> 00:20:16,380
an estimate of this
expectation.

355
00:20:16,380 --> 00:20:21,390
Now I don't know what theta1 is,
but my procedure is going

356
00:20:21,390 --> 00:20:25,320
to generate an estimate of
theta1 called theta1 hat.

357
00:20:25,320 --> 00:20:29,230
And once I have this estimate,
then a reasonable person would

358
00:20:29,230 --> 00:20:33,400
estimate theta0 in this
particular way.

359
00:20:33,400 --> 00:20:37,320
So that's how my estimate
of theta0 is going to be

360
00:20:37,320 --> 00:20:38,490
constructed.

361
00:20:38,490 --> 00:20:41,420
It's this formula here.

362
00:20:41,420 --> 00:20:44,700
We have not yet addressed the
harder question, which is how

363
00:20:44,700 --> 00:20:47,490
to estimate theta1 in
the first place.

364
00:20:47,490 --> 00:20:50,830
So to estimate theta0 I assumed
that I already had an

365
00:20:50,830 --> 00:20:52,180
estimate for a theta1.

366
00:20:52,180 --> 00:20:55,090

367
00:20:55,090 --> 00:21:02,060
OK, the right formula for the
estimate of theta1 happens to

368
00:21:02,060 --> 00:21:03,140
be this one.

369
00:21:03,140 --> 00:21:08,632
It looks messy, but let's
try to interpret it.

370
00:21:08,632 --> 00:21:12,970
What I'm going to do is I'm
going to take this model for

371
00:21:12,970 --> 00:21:18,340
simplicity let's assume that
they're the random variables

372
00:21:18,340 --> 00:21:19,590
have 0 means.

373
00:21:19,590 --> 00:21:22,940

374
00:21:22,940 --> 00:21:28,800
And see how we might estimate
how we might

375
00:21:28,800 --> 00:21:30,960
try to estimate theta1.

376
00:21:30,960 --> 00:21:36,270
Let's multiply both sides of
this equation by X. So we get

377
00:21:36,270 --> 00:21:48,470
Y times X equals theta0 plus
theta0 times X plus theta1

378
00:21:48,470 --> 00:21:54,530
times X-squared, plus X times
W. And now take expectations

379
00:21:54,530 --> 00:21:56,420
of both sides.

380
00:21:56,420 --> 00:22:00,160
If I have 0 mean random
variables the expected value

381
00:22:00,160 --> 00:22:07,210
of Y times X is just the
covariance of X with Y.

382
00:22:07,210 --> 00:22:10,640
I have assumed that my random
variables have 0 means, so the

383
00:22:10,640 --> 00:22:13,680
expectation of this is 0.

384
00:22:13,680 --> 00:22:17,970
This one is going to be the
variance of X, so I have

385
00:22:17,970 --> 00:22:23,260
theta1 times variance of X. And
since I'm assuming that my

386
00:22:23,260 --> 00:22:26,990
random variables have 0 mean,
and I'm also assuming that W

387
00:22:26,990 --> 00:22:32,250
is independent of X this last
term also has 0 mean.

388
00:22:32,250 --> 00:22:39,280
So under such a probabilistic
model this equation is true.

389
00:22:39,280 --> 00:22:43,620
If we knew the variance and the
covariance then we would

390
00:22:43,620 --> 00:22:45,930
know the value of theta1.

391
00:22:45,930 --> 00:22:49,080
But we only have data, we do
not necessarily know the

392
00:22:49,080 --> 00:22:53,070
variance and the covariance,
but we can estimate it.

393
00:22:53,070 --> 00:22:55,885
What's a reasonable estimate
of the variance?

394
00:22:55,885 --> 00:22:59,390
The reasonable estimate of the
variance is this quantity here

395
00:22:59,390 --> 00:23:03,195
divided by n, and the reasonable
estimate of the

396
00:23:03,195 --> 00:23:06,730
covariance is that numerator
divided by n.

397
00:23:06,730 --> 00:23:09,410

398
00:23:09,410 --> 00:23:11,510
So this is my estimate
of the mean.

399
00:23:11,510 --> 00:23:15,390
I'm looking at the squared
distances from the mean, and I

400
00:23:15,390 --> 00:23:18,740
average them over lots
and lots of data.

401
00:23:18,740 --> 00:23:23,990
This is the most reasonable way
of estimating the variance

402
00:23:23,990 --> 00:23:26,070
of our distribution.

403
00:23:26,070 --> 00:23:31,400
And similarly the expected value
of this quantity is the

404
00:23:31,400 --> 00:23:35,020
covariance of X with Y, and then
we have lots and lots of

405
00:23:35,020 --> 00:23:35,830
data points.

406
00:23:35,830 --> 00:23:38,895
This quantity here is going to
be a very good estimate of the

407
00:23:38,895 --> 00:23:40,140
covariance.

408
00:23:40,140 --> 00:23:44,820
So basically what this
formula does is--

409
00:23:44,820 --> 00:23:46,520
one way of thinking about it--

410
00:23:46,520 --> 00:23:50,870
is that it starts from this
relation which is true

411
00:23:50,870 --> 00:23:57,230
exactly, but estimates the
covariance and the variance on

412
00:23:57,230 --> 00:24:00,820
the basis of the data, and then
using these estimates to

413
00:24:00,820 --> 00:24:05,770
come up with an estimate
of theta1.

414
00:24:05,770 --> 00:24:09,890
So this gives us a probabilistic
interpretation

415
00:24:09,890 --> 00:24:13,620
of the formulas that we have for
the way that the estimates

416
00:24:13,620 --> 00:24:14,990
are constructed.

417
00:24:14,990 --> 00:24:19,560
If you're willing to assume that
this is the true model of

418
00:24:19,560 --> 00:24:22,640
the world, the structure of the
true model of the world,

419
00:24:22,640 --> 00:24:24,460
except that you do not
know means and

420
00:24:24,460 --> 00:24:27,590
covariances, and variances.

421
00:24:27,590 --> 00:24:33,010
Then this is a natural way of
estimating those unknown

422
00:24:33,010 --> 00:24:34,260
parameters.

423
00:24:34,260 --> 00:24:36,770

424
00:24:36,770 --> 00:24:39,800
All right, so we have a
closed-form formula, we can

425
00:24:39,800 --> 00:24:43,620
apply it whenever
we have data.

426
00:24:43,620 --> 00:24:47,810
Now linear regression is a
subject on which there are

427
00:24:47,810 --> 00:24:51,520
whole courses, and whole
books that are given.

428
00:24:51,520 --> 00:24:54,560
And the reason for that is that
there's a lot more that

429
00:24:54,560 --> 00:24:58,840
you can bring into the topic,
and many ways that you can

430
00:24:58,840 --> 00:25:02,350
elaborate on the simple solution
that we got for the

431
00:25:02,350 --> 00:25:05,880
case of two parameters and only
two random variables.

432
00:25:05,880 --> 00:25:09,550
So let me give you a little bit
of flavor of what are the

433
00:25:09,550 --> 00:25:12,950
topics that come up when you
start looking into linear

434
00:25:12,950 --> 00:25:14,200
regression in more depth.

435
00:25:14,200 --> 00:25:16,840

436
00:25:16,840 --> 00:25:24,390
So in our discussions so far
we made the linear model in

437
00:25:24,390 --> 00:25:28,370
which we're trying to explain
the values of one variable in

438
00:25:28,370 --> 00:25:30,860
terms of the values of
another variable.

439
00:25:30,860 --> 00:25:35,010
We're trying to explain GPAs
in terms of SAT scores, or

440
00:25:35,010 --> 00:25:39,640
we're trying to predict GPAs
in terms of SAT scores.

441
00:25:39,640 --> 00:25:47,910
But maybe your GPA is affected
by several factors.

442
00:25:47,910 --> 00:25:56,380
For example maybe your GPA is
affected by your SAT score,

443
00:25:56,380 --> 00:26:01,820
also the income of your family,
the years of education

444
00:26:01,820 --> 00:26:06,720
of your grandmother, and many
other factors like that.

445
00:26:06,720 --> 00:26:11,970
So you might write down a model
in which I believe that

446
00:26:11,970 --> 00:26:17,820
GPA has a relation, which is a
linear function of all these

447
00:26:17,820 --> 00:26:20,520
other variables that
I mentioned.

448
00:26:20,520 --> 00:26:24,350
So perhaps you have a theory of
what determines performance

449
00:26:24,350 --> 00:26:29,540
at college, and you want to
build a model of that type.

450
00:26:29,540 --> 00:26:31,460
How do we go about
in this case?

451
00:26:31,460 --> 00:26:33,830
Well, again we collect
the data points.

452
00:26:33,830 --> 00:26:37,980
We look at the i-th student,
who has a college GPA.

453
00:26:37,980 --> 00:26:42,090
We record their SAT score,
their family income, and

454
00:26:42,090 --> 00:26:45,010
grandmother's years
of education.

455
00:26:45,010 --> 00:26:50,390
So this is one data point that
is for one particular student.

456
00:26:50,390 --> 00:26:52,580
We postulate the model
of this form.

457
00:26:52,580 --> 00:26:56,160
For the i-th student this would
be the mistake that our

458
00:26:56,160 --> 00:26:59,940
model makes if we have chosen
specific values for those

459
00:26:59,940 --> 00:27:01,070
parameters.

460
00:27:01,070 --> 00:27:05,450
And then we go and choose the
parameters that are going to

461
00:27:05,450 --> 00:27:07,950
give us, again, the
smallest possible

462
00:27:07,950 --> 00:27:10,000
sum of squared errors.

463
00:27:10,000 --> 00:27:12,360
So philosophically it's exactly
the same as what we

464
00:27:12,360 --> 00:27:15,700
were discussing before, except
that now we're including

465
00:27:15,700 --> 00:27:19,560
multiple explanatory variables
in our model instead of a

466
00:27:19,560 --> 00:27:22,600
single explanatory variable.

467
00:27:22,600 --> 00:27:24,070
So that's the formulation.

468
00:27:24,070 --> 00:27:26,070
What do you do next?

469
00:27:26,070 --> 00:27:29,420
Well, to do this minimization
you're going to take

470
00:27:29,420 --> 00:27:32,750
derivatives once you have your
data, you have a function of

471
00:27:32,750 --> 00:27:34,310
these three parameters.

472
00:27:34,310 --> 00:27:37,190
You take the derivative with
respect to the parameter, set

473
00:27:37,190 --> 00:27:39,170
the derivative equal
to 0, you get the

474
00:27:39,170 --> 00:27:41,060
system of linear equations.

475
00:27:41,060 --> 00:27:43,450
You throw that system of
linear equations to the

476
00:27:43,450 --> 00:27:46,260
computer, and you get numerical
values for the

477
00:27:46,260 --> 00:27:48,060
optimal parameters.

478
00:27:48,060 --> 00:27:52,130
There are no nice closed-form
formulas of the type that we

479
00:27:52,130 --> 00:27:54,510
had in the previous slide
when you're dealing

480
00:27:54,510 --> 00:27:56,230
with multiple variables.

481
00:27:56,230 --> 00:28:02,240
Unless you're willing to go
into matrix notation.

482
00:28:02,240 --> 00:28:04,760
In that case you can again
write down closed-form

483
00:28:04,760 --> 00:28:07,290
formulas, but they will be a
little less intuitive than

484
00:28:07,290 --> 00:28:09,210
what we had before.

485
00:28:09,210 --> 00:28:13,550
But the moral of the story is
that numerically this is a

486
00:28:13,550 --> 00:28:16,480
procedure that's very easy.

487
00:28:16,480 --> 00:28:18,780
It's a problem, an optimization
problem that the

488
00:28:18,780 --> 00:28:20,680
computer can solve for you.

489
00:28:20,680 --> 00:28:23,290
And it can solve it for
you very quickly.

490
00:28:23,290 --> 00:28:25,470
Because all that it involves
is solving a

491
00:28:25,470 --> 00:28:26,720
system of linear equations.

492
00:28:26,720 --> 00:28:29,590

493
00:28:29,590 --> 00:28:34,270
Now when you choose your
explanatory variables you may

494
00:28:34,270 --> 00:28:37,940
have some choices.

495
00:28:37,940 --> 00:28:43,550
One person may think that your
GPA a has something to do with

496
00:28:43,550 --> 00:28:45,340
your SAT score.

497
00:28:45,340 --> 00:28:48,480
Some other person may think that
your GPA has something to

498
00:28:48,480 --> 00:28:51,800
do with the square of
your SAT score.

499
00:28:51,800 --> 00:28:55,380
And that other person may
want to try to build a

500
00:28:55,380 --> 00:28:58,840
model of this kind.

501
00:28:58,840 --> 00:29:01,550
Now when would you want
to do this? ?

502
00:29:01,550 --> 00:29:07,830
Suppose that the data that
you have looks like this.

503
00:29:07,830 --> 00:29:12,177

504
00:29:12,177 --> 00:29:15,740
If the data looks like this then
you might be tempted to

505
00:29:15,740 --> 00:29:20,710
say well a linear model does
not look right, but maybe a

506
00:29:20,710 --> 00:29:25,650
quadratic model will give me
a better fit for the data.

507
00:29:25,650 --> 00:29:30,690
So if you want to fit a
quadratic model to the data

508
00:29:30,690 --> 00:29:35,550
then what you do is you take
X-squared as your explanatory

509
00:29:35,550 --> 00:29:42,520
variable instead of X, and you
build a model of this kind.

510
00:29:42,520 --> 00:29:45,910
There's nothing really different
in models of this

511
00:29:45,910 --> 00:29:48,830
kind compared to models
of that kind.

512
00:29:48,830 --> 00:29:54,700
They are still linear models
because we have theta's

513
00:29:54,700 --> 00:29:57,630
showing up in a linear
fashion.

514
00:29:57,630 --> 00:30:00,460
What you take as your
explanatory variables, whether

515
00:30:00,460 --> 00:30:02,870
it's X, whether it's X-squared,
or whether it's

516
00:30:02,870 --> 00:30:05,390
some other function
that you chose.

517
00:30:05,390 --> 00:30:09,590
Some general function h of X,
doesn't make a difference.

518
00:30:09,590 --> 00:30:14,470
So think of you h of X as being
your new X. So you can

519
00:30:14,470 --> 00:30:17,620
formulate the problem exactly
the same way, except that

520
00:30:17,620 --> 00:30:21,035
instead of using X's you
choose h of X's.

521
00:30:21,035 --> 00:30:23,610

522
00:30:23,610 --> 00:30:26,540
So it's basically a question
do I want to build a model

523
00:30:26,540 --> 00:30:31,390
that explains Y's based on the
values of X, or do I want to

524
00:30:31,390 --> 00:30:35,190
build a model that explains Y's
on the basis of the values

525
00:30:35,190 --> 00:30:38,970
of h of X. Which is the
right value to use?

526
00:30:38,970 --> 00:30:42,160
And with this picture here,
we see that it can make a

527
00:30:42,160 --> 00:30:43,160
difference.

528
00:30:43,160 --> 00:30:47,070
A linear model in X might be
a poor fit, but a quadratic

529
00:30:47,070 --> 00:30:49,660
model might give us
a better fit.

530
00:30:49,660 --> 00:30:55,450
So this brings to the topic of
how to choose your functions h

531
00:30:55,450 --> 00:30:59,480
of X if you're dealing with
a real world problem.

532
00:30:59,480 --> 00:31:03,080
So in a real world problem
you're just given X's and Y's.

533
00:31:03,080 --> 00:31:05,990
And you have the freedom
of building models of

534
00:31:05,990 --> 00:31:07,120
any kind you want.

535
00:31:07,120 --> 00:31:11,330
You have the freedom of choosing
a function h of X of

536
00:31:11,330 --> 00:31:13,130
any type that you want.

537
00:31:13,130 --> 00:31:14,980
So this turns out to be a quite

538
00:31:14,980 --> 00:31:18,800
difficult and tricky topic.

539
00:31:18,800 --> 00:31:22,630
Because you may be tempted
to overdo it.

540
00:31:22,630 --> 00:31:28,450
For example, I got my 10 data
points, and I could say OK,

541
00:31:28,450 --> 00:31:35,660
I'm going to choose an h of X.
I'm going to choose h of X and

542
00:31:35,660 --> 00:31:40,300
actually multiple h's of X
to do a multiple linear

543
00:31:40,300 --> 00:31:45,030
regression in which I'm going to
build a model that's uses a

544
00:31:45,030 --> 00:31:47,600
10th degree polynomial.

545
00:31:47,600 --> 00:31:51,160
If I choose to fit my data with
a 10th degree polynomial

546
00:31:51,160 --> 00:31:54,680
I'm going to fit my data
perfectly, but I may obtain a

547
00:31:54,680 --> 00:31:58,530
model is does something like
this, and goes through all my

548
00:31:58,530 --> 00:31:59,930
data points.

549
00:31:59,930 --> 00:32:03,830
So I can make my prediction
errors extremely small if I

550
00:32:03,830 --> 00:32:08,820
use lots of parameters, and
if I choose my h functions

551
00:32:08,820 --> 00:32:09,930
appropriately.

552
00:32:09,930 --> 00:32:11,800
But clearly this would
be garbage.

553
00:32:11,800 --> 00:32:15,270
If you get those data points,
and you say here's my model

554
00:32:15,270 --> 00:32:16,420
that explains them.

555
00:32:16,420 --> 00:32:21,320
That has a polynomial going up
and down, then you're probably

556
00:32:21,320 --> 00:32:22,900
doing something wrong.

557
00:32:22,900 --> 00:32:26,180
So choosing how complicated
those functions,

558
00:32:26,180 --> 00:32:27,900
the h's, should be.

559
00:32:27,900 --> 00:32:32,020
And how many explanatory
variables to use is a very

560
00:32:32,020 --> 00:32:36,560
delicate and deep topic on which
there's deep theory that

561
00:32:36,560 --> 00:32:39,910
tells you what you should do,
and what you shouldn't do.

562
00:32:39,910 --> 00:32:43,830
But the main thing that one
should avoid doing is having

563
00:32:43,830 --> 00:32:46,620
too many parameters in
your model when you

564
00:32:46,620 --> 00:32:48,900
have too few data.

565
00:32:48,900 --> 00:32:52,350
So if you only have 10 data
points, you shouldn't have 10

566
00:32:52,350 --> 00:32:53,350
free parameters.

567
00:32:53,350 --> 00:32:56,150
With 10 free parameters you will
be able to fit your data

568
00:32:56,150 --> 00:33:00,760
perfectly, but you wouldn't be
able to really rely on the

569
00:33:00,760 --> 00:33:02,010
results that you are seeing.

570
00:33:02,010 --> 00:33:06,050

571
00:33:06,050 --> 00:33:12,630
OK, now in practice, when people
run linear regressions

572
00:33:12,630 --> 00:33:15,410
they do not just give
point estimates for

573
00:33:15,410 --> 00:33:17,370
the parameters theta.

574
00:33:17,370 --> 00:33:20,300
But similar to what we did for
the case of estimating the

575
00:33:20,300 --> 00:33:23,790
mean of a random variable you
might want to give confidence

576
00:33:23,790 --> 00:33:27,200
intervals that sort of tell you
how much randomness there

577
00:33:27,200 --> 00:33:30,730
is when you estimate each one of
the particular parameters.

578
00:33:30,730 --> 00:33:33,960
There are formulas for building
confidence intervals

579
00:33:33,960 --> 00:33:36,230
for the estimates
of the theta's.

580
00:33:36,230 --> 00:33:38,520
We're not going to look
at them, it would

581
00:33:38,520 --> 00:33:39,990
take too much time.

582
00:33:39,990 --> 00:33:44,600
Also you might want to estimate
the variance in the

583
00:33:44,600 --> 00:33:47,400
noise that you have
in your model.

584
00:33:47,400 --> 00:33:52,540
That is if you are pretending
that your true model is of the

585
00:33:52,540 --> 00:33:57,026
kind we were discussing before,
namely Y equals theta1

586
00:33:57,026 --> 00:34:02,190
times X plus W, and W has a
variance sigma squared.

587
00:34:02,190 --> 00:34:05,170
You might want to estimate this,
because it tells you

588
00:34:05,170 --> 00:34:09,199
something about the model, and
this is called standard error.

589
00:34:09,199 --> 00:34:11,929
It puts a limit on how
good predictions

590
00:34:11,929 --> 00:34:14,730
your model can make.

591
00:34:14,730 --> 00:34:18,170
Even if you have the correct
theta0 and theta1, and

592
00:34:18,170 --> 00:34:22,179
somebody tells you X you can
make a prediction about Y, but

593
00:34:22,179 --> 00:34:24,710
that prediction will
not be accurate.

594
00:34:24,710 --> 00:34:26,739
Because there's this additional
randomness.

595
00:34:26,739 --> 00:34:29,699
And if that additional
randomness is big, then your

596
00:34:29,699 --> 00:34:33,810
predictions will also have a
substantial error in them.

597
00:34:33,810 --> 00:34:38,300
There's another quantity that
gets reported usually.

598
00:34:38,300 --> 00:34:41,400
This is part of the computer
output that you get when you

599
00:34:41,400 --> 00:34:45,500
use a statistical package which
is called R-square.

600
00:34:45,500 --> 00:34:49,920
And its a measure of the
explanatory power of the model

601
00:34:49,920 --> 00:34:52,469
that you have built
linear regression.

602
00:34:52,469 --> 00:34:55,650
Using linear regression.

603
00:34:55,650 --> 00:35:01,030
Instead of defining R-square
exactly, let me give you a

604
00:35:01,030 --> 00:35:05,170
sort of analogous quantity
that's involved.

605
00:35:05,170 --> 00:35:08,030
After you do your linear
regression you can look at the

606
00:35:08,030 --> 00:35:10,600
following quantity.

607
00:35:10,600 --> 00:35:15,720
You look at the variance of Y,
which is something that you

608
00:35:15,720 --> 00:35:17,400
can estimate from data.

609
00:35:17,400 --> 00:35:23,370
This is how much randomness
there is in Y. And compare it

610
00:35:23,370 --> 00:35:28,090
with the randomness that you
have in Y, but conditioned on

611
00:35:28,090 --> 00:35:35,840
X. So this quantity tells
me if I knew X how much

612
00:35:35,840 --> 00:35:39,820
randomness would there
still be in my Y?

613
00:35:39,820 --> 00:35:43,650
So if I know X, I have more
information, so Y is more

614
00:35:43,650 --> 00:35:44,390
constrained.

615
00:35:44,390 --> 00:35:48,640
There's less randomness in Y.
This is the randomness in Y if

616
00:35:48,640 --> 00:35:50,790
I don't know anything about X.

617
00:35:50,790 --> 00:35:54,855
So naturally this quantity would
be less than 1, and if

618
00:35:54,855 --> 00:35:58,830
this quantity is small it would
mean that whenever I

619
00:35:58,830 --> 00:36:03,320
know X then Y is very
well known.

620
00:36:03,320 --> 00:36:07,440
Which essentially tells me that
knowing x allows me to

621
00:36:07,440 --> 00:36:12,370
make very good predictions about
Y. Knowing X means that

622
00:36:12,370 --> 00:36:17,390
I'm explaining away most
of the randomness in Y.

623
00:36:17,390 --> 00:36:22,590
So if you read a statistical
study that uses linear

624
00:36:22,590 --> 00:36:29,730
regression you might encounter
statements of the form 60% of

625
00:36:29,730 --> 00:36:36,140
a student's GPA is explained
by the family income.

626
00:36:36,140 --> 00:36:40,600
If you read the statements of
this kind it's really refers

627
00:36:40,600 --> 00:36:43,160
to quantities of this kind.

628
00:36:43,160 --> 00:36:47,820
Out of the total variance in Y,
how much variance is left

629
00:36:47,820 --> 00:36:50,060
after we build our model?

630
00:36:50,060 --> 00:36:56,490
So if only 40% of the variance
of Y is left after we build

631
00:36:56,490 --> 00:37:00,700
our model, that means that
X explains 60% of the

632
00:37:00,700 --> 00:37:02,510
variations in Y's.

633
00:37:02,510 --> 00:37:06,570
So the idea is that
randomness in Y is

634
00:37:06,570 --> 00:37:09,560
caused by multiple sources.

635
00:37:09,560 --> 00:37:12,025
Our explanatory variable
and random noise.

636
00:37:12,025 --> 00:37:15,610
And we ask the question what
percentage of the total

637
00:37:15,610 --> 00:37:19,940
randomness in Y is explained by

638
00:37:19,940 --> 00:37:23,030
variations in the X parameter?

639
00:37:23,030 --> 00:37:26,860
And how much of the total
randomness in Y is attributed

640
00:37:26,860 --> 00:37:30,390
just to random effects?

641
00:37:30,390 --> 00:37:34,050
So if you have a model that
explains most of the variation

642
00:37:34,050 --> 00:37:37,710
in Y then you can think that
you have a good model that

643
00:37:37,710 --> 00:37:42,550
tells you something useful
about the real world.

644
00:37:42,550 --> 00:37:45,990
Now there's lots of things that
can go wrong when you use

645
00:37:45,990 --> 00:37:50,670
linear regression, and there's
many pitfalls.

646
00:37:50,670 --> 00:37:56,440
One pitfall happens when you
have this situation that's

647
00:37:56,440 --> 00:37:58,300
called heteroskedacisity.

648
00:37:58,300 --> 00:38:01,020
So suppose your data
are of this kind.

649
00:38:01,020 --> 00:38:06,550

650
00:38:06,550 --> 00:38:09,330
So what's happening here?

651
00:38:09,330 --> 00:38:17,640
You seem to have a linear model,
but when X is small you

652
00:38:17,640 --> 00:38:19,200
have a very good model.

653
00:38:19,200 --> 00:38:23,830
So this means that W has a small
variance when X is here.

654
00:38:23,830 --> 00:38:26,760
On the other hand, when X is
there you have a lot of

655
00:38:26,760 --> 00:38:27,970
randomness.

656
00:38:27,970 --> 00:38:32,080
This would be a situation
in which the W's are not

657
00:38:32,080 --> 00:38:35,840
identically distributed, but
the variance of the W's, of

658
00:38:35,840 --> 00:38:40,360
the noise, has something
to do with the X's.

659
00:38:40,360 --> 00:38:43,720
So with different regions of our
x-space we have different

660
00:38:43,720 --> 00:38:45,260
amounts of noise.

661
00:38:45,260 --> 00:38:47,615
What will go wrong in
this situation?

662
00:38:47,615 --> 00:38:51,290
Since we're trying to minimize
sum of squared errors, we're

663
00:38:51,290 --> 00:38:54,080
really paying attention
to the biggest errors.

664
00:38:54,080 --> 00:38:57,010
Which will mean that we are
going to pay attention to

665
00:38:57,010 --> 00:38:59,690
these data points, because
that's where the big errors

666
00:38:59,690 --> 00:39:01,130
are going to be.

667
00:39:01,130 --> 00:39:04,250
So the linear regression
formulas will end up building

668
00:39:04,250 --> 00:39:09,110
a model based on these data,
which are the most noisy ones.

669
00:39:09,110 --> 00:39:14,810
Instead of those data that are
nicely stacked in order.

670
00:39:14,810 --> 00:39:17,410
Clearly that's not to the
right thing to do.

671
00:39:17,410 --> 00:39:21,500
So you need to change something,
and use the fact

672
00:39:21,500 --> 00:39:25,800
that the variance of W changes
with the X's, and there are

673
00:39:25,800 --> 00:39:27,770
ways of dealing with it.

674
00:39:27,770 --> 00:39:31,280
It's something that one needs
to be careful about.

675
00:39:31,280 --> 00:39:34,580
Another possibility of getting
into trouble is if you're

676
00:39:34,580 --> 00:39:38,550
using multiple explanatory
variables that are very

677
00:39:38,550 --> 00:39:41,330
closely related to each other.

678
00:39:41,330 --> 00:39:47,500
So for example, suppose that I
tried to predict your GPA by

679
00:39:47,500 --> 00:39:54,100
looking at your SAT the first
time that you took it plus

680
00:39:54,100 --> 00:39:58,290
your SAT the second time that
you took your SATs.

681
00:39:58,290 --> 00:40:00,470
I'm assuming that almost
everyone takes the

682
00:40:00,470 --> 00:40:02,450
SAT more than once.

683
00:40:02,450 --> 00:40:05,630
So suppose that you had
a model of this kind.

684
00:40:05,630 --> 00:40:09,380
Well, SAT on your first try and
SAT on your second try are

685
00:40:09,380 --> 00:40:12,480
very likely to be
fairly close.

686
00:40:12,480 --> 00:40:17,570
And you could think of coming
up with estimates in which

687
00:40:17,570 --> 00:40:19,390
this is ignored.

688
00:40:19,390 --> 00:40:22,780
And you build a model based on
this, or an alternative model

689
00:40:22,780 --> 00:40:25,810
in which this term is ignored,
and you make predictions based

690
00:40:25,810 --> 00:40:27,430
on the second SAT.

691
00:40:27,430 --> 00:40:31,840
And both models are likely to be
essentially as good as the

692
00:40:31,840 --> 00:40:34,430
other one, because these
two quantities are

693
00:40:34,430 --> 00:40:36,630
essentially the same.

694
00:40:36,630 --> 00:40:41,440
So in that case, your theta's
that you estimate are going to

695
00:40:41,440 --> 00:40:44,880
be very sensitive to little
details of the data.

696
00:40:44,880 --> 00:40:48,560
You change your data, you have
your data, and your data tell

697
00:40:48,560 --> 00:40:52,170
you that this coefficient
is big and that

698
00:40:52,170 --> 00:40:52,760
coefficient is small.

699
00:40:52,760 --> 00:40:56,060
You change your data just a
tiny bit, and your theta's

700
00:40:56,060 --> 00:40:57,720
would drastically change.

701
00:40:57,720 --> 00:41:00,750
So this is a case in which you
have multiple explanatory

702
00:41:00,750 --> 00:41:04,110
variables, but they're redundant
in the sense that

703
00:41:04,110 --> 00:41:07,300
they're very closely related
to each other, and perhaps

704
00:41:07,300 --> 00:41:08,830
with a linear relation.

705
00:41:08,830 --> 00:41:11,980
So one must be careful about the
situation, and do special

706
00:41:11,980 --> 00:41:15,940
tests to make sure that
this doesn't happen.

707
00:41:15,940 --> 00:41:20,900
Finally the biggest and most
common blunder is that you run

708
00:41:20,900 --> 00:41:24,910
your linear regression, you
get your linear model, and

709
00:41:24,910 --> 00:41:26,760
then you say oh, OK.

710
00:41:26,760 --> 00:41:33,340
Y is caused by X according to
this particular formula.

711
00:41:33,340 --> 00:41:36,940
Well, all that we did was to
identify a linear relation

712
00:41:36,940 --> 00:41:40,120
between X and Y. This doesn't
tell us anything.

713
00:41:40,120 --> 00:41:44,130
Whether it's Y that causes X, or
whether it's X that causes

714
00:41:44,130 --> 00:41:48,850
Y, or maybe both X and Y are
caused by some other variable

715
00:41:48,850 --> 00:41:51,110
that we didn't think about.

716
00:41:51,110 --> 00:41:56,800
So building a good linear model
that has small errors

717
00:41:56,800 --> 00:42:00,980
does not tell us anything about
causal relations between

718
00:42:00,980 --> 00:42:02,320
the two variables.

719
00:42:02,320 --> 00:42:05,210
It only tells us that there's
a close association between

720
00:42:05,210 --> 00:42:06,010
the two variables.

721
00:42:06,010 --> 00:42:10,370
If you know one you can make
predictions about the other.

722
00:42:10,370 --> 00:42:13,370
But it doesn't tell you anything
about the underlying

723
00:42:13,370 --> 00:42:18,120
physics, that there's some
physical mechanism that

724
00:42:18,120 --> 00:42:22,310
introduces the relation between
those variables.

725
00:42:22,310 --> 00:42:26,430
OK, that's it about
linear regression.

726
00:42:26,430 --> 00:42:30,510
Let us start the next topic,
which is hypothesis testing.

727
00:42:30,510 --> 00:42:35,140
And we're going to continue
with it next time.

728
00:42:35,140 --> 00:42:37,780
So here, instead of trying
to estimate continuous

729
00:42:37,780 --> 00:42:41,920
parameters, we have two
alternative hypotheses about

730
00:42:41,920 --> 00:42:46,550
the distribution of the
X random variable.

731
00:42:46,550 --> 00:42:53,620
So for example our random
variable could be either

732
00:42:53,620 --> 00:42:58,480
distributed according to this
distribution, under H0, or it

733
00:42:58,480 --> 00:43:02,930
might be distributed according
to this distribution under H1.

734
00:43:02,930 --> 00:43:06,230
And we want to make a decision
which distribution is the

735
00:43:06,230 --> 00:43:07,990
correct one?

736
00:43:07,990 --> 00:43:10,850
So we're given those two
distributions, and some common

737
00:43:10,850 --> 00:43:14,290
terminologies that one of them
is the null hypothesis--

738
00:43:14,290 --> 00:43:16,600
sort of the default hypothesis,
and we have some

739
00:43:16,600 --> 00:43:18,290
alternative hypotheses--

740
00:43:18,290 --> 00:43:20,560
and we want to check whether
this one is true,

741
00:43:20,560 --> 00:43:21,950
or that one is true.

742
00:43:21,950 --> 00:43:24,500
So you obtain a data
point, and you

743
00:43:24,500 --> 00:43:26,060
want to make a decision.

744
00:43:26,060 --> 00:43:28,820
In this picture what would
a reasonable person

745
00:43:28,820 --> 00:43:30,650
do to make a decision?

746
00:43:30,650 --> 00:43:35,500
They would probably choose a
certain threshold, Xi, and

747
00:43:35,500 --> 00:43:43,540
decide that H1 is true if your
data falls in this interval.

748
00:43:43,540 --> 00:43:49,590
And decide that H0 is true
if you fall on the side.

749
00:43:49,590 --> 00:43:51,660
So that would be a
reasonable way of

750
00:43:51,660 --> 00:43:54,100
approaching the problem.

751
00:43:54,100 --> 00:43:59,160
More generally you take the set
of all possible X's, and

752
00:43:59,160 --> 00:44:03,050
you divide the set of possible
X's into two regions.

753
00:44:03,050 --> 00:44:11,110
One is the rejection region,
in which you decide H1,

754
00:44:11,110 --> 00:44:13,170
or you reject H0.

755
00:44:13,170 --> 00:44:15,760

756
00:44:15,760 --> 00:44:21,640
And the complement of that
region is where you decide H0.

757
00:44:21,640 --> 00:44:25,210
So this is the x-space
of your data.

758
00:44:25,210 --> 00:44:28,350
In this example here, x
was one-dimensional.

759
00:44:28,350 --> 00:44:31,770
But in general X is going to
be a vector, where all the

760
00:44:31,770 --> 00:44:34,790
possible data vectors that
you can get, they're

761
00:44:34,790 --> 00:44:36,600
divided into two types.

762
00:44:36,600 --> 00:44:40,400
If it falls in this set you'd
make one decision.

763
00:44:40,400 --> 00:44:43,770
If it falls in that set, you
make the other decision.

764
00:44:43,770 --> 00:44:47,380
OK, so how would you
characterize the performance

765
00:44:47,380 --> 00:44:49,690
of the particular way of
making a decision?

766
00:44:49,690 --> 00:44:53,000
Suppose I chose my threshold.

767
00:44:53,000 --> 00:44:57,960
I may make mistakes of
two possible types.

768
00:44:57,960 --> 00:45:03,360
Perhaps H0 is true, but my data
happens to fall here.

769
00:45:03,360 --> 00:45:07,560
In which case I make a mistake,
and this would be a

770
00:45:07,560 --> 00:45:10,730
false rejection of H0.

771
00:45:10,730 --> 00:45:15,070
If my data falls here
I reject H0.

772
00:45:15,070 --> 00:45:16,890
I decide H1.

773
00:45:16,890 --> 00:45:19,510
Whereas H0 was true.

774
00:45:19,510 --> 00:45:21,690
The probability of
this happening?

775
00:45:21,690 --> 00:45:24,890
Let's call it alpha.

776
00:45:24,890 --> 00:45:28,040
But there's another kind of
error that can be made.

777
00:45:28,040 --> 00:45:32,810
Suppose that H1 was true, but by
accident my data happens to

778
00:45:32,810 --> 00:45:34,250
falls on that side.

779
00:45:34,250 --> 00:45:36,610
Then I'm going to make
an error again.

780
00:45:36,610 --> 00:45:40,540
I'm going to decide H0 even
though H1 was true.

781
00:45:40,540 --> 00:45:42,570
How likely is this to occur?

782
00:45:42,570 --> 00:45:46,420
This would be the area under
this curve here.

783
00:45:46,420 --> 00:45:50,600
And that's the other type of
error than can be made, and

784
00:45:50,600 --> 00:45:55,400
beta is the probability of this
particular type of error.

785
00:45:55,400 --> 00:45:57,550
Both of these are errors.

786
00:45:57,550 --> 00:45:59,640
Alpha is the probability
of error of one kind.

787
00:45:59,640 --> 00:46:02,110
Beta is the probability of an
error of the other kind.

788
00:46:02,110 --> 00:46:03,510
You would like the
probabilities

789
00:46:03,510 --> 00:46:05,050
of error to be small.

790
00:46:05,050 --> 00:46:07,550
So you would like to
make both alpha and

791
00:46:07,550 --> 00:46:09,780
beta as small as possible.

792
00:46:09,780 --> 00:46:13,300
Unfortunately that's not
possible, there's a trade-off.

793
00:46:13,300 --> 00:46:17,540
If I go to my threshold it this
way, then alpha become

794
00:46:17,540 --> 00:46:20,760
smaller, but beta
becomes bigger.

795
00:46:20,760 --> 00:46:22,770
So there's a trade-off.

796
00:46:22,770 --> 00:46:29,350
If I make my rejection region
smaller one kind of error is

797
00:46:29,350 --> 00:46:31,880
less likely, but the
other kind of error

798
00:46:31,880 --> 00:46:34,670
becomes more likely.

799
00:46:34,670 --> 00:46:38,050
So we got this trade-off.

800
00:46:38,050 --> 00:46:39,620
So what do we do about it?

801
00:46:39,620 --> 00:46:41,570
How do we move systematically?

802
00:46:41,570 --> 00:46:45,680
How do we come up with
rejection regions?

803
00:46:45,680 --> 00:46:48,900
Well, what the theory basically
tells you is it

804
00:46:48,900 --> 00:46:53,200
tells you how you should
create those regions.

805
00:46:53,200 --> 00:46:57,860
But it doesn't tell
you exactly how.

806
00:46:57,860 --> 00:47:00,970
It tells you the general
shape of those regions.

807
00:47:00,970 --> 00:47:05,120
For example here, the theory
who tells us that the right

808
00:47:05,120 --> 00:47:07,430
thing to do would be to put
the threshold and make

809
00:47:07,430 --> 00:47:10,910
decisions one way to the right,
one way to the left.

810
00:47:10,910 --> 00:47:12,830
But it might not necessarily
tell us

811
00:47:12,830 --> 00:47:15,020
where to put the threshold.

812
00:47:15,020 --> 00:47:18,890
Still, it's useful enough to
know that the way to make a

813
00:47:18,890 --> 00:47:20,960
good decision would
be in terms of

814
00:47:20,960 --> 00:47:22,400
a particular threshold.

815
00:47:22,400 --> 00:47:24,770
Let me make this
more specific.

816
00:47:24,770 --> 00:47:27,380
We can take our inspiration
from the solution of the

817
00:47:27,380 --> 00:47:29,820
hypothesis testing problem
that we had in

818
00:47:29,820 --> 00:47:31,370
the Bayesian case.

819
00:47:31,370 --> 00:47:34,130
In the Bayesian case we just
pick the hypothesis which is

820
00:47:34,130 --> 00:47:37,480
more likely given the data.

821
00:47:37,480 --> 00:47:40,080
The produced posterior
probabilities using Bayesian

822
00:47:40,080 --> 00:47:42,770
rule, they're written
this way.

823
00:47:42,770 --> 00:47:45,240
And this term is the
same as that term.

824
00:47:45,240 --> 00:47:49,500
They cancel out, then let me
collect terms here and there.

825
00:47:49,500 --> 00:47:52,370

826
00:47:52,370 --> 00:47:54,030
I get an expression here.

827
00:47:54,030 --> 00:47:56,090
I think the version you
have in your handout

828
00:47:56,090 --> 00:47:57,340
is the correct one.

829
00:47:57,340 --> 00:47:59,810

830
00:47:59,810 --> 00:48:02,082
The one on the slide was
not the correct one, so

831
00:48:02,082 --> 00:48:03,730
I'm fixing it here.

832
00:48:03,730 --> 00:48:06,920
OK, so this is the form of how
you make decisions in the

833
00:48:06,920 --> 00:48:08,720
Bayesian case.

834
00:48:08,720 --> 00:48:10,620
What you do in the Bayesian
case, you

835
00:48:10,620 --> 00:48:13,270
calculate this ratio.

836
00:48:13,270 --> 00:48:17,110
Let's call it the likelihood
ratio.

837
00:48:17,110 --> 00:48:20,770
And compare that ratio
to a threshold.

838
00:48:20,770 --> 00:48:22,916
And the threshold that you
should be using in the

839
00:48:22,916 --> 00:48:25,240
Bayesian case has something
to do with the prior

840
00:48:25,240 --> 00:48:28,000
probabilities of the
two hypotheses.

841
00:48:28,000 --> 00:48:31,840
In the non-Bayesian case we do
not have prior probabilities,

842
00:48:31,840 --> 00:48:34,690
so we do not know how to
set this threshold.

843
00:48:34,690 --> 00:48:38,350
But we're going to do is we're
going to keep this particular

844
00:48:38,350 --> 00:48:42,690
structure anyway, and maybe use
some other considerations

845
00:48:42,690 --> 00:48:44,480
to pick the threshold.

846
00:48:44,480 --> 00:48:51,030
So we're going to use a
likelihood ratio test, that's

847
00:48:51,030 --> 00:48:54,260
how it's called in which we
calculate a quantity of this

848
00:48:54,260 --> 00:48:56,830
kind that we call the
likelihood, and compare it

849
00:48:56,830 --> 00:48:58,480
with a threshold.

850
00:48:58,480 --> 00:49:00,530
So what's the interpretation
of this likelihood?

851
00:49:00,530 --> 00:49:03,140

852
00:49:03,140 --> 00:49:04,290
We ask--

853
00:49:04,290 --> 00:49:08,570
the X's that I have observed,
how likely were they to occur

854
00:49:08,570 --> 00:49:10,460
if H1 was true?

855
00:49:10,460 --> 00:49:14,590
And how likely were they to
occur if H0 was true?

856
00:49:14,590 --> 00:49:20,560
This ratio could be big if my
data are plausible they might

857
00:49:20,560 --> 00:49:22,400
occur under H1.

858
00:49:22,400 --> 00:49:25,400
But they're very implausible,
extremely unlikely

859
00:49:25,400 --> 00:49:27,380
to occur under H0.

860
00:49:27,380 --> 00:49:30,060
Then my thinking would be well
the data that I saw are

861
00:49:30,060 --> 00:49:33,300
extremely unlikely to have
occurred under H0.

862
00:49:33,300 --> 00:49:36,780
So H0 is probably not true.

863
00:49:36,780 --> 00:49:39,820
I'm going to go for
H1 and choose H1.

864
00:49:39,820 --> 00:49:43,920
So when this ratio is big it
tells us that the data that

865
00:49:43,920 --> 00:49:47,720
we're seeing are better
explained if we assume H1 to

866
00:49:47,720 --> 00:49:50,620
be true rather than
H0 to be true.

867
00:49:50,620 --> 00:49:53,970
So I calculate this quantity,
compare it with a threshold,

868
00:49:53,970 --> 00:49:56,200
and that's how I make
my decision.

869
00:49:56,200 --> 00:49:59,360
So in this particular picture,
for example the way it would

870
00:49:59,360 --> 00:50:02,930
go would be the likelihood ratio
in this picture goes

871
00:50:02,930 --> 00:50:07,230
monotonically with my X. So
comparing the likelihood ratio

872
00:50:07,230 --> 00:50:10,150
to the threshold would be the
same as comparing my x to the

873
00:50:10,150 --> 00:50:12,890
threshold, and we've got
the question of how

874
00:50:12,890 --> 00:50:13,920
to choose the threshold.

875
00:50:13,920 --> 00:50:17,880
The way that the threshold is
chosen is usually done by

876
00:50:17,880 --> 00:50:21,560
fixing one of the two
probabilities of error.

877
00:50:21,560 --> 00:50:26,710
That is, I say, that I want my
error of one particular type

878
00:50:26,710 --> 00:50:30,160
to be a given number,
so I fix this alpha.

879
00:50:30,160 --> 00:50:33,160
And then I try to find where
my threshold should be.

880
00:50:33,160 --> 00:50:36,095
So that this probability theta,
probability out there,

881
00:50:36,095 --> 00:50:39,190
is just equal to alpha.

882
00:50:39,190 --> 00:50:42,050
And then the other probability
of error, beta, will be

883
00:50:42,050 --> 00:50:44,190
whatever it turns out to be.

884
00:50:44,190 --> 00:50:48,140
So somebody picks alpha
ahead of time.

885
00:50:48,140 --> 00:50:52,210
Based on the probability of
a false rejection based on

886
00:50:52,210 --> 00:50:55,890
alpha, I find where my threshold
is going to be.

887
00:50:55,890 --> 00:50:59,890
I choose my threshold, and that
determines subsequently

888
00:50:59,890 --> 00:51:01,270
the value of beta.

889
00:51:01,270 --> 00:51:07,340
So we're going to continue with
this story next time, and

890
00:51:07,340 --> 00:51:08,590
we'll stop here.

891
00:51:08,590 --> 00:51:49,120